Web Data Commons - RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets - November 2017

This document provides statistics about the Web Data Commons RDFa, Microdata and Microformats data sets which have been extracted from the November 2017 release of the Common Crawl.

In summary, we found structured data within 1.2 billion HTML pages out of the 3.2 billion pages contained in the crawl (38.9%). These pages originate from 7.4 million different pay-level-domains out of the 26 million pay-level-domains covered by the crawl (28.4%). Altogether, the extracted data sets consist of 38.7 billion RDF quads.

Instructions on how to download the RDFa, Microdata, Embedded JSON-LD and Microformats data sets are given on the page how to get the data.

Please note that in the following term Domains refers to pay-level-domains. Subdomains are not counted as separate domains.

Overall


Crawl DateNovember 2017
Total Data66 Terabyte(compressed)
Parsed HTML URLs3,155,601,774
URLs with Triples1,228,129,002
Domains in Crawl26,271,491
Domains with Triples7,422,886
Typed Entities9,430,164,323
Triples38,721,044,133
Size of Extracted Data855 GB(compressed)

Results per Format


FormatDomainsURLsTyped EntitiesTriples
html-microdata3,743,822646,409,6254,837,635,22424,359,443,316
html-embedded-jsonld2,685,738190,890,906818,557,5583,623,025,088
html-mf-hcard2,758,884418,095,8603,186,672,0228,371,745,745
html-rdfa1,209,430220,889,867430,349,6201,629,581,643
html-mf-xfn392,03527,320,11469,259,620401,275,671
html-mf-adr192,39017,895,41141,729,827143,728,079
html-mf-geo37,8074,646,1717,965,63221,674,355
html-mf-hcalendar40,2572,343,18514,251,60362,666,600
html-mf-hreview27,1813,702,55410,153,54059,026,065
html-mf-hlisting7,162314,1649,148,19731,778,446
html-mf-hrecipe5,179543,8653,681,01315,228,823
html-mf2-h-adr1,880161,842415,2211,061,536
html-mf-hresume22316,90237,97053,951
html-mf-species22499,301307,276754,815
OVERALL7,422,8861,228,129,0029,430,164,32338,721,044,133

Top Domains by Extracted Triples


  1. blogspot.com (2,326,407,936 triples)
  2. wordpress.com (836,982,484 triples)
  3. theclothdiaperwhisperer.com (258,310,473 triples)
  4. skyrock.com (232,476,462 triples)
  5. moosejaw.com (147,690,597 triples)
  6. canalblog.com (110,509,475 triples)
  7. hotels.com (109,331,631 triples)
  8. peternyssen.com (106,756,630 triples)
  9. untitled-magazine.com (106,468,802 triples)
  10. justia.com (104,903,804 triples)
  11. leadferret.com (92,088,394 triples)
  12. drom.ru (80,352,891 triples)
  13. tutete.com (74,136,046 triples)
  14. prom.ua (73,993,196 triples)
  15. epicsports.com (72,744,802 triples)
  16. repairpal.com (68,768,136 triples)
  17. callersmart.com (68,117,826 triples)
  18. lyst.com (67,716,741 triples)
  19. kidsroom.de (66,900,221 triples)
  20. More

Top Domains by URLs with Triples


  1. blogspot.com (182,764,220 urls)
  2. wordpress.com (32,201,971 urls)
  3. pixnet.net (8,476,081 urls)
  4. skyrock.com (5,038,350 urls)
  5. hatenablog.com (3,385,203 urls)
  6. blogspot.com.es (2,842,337 urls)
  7. blogspot.co.uk (2,673,792 urls)
  8. hotels.com (2,217,535 urls)
  9. canalblog.com (1,915,842 urls)
  10. blogspot.ca (1,805,282 urls)
  11. freelancer.com (1,656,272 urls)
  12. wikipedia.org (1,585,426 urls)
  13. blogspot.com.br (1,585,173 urls)
  14. google.com (1,390,552 urls)
  15. blogspot.de (1,288,835 urls)
  16. typepad.com (1,286,654 urls)
  17. blogspot.fr (1,226,235 urls)
  18. livejournal.com (1,112,299 urls)
  19. forumotion.com (1,103,585 urls)
  20. More

Extractor html-microdata


Triples Extracted24,359,443,316
URLs with Triples646,409,625
Average Triples per URL37.68
Domains with Triples3,743,822
Average Triples per Domain6,506.57
Typed Entities4,837,635,224
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-microdata.xlsx (450kb)

Extractor html-embedded-jsonld


Triples Extracted3,623,025,088
URLs with Triples190,890,906
Average Triples per URL18.97
Domains with Triples2,685,738
Average Triples per Domain1,348.98
Typed Entities818,557,558
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-embedded-jsonld.xlsx (75kb)

Extractor html-mf-hcard


Triples Extracted8,371,745,745
URLs with Triples418,095,860
Average Triples per URL20.02
Domains with Triples3,645,662
Average Triples per Domain2,296.35
Typed Entities3,186,672,022
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-rdfa


Triples Extracted1,629,581,643
URLs with Triples220,889,867
Average Triples per URL7.37
Domains with Triples1,209,430
Average Triples per Domain1,347.39
Typed Entities430,349,620
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-rdfa.xlsx (120kb)

Extractor html-mf-xfn


Triples Extracted401,275,671
URLs with Triples27,320,114
Average Triples per URL14.68
Domains with Triples392,035
Average Triples per Domain1,023.57
Typed Entities69,259,620
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-adr


Triples Extracted143,728,079
URLs with Triples17,895,411
Average Triples per URL8.03
Domains with Triples192,390
Average Triples per Domain747
Typed Entities41,729,827
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-geo


Triples Extracted21,674,355
URLs with Triples4,646,171
Average Triples per URL4.66
Domains with Triples37,807
Average Triples per Domain573.28
Typed Entities7,965,632
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hcalendar


Triples Extracted62,666,600
URLs with Triples2,343,185
Average Triples per URL26.74
Domains with Triples40,257
Average Triples per Domain1,556.66
Typed Entities14,251,603
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hreview


Triples Extracted59,026,065
URLs with Triples3,702,554
Average Triples per URL15.94
Domains with Triples27,181
Average Triples per Domain2,171.59
Typed Entities10,153,540
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hlisting


Triples Extracted31,778,446
URLs with Triples314,164
Average Triples per URL101.15
Domains with Triples7,162
Average Triples per Domain4,437.09
Typed Entities 9,148,197
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hrecipe


Triples Extracted15,228,823
URLs with Triples543,865
Average Triples per URL28
Domains with Triples5,179
Average Triples per Domain2,940.49
Typed Entities 3,681,013
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values
Show top values by entity count

Extractor html-mf-hresume


Triples Extracted53,951
URLs with Triples16,902
Average Triples per URL3.19
Domains with Triples223
Average Triples per Domain241.93
Typed Entities37,970
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-species


Triples Extracted754,815
URLs with Triples99,301
Average Triples per URL7.6
Domains with Triples225
Average Triples per Domain3,354.73
Typed Entities307,276
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count