Web Data Commons - RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets - October 2016

This document provides statistics about the Web Data Commons RDFa, Microdata and Microformats data sets which have been extracted from the October 2016 release of the Common Crawl.

In summary, we found structured data within 1.24 billion HTML pages out of the 3.2 billion pages contained in the crawl (38%). These pages originate from 5.63 million different pay-level-domains out of the 34 million pay-level-domains covered by the crawl (16.5%). Altogether, the extracted data sets consist of 44.2 billion RDF quads.

Instructions on how to download the RDFa, Microdata, Embedded JSON-LD and Microformats data sets are given on the page how to get the data.

Please note that in the following term Domains refers to pay-level-domains. Subdomains are not counted as separate domains.

Overall


Crawl DateOctober 2016
Size of Crawl54 Terabyte(compressed)
Parsed HTML URLs3,181,199,447
URLs with Triples1,242,727,852
Domains in Crawl34,076,469
Domains with Triples5,638,796
Typed Entities9,590,731,005
Triples44,242,655,138
Size of Extracted Data967 GB(compressed)

Results per Format


FormatDomainsURLsTyped EntitiesTriples
html-microdata2,537,539901,118,191 6,872,341,887 34,637,805,559
html-embedded-jsonld2,116,755111,411,049 385,731,201 1,880,721,886
html-mf-hcard1,668,039159,748,255 1,614,688,960 4,600,477,456
html-rdfa938,830311,533,110 511,555,208 2,216,933,416
html-mf-xfn195,59524,242,546 48,011,285 300,764,344
html-mf-adr188,75527,697,569 80,039,476 259,718,235
html-mf-geo23,8006,151,013 14,644,289 25,733,274
html-mf-hcalendar22,3133,450,075 33,962,568 177,931,362
html-mf-hreview16,9844,551,011 13,680,480 79,631,745
html-mf-hlisting4,710374,180 9,578,853 36,521,676
html-mf-hrecipe2,923755,544 5,695,917 24,347,685
html-mf2-h-adr1,415136,200 266,590 726,401
html-mf-hresume1682,961 7,106 22,262
html-mf-species95170,516 527,185 1,319,837
OVERALL5,638,7961,242,727,852 9,590,731,005 44,242,655,138

Top Domains by Extracted Triples


  1. blogspot.com (613,297,268 triples)
  2. ticketprocess.com (470,790,528 triples)
  3. moosejaw.com (410,021,577 triples)
  4. hallmark.com (351,800,787 triples)
  5. theclothdiaperwhisperer.com (314,888,334 triples)
  6. cnbc.com (298,241,469 triples)
  7. hotels.com (288,853,467 triples)
  8. repairpal.com (266,250,224 triples)
  9. uncommongoods.com (261,956,686 triples)
  10. wordpress.com (219,485,949 triples)
  11. justia.com (207,590,139 triples)
  12. leadferret.com (207,038,617 triples)
  13. propartner.ru (180,976,524 triples)
  14. callersmart.com (162,779,369 triples)
  15. gigmasters.com (151,575,208 triples)
  16. epicsports.com (150,282,774 triples)
  17. unitiki.com (125,549,640 triples)
  18. drom.ru (122,572,723 triples)
  19. zap2it.com (122,118,800 triples)
  20. caasa.it (108,318,384 triples)
  21. More

Top Domains by URLs with Triples


  1. blogspot.com (25,188,223 urls)
  2. wordpress.com (4,422,380 urls)
  3. hotels.com (2,767,694 urls)
  4. wikipedia.org (2,581,621 urls)
  5. oclc.org (2,390,862 urls)
  6. made-in-china.com (1,703,736 urls)
  7. epicsports.com (1,599,468 urls)
  8. yahoo.com (1,532,144 urls)
  9. mlb.com (1,506,542 urls)
  10. dreamstime.com (1,255,992 urls)
  11. google.com (1,246,373 urls)
  12. pinterest.com (1,107,381 urls)
  13. wsj.com (1,083,232 urls)
  14. academic.ru (1,034,009 urls)
  15. flightaware.com (1,031,041 urls)
  16. polyvore.com (895,665 urls)
  17. foodily.com (886,440 urls)
  18. goo.ne.jp (878,598 urls)
  19. nj.com (857,635 urls)
  20. typepad.com (785,875 urls)
  21. More

Extractor html-microdata


Triples Extracted34,637,805,559
URLs with Triples901,118,191
Average Triples per URL38.44
Domains with Triples2,537,539
Average Triples per Domain13,650.16
Typed Entities6,872,341,887
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-microdata.xlsx (1,180kb)

Extractor html-embedded-jsonld


Triples Extracted1,880,721,886
URLs with Triples111,411,049
Average Triples per URL16.88
Domains with Triples2,116,755
Average Triples per Domain888.49
Typed Entities385,731,201
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-embedded-jsonld.xlsx (48kb)

Extractor html-mf-hcard


Triples Extracted4,600,477,456
URLs with Triples159,748,255
Average Triples per URL28.8
Domains with Triples1,668,039
Average Triples per Domain2,758.02
Typed Entities1,614,688,960
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-rdfa


Triples Extracted2,216,933,416
URLs with Triples311,533,110
Average Triples per URL7.12
Domains with Triples938,830
Average Triples per Domain2,361.38
Typed Entities511,555,208
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-rdfa.xlsx (159kb)

Extractor html-mf-xfn


Triples Extracted300,764,344
URLs with Triples24,242,546
Average Triples per URL12.41
Domains with Triples195,595
Average Triples per Domain1,537.69
Typed Entities48,011,285
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-geo


Triples Extracted25,733,274
URLs with Triples6,151,013
Average Triples per URL4.18
Domains with Triples23,800
Average Triples per Domain1,081.23
Typed Entities14,644,289
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hcalendar


Triples Extracted177,931,362
URLs with Triples3,450,075
Average Triples per URL51.57
Domains with Triples22,313
Average Triples per Domain7,974.34
Typed Entities33,962,568
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hreview


Triples Extracted79,631,745
URLs with Triples4,551,011
Average Triples per URL17.5
Domains with Triples16,984
Average Triples per Domain4,688.63
Typed Entities13,680,480
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hlisting


Triples Extracted36,521,676
URLs with Triples374,180
Average Triples per URL97.6
Domains with Triples4,710
Average Triples per Domain7,754.07
Typed Entities9,578,853
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hrecipe


Triples Extracted24,347,685
URLs with Triples755,544
Average Triples per URL32.23
Domains with Triples2,923
Average Triples per Domain8,329.69
Typed Entities5,695,917
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values
Show top values by entity count

Extractor html-mf-hresume


Triples Extracted22,262
URLs with Triples2,961
Average Triples per URL7.52
Domains with Triples168
Average Triples per Domain132.51
Typed Entities7,106
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-species


Triples Extracted1,319,837
URLs with Triples170,516
Average Triples per URL7.74
Domains with Triples95
Average Triples per Domain13,893.02
Typed Entities527,185
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count