Web Data Commons - RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets - December 2020

This document provides statistics about the Web Data Commons RDFa, Microdata and Microformats data sets which have been extracted from the September 2020 release of the Common Crawl.

In summary, we found structured data within 1.7 billion HTML pages out of the 3.4 billion pages contained in the crawl (50.0%). These pages originate from 15.3 million different pay-level-domains out of the 34.5 million pay-level-domains covered by the crawl (44.3%). Altogether, the extracted data sets consist of 86.3 billion RDF quads.

Instructions on how to download the RDFa, Microdata, Embedded JSON-LD and Microformats data sets are given on the page how to get the data.

In addition we have extracted schema.org class-specific datasets from the Microdata and JSON-LD corpora.

Please note that in the following term Domains refers to pay-level-domains. Subdomains are not counted as separate domains.

Overall


Crawl Date September 2020
Total Data 81.8 Terabyte (compressed)
Parsed HTML URLs 3,410,268,379
URLs with Triples 1,701,573,394
Domains in Crawl 34,596,585
Domains with Triples 15,316,527
Typed Entities 21,636,494,250
Triples 86,381,005,124
Size of Extracted Data 1.9 Terabyte (compressed)

Results per Format


Format Domains URLs Typed Entities Triples
html-microdata 7,809,978 892,571,812 7,374,224,188 35,612,247,646
html-embedded-jsonld 7,659,585 767,626,882 6,684,258,935 32,078,019,309
html-mf-hcard 4,309,621 371,420,305 6,462,214,971 12,102,799,203
html-rdfa 3,354,300 408,439,900 981,589,696 5,893,793,116
html-mf-xfn 412,477 30,140,863 63,863,815 419,542,263
html-mf-adr 181,242 12,750,690 26,453,736 87,046,820
html-mf-geo 71,749 3,567,479 6,495,007 17,417,545
html-mf-hcalendar 38,130 2,142,316 15,891,355 70,428,605
html-mf-hreview 30,675 2,501,767 6,916,213 49,204,380
html-mf-hlisting 10,987 298,608 11,788,682 38,710,735
html-mf-hrecipe 5,044 473,051 2,393,939 9,830,473
html-mf2-h-adr 11,595 305,724 379,293 1,541,329
html-mf-hresume 126 6,604 12,210 30,825
html-mf-species 256 54,723 12,210 392,875
OVERALL 15,316,527 1,701,573,394 21,636,494,250 86,381,005,124



Top Domains by Extracted Triples


  1. blogspot.com (798,950,384 triples)
  2. wordpress.com (165,188,273 triples)
  3. google.com (106,901,948 triples)
  4. livejournal.com (91,580,297 triples)
  5. wixsite.com (84,547,604 triples)
  6. nasa.gov (74,548,167 triples)
  7. wikipedia.org (71,140,967 triples)
  8. pixnet.net (61,709,979 triples)
  9. leonardoboff.org (61,682,596 triples)
  10. momstart.com (58,408,559 triples)
  11. royalcanin.com (57,194,757 triples)
  12. webcindario.com (57,004,997 triples)
  13. vidal.fr (55,247,556 triples)
  14. yahoo.com (48,241,745 triples)
  15. aljazeera.net (46,945,518 triples)
  16. smittenkitchen.com (46,820,512 triples)
  17. alibaba.com (43,725,806 triples)
  18. hotels.com (43,257,893 triples)
  19. elpais.com (43,213,932 triples)
  20. made-in-china.com (42,868,323 triples)
  21. More

Top Domains by URLs with Triples


  1. wordpress.com (19,908,794 urls)
  2. blogspot.com (19,396,448 urls)
  3. fandom.com (3,223,269 urls)
  4. wikipedia.org (2,883,670 urls)
  5. livejournal.com (2,472,508 urls)
  6. pixnet.net (2,417,572 urls)
  7. fc2.com (1,659,169 urls)
  8. drom.ru (1,274,567 urls)
  9. blog.jp (1,270,870 urls)
  10. hatenablog.com (1,159,907 urls)
  11. stackexchange.com (1,066,485 urls)
  12. herokuapp.com (921,070 urls)
  13. tistory.com (872,667 urls)
  14. whatsapp.com (845,006 urls)
  15. mos.ru (839,469 urls)
  16. cocolog-nifty.com (830,249 urls)
  17. altervista.org (785,463 urls)
  18. yahoo.com (770,848 urls)
  19. weebly.com (747,256 urls)
  20. ning.com (741,574 urls)
  21. More

Extractor html-microdata


Triples Extracted 35,612,247,646
URLs with Triples 892,571,812
Average Triples per URL 39.9
Domains with Triples 7,809,978
Average Triples per Domain 4,559.84
Typed Entities 7,374,224,188
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-microdata.xlsx

Extractor html-embedded-jsonld


Triples Extracted 32,078,019,309
URLs with Triples 767,626,882
Average Triples per URL 41.79
Domains with Triples 7,659,585
Average Triples per Domain 4,187.96
Typed Entities 6,684,258,935
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-embedded-jsonld.xlsx

Extractor html-mf-hcard


Triples Extracted 12,102,799,203
URLs with Triples 371,420,305
Average Triples per URL 32.59
Domains with Triples 4,309,621
Average Triples per Domain 2,808.32
Typed Entities 6,462,214,971
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-rdfa


Triples Extracted 5,893,793,116
URLs with Triples 408,439,900
Average Triples per URL 14.43
Domains with Triples 3,354,300
Average Triples per Domain 1,757.08
Typed Entities 981,589,696
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-rdfa.xlsx

Extractor html-mf-xfn


Triples Extracted 419,542,263
URLs with Triples 30,140,863
Average Triples per URL 13.92
Domains with Triples 412,477
Average Triples per Domain 1,017.13
Typed Entities 63,863,815
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-adr


Triples Extracted 87,046,820
URLs with Triples 12,750,690
Average Triples per URL 6.83
Domains with Triples 181,242
Average Triples per Domain 480.28
Typed Entities 26,453,736
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-geo


Triples Extracted 17,417,545
URLs with Triples 3,567,479
Average Triples per URL 4.88
Domains with Triples 71,749
Average Triples per Domain 242.76
Typed Entities 6,495,007
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hcalendar


Triples Extracted 70,428,605
URLs with Triples 2,142,316
Average Triples per URL 32.87
Domains with Triples 38,130
Average Triples per Domain 1,8477
Typed Entities 15,891,355
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hreview


Triples Extracted 49,204,380
URLs with Triples 2,501,767
Average Triples per URL 19.67
Domains with Triples 30,675
Average Triples per Domain 1,6045
Typed Entities 6,916,213
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hlisting


Triples Extracted 38,710,735
URLs with Triples 298,608
Average Triples per URL 129.64
Domains with Triples 10,987
Average Triples per Domain 3,523.32
Typed Entities 11,788,682
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hrecipe


Triples Extracted 9,830,473
URLs with Triples 473,051
Average Triples per URL 20.78
Domains with Triples 5,044
Average Triples per Domain 1,948.94
Typed Entities 2,393,939
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hresume


Triples Extracted 30,825
URLs with Triples 6,604
Average Triples per URL 4.67
Domains with Triples 126
Average Triples per Domain 244.64
Typed Entities 12,210
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-species


Triples Extracted 392,875
URLs with Triples 54,723
Average Triples per URL 7.18
Domains with Triples 256
Average Triples per Domain 1,534.67
Typed Entities 159,030
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count