Web Data Commons - RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets - October 2022

This document provides statistics about the Web Data Commons RDFa, Microdata and Microformats data sets which have been extracted from the September/October 2022 release of the Common Crawl.

In summary, we found structured data within 1.5 billion HTML pages out of the 3.15 billion pages contained in the crawl (46.88%). These pages originate from 14.2 million different pay-level-domains out of the 33.8 million pay-level-domains covered by the crawl (42.01%). Altogether, the extracted data sets consist of 86 billion RDF quads.

Instructions on how to download the RDFa, Microdata, Embedded JSON-LD and Microformats data sets are given on the page how to get the data.

In addition we have extracted schema.org class-specific datasets from the Microdata and JSON-LD corpora.

Please note that in the following term Domains refers to pay-level-domains. Subdomains are not counted as separate domains.

Overall


Crawl Date October 2022
Total Data 82.71 Terabyte (compressed)
Parsed HTML URLs 3,048,746,652
URLs with Triples 1,518,609,988
Domains in Crawl 33,820,102
Domains with Triples 14,235,035
Typed Entities 19,072,628,514
Triples 86,462,816,435
Size of Extracted Data 1.6 Terabyte (compressed)

Results per Format


Format Domains URLs Typed Entities Triples
html-embedded-jsonld 8,596,990 877,812,654 9,388,554,696 45,840,052,993
html-microdata 7,471,628 801,909,298 6,030,281,759 28,824,001,037
html-mf-hcard 3,880,989 318,625,913 3,300,807,398 10,566,510,286
html-rdfa 594,018 91,100,238 275,563,980 768,304,041
html-mf-xfn 349,876 21,005,003 46,359,958 286,042,505
html-mf-adr 139,998 8,915,594 18,437,160 59,925,902
html-mf-geo 54,550 2,661,528 4,927,722 13,595,239
html-mf-hcalendar 20,810 1,319,116 9,617,912 42,673,659
html-mf-hreview 17,303 1,279,142 3,334,576 23,239,453
html-mf-hlisting 7,481 159,266 9,234,428 31,201,851
html-mf-hrecipe 3,350 258,785 1,540,806 5,850,754
html-mf2-h-adr 13,431 215,769 267,819 1,004,652
html-mf-hresume 92 2,140 6,203 14,014
html-mf-species 586 67,477 168,040 400,049
overall 14,235,035 1,518,609,988 19,089,102,457 86,462,816,435



Top Domains by Extracted Triples


  1. blogspot.com (879,564,145 triples)
  2. wordpress.com (458,770,038 triples)
  3. wikipedia.org (190,087,065 triples)
  4. yummly.com (87,112,540 triples)
  5. hotels.com (81,991,039 triples)
  6. boohoo.com (79,884,394 triples)
  7. kayak.com (77,623,248 triples)
  8. google.com (73,729,078 triples)
  9. yahoo.com (65,317,838 triples)
  10. southleedslife.com (63,758,451 triples)
  11. indiatimes.com (58,899,559 triples)
  12. freepik.com (56,124,447 triples)
  13. airbnb.com (51,964,983 triples)
  14. pinterest.com (47,251,484 triples)
  15. soundcloud.com (45,745,317 triples)
  16. apple.com (42,410,414 triples)
  17. hostadvice.com (42,309,867 triples)
  18. elpais.com (42,136,136 triples)
  19. vsemayki.ru (38,167,517 triples)
  20. smugmug.com (38,031,434 triples)
  21. More

Top Domains by URLs with Triples


  1. blogspot.com (16,922,662 urls)
  2. wordpress.com (10,597,620 urls)
  3. wikipedia.org (4,889,490 urls)
  4. hatenablog.com (2,587,662 urls)
  5. aif.ru (1,130,698 urls)
  6. yahoo.com (1,040,836 urls)
  7. pinterest.com (1,001,806 urls)
  8. airbnb.com (992,869 urls)
  9. typepad.com (835,957 urls)
  10. altervista.org (835,698 urls)
  11. europa.eu (817,761 urls)
  12. tistory.com (712,833 urls)
  13. nih.gov (701,915 urls)
  14. usatoday.com (648,668 urls)
  15. google.com (618,655 urls)
  16. cnn.com (599,683 urls)
  17. mit.edu (552,262 urls)
  18. over-blog.com (549,491 urls)
  19. apple.com (534,948 urls)
  20. threadless.com (507,944 urls)
  21. More

Extractor html-embedded-jsonld


Triples Extracted 45,840,052,993
URLs with Triples 877,812,654
Average Triples per URL 52.22
Domains with Triples 8,596,990
Average Triples per Domain 5,332.1
Typed Entities 9,388,554,696
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-embedded-jsonld.xlsx

Extractor html-microdata


Triples Extracted 28,824,001,037
URLs with Triples 801,909,298
Average Triples per URL 35.94
Domains with Triples 7,471,628
Average Triples per Domain 3,857.79
Typed Entities 6,030,281,759
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-microdata.xlsx

Extractor html-mf-hcard


Triples Extracted 10,566,510,286
URLs with Triples 318,625,913
Average Triples per URL 33.16
Domains with Triples 3,880,989
Average Triples per Domain 2,722.63
Typed Entities 3,300,807,398
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-rdfa


Triples Extracted 768,304,041
URLs with Triples 91,100,238
Average Triples per URL 8.43
Domains with Triples 594,018
Average Triples per Domain 1,293.4
Typed Entities 275,563,980
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-rdfa.xlsx

Extractor html-mf-xfn


Triples Extracted 286,042,505
URLs with Triples 21,005,003
Average Triples per URL 13.62
Domains with Triples 349,876
Average Triples per Domain 817.55
Typed Entities 46,359,958
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-adr


Triples Extracted 59,925,902
URLs with Triples 8,915,594
Average Triples per URL 6.72
Domains with Triples 139,998
Average Triples per Domain 428.05
Typed Entities 18,437,160
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-geo


Triples Extracted 13,595,239
URLs with Triples 2,661,528
Average Triples per URL 5.11
Domains with Triples 54,550
Average Triples per Domain 249.23
Typed Entities 4,927,722
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hcalendar


Triples Extracted 42,673,659
URLs with Triples 1,319,116
Average Triples per URL 32.35
Domains with Triples 20,810
Average Triples per Domain 2,050.63
Typed Entities 9,617,912
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hreview


Triples Extracted 23,239,453
URLs with Triples 1,279,142
Average Triples per URL 18.17
Domains with Triples 17,303
Average Triples per Domain 1,343.09
Typed Entities 3,334,576
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hlisting


Triples Extracted 31,201,851
URLs with Triples 159,266
Average Triples per URL 195.91
Domains with Triples 7,481
Average Triples per Domain 4,170.81
Typed Entities 9,234,428
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hrecipe


Triples Extracted 5,850,754
URLs with Triples 258,785
Average Triples per URL 22.61
Domains with Triples 3,350
Average Triples per Domain 1,746.49
Typed Entities 1,540,806
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hresume


Triples Extracted 14,014
URLs with Triples 2,140
Average Triples per URL 6.55
Domains with Triples 92
Average Triples per Domain 152.33
Typed Entities 6,203
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-species


Triples Extracted 400,049
URLs with Triples 67,477
Average Triples per URL 5.93
Domains with Triples 586
Average Triples per Domain 682.68
Typed Entities 168,040
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count