Web Data Commons - RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets - October 2023

This document provides statistics about the Web Data Commons RDFa, Microdata and Microformats data sets which have been extracted from the September/October 2023 release of the Common Crawl.

In summary, we found structured data within 1.7 billion HTML pages out of the 3.4 billion pages contained in the crawl (50.60%). These pages originate from 15 million different pay-level-domains out of the 34 million pay-level-domains covered by the crawl (42.89%). Altogether, the extracted data sets consist of 98 billion RDF quads.

Instructions on how to download the RDFa, Microdata, Embedded JSON-LD and Microformats data sets are given on the page how to get the data.

In addition we have extracted schema.org class-specific datasets from the Microdata and JSON-LD corpora.

Please note that in the following term Domains refers to pay-level-domains. Subdomains are not counted as separate domains.

Overall


Crawl Date October 2023
Total Data 98.38 Terabyte (compressed)
Parsed HTML URLs 3,353,090,410
URLs with Triples 1,696,953,312
Domains in Crawl 34,144,094
Domains with Triples 14,646,081
Typed Entities 20,890,660,670
Triples 97,689,391,384
Size of Extracted Data 1.8 Terabyte (compressed)

Results per Format


Format Domains URLs Typed Entities Triples
html-embedded-jsonld 9,475,997 1,118,671,588 12,338,692,731 61,138,405,091
html-microdata 7,378,889 822,244,329 6,197,009,822 29,209,301,356
html-mf-hcard 3,791,191 294,014,781 2,020,101,223 6,170,695,047
html-rdfa 535,458 79,466,405 253,110,346 725,044,445
html-mf-xfn 349,366 25,798,293 40,691,517 293,463,163
html-mf-adr 118,566 8,311,710 16,791,318 54,550,623
html-mf-geo 26,138 1,562,561 3,264,614 8,557,213
html-mf-hcalendar 18,446 1,233,159 7,368,223 31,219,954
html-mf-hreview 15,531 1,052,689 3,234,612 22,460,252
html-mf-hlisting 6,897 148,345 8,775,335 29,519,246
html-mf-hrecipe 2,536 186,074 1,139,196 4,663,876
html-mf2-h-adr 15,133 239,368 287,194 1,036,614
html-mf-hresume 89 3,202 10,045 21,451
html-mf-species 263 64,145 184,494 453,053
overall 14,646,081 1,696,953,312 20,890,660,670 97,689,391,384



Top Domains by Extracted Triples


  1. blogspot.com (912,719,572 triples)
  2. airbnb.com (106,740,816 triples)
  3. wikipedia.org (87,443,747 triples)
  4. yummly.com (73,832,565 triples)
  5. yahoo.com (63,441,160 triples)
  6. google.com (56,476,771 triples)
  7. kayak.com (54,114,781 triples)
  8. indiatimes.com (50,318,471 triples)
  9. hotels.com (50,234,633 triples)
  10. freepik.com (48,147,148 triples)
  11. apple.com (46,114,350 triples)
  12. pinkoi.com (45,503,174 triples)
  13. boohoo.com (42,607,579 triples)
  14. bezformata.com (41,618,549 triples)
  15. uni-trier.de (37,951,939 triples)
  16. euronews.com (35,594,580 triples)
  17. clinicsoftware.com (35,155,602 triples)
  18. trivago.com (34,325,610 triples)
  19. justia.com (33,202,469 triples)
  20. hoavouu.com (31,152,702 triples)
  21. More

Top Domains by URLs with Triples


  1. blogspot.com (17,973,412 urls)
  2. wikipedia.org (4,083,649 urls)
  3. yahoo.com (1,138,675 urls)
  4. aif.ru (980,565 urls)
  5. altervista.org (903,358 urls)
  6. ox.ac.uk (815,837 urls)
  7. pinterest.com (786,471 urls)
  8. hatenablog.com (744,488 urls)
  9. wordpress.org (680,355 urls)
  10. iso.org (647,963 urls)
  11. stackexchange.com (594,783 urls)
  12. airbnb.com (562,830 urls)
  13. nih.gov (533,912 urls)
  14. google.com (494,492 urls)
  15. tistory.com (469,729 urls)
  16. threadless.com (454,264 urls)
  17. oregonstate.edu (433,632 urls)
  18. indiatimes.com (417,364 urls)
  19. europa.eu (415,573 urls)
  20. apple.com (408,547 urls)
  21. More

Extractor html-embedded-jsonld


Triples Extracted 61,138,405,091
URLs with Triples 1,118,671,588
Average Triples per URL 54.65
Domains with Triples 9,475,997
Average Triples per Domain 6,451.92
Typed Entities 12,338,692,731
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-embedded-jsonld.xlsx

Extractor html-microdata


Triples Extracted 29,209,301,356
URLs with Triples 822,244,329
Average Triples per URL 35.52
Domains with Triples 7,378,889
Average Triples per Domain 3,958.5
Typed Entities 6,197,009,822
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-microdata.xlsx

Extractor html-mf-hcard


Triples Extracted 6,170,695,047
URLs with Triples 294,014,781
Average Triples per URL 20.99
Domains with Triples 3,791,191
Average Triples per Domain 1,627.64
Typed Entities 2,020,101,223
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-rdfa


Triples Extracted 725,044,445
URLs with Triples 79,466,405
Average Triples per URL 9.12
Domains with Triples 535,458
Average Triples per Domain 1,354.06
Typed Entities 253,110,346
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-rdfa.xlsx

Extractor html-mf-xfn


Triples Extracted 293,463,163
URLs with Triples 25,798,293
Average Triples per URL 11.38
Domains with Triples 349,366
Average Triples per Domain 839.99
Typed Entities 40,691,517
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-adr


Triples Extracted 54,550,623
URLs with Triples 8,311,710
Average Triples per URL 6.56
Domains with Triples 118,566
Average Triples per Domain 460.09
Typed Entities 16,791,318
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-geo


Triples Extracted 8,557,213
URLs with Triples 1,562,561
Average Triples per URL 5.48
Domains with Triples 26,138
Average Triples per Domain 327.39
Typed Entities 3,264,614
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hcalendar


Triples Extracted 31,219,954
URLs with Triples 1,233,159
Average Triples per URL 25.32
Domains with Triples 18,446
Average Triples per Domain 1,692.51
Typed Entities 7,368,223
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hreview


Triples Extracted 22,460,252
URLs with Triples 1,052,689
Average Triples per URL 21.34
Domains with Triples 15,531
Average Triples per Domain 1,446.16
Typed Entities 3,234,612
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hlisting


Triples Extracted 29,519,246
URLs with Triples 148,345
Average Triples per URL 198.99
Domains with Triples 6,897
Average Triples per Domain 4,280.01
Typed Entities 8,775,335
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hrecipe


Triples Extracted 4,663,876
URLs with Triples 186,074
Average Triples per URL 25.06
Domains with Triples 2,536
Average Triples per Domain 1,839.07
Typed Entities 1,139,196
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hresume


Triples Extracted 21,451
URLs with Triples 3,202
Average Triples per URL 6.7
Domains with Triples 89
Average Triples per Domain 241.02
Typed Entities 10,045
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-species


Triples Extracted 453,053
URLs with Triples 64,145
Average Triples per URL 7.06
Domains with Triples 263
Average Triples per Domain 1,722.63
Typed Entities 184,494
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count