Web Data Commons - RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets - November 2018

This document provides statistics about the Web Data Commons RDFa, Microdata and Microformats data sets which have been extracted from the November 2018 release of the Common Crawl.

In summary, we found structured data within 0.9 billion HTML pages out of the 2.5 billion pages contained in the crawl (37.1%). These pages originate from 9.6 million different pay-level-domains out of the 32.8 million pay-level-domains covered by the crawl (29.3%). Altogether, the extracted data sets consist of 31.5 billion RDF quads.

Instructions on how to download the RDFa, Microdata, Embedded JSON-LD and Microformats data sets are given on the page how to get the data.

In addition we have extracted schema.org class-specific datasets from the Microdata and JSON-LD corpora.

Please note that in the following term Domains refers to pay-level-domains. Subdomains are not counted as separate domains.

Overall


Crawl Date November 2018
Total Data 54 Terabyte (compressed)
Parsed HTML URLs 2,544,381,895
URLs with Triples 944,883,841
Domains in Crawl 32,884,530
Domains with Triples 9,650,571
Typed Entities 7,096,578,647
Triples 31,563,782,097
Size of Extracted Data 739GB (compressed)

Results per Format


Format Domains URLs Typed Entities Triples
html-microdata 5,183,199 556,088,266 3,927,363,100 20,318,437,064
html-embedded-jsonld 3,835,046 194,648,550 925,744,293 4,159,835,616
html-mf-hcard 3,399,902 248,130,845 1,793,926,360 5,374,115,412
html-rdfa 1,382,497 154,260,198 338,787,245 1,207,733,576
html-mf-xfn 390,343 19,292,875 44,797,779 263,545,886
html-mf-adr 209,086 12,627,048 28,993,043 90,430,790
html-mf-geo 58,280 3,149,847 6,076,110 15,783,942
html-mf-hcalendar 44,238 1,940,859 12,727,005 56,583,642
html-mf-hreview 32,408 2,699,433 6,643,239 38,699,354
html-mf-hlisting 9,397 252,161 8,352,956 27,250,871
html-mf-hrecipe 5,123 417,518 2,343,367 9,149,734
html-mf2-h-adr 6,146 208,181 606,161 1,717,382
html-mf-hresume 205 31,305 72,274 141,356
html-mf-species 245 50,249 145,715 357,472
OVERALL 9,650,571 933,285,197 7,096,578,647 31,563,782,097

Short Discussion

As a first observation we see that the common crawl corpus used for this extraction is more shallow but wider in comparison to the November 2017 corpus, as it covers less webpages (URLs) of more websites (PLDs). Concerning the annotation formats, Microdata and JSON-LD are the most prominent ones followed by Microformats h-card and RDFa. We observe that even though Microdata is the most dominant mark-up format, JSON-LD is prefered for annotating data of certain domains such as Organization, Person and location information. In addition we analyze the adoption of new vocabulary terms introduced by the schema.org vocabulary during last two years and see that they are well adopted despite their recent release. We present a comparison of the two prominent formats, Microdata and JSON-LD in terms of #PLDs for some selected classes and coverage of new terms in the Formats Comparison File (md-jsonld-comparison.xlsx (14kB)).



Top Domains by Extracted Triples


  1. blogspot.com (441,766,940 triples)
  2. wordpress.com (319,760,823 triples)
  3. google.com (270,098,238 triples)
  4. bezformata.com (109,596,352 triples)
  5. skyrock.com (97,031,258 triples)
  6. ezlocal.com (85,547,857 triples)
  7. canalblog.com (84,956,593 triples)
  8. tunein.com (79,598,015 triples)
  9. shopstyle.com (76,303,978 triples)
  10. elpais.com (63,985,467 triples)
  11. maxpreps.com (63,020,341 triples)
  12. parenting.com (54,381,871 triples)
  13. makaan.com (50,473,517 triples)
  14. apple.com (46,795,182 triples)
  15. more.com (46,113,855 triples)
  16. homehardware.ca (44,098,027 triples)
  17. gigmasters.com (39,829,875 triples)
  18. partycity.com (38,909,389 triples)
  19. foroactivo.com (38,102,896 triples)
  20. kp.ru (36,996,200 triples)
  21. More

Top Domains by URLs with Triples


  1. blogspot.com (37,543,780 urls)
  2. wordpress.com (8,410,578 urls)
  3. pixnet.net (3,289,415 urls)
  4. skyrock.com (2,067,968 urls)
  5. hatenablog.com (2,011,952 urls)
  6. google.com (1,558,826 urls)
  7. canalblog.com (1,538,302 urls)
  8. footeo.com (1,468,636 urls)
  9. thefreedictionary.com (1,324,036 urls)
  10. foroactivo.com (1,232,016 urls)
  11. wikipedia.org (1,108,570 urls)
  12. forumotion.com (1,003,114 urls)
  13. yoo7.com (976,762 urls)
  14. freelancer.com (954,741 urls)
  15. blog.cz (906,768 urls)
  16. forumactif.com (868,933 urls)
  17. hotels.com (844,546 urls)
  18. exblog.jp (755,481 urls)
  19. livejournal.com (753,243 urls)
  20. apple.com (715,628 urls)
  21. More

Extractor html-microdata


Triples Extracted 20,318,437,064
URLs with Triples 556,088,266
Average Triples per URL 36.54
Domains with Triples 5,183,199
Average Triples per Domain 3,920.06
Typed Entities 3,927,363,100
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-microdata.xlsx (335kb)

Extractor html-embedded-jsonld


Triples Extracted 4,159,835,616
URLs with Triples 194,648,550
Average Triples per URL 21.37
Domains with Triples 3,835,046
Average Triples per Domain 1,084.69
Typed Entities 925,744,293
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-embedded-jsonld.xlsx (103kb)

Extractor html-mf-hcard


Triples Extracted 5,374,115,412
URLs with Triples 248,130,845
Average Triples per URL 21.66
Domains with Triples 3,399,902
Average Triples per Domain 1,580.67
Typed Entities 1,793,926,360
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-rdfa


Triples Extracted 1,207,733,576
URLs with Triples 154,260,198
Average Triples per URL 7.83
Domains with Triples 1,382,497
Average Triples per Domain 873.59
Typed Entities 338,787,245
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-rdfa.xlsx (594kb)

Extractor html-mf-xfn


Triples Extracted 263,545,886
URLs with Triples 19,292,875
Average Triples per URL 13.66
Domains with Triples 390,343
Average Triples per Domain 675.16
Typed Entities 44,797,779
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-adr


Triples Extracted90,430,790
URLs with Triples12,627,048
Average Triples per URL7.16
Domains with Triples209,086
Average Triples per Domain432.5
Typed Entities28,993,043
Top Domains by Extracted Triples Show top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-geo


Triples Extracted 15,783,942
URLs with Triples 3,149,847
Average Triples per URL 5.01
Domains with Triples 58,280
Average Triples per Domain 270.83
Typed Entities 6,076,110
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hcalendar


Triples Extracted 56,583,642
URLs with Triples 1,940,859
Average Triples per URL 29.15
Domains with Triples 44,238
Average Triples per Domain 1,279.07
Typed Entities 12,727,005
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hreview


Triples Extracted 38,699,354
URLs with Triples 2,699,433
Average Triples per URL 14.34
Domains with Triples 32,408
Average Triples per Domain 1,194.13
Typed Entities 6,643,239
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hlisting


Triples Extracted 27,250,871
URLs with Triples 252,161
Average Triples per URL 108.07
Domains with Triples 9,397
Average Triples per Domain 2,899.95
Typed Entities 8,352,956
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hrecipe


Triples Extracted 9,149,734
URLs with Triples 417,518
Average Triples per URL 21.91
Domains with Triples 5,123
Average Triples per Domain 1,786.01
Typed Entities 2,343,367
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values
Show top values by entity count

Extractor html-mf-hresume


Triples Extracted 141,356
URLs with Triples 31,305
Average Triples per URL 4.52
Domains with Triples 205
Average Triples per Domain 689.54
Typed Entities 72,274
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-species


Triples Extracted 357,472
URLs with Triples 50,249
Average Triples per URL 7.11
Domains with Triples 245
Average Triples per Domain 1,459.07
Typed Entities 145,715
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count