Class-Specific Subsets of the Schema.org Data contained in the October 2024 Corpus

This page provides access to and statistics about class-specific subsets of the Schema.org data contained in the October 2024 version of the Web Data Commons Microdata and JSON-LD corpus. The datasets are part of the Web Data Commons Schema.org Data Set Series

Introduction

As many users are only interested in specific types of Schema.org data (like product data, event data, job postings, or data describing local businesses), we have created class-specific subsets out of the complete and merged Microdata and JSON-LD corpora for a selection of schema.org classes. The subsets contain all instances of a specific class of either formats as well as all other data that is found on the webpages containing these instances. For example, a page containing data about a product might also contain reviews and offers for this product; a page containing data about an event might also contain data about the location of the event and the persons involved in the event. The data is represented in N-Quads format, meaning that the forth element of each quad contains the URL of the webpage from which the data was extracted. To facilitate the download and access to the class specific data, we provide the schema.org subsets in chunks. Each chunk contains quads of specific pay-level-domains (PLDs), i.e. all quads of one PLD, e.g. yummly.com, are organized within the same chunk file. Additionally, we provide lookup files containing the mappings between PLDs and their corresponding chunks as well as csv files with PLD-specific statistics.

Please note that:

If you want to refer to the datasets in your scientific publications, please cite the following poster: The Web Data Commons Schema.org Data Set Series by Alexander Brinkmann, Anna Primpeli and Christian Bizer in Companion Proceedings of the ACM Web Conference 2023 (WWW ’23 Companion), Austin, Texas, USA, April 2023.

Class-Specific Subsets of the Schema.org Data

Schema.org Subset General Stats Related Classes Size
(# Files)
Download (Sample) PLD to File look-up
PLD Specific Stats
AdministrativeArea Quads: 96,084,887
URLs: 521,570
Hosts: 4,933
http://schema.org/ListItem (1,499,745)
http://schema.org/ImageObject (1,454,619)
http://schema.org/AdministrativeArea (1,301,541)
http://schema.org/Person (976,864)
http://schema.org/PostalAddress (966,277)
1.13 GB
(8)
AdministrativeArea (sample) lookup_file
pld_stats_file
Airport Quads: 53,683,384
URLs: 173,690
Hosts: 1,003
http://schema.org/Airport (3,562,446)
http://schema.org/GeoCoordinates (2,546,608)
http://schema.org/Flight (1,331,723)
http://schema.org/Airline (1,258,369)
http://schema.org/Offer (1,139,953)
455.06 MB
(5)
Airport (sample) lookup_file
pld_stats_file
Answer Quads: 1,617,339,307
URLs: 14,297,841
Hosts: 414,210
http://schema.org/Answer (60,187,175)
http://schema.org/Question (51,844,637)
http://schema.org/ListItem (32,575,196)
https://schema.org/Answer (22,033,793)
http://schema.org/ImageObject (20,709,326)
23.75 GB
(118)
Answer (sample) lookup_file
pld_stats_file
Book Quads: 249,577,347
URLs: 4,207,738
Hosts: 18,993
http://schema.org/Book (10,288,815)
http://schema.org/Country (6,776,472)
http://schema.org/Person (5,754,742)
http://schema.org/Offer (3,590,095)
http://schema.org/ListItem (3,350,101)
3.66 GB
(19)
Book (sample) lookup_file
pld_stats_file
City Quads: 235,102,338
URLs: 1,155,997
Hosts: 16,149
http://schema.org/City (5,772,774)
http://schema.org/ImageObject (4,144,520)
http://schema.org/Person (4,069,973)
http://schema.org/PostalAddress (3,790,669)
http://schema.org/OpeningHoursSpecification (2,991,182)
2.33 GB
(17)
City (sample) lookup_file
pld_stats_file
ClaimReview Quads: 3,919,703
URLs: 49,707
Hosts: 343
http://schema.org/Organization (123,300)
http://schema.org/ImageObject (95,783)
http://schema.org/ListItem (93,535)
http://schema.org/Person (66,621)
http://schema.org/ClaimReview (59,709)
56.3 MB
(1)
ClaimReview (sample) lookup_file
pld_stats_file
CollegeOrUniversity Quads: 112,774,933
URLs: 1,001,573
Hosts: 5,121
http://schema.org/ImageObject (4,911,779)
http://schema.org/CollegeOrUniversity (3,891,936)
http://schema.org/Person (3,167,873)
http://schema.org/PostalAddress (2,714,267)
http://schema.org/GeoCoordinates (1,995,601)
1.06 GB
(6)
CollegeOrUniversity (sample) lookup_file
pld_stats_file
Continent Quads: 759,731
URLs: 6,752
Hosts: 66
http://schema.org/City (57,883)
http://schema.org/AdministrativeArea (42,597)
http://schema.org/Country (10,423)
http://schema.org/Continent (7,337)
http://schema.org/GeoCoordinates (5,692)
7.28 MB
(1)
Continent (sample) lookup_file
pld_stats_file
Country Quads: 950,449,123
URLs: 7,110,729
Hosts: 35,296
http://schema.org/Country (31,979,875)
http://schema.org/ListItem (23,422,062)
http://schema.org/Organization (15,663,467)
http://schema.org/PostalAddress (11,083,809)
http://schema.org/Offer (11,006,875)
10.21 GB
(61)
Country (sample) lookup_file
pld_stats_file
CreativeWork Quads: 2,063,764,239
URLs: 45,267,975
Hosts: 1,325,582
https://schema.org/CreativeWork (80,055,282)
https://schema.org/SiteNavigationElement (55,881,471)
https://schema.org/Person (40,207,650)
https://schema.org/WPHeader (32,242,351)
https://schema.org/WPFooter (30,334,529)
68.24 GB
(162)
CreativeWork (sample) lookup_file
pld_stats_file
Dataset Quads: 58,626,384
URLs: 694,106
Hosts: 2,024
http://schema.org/DataDownload (2,584,051)
http://schema.org/Dataset (1,559,636)
http://schema.org/Organization (1,056,184)
http://schema.org/PropertyValue (744,074)
http://schema.org/Person (737,414)
655.18 MB
(5)
Dataset (sample) lookup_file
pld_stats_file
EducationalOrganization Quads: 67,326,432
URLs: 830,228
Hosts: 11,630
http://schema.org/EducationalOrganization (1,393,304)
http://schema.org/ListItem (1,202,332)
http://schema.org/ImageObject (983,688)
http://schema.org/PostalAddress (955,379)
http://schema.org/Person (627,436)
810.6 MB
(6)
EducationalOrganization (sample) lookup_file
pld_stats_file
Event Quads: 1,959,166,969
URLs: 14,077,443
Hosts: 399,466
http://schema.org/Event (62,976,813)
http://schema.org/Place (47,078,655)
http://schema.org/PostalAddress (36,842,200)
http://schema.org/Person (23,766,388)
http://schema.org/ListItem (19,233,928)
20.83 GB
(133)
Event (sample) lookup_file
pld_stats_file
FAQPage Quads: 1,416,284,547
URLs: 11,599,660
Hosts: 385,247
http://schema.org/Answer (48,925,434)
http://schema.org/Question (48,641,444)
http://schema.org/ListItem (30,143,784)
http://schema.org/ImageObject (20,603,824)
https://schema.org/Answer (17,418,044)
19.89 GB
(104)
FAQPage (sample) lookup_file
pld_stats_file
GeoCoordinates Quads: 3,183,190,155
URLs: 25,257,059
Hosts: 567,265
http://schema.org/ListItem (73,477,298)
http://schema.org/PostalAddress (53,035,222)
http://schema.org/GeoCoordinates (50,513,043)
http://schema.org/OpeningHoursSpecification (32,388,897)
http://schema.org/Offer (31,582,635)
33.28 GB
(237)
GeoCoordinates (sample) lookup_file
pld_stats_file
GovernmentOrganization Quads: 25,785,444
URLs: 389,490
Hosts: 1,940
http://schema.org/ListItem (1,425,244)
http://schema.org/GovernmentOrganization (547,285)
http://schema.org/ImageObject (478,529)
http://schema.org/PropertyValue (289,526)
http://schema.org/PostalAddress (228,164)
305.5 MB
(3)
GovernmentOrganization (sample) lookup_file
pld_stats_file
Hospital Quads: 17,743,553
URLs: 178,158
Hosts: 2,489
http://schema.org/PostalAddress (408,816)
http://schema.org/Hospital (341,523)
https://schema.org/MedicalProcedure (265,300)
http://schema.org/GeoCoordinates (230,027)
http://schema.org/ListItem (193,692)
185.01 MB
(2)
Hospital (sample) lookup_file
pld_stats_file
Hotel Quads: 244,111,716
URLs: 1,961,598
Hosts: 24,641
http://schema.org/ImageObject (12,124,401)
http://schema.org/Hotel (4,413,099)
http://schema.org/PostalAddress (4,118,923)
http://schema.org/ListItem (4,004,074)
http://schema.org/AggregateRating (2,332,130)
3.19 GB
(17)
Hotel (sample) lookup_file
pld_stats_file
JobPosting Quads: 175,205,867
URLs: 3,606,092
Hosts: 63,320
http://schema.org/PostalAddress (6,753,848)
http://schema.org/Place (6,688,964)
http://schema.org/Organization (4,451,973)
http://schema.org/JobPosting (4,068,348)
http://schema.org/ListItem (2,519,108)
4.86 GB
(14)
JobPosting (sample) lookup_file
pld_stats_file
LakeBodyOfWater Quads: 35,276
URLs: 689
Hosts: 100
http://schema.org/ImageObject (1,060)
http://schema.org/Organization (765)
http://schema.org/WebPage (687)
http://schema.org/LakeBodyOfWater (681)
http://schema.org/Person (562)
0.63 MB
(1)
LakeBodyOfWater (sample) lookup_file
pld_stats_file
LandmarksOrHistoricalBuildings Quads: 3,005,418
URLs: 33,100
Hosts: 460
http://schema.org/ImageObject (112,997)
http://schema.org/LandmarksOrHistoricalBuildings (95,367)
http://schema.org/PostalAddress (64,910)
http://schema.org/CreativeWork (50,722)
http://schema.org/OpeningHoursSpecification (49,374)
49.11 MB
(1)
LandmarksOrHistoricalBuildings (sample) lookup_file
pld_stats_file
Language Quads: 586,551,994
URLs: 4,742,085
Hosts: 11,556
http://schema.org/Person (25,797,771)
http://schema.org/Comment (19,971,596)
http://schema.org/ListItem (10,307,076)
http://schema.org/Language (9,360,716)
http://schema.org/InteractionCounter (7,608,122)
9.82 GB
(46)
Language (sample) lookup_file
pld_stats_file
Library Quads: 7,343,688
URLs: 206,299
Hosts: 938
http://schema.org/Library (220,963)
http://schema.org/Place (115,805)
http://schema.org/CreativeWork (108,818)
http://schema.org/ListItem (95,132)
http://schema.org/PostalAddress (90,187)
75.56 MB
(1)
Library (sample) lookup_file
pld_stats_file
LocalBusiness Quads: 2,245,941,658
URLs: 27,184,047
Hosts: 1,456,650
http://schema.org/ListItem (68,902,449)
http://schema.org/LocalBusiness (42,248,382)
http://schema.org/PostalAddress (39,579,198)
http://schema.org/ImageObject (16,996,637)
http://schema.org/Offer (16,958,169)
23.23 GB
(176)
LocalBusiness (sample) lookup_file
pld_stats_file
Mountain Quads: 232,960
URLs: 11,296
Hosts: 63
http://schema.org/Mountain (20,970)
http://schema.org/GeoCoordinates (13,074)
http://schema.org/propertyValue (5,749)
http://schema.org/ListItem (1,101)
http://schema.org/Place (712)
2.56 MB
(1)
Mountain (sample) lookup_file
pld_stats_file
Movie Quads: 150,224,569
URLs: 1,849,096
Hosts: 8,969
http://schema.org/Person (9,033,142)
http://schema.org/Movie (3,785,443)
http://schema.org/ListItem (2,092,306)
http://schema.org/AggregateRating (1,498,480)
http://schema.org/Place (1,232,017)
1.82 GB
(12)
Movie (sample) lookup_file
pld_stats_file
Museum Quads: 5,066,100
URLs: 81,577
Hosts: 653
http://schema.org/PostalAddress (108,570)
http://schema.org/ListItem (81,923)
http://schema.org/Museum (81,127)
http://schema.org/ImageObject (72,825)
http://schema.org/OpeningHoursSpecification (63,146)
47.21 MB
(1)
Museum (sample) lookup_file
pld_stats_file
MusicAlbum Quads: 81,151,787
URLs: 582,435
Hosts: 2,812
http://schema.org/Country (6,016,664)
http://schema.org/Offer (2,290,150)
http://schema.org/MusicRecording (2,229,256)
http://schema.org/MusicAlbum (1,964,884)
http://schema.org/MusicGroup (1,252,134)
755.87 MB
(5)
MusicAlbum (sample) lookup_file
pld_stats_file
MusicRecording Quads: 115,955,562
URLs: 879,727
Hosts: 5,314
http://schema.org/MusicRecording (6,361,413)
http://schema.org/Country (4,576,815)
http://schema.org/Offer (2,499,674)
http://schema.org/MusicAlbum (1,372,400)
https://schema.org/MusicRecording (1,357,950)
1.08 GB
(7)
MusicRecording (sample) lookup_file
pld_stats_file
Organization Quads: 40,063,217,202
URLs: 612,866,985
Hosts: 4,318,211
http://schema.org/ListItem (1,116,144,976)
http://schema.org/ImageObject (837,488,023)
http://schema.org/Organization (825,414,748)
http://schema.org/Offer (451,019,145)
http://schema.org/BreadcrumbList (390,102,953)
488.41 GB
(3072)
Organization (sample) lookup_file
pld_stats_file
Painting Quads: 10,557,775
URLs: 62,179
Hosts: 530
http://schema.org/Person (2,199,905)
http://schema.org/Offer (478,440)
http://schema.org/Painting (264,229)
http://schema.org/Product (154,817)
http://schema.org/ListItem (90,303)
75.59 MB
(1)
Painting (sample) lookup_file
pld_stats_file
Park Quads: 645,285
URLs: 8,015
Hosts: 337
http://schema.org/PostalAddress (25,328)
http://schema.org/Organization (15,538)
http://schema.org/Park (8,571)
http://schema.org/ListItem (7,464)
http://schema.org/GeoCoordinates (7,251)
6.27 MB
(1)
Park (sample) lookup_file
pld_stats_file
Person Quads: 25,755,663,162
URLs: 332,374,298
Hosts: 5,567,680
http://schema.org/ImageObject (603,863,544)
http://schema.org/Person (553,126,279)
http://schema.org/ListItem (552,466,104)
http://schema.org/Organization (273,891,715)
http://schema.org/WebPage (271,622,375)
401.48 GB
(1953)
Person (sample) lookup_file
pld_stats_file
Place Quads: 3,314,637,936
URLs: 26,959,041
Hosts: 536,276
http://schema.org/Place (84,439,411)
http://schema.org/ListItem (69,600,800)
http://schema.org/PostalAddress (68,404,034)
http://schema.org/Event (51,433,809)
http://schema.org/Person (34,850,605)
38.3 GB
(246)
Place (sample) lookup_file
pld_stats_file
Product Quads: 21,539,828,659
URLs: 279,715,051
Hosts: 3,309,209
http://schema.org/Offer (749,382,740)
http://schema.org/ListItem (500,258,027)
http://schema.org/Product (492,076,060)
http://schema.org/Organization (279,065,071)
http://schema.org/ImageObject (153,492,459)
242.98 GB
(1658)
Product (sample) lookup_file
pld_stats_file
QAPage Quads: 150,385,974
URLs: 2,328,487
Hosts: 11,113
http://schema.org/Person (8,305,793)
http://schema.org/Answer (6,534,539)
http://schema.org/ListItem (2,161,061)
http://schema.org/Question (2,116,916)
http://schema.org/QAPage (2,000,004)
2.76 GB
(12)
QAPage (sample) lookup_file
pld_stats_file
Question Quads: 1,632,186,128
URLs: 15,016,691
Hosts: 418,451
http://schema.org/Answer (59,457,402)
http://schema.org/Question (52,839,646)
http://schema.org/ListItem (32,374,542)
https://schema.org/Answer (21,589,599)
http://schema.org/ImageObject (21,100,233)
23.99 GB
(120)
Question (sample) lookup_file
pld_stats_file
RadioStation Quads: 11,699,488
URLs: 236,850
Hosts: 862
http://schema.org/ListItem (318,036)
http://schema.org/RadioStation (285,586)
http://schema.org/NewsArticle (201,586)
http://schema.org/ImageObject (161,882)
http://schema.org/WPSideBar (123,784)
161.49 MB
(1)
RadioStation (sample) lookup_file
pld_stats_file
Recipe Quads: 258,349,284
URLs: 2,746,545
Hosts: 37,304
http://schema.org/HowToStep (8,610,659)
http://schema.org/ListItem (5,355,018)
http://schema.org/ImageObject (3,430,763)
http://schema.org/Person (3,051,861)
http://schema.org/Recipe (2,922,378)
3.84 GB
(21)
Recipe (sample) lookup_file
pld_stats_file
Restaurant Quads: 158,662,564
URLs: 1,186,870
Hosts: 84,256
http://schema.org/Offer (6,208,346)
http://schema.org/MenuItem (3,963,413)
http://schema.org/Restaurant (2,969,736)
http://schema.org/Product (2,780,205)
http://schema.org/ListItem (2,372,360)
1.59 GB
(11)
Restaurant (sample) lookup_file
pld_stats_file
RiverBodyOfWater Quads: 170,020
URLs: 1,418
Hosts: 25
https://schema.org/Canal (16,992)
https://schema.org/Service (5,580)
http://schema.org/ImageObject (2,198)
http://schema.org/ListItem (2,022)
http://schema.org/TouristDestination (1,746)
1.38 MB
(1)
RiverBodyOfWater (sample) lookup_file
pld_stats_file
School Quads: 10,071,921
URLs: 187,087
Hosts: 2,099
http://schema.org/School (291,497)
http://schema.org/ListItem (194,016)
http://schema.org/PostalAddress (180,523)
http://schema.org/Organization (106,718)
http://schema.org/ImageObject (95,256)
113.51 MB
(1)
School (sample) lookup_file
pld_stats_file
SearchAction Quads: 27,878,062,181
URLs: 417,720,816
Hosts: 6,756,347
http://schema.org/ListItem (1,052,351,965)
http://schema.org/ImageObject (653,529,667)
http://schema.org/WebSite (433,191,069)
http://schema.org/SearchAction (422,553,707)
http://schema.org/BreadcrumbList (408,755,536)
265.53 GB
(2194)
SearchAction (sample) lookup_file
pld_stats_file
ShoppingCenter Quads: 15,255,169
URLs: 135,249
Hosts: 1,345
http://schema.org/Offer (363,660)
http://schema.org/ListItem (251,172)
http://schema.org/PostalAddress (249,166)
http://schema.org/Organization (238,757)
http://schema.org/ShoppingCenter (180,907)
157.32 MB
(2)
ShoppingCenter (sample) lookup_file
pld_stats_file
SkiResort Quads: 1,173,165
URLs: 28,128
Hosts: 245
http://schema.org/ListItem (42,596)
http://schema.org/SkiResort (38,305)
http://schema.org/PostalAddress (24,781)
http://schema.org/Person (21,854)
http://schema.org/Review (21,440)
15.61 MB
(1)
SkiResort (sample) lookup_file
pld_stats_file
SportsEvent Quads: 118,751,716
URLs: 801,065
Hosts: 7,213
http://schema.org/SportsTeam (6,022,233)
http://schema.org/SportsEvent (5,823,577)
http://schema.org/Place (5,054,488)
http://schema.org/PostalAddress (4,570,509)
http://schema.org/Organization (1,017,313)
901.17 MB
(9)
SportsEvent (sample) lookup_file
pld_stats_file
SportsTeam Quads: 99,701,119
URLs: 754,061
Hosts: 4,063
http://schema.org/SportsTeam (7,166,129)
http://schema.org/SportsEvent (2,995,518)
http://schema.org/Place (2,387,843)
http://schema.org/PostalAddress (2,094,582)
http://schema.org/Person (1,310,001)
810.39 MB
(8)
SportsTeam (sample) lookup_file
pld_stats_file
StadiumOrArena Quads: 14,431,742
URLs: 57,177
Hosts: 256
http://schema.org/SportsTeam (937,935)
http://schema.org/StadiumOrArena (322,752)
http://schema.org/SportsEvent (247,964)
http://schema.org/SportsMatchCompetitor (247,746)
http://schema.org/Organization (231,215)
108.79 MB
(2)
StadiumOrArena (sample) lookup_file
pld_stats_file
TVEpisode Quads: 29,569,610
URLs: 220,849
Hosts: 1,065
http://schema.org/Country (3,253,437)
http://schema.org/TVEpisode (974,857)
http://schema.org/Person (505,723)
https://schema.org/TVEpisode (299,969)
http://schema.org/TVSeries (213,849)
249.69 MB
(3)
TVEpisode (sample) lookup_file
pld_stats_file
TelevisionStation Quads: 1,927,220
URLs: 22,720
Hosts: 89
http://schema.org/ListItem (44,890)
http://schema.org/ImageObject (41,683)
http://schema.org/TelevisionStation (39,376)
http://schema.org/Person (26,370)
http://schema.org/WebPage (24,915)
21.03 MB
(1)
TelevisionStation (sample) lookup_file
pld_stats_file


In case you are interested in a particular class or set of classes which is not listed above, please get in contact with the WebDataCommons team via Mailing List or our Google Group.

Conversion to Other Formats

We provide the extracted data for download using a variation of the N-Quads format. For users who prefer other formats, we provide code for converting the download files into CSV and JSON formats, which are supported by a wide range of spreadsheet applications, relational databases and data mining frameworks like the python data analysis library pandas. Please find further details on how to convert the download files to other formats on the main page.

Get the Code

The jupyter notebooks used to create the schema.org subsets from the MD and JSON-LD corpus can be checked out from our Git repository.

The extraction of December 2024 was done with version 1.5 of the extractor. For more information about the framework and a detailed description how to run a own extraction visit the framework page.

Get Support

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.