Class-Specific Subsets of the Schema.org Data contained in the October 2021 Corpus

This page provides access to and statistics about class-specific subsets of the Schema.org data contained in the October 2021 version of the Web Data Commons Microdata and JSON-LD corpus. The datasets are part of the Web Data Commons Schema.org Data Set Series

Introduction

As many users are only interested in specific types of Schema.org data (like product data, event data, job postings, or data describing local businesses), we have created class-specific subsets out of the complete and merged Microdata and JSON-LD corpora for a selection of schema.org classes. The subsets contain all instances of a specific class of either formats as well as all other data that is found on the webpages containing these instances. For example, a page containing data about a product might also contain reviews and offers for this product; a page containing data about an event might also contain data about the location of the event and the persons involved in the event. The data is represented in N-Quads format, meaning that the forth element of each quad contains the URL of the webpage from which the data was extracted. To facilitate the download and access to the class specific data, we provide the schema.org subsets in chunks. Each chunk contains quads of specific pay-level-domains (PLDs), i.e. all quads of one PLD, e.g. yummly.com, are organized within the same chunk file. Additionally, we provide lookup files containing the mappings between PLDs and their corresponding chunks as well as csv files with PLD-specific statistics.

Please note that:

You are welcome to use the datasets and also to tell about your findings. If you find our datasets useful for your research, please cite the poster: The Web Data Commons Schema.org Data Set Series by Alexander Brinkmann, Anna Primpeli and Christian Bizer in Companion Proceedings of the ACM Web Conference 2023 (WWW ’23 Companion), Austin, Texas, USA, April 2023.

Class-Specific Subsets of the Schema.org Data

Schema.org Subset General Stats Related Classes Size
(# Files)
Download (Sample) PLD to File look-up
PLD Specific Stats
AdministrativeArea Quads: 60,765,271
URLs: 447,918
PLDs: 2,622
http://schema.org/ImageObject (944,007)
http://schema.org/AdministrativeArea (850,664)
http://schema.org/ListItem (802,971)
http://schema.org/City (576,671)
http://schema.org/Person (562,119)
842.04 MB
(1 file(s))
AdministrativeArea (sample) lookup_file
pld_stats_file
Airport Quads: 64,083,243
URLs: 202,332
PLDs: 617
http://schema.org/Airport (5,406,690)
http://schema.org/GeoCoordinates (2,454,639)
http://schema.org/Flight (1,835,551)
http://schema.org/Airline (1,662,291)
http://schema.org/Offer (1,473,869)
486.08 MB
(1 file(s))
Airport (sample) lookup_file
pld_stats_file
Book Quads: 360,291,528
URLs: 6,444,929
PLDs: 17,998
http://schema.org/Book (18,731,029)
http://schema.org/Country (8,274,215)
http://schema.org/Person (7,899,124)
http://schema.org/Offer (7,376,178)
http://schema.org/ListItem (3,784,768)
6.2 GB
(10 file(s))
Book (sample) lookup_file
pld_stats_file
City Quads: 199,894,118
URLs: 1,276,457
PLDs: 11,260
http://schema.org/City (4,662,681)
http://schema.org/ImageObject (3,526,546)
http://schema.org/PostalAddress (3,375,680)
http://schema.org/Person (3,215,783)
http://schema.org/ListItem (3,091,827)
2.1 GB
(2 file(s))
City (sample) lookup_file
pld_stats_file
CollegeOrUniversity Quads: 118,551,289
URLs: 1,219,624
PLDs: 3,007
http://schema.org/CollegeOrUniversity (4,566,888)
http://schema.org/Person (3,332,672)
http://schema.org/ImageObject (3,149,952)
http://schema.org/PostalAddress (2,949,686)
http://schema.org/ListItem (1,589,546)
1.5 GB
(2 file(s))
CollegeOrUniversity (sample) lookup_file
pld_stats_file
Continent Quads: 1,403,157
URLs: 10,604
PLDs: 55
http://schema.org/City (176,404)
http://schema.org/AdministrativeArea (89,871)
http://schema.org/Continent (12,635)
http://schema.org/Country (11,200)
http://schema.org/GeoCoordinates (9,053)
13.57 MB
(1 file(s))
Continent (sample) lookup_file
pld_stats_file
Country Quads: 790,910,810
URLs: 6,673,935
PLDs: 40,409
http://schema.org/Country (56,150,693)
http://schema.org/ListItem (23,345,865)
http://schema.org/Organization (14,653,383)
http://schema.org/Offer (10,637,184)
http://schema.org/ContactPoint (10,181,864)
10.1 GB
(3 file(s))
Country (sample) lookup_file
pld_stats_file
CreativeWork Quads: 2,858,229,005
URLs: 47,768,961
PLDs: 1,033,049
https://schema.org/CreativeWork (79,159,623)
https://schema.org/Person (75,522,631)
https://schema.org/Comment (44,714,553)
https://schema.org/SiteNavigationElement (42,802,162)
https://schema.org/WPHeader (29,937,233)
97.9 GB
(160 file(s))
CreativeWork (sample) lookup_file
pld_stats_file
Dataset Quads: 38,953,280
URLs: 895,386
PLDs: 1,555
http://schema.org/Dataset (1,286,271)
http://schema.org/Organization (961,307)
http://schema.org/PropertyValue (878,478)
http://schema.org/DataDownload (790,598)
http://schema.org/ImageObject (517,210)
574.78 MB
(1 file(s))
Dataset (sample) lookup_file
pld_stats_file
EducationalOrganization Quads: 76,596,728
URLs: 1,102,570
PLDs: 8,375
http://schema.org/EducationalOrganization (2,206,149)
http://schema.org/PostalAddress (1,558,213)
http://schema.org/ListItem (1,526,639)
http://schema.org/ImageObject (706,628)
http://schema.org/Course (672,205)
1.2 GB
(1 file(s))
EducationalOrganization (sample) lookup_file
pld_stats_file
Event Quads: 1,633,842,997
URLs: 13,433,021
PLDs: 261,160
http://schema.org/Event (72,764,507)
http://schema.org/Place (52,668,977)
http://schema.org/PostalAddress (38,480,951)
http://schema.org/Person (20,414,748)
http://schema.org/Offer (15,676,301)
22.1 GB
(12 file(s))
Event (sample) lookup_file
pld_stats_file
GeoCoordinates Quads: 3,226,742,821
URLs: 28,650,982
PLDs: 469,253
http://schema.org/PostalAddress (62,911,798)
http://schema.org/GeoCoordinates (59,146,834)
http://schema.org/ListItem (42,608,148)
http://schema.org/ImageObject (32,868,276)
http://schema.org/Offer (29,585,584)
42.3 GB
(20 file(s))
GeoCoordinates (sample) lookup_file
pld_stats_file
GovernmentOrganization Quads: 16,628,646
URLs: 342,837
PLDs: 1,304
http://schema.org/ImageObject (575,166)
http://schema.org/GovernmentOrganization (530,788)
http://schema.org/ListItem (343,372)
http://schema.org/PostalAddress (296,180)
http://schema.org/Organization (231,342)
269.92 MB
(1 file(s))
GovernmentOrganization (sample) lookup_file
pld_stats_file
Hospital Quads: 21,693,491
URLs: 286,317
PLDs: 2,154
http://schema.org/PostalAddress (538,221)
http://schema.org/Hospital (443,902)
http://schema.org/ListItem (277,373)
http://schema.org/GeoCoordinates (238,584)
https://schema.org/MedicalCondition (230,568)
294.66 MB
(1 file(s))
Hospital (sample) lookup_file
pld_stats_file
Hotel Quads: 359,533,927
URLs: 2,609,123
PLDs: 23,487
http://schema.org/ImageObject (8,530,131)
http://schema.org/Hotel (7,448,020)
http://schema.org/LocationFeatureSpecification (6,061,145)
http://schema.org/Rating (5,313,944)
http://schema.org/ListItem (5,108,429)
4.6 GB
(4 file(s))
Hotel (sample) lookup_file
pld_stats_file
JobPosting Quads: 159,344,116
URLs: 3,698,141
PLDs: 43,357
http://schema.org/JobPosting (4,883,181)
http://schema.org/Place (4,881,113)
http://schema.org/PostalAddress (4,770,776)
http://schema.org/Organization (4,450,457)
http://schema.org/ListItem (2,213,865)
6.2 GB
(4 file(s))
JobPosting (sample) lookup_file
pld_stats_file
LakeBodyOfWater Quads: 1,007,227
URLs: 5,515
PLDs: 85
https://schema.org/AdministrativeArea (63,970)
https://schema.org/Place (39,398)
https://schema.org/Map (21,306)
https://schema.org/LakeBodyOfWater (17,822)
https://schema.org/ListItem (15,398)
10.26 MB
(1 file(s))
LakeBodyOfWater (sample) lookup_file
pld_stats_file
LandmarksOrHistoricalBuildings Quads: 2,499,817
URLs: 31,107
PLDs: 367
http://schema.org/LandmarksOrHistoricalBuildings (98,037)
http://schema.org/ImageObject (38,520)
http://schema.org/CreativeWork (37,995)
http://schema.org/PostalAddress (34,395)
http://schema.org/PropertyValue (30,455)
46.52 MB
(1 file(s))
LandmarksOrHistoricalBuildings (sample) lookup_file
pld_stats_file
Language Quads: 905,313,908
URLs: 7,490,791
PLDs: 11,793
http://schema.org/Person (44,425,861)
http://schema.org/Comment (37,520,118)
http://schema.org/ListItem (15,781,306)
http://schema.org/Language (11,051,639)
http://schema.org/InteractionCounter (10,489,920)
16.9 GB
(10 file(s))
Language (sample) lookup_file
pld_stats_file
Library Quads: 6,316,189
URLs: 189,051
PLDs: 617
http://schema.org/Library (211,533)
http://schema.org/OpeningHoursSpecification (200,112)
http://schema.org/Book (63,575)
http://schema.org/ListItem (60,836)
http://schema.org/PostalAddress (59,573)
103.42 MB
(1 file(s))
Library (sample) lookup_file
pld_stats_file
LocalBusiness Quads: 2,133,052,253
URLs: 36,545,099
PLDs: 727,613
http://schema.org/LocalBusiness (56,481,369)
http://schema.org/PostalAddress (49,720,404)
http://schema.org/ListItem (37,017,817)
http://schema.org/ImageObject (24,011,725)
http://schema.org/Rating (14,248,632)
29.6 GB
(40 file(s))
LocalBusiness (sample) lookup_file
pld_stats_file
Mountain Quads: 2,113,514
URLs: 33,646
PLDs: 60
https://schema.org/AdministrativeArea (101,191)
https://schema.org/Place (59,246)
http://schema.org/Mountain (46,130)
http://schema.org/GeoCoordinates (32,748)
https://schema.org/Map (31,990)
25.74 MB
(1 file(s))
Mountain (sample) lookup_file
pld_stats_file
Movie Quads: 188,319,450
URLs: 2,589,342
PLDs: 8,372
http://schema.org/Person (11,715,266)
http://schema.org/Movie (4,900,155)
https://schema.org/Person (3,242,657)
http://schema.org/ListItem (2,082,100)
https://schema.org/Comment (2,065,826)
3.2 GB
(5 file(s))
Movie (sample) lookup_file
pld_stats_file
Museum Quads: 7,104,754
URLs: 128,841
PLDs: 645
http://schema.org/OpeningHoursSpecification (173,199)
http://schema.org/PostalAddress (124,607)
http://schema.org/Museum (121,880)
http://schema.org/Person (94,311)
https://schema.org/Museum (90,757)
91.81 MB
(1 file(s))
Museum (sample) lookup_file
pld_stats_file
MusicAlbum Quads: 144,932,250
URLs: 1,325,222
Hosts: 15,569
http://schema.org/MusicRecording (10,531,635)
http://schema.org/Country (8,875,501)
http://schema.org/MusicAlbum (3,006,915)
http://schema.org/Offer (1,984,069)
http://schema.org/MusicGroup (1,974,991)
1.27 GB
(2 file(s))
MusicAlbum (sample) lookup_file
pld_stats_file
MusicRecording Quads: 224,396,466
URLs: 2,347,138
PLDs: 23,401
http://schema.org/Country (17,125,378)
http://schema.org/MusicRecording (16,930,443)
http://schema.org/MusicGroup (2,942,496)
http://schema.org/Offer (2,274,844)
http://schema.org/MusicAlbum (1,991,504)
2.0 GB
(4 file(s))
MusicRecording (sample) lookup_file
pld_stats_file
Organization Quads: 33,849,515,925
URLs: 586,167,976
PLDs: 5,590,365
http://schema.org/ListItem (868,673,730)
http://schema.org/Organization (768,124,229)
http://schema.org/ImageObject (723,967,119)
http://schema.org/WebPage (395,618,415)
http://schema.org/Person (388,122,526)
614 GB
(500 file(s))
Organization (sample) lookup_file
pld_stats_file
Painting Quads: 20,541,804
URLs: 165,580
PLDs: 451
http://schema.org/Person (5,608,152)
http://schema.org/Painting (594,433)
http://schema.org/ListItem (320,224)
http://schema.org/Offer (193,129)
http://schema.org/Organization (154,693)
161.87 MB
(1 file(s))
Painting (sample) lookup_file
pld_stats_file
Park Quads: 1,992,346
URLs: 18,013
PLDs: 295
https://schema.org/AdministrativeArea (50,733)
https://schema.org/Place (42,126)
http://schema.org/GeoCoordinates (36,681)
http://schema.org/Park (30,678)
http://schema.org/PostalAddress (27,463)
26.31 MB
(1 file(s))
Park (sample) lookup_file
pld_stats_file
Person Quads: 25,370,826,738
URLs: 379,276,912
PLDs: 4,162,621
http://schema.org/Person (683,705,984)
http://schema.org/ImageObject (645,796,885)
http://schema.org/ListItem (488,479,836)
http://schema.org/WebPage (358,768,733)
http://schema.org/Organization (294,732,412)
492.0GB
(527 file(s))
Person (sample) lookup_file
pld_stats_file
Place Quads: 3,265,055,697
URLs: 29,633,238
PLDs: 378,270
http://schema.org/Place (97,713,014)
http://schema.org/PostalAddress (76,701,627)
http://schema.org/Event (57,270,391)
http://schema.org/ListItem (36,550,268)
http://schema.org/Person (34,451,494)
47.1 GB
(25 file(s))
Place (sample) lookup_file
pld_stats_file
Product Quads: 17,301,144,036
URLs: 271,813,425
PLDs: 2,583,228
http://schema.org/Offer (610,738,978)
http://schema.org/Product (590,894,883)
http://schema.org/ListItem (437,803,391)
http://schema.org/Organization (212,004,989)
http://schema.org/BreadcruMB
List (117,297,771)
274.7 GB
(300 file(s))
Product (sample) lookup_file
pld_stats_file
RadioStation Quads: 14,952,891
URLs: 337,200
PLDs: 663
http://schema.org/ListItem (693,857)
http://schema.org/RadioStation (373,590)
http://schema.org/NewsArticle (323,919)
http://schema.org/ImageObject (153,898)
http://schema.org/PostalAddress (144,450)
227.1 MB
(1 file(s))
RadioStation (sample) lookup_file
pld_stats_file
Recipe Quads: 367,757,913
URLs: 4,521,389
PLDs: 42,495
http://schema.org/HowToStep (10,567,769)
http://schema.org/ListItem (6,030,978)
http://schema.org/Person (5,353,168)
http://schema.org/Recipe (4,937,223)
http://schema.org/ImageObject (4,499,447)
6.8 GB
(5 file(s))
Recipe (sample) lookup_file
pld_stats_file
Restaurant Quads: 233,766,701
URLs: 1,556,445
PLDs: 57,710
http://schema.org/Offer (11,319,253)
http://schema.org/MenuItem (10,559,825)
http://schema.org/Product (4,257,912)
http://schema.org/Restaurant (4,144,203)
http://schema.org/ListItem (3,457,459)
2.5 GB
(2 file(s))
Restaurant (sample) lookup_file
pld_stats_file
RiverBodyOfWater Quads: 367,856
URLs: 3,945
PLDs: 24
https://schema.org/AdministrativeArea (14,587)
https://schema.org/Place (8,679)
https://schema.org/BodyOfWater (6,838)
http://schema.org/ListItem (5,234)
http://schema.org/ImageObject (5,138)
4.93 MB
(1 file(s))
RiverBodyOfWater (sample) lookup_file
pld_stats_file
School Quads: 15,037,483
URLs: 280,034
PLDs: 1,734
http://schema.org/ListItem (455,460)
http://schema.org/School (401,765)
http://schema.org/PostalAddress (262,017)
http://schema.org/WebPage (201,885)
http://schema.org/ImageObject (135,657)
218.82 MB
(1 file(s))
School (sample) lookup_file
pld_stats_file
ShoppingCenter Quads: 13,571,658
URLs: 166,061
PLDs: 1,316
http://schema.org/Offer (313,733)
http://schema.org/PostalAddress (296,514)
http://schema.org/Organization (282,880)
http://schema.org/ShoppingCenter (258,638)
http://schema.org/Product (147,719)
163.38 MB
( 1file(s))
ShoppingCenter (sample) lookup_file
pld_stats_file
SkiResort Quads: 1,034,270
URLs: 30,349
PLDs: 220
http://schema.org/SkiResort (31,231)
http://schema.org/PostalAddress (28,630)
http://schema.org/ListItem (27,387)
http://schema.org/AggregateRating (21,499)
http://schema.org/Organization (15,414)
18.51 MB
(1 file(s))
SkiResort (sample) lookup_file
pld_stats_file
SportsEvent Quads: 141,738,795
URLs: 957,983
PLDs: 6,844
http://schema.org/SportsTeam (7,329,202)
http://schema.org/SportsEvent (6,290,522)
http://schema.org/Place (5,328,600)
http://schema.org/PostalAddress (4,300,669)
http://schema.org/Organization (1,676,829)
1.2 GB
(3 file(s))
SportsEvent (sample) lookup_file
pld_stats_file
SportsTeam Quads: 132,540,282
URLs: 936,635
PLDs: 4,770
http://schema.org/SportsTeam (9,126,369)
http://schema.org/Place (4,777,298)
http://schema.org/SportsEvent (3,758,544)
http://schema.org/PostalAddress (2,995,648)
http://schema.org/Person (1,736,128)
1.1 GB
(3 file(s))
SportsTeam (sample) lookup_file
pld_stats_file
StadiumOrArena Quads: 26,788,192
URLs: 95,667
PLDs: 235
http://schema.org/Place (1,331,291)
http://schema.org/SportsTeam (720,717)
http://schema.org/Organization (637,179)
http://schema.org/ImageObject (577,567)
http://schema.org/StadiumOrArena (328,592)
221.26 MB
(1 file(s))
StadiumOrArena (sample) lookup_file
pld_stats_file
TVEpisode Quads: 85,209,605
URLs: 460,809
PLDs: 1,284
http://schema.org/Country (7,956,609)
http://schema.org/TVEpisode (3,579,380)
https://schema.org/TVEpisode (1,303,687)
http://schema.org/Person (1,253,586)
http://schema.org/OnDemandEvent (732,639)
801.1 MB
(1 file(s))
TVEpisode (sample) lookup_file
pld_stats_file
TelevisionStation Quads: 1,184,918
URLs: 16,879
PLDs: 103
http://schema.org/ListItem (25,527)
http://schema.org/ImageObject (21,425)
http://schema.org/TelevisionStation (20,807)
http://schema.org/SiteNavigationElement (19,267)
http://schema.org/CreativeWorkSeries (17,466)
17.63 MB
(1 file(s))
TelevisionStation (sample) lookup_file
pld_stats_file


In case you are interested in a particular class or set of classes which is not listed above, please get in contact with the WebDataCommons team via Mailing List or our Google Group.

Conversion to Other Formats

We provide the extracted data for download using a variation of the N-Quads format. For users who prefer other formats, we provide code for converting the download files into CSV and JSON formats, which are supported by a wide range of spreadsheet applications, relational databases and data mining frameworks like the python data analysis library pandas. Please find further details on how to convert the download files to other formats on the main page.

Get the Code

The jupyter notebooks used to create the schema.org subsets from the MD and JSON-LD corpus can be checked out from our Git repository.

The extraction of the December 2021 was done with version 1.5 of the extractor. For more information about the framework and a detailed description how to run a own extraction visit the framework page.

Get Support

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.