Class-Specific Subsets of the Schema.org Data contained in the November 2017 Corpus

This page provides access to and statistics about class-specific subsets of the Schema.org data contained in the November 2017 version of the Web Data Commons Microdata corpus.

Introduction

As many users are only interested in specific types of Schema.org data (like product data, event data, or address data), we have created class-specific subsets out of the complete Microdata corpus for a selection of schema.org classes. The subsets contain all instances of a specific class as well as all other data that is found on the webpages containing these instances. For example, a page containing data about a product might also contain reviews and offers for this product; a page containing data about an event might also contain data about the location of the event and the persons involved in the event. The data is represented in N-Quads format, meaning that the forth element of each quad contains the URL of the webpage from which the data was extracted.

Please note that

You are welcome to use the datasets and also to tell about your findings. If you find our datasets useful for your research, please quote the paper: The WebDataCommons Microdata, RDFa and Microformat Dataset Series by Robert Meusel, Petar Petrovski, and Christian Bizer in the Proceedings of the 13th International Semantic Web Conference: Replication, Benchmark, Data and Software Track (ISWC2014).

Contents

  • Class-Specific Subsets of the Schema.org Data
  • Extended Analysis
  • Code
  • Get Support
  • Class-Specific Subsets of the Schema.org Data

    Class NameTotal Number ofTop Classes (Entity Count)Total File SizeQuad File
    http://schema.org/AdministrativeArea Quads: 15,915,728
    URLs: 206,143
    Hosts: 301
    http://schema.org/City (822,377)
    http://schema.org/AdministrativeArea (501,176)
    http://schema.org/CityHall (162,397)
    http://schema.org/ListItem (161,751)
    http://schema.org/Service (128,219)
    266.3 MBschema_AdministrativeArea.gz (sample)
    http://schema.org/Airport Quads: 11,889,956
    URLs: 164,664
    Hosts: 132
    http://schema.org/Airport (2,846,502)
    http://schema.org/Place (97,194)
    http://schema.org/Flight (65,685)
    http://schema.org/GeoCoordinates (32,079)
    http://schema.org/ListItem (22,230)
    161.2 MBschema_Airport.gz (sample)
    http://schema.org/Book Quads: 355,749,038
    URLs: 8,612,144
    Hosts: 7,402
    http://schema.org/Book (21,437,601)
    http://schema.org/Offer (9,480,071)
    http://schema.org/Person (9,217,519)
    http://schema.org/ListItem (4,224,099)
    http://schema.org/AggregateRating (2,064,487)
    8.3 GBschema_Book.gz (sample)
    http://schema.org/City Quads: 134,336,690
    URLs: 626,881
    Hosts: 850
    http://schema.org/GeoCoordinates (5,486,532)
    http://schema.org/PostalAddress (5,413,186)
    http://schema.org/LocalBusiness (4,915,558)
    http://schema.org/Person (4,504,999)
    http://schema.org/City (3,784,668)
    1.9 GBschema_City.gz (sample)
    http://schema.org/CollegeOrUniversity Quads: 197,217,532
    URLs: 746,745
    Hosts: 664
    http://schema.org/Organization (11,372,563)
    http://schema.org/Person (9,662,369)
    http://schema.org/CollegeOrUniversity (8,759,942)
    http://schema.org/PostalAddress (5,582,453)
    http://schema.org/GeoCoordinates (5,349,424)
    2.9 GBschema_CollegeOrUniversity.gz (sample)
    http://schema.org/Continent Quads: 5,781,203
    URLs: 73,939
    Hosts: 17
    http://schema.org/City (731,339)
    http://schema.org/AdministrativeArea (303,342)
    http://schema.org/Place (99,039)
    http://schema.org/GeoCoordinates (78,425)
    http://schema.org/Country (76,246)
    71.1 MBschema_Continent.gz (sample)
    http://schema.org/Country Quads: 91,785,020
    URLs: 469,394
    Hosts: 804
    http://schema.org/Country (4,445,475)
    http://schema.org/Person (3,258,729)
    http://schema.org/PostalAddress (2,920,998)
    http://schema.org/Rating (2,537,527)
    http://schema.org/Review (2,526,032)
    1.6 GBschema_Country.gz (sample)
    http://schema.org/CreativeWork Quads: 616,442,965
    URLs: 11,851,424
    Hosts: 171,817
    http://schema.org/CreativeWork (24,486,210)
    http://schema.org/Person (20,397,070)
    http://schema.org/Comment (9,554,245)
    http://schema.org/TelevisionChannel (6,818,740)
    http://schema.org/SiteNavigationElement (6,400,561)
    27.2 GBschema_CreativeWork.gz (sample)
    http://schema.org/EducationalOrganization Quads: 9,385,781
    URLs: 214,844
    Hosts: 3,142
    http://schema.org/EducationalOrganization (371,542)
    http://schema.org/Person (284,633)
    http://schema.org/PostalAddress (267,969)
    http://schema.org/Place (135,195)
    http://schema.org/ListItem (79,756)
    172.7 MBschema_EducationalOrganization.gz (sample)
    http://schema.org/Event Quads: 263,504,427
    URLs: 4,565,851
    Hosts: 65,114
    http://schema.org/Event (21,359,059)
    http://schema.org/Place (12,927,463)
    http://schema.org/PostalAddress (7,759,732)
    http://schema.org/Offer (1,374,495)
    http://schema.org/Person (1,122,117)
    5.6 GBschema_Event.gz (sample)
    http://schema.org/GeoCoordinates Quads: 858,240,131
    URLs: 8,799,691
    Hosts: 50,871
    http://schema.org/GeoCoordinates (34,893,781)
    http://schema.org/PostalAddress (32,573,487)
    http://schema.org/LocalBusiness (14,230,960)
    http://schema.org/Place (12,920,066)
    http://schema.org/Offer (11,308,063)
    14 GBschema_GeoCoordinates.gz (sample)
    http://schema.org/GovernmentOrganization Quads: 2,849,188
    URLs: 91,640
    Hosts: 331
    http://schema.org/GovernmentOrganization (136,583)
    http://schema.org/ListItem (82,208)
    http://schema.org/PostalAddress (43,621)
    http://schema.org/WebPage (18,349)
    http://schema.org/Event (14,767)
    64.1 MBschema_GovernmentOrganization.gz (sample)
    http://schema.org/Hospital Quads: 3,277,313
    URLs: 90,644
    Hosts: 361
    http://schema.org/PostalAddress (165,337)
    http://schema.org/Hospital (149,765)
    http://schema.org/ListItem (63,968)
    http://schema.org/Place (58,126)
    http://schema.org/GeoCoordinates (48,088)
    57.9 MBschema_Hospital.gz (sample)
    http://schema.org/Hotel Quads: 161,254,476
    URLs: 5,594,793
    Hosts: 7,494
    http://schema.org/Hotel (10,302,877)
    http://schema.org/PostalAddress (4,919,871)
    http://schema.org/ListItem (4,756,155)
    http://schema.org/AggregateRating (3,811,134)
    http://schema.org/ImageObject (1,392,820)
    3.5 GBschema_Hotel.gz (sample)
    http://schema.org/JobPosting Quads: 266,933,002
    URLs: 2,295,403
    Hosts: 7,023
    http://schema.org/JobPosting (23,597,716)
    http://schema.org/Place (16,792,907)
    http://schema.org/PostalAddress (12,561,300)
    http://schema.org/Organization (5,982,465)
    http://schema.org/ListItem (1,903,144)
    6.2 GBschema_JobPosting.gz (sample)
    http://schema.org/LakeBodyOfWater Quads: 90,108
    URLs: 633
    Hosts: 21
    http://schema.org/GeoCoordinates (3,674)
    http://schema.org/PostalAddress (3,648)
    http://schema.org/LakeBodyOfWater (1,697)
    http://schema.org/PropertyValue (1,514)
    http://schema.org/City (1,269)
    1.5 MBschema_LakeBodyOfWater.gz (sample)
    http://schema.org/LandmarksOrHistoricalBuildings Quads: 2,208,810
    URLs: 115,230
    Hosts: 194
    http://schema.org/LandmarksOrHistoricalBuildings (149,824)
    http://schema.org/PostalAddress (128,601)
    http://schema.org/GeoCoordinates (32,760)
    http://schema.org/ImageObject (30,982)
    http://schema.org/Offer (18,081)
    45.3 MBschema_LandmarksOrHistoricalBuildings.gz (sample)
    http://schema.org/Language Quads: 7,327,523
    URLs: 95,606
    Hosts: 454
    http://schema.org/SiteNavigationElement (194,315)
    http://schema.org/Language (124,807)
    http://schema.org/PostalAddress (110,926)
    http://schema.org/GeoCoordinates (107,459)
    http://schema.org/Organization (69,782)
    158.7 MBschema_Language.gz (sample)
    http://schema.org/Library Quads: 794,835
    URLs: 16,471
    Hosts: 188
    http://schema.org/Library (31,396)
    http://schema.org/PostalAddress (27,365)
    http://schema.org/Organization (16,693)
    http://schema.org/Offer (10,098)
    http://schema.org/SiteNavigationElement (8,449)
    12.9 MBschema_Library.gz (sample)
    http://schema.org/LocalBusiness Quads: 1,144,571,235
    URLs: 20,364,380
    Hosts: 230,844
    http://schema.org/LocalBusiness (66,826,984)
    http://schema.org/PostalAddress (52,820,628)
    http://schema.org/Person (32,456,801)
    http://schema.org/GeoCoordinates (13,414,986)
    http://schema.org/ListItem (11,676,565)
    19.9 GBschema_LocalBusiness.gz (sample)
    http://schema.org/Mountain Quads: 536,575
    URLs: 15,901
    Hosts: 38
    http://schema.org/Mountain (31,219)
    http://schema.org/GeoCoordinates (19,691)
    http://schema.org/PostalAddress (4,716)
    http://schema.org/Review (3,321)
    http://schema.org/City (575)
    8.5 MBschema_Mountain.gz (sample)
    http://schema.org/Movie Quads: 164,589,867
    URLs: 3,719,152
    Hosts: 6,010
    http://schema.org/Person (12,496,472)
    http://schema.org/Movie (8,499,513)
    http://schema.org/AggregateRating (2,600,319)
    http://schema.org/ImageObject (898,326)
    http://schema.org/Organization (716,639)
    3.6 GBschema_Movie.gz (sample)
    http://schema.org/Museum Quads: 2,057,104
    URLs: 33,132
    Hosts: 234
    http://schema.org/GeoCoordinates (61,202)
    http://schema.org/Review (57,210)
    http://schema.org/Museum (54,193)
    http://schema.org/PostalAddress (48,488)
    http://schema.org/AggregateRating (26,501)
    43.4 MBschema_Museum.gz (sample)
    http://schema.org/MusicAlbum Quads: 94,295,777
    URLs: 1,194,310
    Hosts: 7,992
    http://schema.org/MusicRecording (8,843,652)
    http://schema.org/Country (3,074,299)
    http://schema.org/MusicAlbum (2,768,809)
    http://schema.org/ListItem (1,579,226)
    http://schema.org/MusicGroup (1,434,431)
    1.4 GBschema_MusicAlbum.gz (sample)
    http://schema.org/MusicRecording Quads: 145,978,352
    URLs: 2,175,324
    Hosts: 4,749
    http://schema.org/MusicRecording (16,436,479)
    http://schema.org/Country (3,731,689)
    http://schema.org/MusicGroup (2,191,117)
    http://schema.org/ListItem (1,840,130)
    http://schema.org/MusicAlbum (1,592,186)
    2.3 GBschema_MusicRecording.gz (sample)
    http://schema.org/Organization Quads: 839,872,521
    URLs: 68,187,331
    Hosts: 352,669
    http://schema.org/Organization (148,575,835)
    http://schema.org/Offer (53,476,589)
    http://schema.org/Product (48,834,225)
    http://schema.org/PostalAddress (48,200,710)
    http://schema.org/ListItem (38,635,887)
    87.1 GBschema_Organization.gz (sample)
    http://schema.org/Painting Quads: 2,656,918
    URLs: 70,899
    Hosts: 210
    http://schema.org/Painting (282,229)
    http://schema.org/Person (109,642)
    http://schema.org/Offer (33,257)
    http://schema.org/Comment (28,458)
    http://schema.org/UserComments (18,940)
    67.6 MBschema_Painting.gz (sample)
    http://schema.org/Park Quads: 491,176
    URLs: 5,350
    Hosts: 103
    http://schema.org/GeoCoordinates (29,150)
    http://schema.org/PostalAddress (15,272)
    http://schema.org/Park (14,867)
    http://schema.org/Museum (5,481)
    http://schema.org/City (3,890)
    9 MBschema_Park.gz (sample)
    http://schema.org/Person Quads: 1,494,835,164
    URLs: 41,801,740
    Hosts: 289,232
    http://schema.org/Person (205,659,159)
    http://schema.org/ImageObject (30,957,223)
    http://schema.org/Comment (27,154,959)
    http://schema.org/Organization (25,225,064)
    http://schema.org/Article (20,419,037)
    84.2 GBschema_Person.gz (sample)
    http://schema.org/Place Quads: 888,169,505
    URLs: 10,411,701
    Hosts: 78,010
    http://schema.org/Place (52,139,212)
    http://schema.org/PostalAddress (37,829,439)
    http://schema.org/JobPosting (16,693,597)
    http://schema.org/Event (13,393,794)
    http://schema.org/Offer (12,674,341)
    18 GBschema_Place.gz (sample)
    http://schema.org/Product Quads: 6,321,909,578
    URLs: 112,695,547
    Hosts: 581,482
    http://schema.org/Product (444,760,713)
    http://schema.org/Offer (365,577,281)
    http://schema.org/AggregateRating (46,793,199)
    http://schema.org/Organization (32,839,969)
    http://schema.org/Review (23,361,605)
    135 GBschema_Product.gz (sample)
    http://schema.org/RadioStation Quads: 1,314,016
    URLs: 96,153
    Hosts: 138
    http://schema.org/RadioStation (109,761)
    http://schema.org/PostalAddress (34,133)
    http://schema.org/ImageObject (13,622)
    http://schema.org/MusicVideoObject (13,552)
    http://schema.org/VideoObject (13,528)
    29 MBschema_RadioStation.gz (sample)
    http://schema.org/Recipe Quads: 108,908,798
    URLs: 2,757,523
    Hosts: 25,111
    http://schema.org/Recipe (4,415,586)
    http://schema.org/Person (1,141,445)
    http://schema.org/ListItem (1,126,462)
    http://schema.org/AggregateRating (1,091,320)
    http://schema.org/NutritionInformation (503,663)
    3.4 GBschema_Recipe.gz (sample)
    http://schema.org/Restaurant Quads: 82,228,482
    URLs: 677,878
    Hosts: 11,979
    http://schema.org/Review (4,294,997)
    http://schema.org/Rating (4,137,198)
    http://schema.org/Person (3,888,557)
    http://schema.org/Restaurant (1,676,161)
    http://schema.org/Product (1,150,037)
    1.7 GBschema_Restaurant.gz (sample)
    http://schema.org/RiverBodyOfWater Quads: 71,740
    URLs: 763
    Hosts: 13
    http://schema.org/GeoCoordinates (3,301)
    http://schema.org/PostalAddress (3,150)
    http://schema.org/RiverBodyOfWater (1,513)
    http://schema.org/NewsArticle (998)
    http://schema.org/City (546)
    1.1 MBschema_RiverBodyOfWater.gz (sample)
    http://schema.org/School Quads: 2,601,454
    URLs: 78,274
    Hosts: 365
    http://schema.org/School (201,657)
    http://schema.org/PostalAddress (66,513)
    http://schema.org/Review (52,890)
    http://schema.org/Rating (51,050)
    http://schema.org/Person (43,465)
    65.3 MBschema_School.gz (sample)
    http://schema.org/ShoppingCenter Quads: 3,136,544
    URLs: 30,079
    Hosts: 151
    http://schema.org/PostalAddress (157,508)
    http://schema.org/ShoppingCenter (120,387)
    http://schema.org/Product (112,094)
    http://schema.org/Offer (111,799)
    http://schema.org/ClothingStore (64,668)
    49.6 MBschema_ShoppingCenter.gz (sample)
    http://schema.org/SkiResort Quads: 347,463
    URLs: 35,869
    Hosts: 54
    http://schema.org/SkiResort (37,568)
    http://schema.org/AggregateRating (23,652)
    http://schema.org/Review (5,215)
    http://schema.org/Person (3,613)
    http://schema.org/Rating (1,618)
    8.5 MBschema_SkiResort.gz (sample)
    http://schema.org/SportsEvent Quads: 43,627,478
    URLs: 400,719
    Hosts: 1,571
    http://schema.org/SportsEvent (2,651,794)
    http://schema.org/Place (1,884,042)
    http://schema.org/SportsTeam (1,506,391)
    http://schema.org/PostalAddress (1,160,975)
    http://schema.org/SiteNavigationElement (353,692)
    660.4 MBschema_SportsEvent.gz (sample)
    http://schema.org/SportsTeam Quads: 25,217,786
    URLs: 306,609
    Hosts: 1,085
    http://schema.org/SportsTeam (2,078,054)
    http://schema.org/Person (818,624)
    http://schema.org/SportsEvent (813,244)
    http://schema.org/Place (587,103)
    http://schema.org/SportsMatchCompetitor (397,568)
    396.6 MBschema_SportsTeam.gz (sample)
    http://schema.org/StadiumOrArena Quads: 2,530,400
    URLs: 21,937
    Hosts: 96
    http://schema.org/Person (240,227)
    http://schema.org/StadiumOrArena (64,721)
    http://schema.org/PostalAddress (62,983)
    http://schema.org/SportsTeam (45,687)
    http://schema.org/SportsEvent (30,982)
    38 MBschema_StadiumOrArena.gz (sample)
    http://schema.org/TVEpisode Quads: 49,090,967
    URLs: 756,921
    Hosts: 508
    http://schema.org/TVEpisode (3,812,335)
    http://schema.org/Person (1,365,567)
    http://schema.org/TVSeries (817,184)
    http://schema.org/SiteNavigationElement (776,574)
    http://schema.org/TVSeason (470,736)
    889.9 MBschema_TVEpisode.gz (sample)
    http://schema.org/TelevisionStation Quads: 679,578
    URLs: 26,765
    Hosts: 31
    http://schema.org/TelevisionStation (37,016)
    http://schema.org/PostalAddress (20,412)
    http://schema.org/Article (19,783)
    http://schema.org/AggregateRating (19,308)
    http://schema.org/GeoCoordinates (1,031)
    9.6 MBschema_TelevisionStation.gz (sample)

    In case you are interested in a particular class or set of classes which is not listed above, please get in contact with the WebDataCommons team via Mailing List or our Google Group.

    Extended Analysis

    We analyzed the adoption of important properties for the classes schema.org/Product, schema.org/JobPosting, schema.org/Hotel and schema.org/LocalBusiness over the period of three years (2015-2017). In general, we observe that more and more websites use structured data to describe content referring to these four domains. You can find the detailed statistics in the schema.org_SubsetsAnalysis Excel file (33kb).

    Get the Code

    The source code can be checked out from our Subversion repository. The extraction of November 2017 was done with version 1.0.5 of the extractor. For more information about the framework and a detailed description how to run a own extraction visit the framework page.

    Get Support

    Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.