The Web Data Commons Schema.org data set series

Millions of websites have started to annotate structured data within their HTML pages using the schema.org vocabulary. Popular types of entities that are annotated with schema.org terms are products, local businesses, events, job postings, reviews, question and answers. The Web Data Commons project has been extracting schema.org data from the Common Crawl every year since 2013 and offers the extracted data for public download in the form of the schema.org data set series.

1. About the Schema.org Data Set Series
2. Releases
3. Feedback
4. References

1. About the Schema.org Data Set Series

The Web Data Commons project has been extracting schema.org data from the Common Crawl every year since 2013 and offers the extracted data for public download in the form of the schema.org data set series. From a Web Science perspective, this data set series lays the foundation for analyzing the adoption process of schema.org annotations on the Web over the past decade. From a machine learning perspective, the annotations provide a large pool of training data for tasks such as product matching, product or job categorization, information extraction, or question answering. Webmasters primarily use the JSON-LD and Microdata formats to annotate structured data with schema.org, the extracted JSON-LD and Microdata corpora of the Web Data Commons Data Set are merged for a selection of schema.org classes to create class-specific subsets. The subsets contain all entities of a particular class together with entities of other classes that are found on the same pages. For example, a page containing data about a product might also contain reviews and offers for this product; a page containing data about an event might also contain data about the location of the event and the persons involved in the event. The data is represented in N-Quads format, meaning that the forth element of each quad contains the URL of the webpage from which the data was extracted. The diagrams below on the left-hand side show the growth in the number of websites (pay-level domains) over the last decade that annotate specific entity types using JSON-LD and Microdata respectively. The diagrams on the right-hand side show the growth of the number of entities of these classes that are contained in the data sets.

2. Releases

The following table gives an overview of the releases of the schema.org data set series.
For detailed statistics as well as for downloading a release, please click on the release date.

Release	PLDs	Quads	Size
2024-12	15,417,811	136,796,215,819	2,168 GB
2023-12	13,328,045	138,607,560,867	2,457 GB
2022-12	12,838,680	105,579,841,305	1,831 GB
2021-12	12,897,837	94,445,547,365	1,691 GB
2020-12	12,414,851	90,416,246,904	1,339 GB
2019-12	9,470,865	25,366,269,937	542 GB
2018-12	7,226,784	20,135,070,334	469 GB
2017-12	5,144,983	14,424,702,204	431 GB
2016-10	3,846,805	26,719,043,004	707 GB
2015-11	1,370,192	12,910,049,866	267 GB
2014-12	700,000	11,067,728,953	262 GB
2013-11	400,000	10,395,924,438	227 GB

We publish the corpora for research purposes only. The December 2020 release and the December 2023 release of the WDC schema.org data set series are also available as table corpora, which have been created by grouping the data into separate tables for each class/host combination, e.g. all records of a specific class extracted from a specific website are put into a single table. The resulting corpora consists of 4.2 million relational tables (2020) and 5 million relational tables (2023), which are available for download in a JSON format that can be read by the pandas library.

We provide the extracted data for download using a variation of the N-Quads format. For users who prefer other formats, we provide code for converting the download files into CSV and JSON formats, which are supported by a wide range of spreadsheet applications, relational databases and data mining frameworks like the python data analysis library pandas. Please find further details on how to convert the download files to other formats on the main page.

3. Feedback

Please post questions and feedback in our Web Data Commons Google Group.

More information about Web Data Commons is found here.

4. References

Alexander Brinkmann, Anna Primpeli, Christian Bizer: The Web Data Commons Schema.org Data Set Series. In Companion Proceedings of the ACM Web Conference 2023 (WWW ’23 Companion), Austin, Texas, USA, April 2023.
Robert Meusel, Christian Bizer, Heiko Paulheim: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time. 5th International Conference on Web Intelligence, Mining and Semantics (WIMS2015), Limassol, Cyprus, July 2015.

The Web Data Commons Schema.org Data Set Series

Contents

1. About the Schema.org Data Set Series

2. Releases

3. Feedback

4. References