Extracting Schema.org Data from the Common Crawl
Christian Bizer
Alexander Brinkmann
Anna Primpeli

Millions of websites have started to annotate structured data within their HTML pages using the schema.org vocabulary. Popular types of entities that are annotated with schema.org terms are products, local businesses, events, job postings, reviews, question and answers. The Web Data Commons project has been extracting schema.org data from the Common Crawl every year since 2013 and offers the extracted data for public download in the form of the schema.org data set series.

Contents

1. About the Schema.org Data Set Series

The Web Data Commons project has been extracting schema.org data from the Common Crawl every year since 2013 and offers the extracted data for public download in the form of the schema.org data set series. From a Web Science perspective, this data set series lays the foundation for analyzing the adoption process of schema.org annotations on the Web over the past decade. From a machine learning perspective, the annotations provide a large pool of training data for tasks such as product matching, product or job categorization, information extraction, or question answering. Webmasters primarily use the JSON-LD and Microdata formats to annotate structured data with schema.org, the extracted JSON-LD and Microdata corpora of the Web Data Commons Data Set are merged for a selection of schema.org classes to create class-specific subsets. The subsets contain all entities of a particular class together with entities of other classes that are found on the same pages. For example, a page containing data about a product might also contain reviews and offers for this product; a page containing data about an event might also contain data about the location of the event and the persons involved in the event. The data is represented in N-Quads format, meaning that the forth element of each quad contains the URL of the webpage from which the data was extracted. The diagrams below on the left-hand side show the growth in the number of websites (pay-level domains) over the last decade that annotate specific entity types using JSON-LD and Microdata respectively. The diagrams on the right-hand side show the growth of the number of entities of these classes that are contained in the data sets.

2. Releases

The following table gives an overview of the releases of the schema.org data set series.
For detailed statistics as well as for downloading a release, please click on the release date.

Release PLDs Quads Size
2023-12 13,328,045 138,607,560,867 2,457 GB
2022-12 12,838,680 105,579,841,305 1,831 GB
2021-12 12,897,837 94,445,547,365 1,691 GB
2020-12 12,414,851 90,416,246,904 1,339 GB
2019-12 9,470,865 25,366,269,937 542 GB
2018-12 7,226,784 20,135,070,334 469 GB
2017-12 5,144,983 14,424,702,204 431 GB
2016-10 3,846,805 26,719,043,004 707 GB
2015-11 1,370,192 12,910,049,866 267 GB
2014-12 700,000 11,067,728,953 262 GB
2013-11 400,000 10,395,924,438 227 GB

We publish the corpora for research purposes only. The December 2020 release and the December 2023 release of the WDC schema.org data set series are also available as table corpora, which have been created by grouping the data into separate tables for each class/host combination, e.g. all records of a specific class extracted from a specific website are put into a single table. The resulting corpora consists of 4.2 million relational tables (2020) and 5 million relational tables (2023), which are available for download in a JSON format that can be read by the pandas library.

We provide the extracted data for download using a variation of the N-Quads format. For users who prefer other formats, we provide code for converting the download files into CSV and JSON formats, which are supported by a wide range of spreadsheet applications, relational databases and data mining frameworks like the python data analysis library pandas. Please find further details on how to convert the download files to other formats on the main page.

3. Feedback

Please post questions and feedback in our Web Data Commons Google Group.

More information about Web Data Commons is found here.

4. References