Web Data Commons

The Web Data Commons project extracts structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers and companies in exploiting the wealth of information that is available on the Web.

News

2025-01-10: We have released the WDC RDFa, Microdata, Microformat, and Embedded JSON-LD data sets extracted from the October 2024 Common Crawl corpus and created multiple schema.org class-specific subsets.
2024-02-01: We have released the WDC Schema.org Table Corpus 2023 which contains ~5M tables and is based on the October 2023 WDC schema.org extraction.
2024-01-08: We have released the WDC RDFa, Microdata, Microformat, and Embedded JSON-LD data sets extracted from the October 2023 Common Crawl corpus and created multiple schema.org class-specific subsets.
2023-06-22: We have released WDC Block a benchmark for comparing the performance of blocking methods. WDC Block features a maximal Cartesian product of 200 billion pairs of product offers which were extracted form 3,259 e-shops.
2023-01-25: We have released the WDC RDFa, Microdata, Microformat, and Embedded JSON-LD data sets extracted from the October 2022 Common Crawl corpus and created multiple schema.org class-specific subsets.
2022-12-22: We have released the WDC Products benchmark for fine-grained evaluation of the performance of entity matching methods along three dimensions.
2022-09-22: We have released the WDC Schema.org Table Annotation Benchmark for evaluating the performance of methods for annotating columns of Web tables with terms from the Schema.org vocabulary.
2022-01-04: We have released the WDC RDFa, Microdata, Microformat, and Embedded JSON-LD data sets extracted from the October 2021 Common Crawl corpus and created multiple schema.org class-specific subsets.
2021-09-10: We have released the WDC Product Data Corpus V.2020, extracted from the December 2020 WDC schema.org Product and Offer subsets.
2021-03-29: We have released the WDC Schema.org Table Corpus, which was created by grouping the December 2020 schema.org class-specific subsets into relational tables.
2021-03-22: The paper Improving Hierarchical Product Classification using Domain-specific Language Modelling has been accepted at the Knowledge Management in e-Commerce workshop held in conjunction with The Web Conference 2021.
2021-01-21: We have released the WDC RDFa, Microdata, Microformat, and Embedded JSON-LD data sets extracted from the September 2020 Common Crawl corpus.
2020-08-24: The paper Intermediate Training of BERT for Product Matching using Version 2.0 of the WDC Product Data Corpus and Gold Standard for Large-Scale Product Matching has been accepted at the DI2KG workshop held in conjunction with VLDB2020.
2020-07-01: We will present the paper Using schema.org Annotations for Training and Maintaining Product Matchers using Version 2.0 of the WDC Product Data Corpus and Gold Standard for Large-Scale Product Matching at the WIMS2020 conference.
2020-03-19: The CfP for the Semantic Web Challenge@ISWC2020 "Mining the Web of HTML-embedded Product Data" has been announced. The WDC Product Data Corpus and Gold Standard V2.0 will be used as training and evaluation resources for the Product Matching task.
2020-01-13: We have released the WDC RDFa, Microdata, Microformat, and Embedded JSON-LD data sets extracted from the November 2019 Common Crawl corpus.
2019-10-23: Version 2.0 of the WDC Product Data Corpus and Gold Standard for Large-Scale Product Matching released.
2019-07-19: We have released the Web Tables for Long-Tail Entity Extraction (T4LTE) dataset, the first gold standard for the task of long-tail entity extraction from web tables.
2019-07-19: We have released the Time-Dependent Ground Truth (TDGT), a dataset covering time-dependent data from various domains.
2019-05-15: Journal Article about Using the Semantic Web as a Source of Training Data has been published by the Datenbank-Spektrum Journal.
2019-02-27: Paper about The WDC training dataset and gold standard for large-scale product matching accepted at ECNLP Workshop at WWW2019 conference in San Francisco.
2019-01-17: We have released a new version of the RDFa, Microdata, Microformat, and Embedded JSON-LD data corpus extracted from the November 2018 Common Crawl corpus.
2018-12-20: We have released the WDC Training Dataset and Gold Standard for Large-Scale Product Matching.
2018-01-08: We have released a new version of the RDFa, Microdata, Microformat, and Embedded JSON-LD data corpus extracted from the November 2017 Common Crawl corpus.
2017-06-26: We have released the Web Data Integration Framework (WInte.r), which provides parsers and methods for the integration of Web Tables.
2017-01-17: We have released a new version of the RDFa, Microdata, Microformat, and Embedded JSON-LD data corpus extracted from the October 2016 Common Crawl corpus.
2016-09-01: We have released a Gold Standard for Product Matching and Product Feature Extraction. The gold standards are accompanied by a 11.2 million product data corpus crawled in the first quarter of 2016.
2016-04-25: We have released a new version of the RDFa, Microdata, Microformat, and Embedded JSON-LD data corpus extracted from the November 2015 Common Crawl corpus. This corpus for the first time also includes JSON-LD data.
2016-04-13: We have released a web-scale "IsA" database containing over 400 million hypernymy relations extracted from the text of HTML pages.
2015-12-15: Paper about Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases has been accepted at the WWW'16 conference in Montréal, Canada.
2015-12-15: Anthelion, a focused crawler for structured data released as Yahoo open source project. Find the code as well as a more comprehensive description at the Yahoo GitHub repository (Yahoo tumblr posting)
2015-11-19: WDC Web Table Corpus 2015 released consisting of 233 million Web tables extracted from the July 2015 Common Crawl.
2015-08-13: Journal Article about The Graph Structure in the Web - Analyzed on Different Aggregation Levels has been published by the Journal of Web Science.
2015-04-02: RDFa, Microdata, and Microformat data sets extracted from the December 2014 Common Crawl corpus available for download.
2015-04-01: T2D Gold Standard for comparing matching systems on the task of finding correspondences between Web tables and large-scale knowledge bases released.
2015-03-30: Paper about Heuristics for Fixing Common Errors in Deployed schema.org Microdata accepted at ESWC2015 conference in Portoroz, Slovenia.
2014-12-04: We have created several Class-Specific Subsets of the Schema.org Data contained in the Winter 2013 Microdata Corpus (e.g. schema.org/Product and schema.org/Offer) in order to make it easier to work with the data.
2014-08-27: We have released an easy to customize version of the WDC Extraction Framework including a tutorial, which explains the usage and customization in detail. See also our guest post at the Common Crawl Blog.
2014-08-13: Hyperlink Graph Dataset covering 1.7 billion web pages extracted from the April 2014 Common Crawl corpus available for download.
2014-07-06: Paper about WebDataCommons Microdata, Rdfa and Microformats Dataset Series accepted at ISWC'14 conference in Riva del Garda - Trentino, Italy: The WebDataCommons Microdata, RDFa and Microformat Dataset Series
2014-04-14: Paper about WDC Pay-Level Domain Graph accepted at WebSci'14 conference in Bloomington, USA: Graph Structure in the Web - Aggregated by Pay-Level Domain
2014-04-01: RDFa, Microdata, and Microformat data sets extracted from the Winter 2013 Common Crawl corpus available for download.
2014-03-05: Initial release of the WDC Web Tables data set consisting of 147 million relational Web tables.
2014-02-12: First open ranking of the World Wide Web is now available. The ranking is based on the WDC Hyperlink Graph.
2014-02-04: Paper about WDC Hyperlink Graph accepted at WWW2014 conference (Web Science Track) in Seoul: Graph Structure in the Web - Revisited
2014-01-20: Paper about the integration of product data from the WDC Microdata data set accpeted at the DEOS2014 workshop at the WWW2014 conference in Seoul: Integrating Product Data from Websites offering Microdata Markup
2013-11-12: Web Data Commons releases large Hyperlink Graph covering 3.5 billion web pages and 128 billion hyperlinks between these pages.
2013-09-02: Paper about the WDC RDFa, Microdata, and Microformat data set accepted at the ISWC2013 conference in Sydney: Deployment of RDFa, Microdata, and Microformats on the Web -- A Quantitative Analysis.
2013-07-12: New analysis available about the types of products that are offered by e-shops using Microdata markup.
2013-07-05: Yahoo! Research releases Glimmer search engine which enables you to search Web Data Commons data. Details.
2012-12-10: RDFa, Microdata, and Microformat data sets extracted from the August 2012 Common Crawl corpus available for download.
2012-06-29: We have created a new analysis on vocabulary usage in our Microdata and RDFa data set.
2012-06-20: Presentation of the Web Data Commons project and our data extraction framework at the AWS Summit 2012 Berlin - Slides.
2012-04-16: Paper on Web Data Commons presented at the LDOW 2012 Workshop (References)
2012-03-22: RDFa, Microdata, and Microformat data sets extracted from the February 2012 Common Crawl corpus available for download.
2012-03-13: RDFa, Microdata, and Microformat data sets extracted from the 2009/2010 Common Crawl corpus available for download.

Available Data Sets

RDFa, Microdata, and Microformat

More and more websites have started to embed structured data describing products, people, organizations, places, and events into their HTML pages using markup standards such as RDFa, Microdata and Microformats. The Web Data Commons project extracts this data from several billion web pages. So far the project provides six different data set releases extracted from the Common Crawl 2016, 2015, 2014, 2013, 2012 and 2010. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.

Web Tables

The Web contains vast amounts of HTML tables. Most of these tables are used for layout purposes, but a fraction of the tables is also quasi-relational, meaning that they contain structured data describing a set of entities, and are thus useful in application contexts such as data search, table augmentation, knowledge base construction, and for various NLP tasks. The WDC Web Tables data set consists of the 147 million relational Web tables that are contained in the overall set of 11 billion HTML tables found in the Common Crawl.

Hyperlink Graph

We offer a large hyperlink graph that we extracted from the 2012 version of the Common Crawl. The WDC Hyperlink Graph covers 3.5 billion web pages and 128 billion hyperlinks between these pages. The graph can help researchers to improve search algorithms, develop spam detection methods and evaluate graph analysis algorithms. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public.

WebIsA Database

We offer a large IsA database that we extracted from the 2015 version of the Common Crawl. The WDC IsA Database contains more than 400 million hypernymy relations we extracted from the text of HTML pages included in the crawl. This collection of relations represents a rich source of knowledge and can be used to improve approaches in various application domains. We offer the tuple dataset for public download and an application programming interface to help other researchers programmatically query the database. In addition a demo web application of the database is available.

Product Data Corpora

We offer two product data corpora containing offers from multiple e-shops. The first corpus consists of 5.6 million product offers from the categories mobile phones, headphones and televisions and was crawled from 32 popular shopping websites. The corpus is accompanies by a manually verified gold standard for the evaluation and comparison of product feature extraction and product matching methods. The second corpus consists of more than 26 million product offers originating from 79 thousand websites. The offers are grouped into 16 million clusters of offers referring to the same product using product identifiers, such as GTINs or MPNs.

Available Software

Extraction Framework

The effective processing of large web corpora presents challenges in terms of resources, time and costs. In order to extract the data sets presented above, the Web Data Commons project has developed a framework which provides an easy to use basis for the distributed processing of large web crawls using Amazon EC2 cloud services. The framework is published under the terms of the Apache license and can be simply customized to perform also different data extraction tasks.

License

The Web Data Commons extraction framework can be used under the terms of the Apache Software License.

Feedback

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

About Web Data Commons Project

The Web Data Commons project was started by researchers from Freie Universität Berlin and the Karlsruhe Institute of Technology (KIT) in 2012. The goal of the project is to facilitate research and support companies in exploiting the wealth of information on the Web by extracting structured data from web crawls and provide this data for public download. Today the WDC Project is mainly maintained by the Data and Web Science Research Group at the University of Mannheim. The project is coordinated by Christian Bizer who has moved from Berlin to Mannheim.

Credits

Web Data Commons is supported by the EU FP7 projects PlanetData and LOD2, by an Amazon Web Services in Education Grant Award, by the German Research Foundation (DFG) and by the ViCe research project of the Ministry of Economy, Research and Arts of Baden - Württemberg.