The Web Data Commons project extracts structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers and companies in exploiting the wealth of information that is available on the Web.
- 2014-08-27: We have released an easy to customize version of the WDC Extraction Framework including a tutorial, which explains the usage and customization in detail. See also our guest post at the Common Crawl Blog.
- 2014-08-13: Hyperlink Graph Dataset covering 1.7 billion web pages extracted from the April 2014 Common Crawl corpus available for download.
- 2014-07-06: Paper about WebDataCommons Microdata, Rdfa and Microformats Dataset Series accepted at ISWC'14 conference in Riva del Garda - Trentino, Italy: The WebDataCommons Microdata, RDFa and Microformat Dataset Series
- 2014-04-14: Paper about WDC Pay-Level Domain Graph accepted at WebSci'14 conference in Bloomington, USA: Graph Structure in the Web - Aggregated by Pay-Level Domain
- 2014-04-01: RDFa, Microdata, and Microformat data sets extracted from the Winter 2013 Common Crawl corpus available for download.
- 2014-03-05: Initial release of the WDC Web Tables data set consisting of 147 million relational Web tables.
- 2014-02-12: First open ranking of the World Wide Web is now available. The ranking is based on the WDC Hyperlink Graph.
- 2014-02-04: Paper about WDC Hyperlink Graph accepted at WWW2014 conference (Web Science Track) in Seoul: Graph Structure in the Web - Revisited
- 2014-01-20: Paper about the integration of product data from the WDC Microdata data set accpeted at the DEOS2014 workshop at the WWW2014 conference in Seoul: Integrating Product Data from Websites offering Microdata Markup
- 2013-11-12: Web Data Commons releases large Hyperlink Graph covering 3.5 billion web pages and 128 billion hyperlinks between these pages.
- 2013-09-02: Paper about the WDC RDFa, Microdata, and Microformat data set accepted at the ISWC2013 conference in Sydney: Deployment of RDFa, Microdata, and Microformats on the Web -- A Quantitative Analysis.
- 2013-07-12: New analysis available about the types of products that are offered by e-shops using Microdata markup.
- 2013-07-05: Yahoo! Research releases Glimmer search engine which enables you to search Web Data Commons data. Details.
- 2012-12-10: RDFa, Microdata, and Microformat data sets extracted from the August 2012 Common Crawl corpus available for download.
- 2012-06-29: We have created a new analysis on vocabulary usage in our Microdata and RDFa data set.
- 2012-06-20: Presentation of the Web Data Commons project and our data extraction framework at the AWS Summit 2012 Berlin - Slides.
- 2012-04-16: Paper on Web Data Commons presented at the LDOW 2012 Workshop (References)
- 2012-03-22: RDFa, Microdata, and Microformat data sets extracted from the February 2012 Common Crawl corpus available for download.
- 2012-03-13: RDFa, Microdata, and Microformat data sets extracted from the 2009/2010 Common Crawl corpus available for download.
Available Data Sets
More and more websites have started to embed structured data describing products, people, organizations, places, and events into their HTML pages using markup standards such as RDFa, Microdata and Microformats. The Web Data Commons project extracts this data from several billion web pages. So far the project provides three different data sets extracted from the Common Crawl 2013, 2012 and 2010 corpus. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.
The Web contains vast amounts of HTML tables. Most of these tables are used for layout purposes, but a fraction of the tables is also quasi-relational, meaning that they contain structured data describing a set of entities, and are thus useful in application contexts such as data search, table augmentation, knowledge base construction, and for various NLP tasks. The WDC Web Tables data set consists of the 147 million relational Web tables that are contained in the overall set of 11 billion HTML tables found in the Common Crawl.
We offer a large hyperlink graph that we extracted from the 2012 version of the Common Crawl. The WDC Hyperlink Graph covers 3.5 billion web pages and 128 billion hyperlinks between these pages. The graph can help researchers to improve search algorithms, develop spam detection methods and evaluate graph analysis algorithms. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public.
The effective processing of large web corpora presents challenges in terms of resources, time and costs. In order to extract the data sets presented above, the Web Data Commons project has developed a framework which provides an easy to use basis for the distributed processing of large web crawls using Amazon EC2 cloud services. The framework is published under the terms of the Apache license and can be simply customized to perform also different data extraction tasks.
The Web Data Commons extraction framework can be used under the terms of the Apache Software License.
About Web Data Commons Project
The Web Data Commons project was started by researchers from Freie Universität Berlin and the Karlsruhe Institute of Technology (KIT) in 2012. The goal of the project is to facilitate research and support companies in exploiting the wealth of information on the Web by extracting structured data from web crawls and provide this data for public download. Today the WDC Project is mainly maintained by the Data and Web Science Research Group at the University of Mannheim. The project is coordinated by Christian Bizer who has moved from Berlin to Mannheim.