WDC - RDFa, Microdata, and Microformat Data Sets

More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages using markup standards such as RDFa, Microdata and Microformats.
The Web Data Commons project extracts this data from several billion web pages. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.
Similar to other datasets retrieved from the Web, RDF datasets retrieved from such markup include duplicates, and based on the nature of the Web the compliance to given schemas is not always given.
For futher details about the statistical distribution of both issue types as well as some indications about the reason for their occurence have a look in the following papers:

Robert Meusel, Christian Bizer, Heiko Paulheim: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time. 5th International Conference on Web Intelligence, Mining and Semantics (WIMS2015), Limassol, Cyprus, July 2015.
Robert Meusel and Heiko Paulheim: Heuristics for Fixing Common Errors in Deployed schema.org Microdata,Proceedings of the 12th Extended Semantic Web Conference (ESWC 2015), Portoroz, Slovenia, May 2015.
Robert Meusel, Dominique Ritze, Heiko Paulheim: Towards More Accurately Statistical Profiling of Deployed schema.org Microdata, submitted revised Version in 2016 to the Journal of Data and Information Quality (JDIQ)

1. Data Cleansing

In order to overcome the two major issues which could be identified within RDF data retrieved from markup with HTML we propose a simple but effective pipeline to clean the data and improve the data quality.

In a first step, syntactical duplicates (whenever two RDF-graph representations of two entities are identical) are removed within each website (PLD).
In a second step using a set or heuristics, the most common schema violations are fixed. This will, under circumstances introduce new classes for various websites and therefore has an effect on the over profile of this particular website.
In a third step, the duplicates within the whole data corpus are removed. This results in a more accurate statistical profile of the data corpus.

2. Source Code

The code in order to perform the described methods step by step can be found in the SVN of the WebDataCommons project: Data Cleaner.

The project can be compiled using Maven and can be execute via command-line-interface.

Master -sort FileFolder OutputFolder FILTERSTRING NumberOfThreads sorts the quads within the input files based on their webpage
Master -plddedup FileFolder OutputFolder FIXRDF NumberOfThreads removes syntactical duplicates within a site.
Master -correct FileFolder OutputFolder ClassPropertyFile DataTypePropertyFile ObjectPropertyFile DomainViolationFixFile RangeVioloationFixFile NumberOfThreads corrects schema violations. The necessary files for schema.org Version 1.91 are part of the project.
Master -dedup FileFolder OutputFolder NumberOfThreads removes syntactical duplicates within the whole corpus.

It is important to note, that the current code is designed to be executed on a single machine and cannot scale vertical.

3. Data

We have already applied the described methodology to the WDC 2014 Microdata Corpus, filtering for only schema.org related quads. This data was also used in order to perform the experiments described in Robert Meusel, Dominique Ritze, Heiko Paulheim: Towards More Accurately Statistical Profiling of Deployed schema.org Microdata (submitted to JDIQ). We therefore stick to the nomenclature (S1 to S5) of the different intermediate coropra used in this publication, which is visualed also in the following diagramm:

The data of the different steps within the process are provided for download as N-Quads. The data generated in each step is provided indifferent files, which are compressed using GZIP and are not larger than 100MB each.

The files of one step can be downloaded using wget with e.g. the command wget -i http://webdatacommons.org/structureddata/2014-12/files/md-schemaorg-cleaned-s1.list to download all files for S1. The file lists, containing quads for a specific steps can be found in the table below, together with more detailed statistics about sizes.

Step	Total File Size	File List
S1	193 GB	md-schemaorg-cleaned-s1.list
S2	112 GB	md-schemaorg-cleaned-s2.list
S3	116 GB	md-schemaorg-cleaned-s3.list
S4	115 GB	md-schemaorg-cleaned-s4.list
S5	110 GB	md-schemaorg-cleaned-s5.list

4. Further References

Robert Meusel, Petar Petrovski, Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. In Proceedings of the 13th International Semantic Web Conference: Replication, Benchmark, Data and Software Track (ISWC2014).
Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher, and Johanna Völker: Deployment of RDFa, Microdata, and Microformats on the Web - A Quantitative Analysis. In Proceedings of the 12th International Semantic Web Conference, Part II: In-Use Track, pp.17-32 (ISWC2013).
Hannes Mühleisen, Christian Bizer: Web Data Commons - Extracting Structured Data from Two Large Web Corpora. In Proceedings of the WWW2012 Workshop on Linked Data on the Web (LDOW2012).
Peter Mika, Tim Potter: Metadata Statistics for a Large Web Corpus. In Proceedings of the WWW2012 Workshop on Linked Data on the Web (LDOW2012).
Peter Mika: Microformats and RDFa deployment across the Web. Blog Post.
Class Statistics from the Sindice data search engine.

Web Data Commons - RDFa, Microdata, and Microformat Data Sets

Contents

1. Data Cleansing

2. Source Code

3. Data

4. Further References