Cleaning schema.org Microdata
Robert Meusel
Dominique Ritze
Heiko Paulheim

More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages using markup standards such as RDFa, Microdata and Microformats.
The Web Data Commons project extracts this data from several billion web pages. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.
Similar to other datasets retrieved from the Web, RDF datasets retrieved from such markup include duplicates, and based on the nature of the Web the compliance to given schemas is not always given.
For futher details about the statistical distribution of both issue types as well as some indications about the reason for their occurence have a look in the following papers:

Contents

1. Data Cleansing

In order to overcome the two major issues which could be identified within RDF data retrieved from markup with HTML we propose a simple but effective pipeline to clean the data and improve the data quality.

2. Source Code

The code in order to perform the described methods step by step can be found in the SVN of the WebDataCommons project: Data Cleaner.

The project can be compiled using Maven and can be execute via command-line-interface.

It is important to note, that the current code is designed to be executed on a single machine and cannot scale vertical.

3. Data

We have already applied the described methodology to the WDC 2014 Microdata Corpus, filtering for only schema.org related quads. This data was also used in order to perform the experiments described in Robert Meusel, Dominique Ritze, Heiko Paulheim: Towards More Accurately Statistical Profiling of Deployed schema.org Microdata (submitted to JDIQ). We therefore stick to the nomenclature (S1 to S5) of the different intermediate coropra used in this publication, which is visualed also in the following diagramm:

Pipeline

The data of the different steps within the process are provided for download as N-Quads. The data generated in each step is provided indifferent files, which are compressed using GZIP and are not larger than 100MB each.

The files of one step can be downloaded using wget with e.g. the command wget -i http://webdatacommons.org/structureddata/2014-12/files/md-schemaorg-cleaned-s1.list to download all files for S1. The file lists, containing quads for a specific steps can be found in the table below, together with more detailed statistics about sizes.

Step Total File Size File List
S1 193 GB md-schemaorg-cleaned-s1.list
S2 112 GB md-schemaorg-cleaned-s2.list
S3 116 GB md-schemaorg-cleaned-s3.list
S4 115 GB md-schemaorg-cleaned-s4.list
S5 110 GB md-schemaorg-cleaned-s5.list

4. Further References