Web Data Commons - 2009/2010 Corpus - Download Instructions

This file contains the detailed extraction report of the extraction of 2009/2010 of the Web Data Commons project.

All our data and used code is available as download.

Download the Extracted RDF Data

The extracted structured data is provided for download in the N-Quads RDF encoding and divided according to the format the data was encoded in. Files are compressed using GZIP and split after reaching a size of 100MB. Overall, 410 files with a total size of 40GB were produced.

List of download URLs for RDF from the 2009/2010 corpus (Example Content)

The extracted RDF data can be downloaded using wget with the command wget -i http://webdatacommons.org/downloads/2010-09/nquads/files.list

Download the Generated CSV Tables

The extracted microformat data are also available for download as CSV tables. The SPARQL queries used for generating the CSV tables are available as well.

hCalendar.csv (~461MB)HCalendar.rq
hCard.csv (~14GB)hCard.rq
Geo.csv (~208MB)Geo.rq
hListing.csv (~1GB)hListing.rq
hResume.csv (~227MB)hResume.rq
hReview.csv (1.4GB)hReview.rq
hRecipe.csv (~2MB)hRecipe.rq
Species.csv (~1MB)species.rq
XFN.csv (~390MB)XFN.rq

Download the Extraction Statistics

To provide a general overview about the URLs using structured data and the to link back to the Common Crawl .arc files the detailed extraction statistic can be used. The extraction statistics record the amount of structured data found for each URL from the crawl data. Be advised to use a parser which is able to skip invalid lines, since they could present in the tab-separated files. The table contains the following columns (not in this order):

Source Data Columns

Result Data Columns

Sample Extraction Statistic File (csv)
Extraction Statistic File (csv)