Web Data Commons - 2009/2010 Corpus - Download Instructions

This file contains the detailed extraction report of the extraction of 2009/2010 of the Web Data Commons project.

All our data and used code is available as download.

Download the Extracted RDF Data

The extracted structured data is provided for download in the N-Quads RDF encoding and divided according to the format the data was encoded in. Files are compressed using GZIP and split after reaching a size of 100MB. Overall, 410 files with a total size of 40GB were produced.

List of download URLs for RDF from the 2009/2010 corpus (Example Content)

The extracted RDF data can be downloaded using wget with the command wget -i http://webdatacommons.org/downloads/2010-09/nquads/files.list

Download the Generated CSV Tables

The extracted microformat data are also available for download as CSV tables. The SPARQL queries used for generating the CSV tables are available as well.

CSV Table	SPARQL Query
hCalendar.csv (~461MB)	HCalendar.rq
hCard.csv (~14GB)	hCard.rq
Geo.csv (~208MB)	Geo.rq
hListing.csv (~1GB)	hListing.rq
hResume.csv (~227MB)	hResume.rq
hReview.csv (1.4GB)	hReview.rq
hRecipe.csv (~2MB)	hRecipe.rq
Species.csv (~1MB)	species.rq
XFN.csv (~390MB)	XFN.rq

Download the Extraction Statistics

To provide a general overview about the URLs using structured data and the to link back to the Common Crawl .arc files the detailed extraction statistic can be used. The extraction statistics record the amount of structured data found for each URL from the crawl data. Be advised to use a parser which is able to skip invalid lines, since they could present in the tab-separated files. The table contains the following columns (not in this order):

Source Data Columns

uri - The URL of the crawled page (e.g. http://www.example.com)
hostIp - The IP address of the computer the page was crawled from (e.g. 192.0.43.10)
mimeType - The MIME type of the page as communicated by the web server (e.g. text/html)
timestamp - Time and date when the page was crawled as UTC UNIX timestamp (e.g. 1331642430)
recordLength - Size of the HTML content in Bytes (e.g. 5831)
arcFileName - Name of the Common Crawl archive file containing the page (e.g. common-crawl/crawl-002/2010/01/06/24/1262851198118_24.arc.gz)
arcFilePos - Byte offset of the page inside the archive file (e.g. 81995525)

Result Data Columns

detectedMimeType - MIME type as detected by the extractor (e.g. text/html)
html-* - Number of triples found on the page for each extractor identifier (e.g. 42)
totalTriples - Number of all triples found on this HTML page (e.g. 42)

Sample Extraction Statistic File (csv)
Extraction Statistic File (csv)