Download the Extracted RDF Data
The extracted RDFa, Microdata and Microformat data is provided for download as N-Quads. Files are compressed using GZIP and each file is around 100 MB large. Overall, 1,416 files with a total size of 101 GB are provided.
The extracted RDF data can be downloaded using wget with the command
wget -i http://webdatacommons.org/structureddata/2012-08/stats/files.list
In order to make it easier to find data from a specific website or specific top-level-domain, the N-Quad within the files are ordered by top-level-domain and pay-level-domain and listed in index files. The indexes consist of tab-separated values having the following structure:
tld pld quad-file-name first-line last-line. The first column represents the top-level-domain, the second column the pay-level-domain, the third column the quad-file which contains the data that has been extracted from the pay-level-domain. The fourth and fifth column specify the first and the last line in the file that contains the data belonging to a specific pay-level-domain.
|Format||Number of Files||Number of Quads||Indexfile|
|html-rdfa||256||1,079,175,202||html-rdfa.nq.index.gz (~7 MB)|
|html-microdata||401||1,488,063,426||html-microdata.nq.index.gz (~2 MB)|
|html-mf-geo||6||32,722,603||html-mf-geo.nq.index.gz (~0.6 MB)|
|html-mf-hcalendar||75||142,975,309||html-mf-hcalendar.nq.index.gz (~0.5 MB)|
|html-mf-hcard||501||3,547,824,107||html-mf-hcard.nq.index.gz (~20 MB)|
|html-mf-hrecipe||10||50,898,293||html-mf-hrecipe.nq.index.gz (~37 KB)|
|html-mf-hlisting||15||97,711,757||html-mf-hlisting.nq.index.gz (~60 KB)|
|html-mf-hresume||1||678,097||html-mf-hresume.nq.index.gz (~18 KB)|
|html-mf-hreview||65||207,589,518||html-mf-hreview.nq.index.gz (~300 KB)|
|html-mf-species||1||127,568||html-mf-species.nq.index.gz (~1 KB)|
|html-mf-xfn||75||703,188,115||html-mf-xfn.nq.index.gz (~6 MB)|
Download the Website-Class-Property Matrixes
In order to make it easy for third parties to investigate the usage of different vocabularies and to generate seed-lists for focused crawling endeavors, we have generated a Website-Class-Property matrix for each format. The matrixes show which vocabulary term (class/property) is used by which website. The matrixes are provided as ARFF-files. Within the files, each website is represented by a single line containing the pay-level-domain name as well as binary values indicating whether the website uses a specific vocabulary term or not. If a website uses a class or property, the value is 1, 0 otherwise. SampleMatrix.arff shows the structure of the website-class-property matrixes. The matrixes cover all classes that are used by at least 5 different websites as well as all properties that are used by at least 10 different websites. Due to processing reasons we replaced the property/class identfier by its rank in the PLD list (e.g.
schema.org/Product is replaced by
type-4 as its the fourth most used Microdata class based on PLDs). The ranking can be found in the Co-Occurrence Matrix of the corresponding format.
RDFa Matrix (12 MB - arff)
Microdata Matrix (4.4 MB - arff)
Microformats Matrix (29 MB - arff)
Download the Class-Property-Co-occurrence Matrixes
For each class, we have generated a vector indicating how many websites use specific properties together with this class. These vectors allow you to get an impression about the richness of the published data. The vectors are provided as Microsoft Excel files. The files cover all classes that are used by at least 5 website and all properties that are used by at least 10 websites.
RDFa Co-Occurrence Matrix (0.5 MB - xlsx)
Microdata Co-Occurrence Matrix (1.7 MB - xlsx)
Microformts Co-Occurrence Matrix (32 KB - xlsx)
Download the Raw Extraction Statistics
In addition to the aggregated statistics described above, we also provide the raw extraction statistics which indicate how many triples were found in each HTML page of the Common Crawl. The raw statistics also enable you to locate the Common Crawl .arc file that contains a specific HTML page. Be warned that the raw statistic files are rather large and advised to use a parser which is able to skip invalid lines, since they could be present in the files. The files contain the following tab-separated columns (not in this order):
Source Data Columns
uri- The URL of the crawled page (e.g.
hostIp- The IP address of the computer the page was crawled from (e.g.
mimeType- The MIME type of the page as communicated by the web server (e.g.
timestamp- Time and date when the page was crawled as UTC UNIX timestamp (e.g.
recordLength- Size of the HTML content in Bytes (e.g.
arcFileName- Name of the Common Crawl archive file containing the page (e.g.
arcFilePos- Byte offset of the page inside the archive file (e.g.
Result Data Columns
detectedMimeType- MIME type as detected by the extractor (e.g.
html-*- Number of triples found on the page for each extractor identifier (e.g.
totalTriples- Number of all triples found on this HTML page (e.g.
referencedData- The format, whose regex matched first with the HTML page, correspondes to the extractions pre-processing (e.g.
Get the Code
The source code can be checked out from our Subversion repository. The extraction of August 2012 was done with version 0.0.1 of the extractor. Having checked-out the source code, create your own configuration by copying
src/main/resources/ccrdf.properties, then fill in your AWS authentication information and bucket names. Compilation is performed using Maven, thus changing into the source root directory and typing
mvn install should be sufficient to create a build. In order to run the extractor on more than 10 EC2 nodes, you will have to request an EC2 instance limit increase for your AWS account. More information about running the extractor is provided in the file