Extracting the Hyperlink Graph from the Common Web Crawl
Robert Meusel
Oliver Lehmberg
Christian Bizer
Sebastiano Vigna


This page provides detailed download instruction to obtain the hyperlink graph extracted from the Common Crawl 2014 web corpus. The graph covers 1.7 billion web pages and 64 billion hyperlinks between these pages.
We also provide basic statistics about the hyperlink. Please visit the overview page for more information about the provided file formats.

Contents

1. Index/Arc Format

The following table contains the links for downloading the index of the graph for each aggregation level.

Data SetIndex File
Page Graphsee below (20 GB)
Host Graphdownload (309 MB)
PLD Graphdownload (161 MB)
In case you experience any problems downloading the files please find additional information here.

Downloading the page graph: The page graph (index files) are, due to their size split into in small files of around 400 MB. These files can be downloaded using wget -i http://webdatacommons.org/hyperlinkgraph/2014-04/data/index.list.txt for the index files and respectively wget -i http://webdatacommons.org/hyperlinkgraph/2014-04/data/arc.list.txt for the arc files.

2. WebGraph Files

This time we mainly provide the graph within the web graph format expected by the WebGraph Framework developed by Sebastiano Vigna. The graph is represented using three files: .graph, .offsets, .properties. All three are necessary to load the network into the library.

Data Set.graph.offsets.properties
Page Graph webgraph.graph (20 GB) webgraph.offsets (2.1 GB) webgraph.properties (< 2 KB)
Host Graph hostgraph.graph (326 MB) hostgraph.offsets (25 MB) hostgraph.properties (< 2 KB)
PLD Graph pldgraph.graph (134 MB) pldgraph.offsets (13 MB) pldgraph.properties (< 2 KB)
Please have a look at the README file, for md5sums of the files.
Using the WebGraph Framework, which can be downloaded from Maven Central, these files can be loaded using the following line of code: BVGraph graph = BVGraph.loadMapped(baseName, new ProgressLogger()).

3. Additional Material

3.1. Symmetric and Transposed Graphs

In addition to the original representation of the page graph we have calculated a symmetric version of the graph, as well as a transposed version. Within the symmetric version of the page graph whenever there is an arc (directed link) within the original we replaced it with an edge (undirected link). Within the symmetric version of the page graph whenever there is an arc from page 1 to page 2, this arc is replaced by an arc from page 2 to page 1. The graphs are provided in the web graph format:

Data Set.graph.offsets.properties
Symmetric Page Graph webgraph.graph (26 GB) webgraph.offsets (2.2 GB) webgraph.properties (< 2 KB)
Transposed Page Graph webgraph.graph (8.5 GB) webgraph.offsets (1.2 GB) webgraph.properties (< 2 KB)

4. License

The extracted data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus.

The Web Data Commons extraction framework can be used under the terms of the Apache Software License.

5. Feedback

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

More information about Web Data Commons is found here.