Extracting the Hyperlink Graphs from the Common Crawl
Robert Meusel
Oliver Lehmberg
Christian Bizer
Sebastiano Vigna


This page provides two large hyperlink graph for public download. The graphs have been extracted from the 2012 and 2014 versions of the Common Crawl web corpera. The 2012 graph covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. The2014 graph covers 1.7 billion web pages connected by 64 billion hyperlinks. Below we provide instructions on how to download the graphs as well as basic statistics about their topology.

We hope that the graphs will be useful for researchers who develop

Contents

1. Levels of Aggregation

We provide the graphs on three different levels of aggregation:

  1. Page-Level Graph - This version of the graph contains all details with each node representing a single web page (like http://dws.informatik.uni-mannheim.de/en/projects/current-projects/#c13686) and each arc a hyperlink between to two pages.
  2. Host Graph - This graph aggregates the page graph by subdomain/host. Each node in the graph represents a specific subdomain/host (like research.dws.uni-mannheim.de) and a arc exists, if at least one hyperlink was found between pages that belong to a pair of subdomains/hosts. Note that subdomains/hosts can be of arbitrary depth.
  3. Pay-Level-Domain Graph - Each node represents a pay-level-domain (like uni-mannheim.de). An arc exists if at least one hyperlink was found between pages contained in a pair pay-level-domains.

2. Available Web Graphs

Up till now, we have extracted the hyperlink graphs from two releases of the Common Crawl web corpora. The first graph was extracted from a Web corpus gathered in the first half of 2012 and released in August 2012. The second graph was extracted from a crawl gathered in the first quarter of 2014, released in April 2014.

2.2. Extracted Hyperlink Graph from August 2012 Common Crawl Corpus

The hyperlink graph was extracted from the Web cropus released by the Common Crawl Foundation in August 2012. The Web cropus was gathered using a web crawler employing a breadth-first-search selection strategy and embedding link discovery while crawling. The crawl was seeded with a large number of URLs from former crawls performed by the Common Crawl Foundation.
The table below gives an overview of the size of the different aggregation levels of the extracted graph which are provided for download:

Granularity#Nodes#Arcs
Page 3,563 million128,736 million
Host 101 million2,043 million
Pay-Level-Domain 43 million623 million

Within the graph over 94% of all pages are connected and the largest strongly connected component consists of over 50% of all pages. Comparing these and other properties to what is known about the structure of the Web graph from earlier research indicates that the graph is a good sample of the overall Web graph. Please visit the following links for more detailed statistics about the graph as well as a detailed description of how to download and make use of the graph:

2.3. Extracted Hyperlink Graph from Spring 2014 Common Crawl Corpus

The hyperlink graph was extracted from the Web cropus released by the Common Crawl Foundation in April 2014. The Web corpus was gathered using a modified Apache Nutch crawler to download pages from a large but fixed seed list. The crawler was restricted to URLs contained in this list and did not extract additional URLs from links in the crawled pages. The seed list contained around 6 billion URLs and was provided by the search engine company blekko. The Common Crawl Foundation and blekko started a cooperation in 2013 to increase the quality and popularity of the crawled pages and reduce the number of crawled spam pages and crawler traps. The table below gives an overview of the size of the different aggregation levels of the extracted graph which are provided for download:

Granularity#Nodes#Arcs
Page 1,727 million64,422 million
Host 22 million123 million
Pay-Level-Domain 13 million56 million

Within the extracted graph over 91% of all crawled pages are connected, and the largest strongly connected component consists of 19% of all nodes in the graph. This means, that in comparison to the 2012 graph, the percentage of connected nodes is almost equal, but the size of the largest strongly connected component is much smaller. This phenomenon results from the different crawling strategies which were used to gather the web corpora. For the analysis of the connectivity of Web pages or the overall analysis of the Web graph, we thus recommend to use the 2012 graph and not the 2014 graph, as a BFS based selection strategy including URL discovery while crawling more likely results in a realistic sample of the Web graph.
Please visit the following links for more detailed statistics about the 2014 graph as well as detailed description how to download and make use of the graph:

3. Data Formats and Download

We provide the graphs for free download in several formats. All graphs are provided in an index/arc data format. In addition, we provide the page graphs in the format used by the WebGraph library and the PLD graphs in the format used by Pajek.

3.1. Index/Arc Format

The Index/Arc format represents each graph using two files. Within the index file each line represents one node. The first column states the node name, the second column states the node index. Within the arc file each line represents a directed edge between two nodes, where the first column is the origin node and the second the target node. The files are sorted by index and use tabs as a delimiter. The following example files contain a graph with 106 nodes and 141 arcs.

3.2. WebGraph Framework Format

We also provide the page graph in the format expected by the WebGraph Framework developed by Sebastiano Vigna. The graph is represented using three files: .graph, .offsets, .properties. All three are necessary to load the network into the library.
Using the WebGraph Framework, which can be downloaded from Maven Central, these files can be loaded using the following line of code: BVGraph graph = BVGraph.loadMapped(baseName, new ProgressLogger()).

3.3. Pajek NET Format

We offer the PLD-aggregation of the page graph also in the Pajek NET Format which is understood by various graph analysis tools such as Pajek or Gephi. The format combines the index and the arc list into a single file (example file, 106 nodes, 141 arcs).

4. Extraction Process and Source Code

We extracted all HTML pages (mime-type: text/html) and every hyperlink pointing to another crawled HTML page (link type: a and link) from the corpera. For each re-direct, we include an additional node in the graph which links to the re-direct target.

Since the Common Crawl coprus is provided via the AWS Simple Storage Service (S3), it made sense to perform the extraction in the Amazon cloud (EC2). The main criterion here is the cost to achieve a certain task. Instead of using the ubiquitous Hadoop framework, we found using the Simple Queue Service (SQS) for our extraction process increased efficiency. SQS provides a message queue implementation, which we use to coordinate the extraction nodes. The Common Crawl data set is readily partitioned into compressed files of around 100MB each including several thousand webpages. Beside those content files, also metadata files are provided. These files include for each page the URL, re-directs, mime-type, hyperlinks, and type of link using a JSON format. As these files contain all information needed to extract the hyperlink graph for the crawled webpages, we used an adapted version of the framework that we already used to extract RDFa, Microformat and Microdata from the crawled pages to parse the URL, redirect, links and link types from the metadata files. We used 100 machines on Amazon EC2 to process the metadata files. In a second step, we created an index file for each aggregation level (PLD, Domain, 1st Subdomain) and indexed the graphs based on this mappings using Apache PIG running on a 40 node Amazon Elastic MapReduce cluster (EMR).

The source code to extract the WDC Hyperlink Graph from the Common Crawl corpus can be checked out from our Subversion repository. For using the code, you will need to create your own configuration and fill in your AWS authentication information and bucket names. Compilation is performed using Maven, thus changing into the source root directory and typing mvn install should be sufficient to create a build. In order to run the extractor on more than 10 EC2 nodes, you will have to request an EC2 instance limit increase for your AWS account. Beside the raw extraction framework, the project also includes various algorithms to format and manipulate the entire graph, as shrink them to a specific aggregation level or index the graph to compress it.

The Laboratory for Web Algorithms provides various hyperlink graphs for public download in the format understood by the WebGraph Framework. In comparison to these graphs, the WDC Hyperlink Graph is more recent and larger.
The Stanford Large Network Dataset Collection also contains several smaller hyperlink graphs (all below 1 million nodes).

Beside of the Common Crawl that was used to extract the WDC Hyperlink Graph, there are several other public web corpora that could be used to extract hyperlink graphs:

6. License

The Web Data Commons extraction framework can be used under the terms of the Apache Software License.

7. Feedback

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

More information about Web Data Commons is found here.

8. Credits

Lots of thanks to

The creation of the WDC Hyperlink Graph was supported by the EU research project PlanetData and by Amazon Web Services in Education Grant award. We thank your sponsors a lot for supporting Web Data Commons.

PlanetData Logo    AWS Logo   

9. References