This page provides a large hyperlink graph for public download. The graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. Below we provide instructions on how to download the graph as well as basic statistics about its topology.
We hope that the graph will be useful for researchers who develop
- search algorithms that rank results based on the hyperlinks between pages.
- SPAM detection methods which identity networks of web pages that are published in order to trick search engines.
- graph analysis algorithms and can use the hyperlink graph for testing the scalability and performance of their tools.
- Web Science researchers who want to analyze the linking patterns within specific topical domains in order to identify the social mechanisms that govern these domains.
1. Levels of Aggregation
We provide the hyperlink graph on four different levels of aggregation:
- Page-Level Graph - This version of the graph contains all details with each node representing a single web page (like
http://dws.informatik.uni-mannheim.de/en/projects/current-projects/#c13686) and each arc a hyperlink between to two pages.
- Subdomain-Level/Host Graph - This graph aggregates the page graph by subdomain/host. Each node in the graph represents a specific subdomain/host (like
research.dws.uni-mannheim.de) and a arc exists, if at least one hyperlink was found between pages that belong to a pair of subdomains/hosts. Note that subdomains/hosts can be of arbitrary depth.
- First-Level-Subdomain Graph - Each node represents a first level subdomain (like
dws.uni-mannheim.de) with all subjacent subdomains aggregated into this domain.
- Pay-Level-Domain Graph - Each node represents a pay-level-domain (like
uni-mannheim.de). An arc exists if at least one hyperlink was found between pages contained in a pair pay-level-domains.
The table below gives an overview of the size of the different graphs:
|Page Graph||3,563 million||128,736 million|
|Subdomain/Host Graph||101 million||2,043 million|
|1st Level Subdomain Graph||95 million||1,937 million|
|PLD Graph||43 million||623 million|
2. Data Formats and Download
We provide the graphs for free download in several formats. All graphs are provided in an index/arc data format. In addition, we provide the page graph in the format used by the WebGraph library and the PLD graph in the format used by Pajek. The page graphs are hosted on Amazon S3. The aggregated graphs are provided for download via a server in Mannheim, Germany.
2.1 Index/Arc Format
The Index/Arc format represents each graph using two files. Within the index file each line represents one node. The first column states the node name, the second column states the node index. Within the arc file each line represents a directed edge between two nodes, where the first column is the origin node and the second the target node. The files are sorted by index and use tabs as a delimiter. The following example files contain a graph with 106 nodes and 141 arcs.
The following table contains the links for downloading the graphs.
|Data Set||Index File||Arc File|
|Page Graph||see below (45 GB)||see below (331 GB)|
|Subdomain/Host Graph||download (832 MB)||download (9.2 GB)|
|1st Subdomain Graph||download (757 MB)||download (8.7 GB)|
|PLD Graph||download (297 MB)||download (2.8 GB)|
Downloading the page graph: The page graph (arc and indes files) are, due to their size split into in small files of around 500 MB. These files can be downloaded using
wget -i http://webdatacommons.org/hyperlinkgraph/data/index.list.txtfor the index files and respectively
wget -i http://webdatacommons.org/hyperlinkgraph/data/arc.list.txtfor the arc files.
2.2 WebGraph Framework Format
We also provide the page graph in the format expected by the WebGraph Framework developed by Sebastiano Vigna. The graph is represented using three files:
.graph, .offsets, .properties. All three are necessary to load the network into the library.
BVGraph graph = BVGraph.loadMapped(baseName, new ProgressLogger()).
2.3 Pajek NET Format
We offer the PLD-aggregation of the page graph also in the Pajek NET Format which is understood by various graph analysis tools such as Pajek or Gephi. The format combines the index and the arc list into a single file (example file, 106 nodes, 141 arcs). The PLD graph can be downloaded from Pajek version of PLD Graph (2.3 GB) and can directly be imported into Pajek after unzipping. To process the graph in acceptable time, we recommend to run Pajek with at least 32 GB of RAM.
2.4 Ranking Files
Beside the pure graph files, we also calculated for each host in the host graph the
Katz's index, and
Page Rank. An interactive version of those ranking is available at wwwranking.webdatacommons.org. The incorporated rating data can also be downloaded using the files listed below. Each line within the files consists of the host and the value for the corresponding measure.
- Harmonic Centrality: http://data.dws.informatik.uni-mannheim.de/hyperlinkgraph/ranking/hostgraph-h.tsv.gz
- Indegree: http://data.dws.informatik.uni-mannheim.de/hyperlinkgraph/ranking/hostgraph-indegree.tsv.gz
- Katz: http://data.dws.informatik.uni-mannheim.de/hyperlinkgraph/ranking/hostgraph-katz.tsv.gz
- PageRank: http://data.dws.informatik.uni-mannheim.de/hyperlinkgraph/ranking/hostgraph-pr.tsv.gz
3. Extraction Process and Source Code
The WDC Hyperlink Graph was extracted from the latest version of the Common Crawl , which was gathered in the first half of 2012. From this corpus, we extracted all HTML pages (mime-type: text/html) and every hyperlink pointing to another crawled HTML page (link type: a and link). For each re-direct, we include an additional node in the graph which links to the re-direct target.
Since the Common Crawl coprus is provided via the AWS Simple Storage Service (S3), it made sense to perform the extraction in the Amazon cloud (EC2). The main criterion here is the cost to achieve a certain task. Instead of using the ubiquitous Hadoop framework, we found using the Simple Queue Service (SQS) for our extraction process increased efficiency. SQS provides a message queue implementation, which we use to coordinate the extraction nodes. The Common Crawl data set is readily partitioned into compressed files of around 100MB each including several thousand webpages. Beside those content files, also metadata files are provided. These files include for each page the URL, re-directs, mime-type, hyperlinks, and type of link using a JSON format. As these files contain all information needed to extract the hyperlink graph for the crawled webpages, we used an adapted version of the framework that we already used to extract RDFa, Microformat and Microdata from the crawled pages to parse the URL, redirect, links and link types from the metadata files. We used 100 machines on Amazon EC2 to process the metadata files. In a second step, we created an index file for each aggregation level (PLD, Domain, 1st Subdomain) and indexed the graphs based on this mappings using Apache PIG running on a 40 node Amazon Elastic MapReduce cluster (EMR).
The source code to extract the WDC Hyperlink Graph from the Common Crawl corpus can be checked out from our Subversion repository. For using the code, you will need to create your own configuration and fill in your AWS authentication information and bucket names. Compilation is performed using Maven, thus changing into the source root directory and typing
mvn install should be sufficient to create a build. In order to run the extractor on more than 10 EC2 nodes, you will have to request an EC2 instance limit increase for your AWS account. Beside the raw extraction framework, the project also includes various algorithms to format and manipulate the entire graph, as shrink them to a specific aggregation level or index the graph to compress it.
4. Topology of the Hyperlink Graph
We provide basic statistics about the topology of the graphs in a separate document.
5. Other Public Hyperlink Graphs and Web Crawls
The Laboratory for Web Algorithms provides various hyperlink graphs for public download in the format understood by the WebGraph Framework. In comparison to these graphs, the WDC Hyperlink Graph is more recent and larger.
The Stanford Large Network Dataset Collection also contains several smaller hyperlink graphs (all below 1 million nodes).
Beside of the Common Crawl that was used to extract the WDC Hyperlink Graph, there are several other public web corpora that could be used to extract hyperlink graphs:
- The ClueWeb12 corpus was crawled in a similar time period as the Common Crawl. The corpus consists of 740 million English webpages. In comparison, the Common Crawl is 4 times larger and covers non-English top-level domains as well.
- The Stanford WebBase project provides a Web crawl containing 118 million pages and around 1 billion links. The corpus was collected in 2001.
- The Yahoo! Webscope Project has published an older version of the AltaVista crawl, created in 2002. The corpus includes 1.4 billion webpages that are connected by 6.6 billion hyperlinks.
The Web Data Commons extraction framework can be used under the terms of the Apache Software License.
Lots of thanks to
- the Common Crawl project for providing their great web crawl and thus enabling the creation of the WDC Hyperlink Graph.
- Sebastiano Vigna for providing and supporting us with his amazing Java WebGraph library.
- Andrej Mrvar for his fast and detailed answers about the usage of specific functions in Pajek.
- Stephan Seufert for giving us some initial ideas about how to compress and format our graph.
The creation of the WDC Hyperlink Graph was supported by the EU research project PlanetData and by Amazon Web Services in Education Grant award. We thank your sponsors a lot for supporting Web Data Commons.