Focused Crawling for Structured Data
Robert Meusel
Peter Mika
Roi Blanco

The Web is rapidly transforming from a pure document collection to the largest connected public data space. Semantic annotations of web pages make it notably easier to extract and reuse data and are increasingly used by both search engines and social media sites to provide better search experiences through rich snippets, faceted search, task completion etc. On this page, we present the results of our study of the novel problem of crawling structured data embedded inside HTML pages. We describe Anthelion, the first focused crawler addressing this task. We propose new methods of focused crawling specifically designed for collecting data-rich pages with greater efficiency. In particular, we propose a novel combination of online learning and bandit-based explore/exploit approaches to predict data-rich web pages based on the context of the page as well as using feedback from the extraction of metadata from previously seen pages. We show that these techniques significantly outperform state-of-the-art approaches for focused crawling, measured as the ratio of relevant pages and non-relevant pages collected within a given budget.

Contents

1. Methodology

In the following, we are presenting the two general approaches to machine learning (online classification and bandit-based selection) that we adapt to the domain of focused crawling, and in particular to the task of collecting structured data from web pages.

1.1. Online Classification

Crawling pages that embody markup data can be cast as a focused crawling task, as their general aim is to devise an algorithm to gather as quickly as possible web pages relevant for a given objective function. Standard focused crawling approaches target pages that include information about a given topic, like sports, politics, events and so on.
Focused crawlers make use of topic specific seeds and operate by training a classifier that is able to predict whether a newly discovered web page (before downloading and parsing its content) is relevant for the given target or not. Thus, it is mandatory to assemble a training set, find suitable topic seeds and learn a classifier before the crawling commences.
On the other hand, online learning approaches adapt the underlying model used for classification on the fly with new labeled examples. In the case of a crawler this would be suitable provided that it is possible to automatically acquire a label for a web page as soon as the content of the crawled page has been parsed. This approach is appealing because not only it is not necessary to create a training set in advanced but also the classifier adapts itself over time. In the case of the Web, where the distribution of single features is hard to predict it might happen that, while discovering larger amounts of pages the actual distribution differs strongly from the one of the training set. This adaptability is useful to ensure suitable classification results.
In order to predict the relevance of an unseen newly discovered page it is necessary to extract features which the classifier can take into account to make its prediction. We considered three major sources of features which are (partly) available for a web page before downloading and parsing it:

We note that these sources of features may become gradually available during the crawling process. We will always know the URL of candidate pages, but we might not have discovered every parent of a page; furthermore, information about siblings could not be available at all.
There are several possibilities to extract features from the URL of a page. In general the URL is split into tokens whenever there is a non alphanumeric character (punctuation, slashes and so on) and these tokens can be directly used as features of the page. In order to reduce the sparseness of the generated feature vectors, and potentially improve the accuracy of the classifier, it is possible to apply several pre-processing steps for the extracted tokens before finally transforming them into features. Among standard transformations, for example, we include removing tokens consisting of too few or too many characters. Another technique consists in mapping different spellings of a given token into its normalized version, or replacing tokens only made up by numbers with one static token representing any number.
We also extract features from the parents and siblings of a page. These features are based on labels assigned to parent/sibling pages previously. For example, we introduce as a feature the number of parents/siblings labeled with the target class, and a binary feature representing the existence of at least one parent or sibling labeled with the target class.

1.2. Bandit-Based Selection Approach

A bandit-based selection approach estimates the relevance of a group of items for a given target, and performs this selection based on the expected gain (or relevance) of the groups. The bandit operates as follows: at each round t we have a set of actions A (or arms), and we choose one of them a(t) from A; then, we observe a reward r(a,t), and the goal is to find a policy for selecting actions such as the cumulative reward over time is maximized. The main idea is that the algorithm can improve its arm-selection strategy over time with every new observation. It is important to remark that the algorithm receives no feedback for unchosen arms.
An ideal bandit would like to maximize the expected reward. If we just want to maximize the immediate reward (exploitation) we need to choose an action where calculated possible reward is largest. However, in an exploration/exploitation setting we want to randomly select an action a according to its probability of being Bayes-optimal. (For more detailed information have a look at Section 3.2 within the paper).
In the case of our crawler, we will use our bandit-based approach to make a first selection of the host to be crawled. This is motivated by the observation that the decision to use structured data markup is performed at a host-level in most cases. Informally, we represent each host with a bandit that represents the value of all discovered pages belonging to this host. The available functions to calculate the score for a host and by this the estimated relevance for a target are diverse and described next. It is important to remark that selecting an arm (action) in this context would mean to select the host, which at a given point in time t has the highest expected value to include pages which are relevant for our target. Once we have selected the host, we follow by selecting a page from that host using the online classifier described above.
We have defined several scoring functions to estimate the expected gain of each bandit, which are also implemented within the tool:

2. System Architecture

We implemented our system as module to replace the selection strategy of an existing crawler. The architecture is described in detail within section 4.1 within the paper presented at CIKM 2014.
The following diagram illustrates the process flow of the system:


The crawling process starts with a number of initial seed pages (0). Then, the pages are pulled into the Sieve. Before adding each page into the corresponding host bin, the page is classified using the online classifier. In the online setting starts off with an empty model as no training data (pages) are available so far. Whenever needed the bandit selects one host based on the given score of the host and the set lambda (1). The selected host is inserted into the Queue. For each host, the URL with the highest confidence for the target class is selected and pushed into the Crawler (2). The reordered pages are now ready to be handled by other components of the crawler. After downloading (3) and parsing (4) the page, the newly found links are added into the Sieve after classification (5). In addition, the label of the crawled pages is returned as feedback to classifier which updates its classification model (6).

3. Code and Data

For the experiments presented in our work Focused Crawling for Structured Data used the 2012 corpus of the Common Crawl foundation. For our experiments we crawled (within the Common Crawl Corpus) over 5 million pages, starting from seeds from the Open Directory Project. The data files can be found here. The code which was used to perform the experiments, as well as a ready-to-run Apache Nutch plugin is provided within the open source project released in github under Apache License 2.0.

4. References