Generating a Product Data Catalog out of the Web
Petar Petrovski
Anna Primpeli
Robert Meusel
Christian Bizer

In order to support the evaluation and comparison of product feature extraction and product matching methods, we have manually created two public gold standards for these tasks. The gold standards comprise of several hundreds products from the categories mobile phones, TVs, and headphones.



1. About the WDC Gold Standard for Product Matching and Product Feature Extraction

Finding out which e-shops offer a specific product is a central challenge for building integrated product catalogs and comparison shopping portals. Determining whether two offers refer to the same product involves extracting a set of features (product attributes) from the web pages containing the offers and comparing these features using a matching function. The existing gold standards [5, 6] for product matching have two shortcomings: (i) they only contain offers from a small number of e-shops and thus do not properly cover the heterogeneity that is found on the Web. (ii) they only provide a small number of generic product attributes and therefore cannot be used to evaluate whether detailed product attributes have been correctly extracted from textual product descriptions.

To overcome above shortcomings we introduce the WDC gold standard for product matching and product featrue extraction, which packages the following artifacts:

In addition, we also provide a Product Data Corpus for public download containing over 13 million product-related web pages retrieved from the same 32 websites. This corpus might be useful as background knowledge for the semi-supervised training of feature extraction and matching methods.

2. Product Selection

We have used the 32 most frequently visited shopping web sites based on the ranking provided by Alexa. We collected first the ten most popular products from the different web sites, for each of the three chosen product categories. We further complemented this list by similar products (based on their name). As example, we found the product Apple iPhone 6 64GB to be one of the most popular amongst all shopping web sites. We therefore included also the products Apple iPhone 6 128GB as well as Apple iPhone 5 into our product catalog. Especially for the product matching task, this methodology introduces a certain level of complexity, as the product names only differ by one or two characters. All in all, for each product category we selected 50 different products.

3. Gold Standard for Feature Extraction

We labed 4 distinct stuctural units from the HTML pages: (1) Microdata title, (2) Microdata description, (3) HTML tables and (4) HTML lists. The labeled set comprises out of 500 product entities, while the distinct labeled properties are 338 in total. It was created by three different annotators. In order to ensure the validity of the annotaton process, the inter annotator agreement was calculated. Out of 1,500 labeled properties there were found 75 disagreements. Therefore, the inter annotator agreement is estimated to be 95%. The table below shows the distribution of labeled entities per category. The product entities were labeled as JSON objects for which a documentation of propety meaning and usage is available here. Additionally, the files containing the labeled entities are provided per category and in total for further use.

Category#labeled ProductsFile with labeled Products
Mobile Phones212phones_labeled_entities.json(603kb)

General statistics per PLD, Web Page and Totals for each category

Statistics per PLD are available by clicking here.

The charts below show the PLD usage for the top 12 most used properties in the gold standard for each product category. Some more detailed statistics on all the properties before being mapped can be found in the following file (UnMappedProps_Statistics.xlsx). The file includes the feature names of tables and lists and their frequency per product category as used by the different commercial websites as well as the distribution of PLDs to the feature names per product category.

Statistics about product attributes withing tables and lists per web page are available by clicking here.

The charts below show the usage of attributes per product per product category within lists and tables.

Statistics on total number of properties lalbeled are available by click here.

Among the total number of defined properties we observed some that were used rather frequently and others that were used only from a few websites. The charts below present the most frequent properties per product category as found in titles, descriptions, tables and lists. Please note that the presented results do not include the tagline property which is used for the cases where a word in a title or description did not reveal any specification and thus could not be given any property value.

4. Gold Standard for Product Matching

Product Catalog

Feature Enrichment: In order to complement the products with features, for each of the 150 products in our product catalog we obtained product-specific features from the manufacturers’ web site or, from shopping service web sites like Google Shopping. We did not extract features from individual vendor web site, like Amazon as we want to focus on product-specific features, rather than on features arising from bundles offered by specific vendors.

Figure 1 shows two example pages, which we used to manually extract the features for our product catalog. Figure 1left) depicts the page of Google Shopping and Figure 1right) from the manufacturers’ web site.

Figure 1: left) Features annotated by Google Shopping and right) Features annotated by the Manufacturer

The table below shows the scraped attributes per product category.

Defined Properties per Product Category

Category#PropertiesList of PropertiesDownloads
Headphones36Show Properties for Headphones HeadphoneCatalog.json(22kb)
Mobile Phones32Show Properties for Mobile Phones PhoneCatalog.json(38kb)
Televisions76Show Properties for Televisions TVCatalog.json(38kb)


We manually generated 1,500 positive correspondences, 500 for each product category (the data are available here). For each product of the product catalog at least one positive correspondence is included. Additionally, to make the matching task more realistic the annotators also annotate closely related products to the once in the product catalog like: phone cases, TV wall mounts or headphone cables, ear-buds, etc. Furthermore we created additional negative correspondences exploiting transitive closure. As all products in the product catalog are distinct, we can generate for all product descriptions contained in web pages, where a positive correspondence exist to a product in the catalog, for all other products in the catalog a negative correspondence to this product on the web page. Using the two approaches we ended up with 73,500 negative correspondences. The full list of correspondences can be downloaded here.

Figure 2 depicts the number of positive correspondences which are contained for each product from the three different product categories.

Figure 2: Distribution of correspondeces per product category

Download Page Subset

We also provide the pages icluded in the gold standard for download. Each page is named by the id of each product given in the data.

5. Product Data Corpus

The motivation for the product crawl follow two main criteria: the diversity of its origin and its quantity. This means that the chosen product specifications derive from different websites which use different schemata to express product specifications for the same product category. Furthermore the data should include a comprehensive amount of the products and offers provided by the selected websites.

The approach adapted for crawling data while satisfying the previous criteria is the crawling and scraping framework, scrapy. The crawler was forced to retrieve data from 32 different PLDs which were chosen based on their containment of marked up annotations as well as their traffic rankings as reported by Alexa. The following table shows the distribution of pages included in the crawler per PLD.

Number of Pages per PLD in the Crawled Corpus

target.com2,007,121 shop.com1,754,368
microcenter.com459,921 aliexpress.com431,005
ebay.com413,144 macmall.com401,944
apple.com391,539 bestbuy.com389,146
techspot.com386,273 techforless.com361,234
overstock.com347,846 searsoutlet.com341,924
pcrush.com292,904 tesco.com222,802
frontierpc.com187,184 abt.com115,539
ipkart.com63,112 conns.com61,158
costco.com54,274 dhgate.com50,099
shop.lenovo.com41,465 bjs.com40,930
newegg.com37,393 microsoftstore.com24,163
samsclub.com22,788 tomtop.com13,306
alibaba.com7,136 boostmobile.com2,487
sears.com659 membershipwireless.com457

In order to restrict the obtained results, product categories and a list of 10 products per category were specified. The table shows the distribution of the retrieved pages after the implementation of the above mentioned filters.

Number of Pages per Product Category

Mobile Phones1,246,213

Markup Data of the Crawled Corpus

It has been shown that markup data like Microdata and RDFa are becoming the standrad for embedding varios semantics of data items. For the purpose of exploiting the Microdata markup in this couprus, the data derived from the crawl were profiled on the basis of markup data. The following table gives an overview about the top domains, classes and properties as observed in the markup entities of the crawled corpus.

Statistics on the Markup Data of the Crawled Corpus

Distinct PLDs with markup data 29
Total Triples 26,184,401
Total Entities 5,636,972
Top Domains by Extracted Triples Show top values
Top Classes by Entities Show top values
Top Properties Show top values
Detailed Statistics as Excel fileCorpusStatistics.xlsx (40kb)

Product Corpus and extracted Microdata downloads

Download access to the product corpus, its general statistics and the extracted Microdata as n-quads from the general crawl can be found at the following link. Similarly, we provide the product corpus from the keyword serach crawl here.

6. Baselines

Baselines for Product Feature Extraction


The approach makes use of the properties which are contained for the different products in the product catalog, described in Section 2. From this property names, we generate a dictionary, which we than apply to all web pages in the goldstandard. This means, whenever the name of the feature within the catalog occurs on the web page, we extract this as feature for the product.


We applied the dictionary method described above for each part of our gold standard: (1) title and description form the markup data, (2) specification tables, and (3) specification list; all shown in the table below. The results show that the dictionary method does not perform well for each of the parts of the gold standard and improvement is needed. The reason for the poor performance can be found in the disparity of values coming from different vendors. For instance, the size of a display in our catalog are inches, however some of the vendors use the metric system for that measure.

Markup Data
Table Specification
List Specification

Baselines for Product Matching

We experiment on 3 datasets:

Common text preprocessing techniques are applied in order to reduce the number of tokens of every textual input and to result in filtered, richer in content unstructured and structured product specifications which will eventually contribute better to the identity resolution task.

The preprocessing starts with the removal of HTML tags which do not contribute to the content of the specification. As a second step, tokenization by non-alphanumeric character is realized. Thirdly, the textual characters are converted to lowercase and stemmed using a Porter Stemming Filter. Finally, stopwords are removed. It should be mentioned that all the steps are equally implemented in the unstructured and structured product specifications apart from the first step which is only realized for the unstructured specifications contained in the HTML pages.


We experiment with 3 different variable parameters. We take all n- grams (n ∈ [1, 2, 3]) and test on them separately. Additionally, we relax to string matching constraints by calculating the Levenshtein distance between potential matches. Lastly, we do maximum and minimum frequency pruning.

Similarity Methods

There are three main similarity methods involved in the experiments: exact string matching, Jaccard similarity and Cosine similarity. On top of the implemented similarity measures Levenshtein Distance can be calculated.

Feature Extraction

We implement 2 feature extraction methods:

In order to evaluate the implemented similarity methods we measure precision, recall and F1 measure. For this we need to first define a threshold. The threshold selection is done by brute-force search for the best F1 score result over the set of [0,1] using a step size of 0.001.


Below we show the best results for both the BOW and Dictionary approaches. All results are available for download in the respective tables.

BOW results


Dictionary results


7. Extraction of Attribute-Value Pairs from Product Specifications

Automatically extracting specifications from HTML pages is not a trivial task. Qui et al. have provided an analisys on product specification and concluded that technical specifications can be contained in different HTML structures, however they are primarily found in tables and lists. Similar to Qui et al. [3], we employ a two step approach: First, we detect specification tables and list in HTML pages: afterwards, we extract attribute-values pairs from the detected product specifications. We start by training a model for detection of specification tables and lists. Next, we apply that model on the web pages. Subsequently, we use a sample of the detected specifications to learn a model for column attribute and value detection. Finally, with the learned model we extract attribute value pairs from the specifications.

Specification Detection

As stated above product specifications are mostly found in tables and list. Considering that HTML tables and lists have a different base structure, we train separate classi- fiers for detecting specification tables and lists. The models are trained by learning a binary classifier that classifies tables/list into specification and non-specification. The classifiers use the following features which have also been used previously by Qui et al. [3]: Average text length per row, number of rows, overall frequency of the word "specification", number of links and number of images, standard deviation of text size. In addition to these features, we introduce the following new features in order to improve the detection accuracy: Average DOM node depth of the items relative to the root, average number of columns, standard deviation of columns, maximum number of columns, number of non table/list tags, average ratio between numerical and alphabetical characters in a cell, and maximum number of rows. In order to illustrate the correlation of these features with the target attribute (specification vs. non-specification), the figure below shows summary statistics about the evaluation dataset that we use. As it is evident from the figure, there is a clear difference between specifications and non-specifications when considering these features. For instance, when considering “standard deviation of columns”, specification tables hardly deviate from a given layout, while non-specification tables deviate form the layout much more. Another interesting feature is the “number of non-table\list tags” where non-specification structures contain much more tags like “<p>, <span>” etc. than the specification ones. With that said, it is intuitive that the binary classifier could learn a better model if both feature sets are used.

Figure 2: Specification vs. Non-specification table/list features

Specification extraction

Qui et al. [3] apply straight forward heuristics to extract attribute-value pairs from product specifications. Specifically, for tables they use a heuristic that, for each table row, extracts the first cell as attribute name and the remaining ones are extracted as concatenated values. Main limitation of this heuristics is that it can not handle tables that have more than two attribute name columns. Namely, instead yielding a correct result like in Figure 3, this heuristic will treat values in the third column as attribute values, which is obviously incorrect . To solve this issue we train a model which can detect whether a given column is an attribute or a value column. Much like the detection model, we train a binary classifier with the following features: average text length per cell, average text length per column, number of non table tags per column, standard deviation of text size per column, ratio between alphabetical and numerical characters in a cell. Taking into account that lists follow different structure (no columns), we convert list items to columns by separating the items by a delimiter and organise the separation result into columns. We consider the common delimiter characters: “:” and “;”. We don’t use delimiters like “,”, “-” , “/” and “\” since they might be a part of a product identification numbers like MPN. After the conversion process is done we are able to train the same model as described above. However, like in [3], this approach falls short when the lists do not contain any delimiter. Since the percentage of specification with out delimiters is less than 8%, we do not pursue a solution of this case. After the columns have been classified, we continue with pairing attribute with value columns, after which each attribute and value item, row wise, is considered a pair. The pairing of columns is done left to right, that is starting from the most left we pair the first attribute and value columns and we continue to the right. In the case of consecutive attribute or value columns, we concatenate them. Figure 3 shows an example of a table with tagged attribute and value columns, where the first and second column constitute the first attribute-value pairing, while the third and the fourth column constitute the second attribute-value pairing.

Figure 3: Example of extracted Product Specification Table

Details on the methodology and the results are presented in Petrovski and Bizer available here.

Additionally, we provide the code, models and training data at the following repository: Attribute-Value Pairs Extraction. Moreover, we provide extracted product attributes from specification tables and lists with aligned attribute names to the ones in the product catalog (available for download here).

8. References

  1. Petar Petrovski, Anna Primpeli, Robert Meusel, Christian Bizer: Product Data Corpus and Gold Standard for Product Feature Extraction and Matching. 17th International Conference on Electronic Commerce and Web Technologies, Porto, Portugal, September 2016. (To Appear)
  2. Robert Meusel, Anna Primpeli, Christian Meilicke, Heiko Paulheim, Christian Bizer: Exploiting Microdata Annotations to Consistently Categorize Product Offers at Web Scale. 16th International Conference on Electronic Commerce and Web Technologies
  3. Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, Divesh Srivastava: DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web, Proceedings of the VLDB Endowment, Vol. 8, No. 13, 2015.
  4. Hanna Köpcke, Andreas Thor, Stefan Thomas, Erhard Rahm: Tailoring entity resolution for matching product offers. In Proceedings of the 15th Int.Conf. on Extending Database Technology (EDBT), 2012.
  5. Hanna Köpcke, Andreas Thor, Erhard Rahm: Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, Volume 3 Issue 1-2, September 2010 , 2010.
  6. Petar Ristoski, Peter Mika: Enriching product ads with metadata from html annotations. 13th Extendended Semantic Web Conference