In order to support the evaluation and comparison of product feature extraction and product matching methods, we have manually created two public gold standards for these tasks. The gold standards comprise of several hundreds products from the categories mobile phones, TVs, and headphones.

News

2018-12-20: We have released the WDC Training Dataset and Gold Standard for Large-Scale Product Matching consisting of 26 million product offers originating from 79 thousand websites.
2017-30-06: Paper about extracting attribute-value pairs from product specifications accepted at at WI'17 conference in Leipzig, Germany: Extracting Attribute-Value Pairs from Product Specifications on the Web
2016-01-09: Paper about the gold standards accepted at EC-WEB'16 conference in Porto, Portugal: The WDC Gold Standards for Product Feature Extraction and Product Matching

1.About the WDC gold standard for product matching and product feature extraction
2.Product Selection
3.Gold Standard for Feature Extraction
4.Gold Standard for Product Matching
5.Product Data Corpus
6.Baselines
7.Extraction of Attribute-Value Pairs from Product Specifications
8.References

1. About the WDC Gold Standard for Product Matching and Product Feature Extraction

Finding out which e-shops offer a specific product is a central challenge for building integrated product catalogs and comparison shopping portals. Determining whether two offers refer to the same product involves extracting a set of features (product attributes) from the web pages containing the offers and comparing these features using a matching function. The existing gold standards [5, 6] for product matching have two shortcomings: (i) they only contain offers from a small number of e-shops and thus do not properly cover the heterogeneity that is found on the Web. (ii) they only provide a small number of generic product attributes and therefore cannot be used to evaluate whether detailed product attributes have been correctly extracted from textual product descriptions.

To overcome above shortcomings we introduce the WDC gold standard for product matching and product featrue extraction, which packages the following artifacts:

Goldstandard for Product Feature Extraction: containing over 500 deeply annotated product web pages. On each web page the product-specific features are annotated.
Goldstandard for Product Matching: containing over 75,000 correspondences (1,500 positive, and 73,500 negative) between products from the product catalog and products described on web pages.

In addition, we also provide a Product Data Corpus for public download containing over 13 million product-related web pages retrieved from the same 32 websites. This corpus might be useful as background knowledge for the semi-supervised training of feature extraction and matching methods.

2. Product Selection

We have used the 32 most frequently visited shopping web sites based on the ranking provided by Alexa. We collected first the ten most popular products from the different web sites, for each of the three chosen product categories. We further complemented this list by similar products (based on their name). As example, we found the product Apple iPhone 6 64GB to be one of the most popular amongst all shopping web sites. We therefore included also the products Apple iPhone 6 128GB as well as Apple iPhone 5 into our product catalog. Especially for the product matching task, this methodology introduces a certain level of complexity, as the product names only differ by one or two characters. All in all, for each product category we selected 50 different products.

3. Gold Standard for Feature Extraction

We labed 4 distinct stuctural units from the HTML pages: (1) Microdata title, (2) Microdata description, (3) HTML tables and (4) HTML lists. The labeled set comprises out of 500 product entities, while the distinct labeled properties are 338 in total. It was created by three different annotators. In order to ensure the validity of the annotaton process, the inter annotator agreement was calculated. Out of 1,500 labeled properties there were found 75 disagreements. Therefore, the inter annotator agreement is estimated to be 95%. The table below shows the distribution of labeled entities per category. The product entities were labeled as JSON objects for which a documentation of propety meaning and usage is available here. Additionally, the files containing the labeled entities are provided per category and in total for further use.

Category	#labeled Products	File with labeled Products
Headphones	124	headphones_labeled_entities.json(230kb)
Mobile Phones	212	phones_labeled_entities.json(603kb)
Televisions	164	tvs_labeled_entities.json(291kb)
Total	500	all_labeled_entities.json(1,125kb)

General statistics per PLD, Web Page and Totals for each category

Statistics per PLD are available by clicking here.

Colapse...

The charts below show the PLD usage for the top 12 most used properties in the gold standard for each product category. Some more detailed statistics on all the properties before being mapped can be found in the following file (UnMappedProps_Statistics.xlsx). The file includes the feature names of tables and lists and their frequency per product category as used by the different commercial websites as well as the distribution of PLDs to the feature names per product category.

Statistics about product attributes withing tables and lists per web page are available by clicking here.

Colapse...

The charts below show the usage of attributes per product per product category within lists and tables.

Statistics on total number of properties lalbeled are available by click here.

Colapse...

Among the total number of defined properties we observed some that were used rather frequently and others that were used only from a few websites. The charts below present the most frequent properties per product category as found in titles, descriptions, tables and lists. Please note that the presented results do not include the tagline property which is used for the cases where a word in a title or description did not reveal any specification and thus could not be given any property value.

4. Gold Standard for Product Matching

Product Catalog

Feature Enrichment: In order to complement the products with features, for each of the 150 products in our product catalog we obtained product-specific features from the manufacturers’ web site or, from shopping service web sites like Google Shopping. We did not extract features from individual vendor web site, like Amazon as we want to focus on product-specific features, rather than on features arising from bundles offered by specific vendors.

Figure 1 shows two example pages, which we used to manually extract the features for our product catalog. Figure 1left) depicts the page of Google Shopping and Figure 1right) from the manufacturers’ web site.

Figure 1: left) Features annotated by Google Shopping and right) Features annotated by the Manufacturer

The table below shows the scraped attributes per product category.

Defined Properties per Product Category

Category	#Properties	List of Properties	Downloads
Headphones	36	Show Properties for Headphones product_name (key attribute) description product_type width depth height compatibility series weight color headphones_form_factor headphones_cup_type foldable microphone microphone_sensitivity microphone_response microphone_audio_details headphones_technology connectivity_technology sound_output_mode frequency_response max_input_power sensitivity thd impedance diaphragm magnet_material detachable_cable controls cables_included included_accessories compliant_standards warranty brand mpn product_gtin aditional_features	HeadphoneCatalog.json(22kb)
Mobile Phones	32	Show Properties for Mobile Phones product_name (key attribute) description ram water_resistance memory brand phone_type computer_operating_system phone_carrier product_type core_count processor_type rear_cam_resolution front_cam_resolution color wattage power_supply display_size display_resolution voltage mpn modelnum width height depth weight dimensions body_material package_height product_code product_gtin manufacturer	PhoneCatalog.json(38kb)
Televisions	76	Show Properties for Televisions product_name (key attribute) description compatible_channels product_type wattage_operational yearly_consumption viewable_size total_size 3d_technology series computer_operating_system tv_tuner builtin_dvd_player video_interface hdmi_ports pc_interface hdcp_compatability timer_functions stand width depth height weight color display_resolution display_type image_aspect_ratio motion_enhancement_technology refresh_rate 24p_technology backlight_technology image_contrast brightness widescreen_modes viewing_angle response_time supported_languages commercial_features stand_design stand_color analog_tv_tuner stereo_reception _system secondary_audio_program digital_tv_tuner input_video_formats supported_computer_resolution channel_lock closed_caption_capability supported_memory_cards supported_video_formats supported_audio_formats supported_picture_formats usb_ports memory remote_control_model digital_audio_format audio_surround speakers_qty speakers_type connectivity wifi_protocol internet_services dlna batteries_included voltage wattage_standby warranty dimensions_with_stand dimensions_without_stand package_width package_depth package_height package_weight brand mpn product_gtin	TVCatalog.json(38kb)

Correspondences

We manually generated 1,500 positive correspondences, 500 for each product category (the data are available here). For each product of the product catalog at least one positive correspondence is included. Additionally, to make the matching task more realistic the annotators also annotate closely related products to the once in the product catalog like: phone cases, TV wall mounts or headphone cables, ear-buds, etc. Furthermore we created additional negative correspondences exploiting transitive closure. As all products in the product catalog are distinct, we can generate for all product descriptions contained in web pages, where a positive correspondence exist to a product in the catalog, for all other products in the catalog a negative correspondence to this product on the web page. Using the two approaches we ended up with 73,500 negative correspondences. Please note that the properties product_name from the product catalogs and normalized product name can only be used as keys and not as extracted attributes. The full list of correspondences can be downloaded here.

Figure 2 depicts the number of positive correspondences which are contained for each product from the three different product categories.

Figure 2: Distribution of correspondeces per product category

Download Page Subset

We also provide the pages icluded in the gold standard for download. Each page is named by the id of each product given in the data.

5. Product Data Corpus

The motivation for the product crawl follow two main criteria: the diversity of its origin and its quantity. This means that the chosen product specifications derive from different websites which use different schemata to express product specifications for the same product category. Furthermore the data should include a comprehensive amount of the products and offers provided by the selected websites.

The approach adapted for crawling data while satisfying the previous criteria is the crawling and scraping framework, scrapy. The crawler was forced to retrieve data from 32 different PLDs which were chosen based on their containment of marked up annotations as well as their traffic rankings as reported by Alexa. The following table shows the distribution of pages included in the crawler per PLD.

Number of Pages per PLD in the Crawled Corpus

PLD	#Pages	PLD	#Pages
target.com	2,007,121	shop.com	1,754,368
walmart.com	1,711,517	selection.alibaba.com	607,410
microcenter.com	459,921	aliexpress.com	431,005
ebay.com	413,144	macmall.com	401,944
apple.com	391,539	bestbuy.com	389,146
techspot.com	386,273	techforless.com	361,234
overstock.com	347,846	searsoutlet.com	341,924
pcrush.com	292,904	tesco.com	222,802
frontierpc.com	187,184	abt.com	115,539
ipkart.com	63,112	conns.com	61,158
costco.com	54,274	dhgate.com	50,099
shop.lenovo.com	41,465	bjs.com	40,930
newegg.com	37,393	microsoftstore.com	24,163
samsclub.com	22,788	tomtop.com	13,306
alibaba.com	7,136	boostmobile.com	2,487
sears.com	659	membershipwireless.com	457

In order to restrict the obtained results, product categories and a list of 10 products per category were specified. The table shows the distribution of the retrieved pages after the implementation of the above mentioned filters.

Number of Pages per Product Category

Category	#Pages
Headphones	355,648
Mobile Phones	1,246,213
Televisions	556,121

Markup Data of the Crawled Corpus

It has been shown that markup data like Microdata and RDFa are becoming the standrad for embedding varios semantics of data items. For the purpose of exploiting the Microdata markup in this couprus, the data derived from the crawl were profiled on the basis of markup data. The following table gives an overview about the top domains, classes and properties as observed in the markup entities of the crawled corpus.

Statistics on the Markup Data of the Crawled Corpus

Distinct PLDs with markup data	29
Total Triples	26,184,401
Total Entities	5,636,972
Top Domains by Extracted Triples	Show top values overstock.com(10,363,461 triples) walmart.com(4,871,779 triples) aliexpress.com(3,923,894 triples) ebay.com(1,987,617 triples) tesco.com(1,380,686 triples) bestbuy.com(809,371 triples) shop.com(696,471 triples) searsoutlet.com(525,216 triples) samsclub.com(278,791 triples) alibaba.com(240,143 triples) microcenter.com(235,147 triples) bjs.com(228,063 triples) dhgate.com(139,856 triples) flipkart.com(98,069 triples) newegg.com(94,945 triples) macmall.com(83,127 triples) abt.com(55,740 triples) techspot.com(47,533 triples) conns.com(34,416 triples) apple.com(21,726 triples)
Top Classes by Entities	Show top values schema:Product (1,123,437 Entities) schema:Offer (939,099 Entities) schema:AggregateRating (783,860 Entities) schema:Review (614,110 Entities) schema:Rating (578,198 Entities) data-vocabulary.org/Breadcrumb (527,215 Entities) schema:SomeProducts (386,745 Entities) schema:Organization (205,228 Entities) schema:ListItem (143,448 Entities) schema:BreadcrumbList (92,029 Entities) schema:Brand (54,564 Entities) schema:WebPage (45,328 Entities) schema:AggregateOffer (37,760 Entities) schema:SearchAction (36,577 Entities) schema:aggregateRating (24,531 Entities) schema:SearchResultsPage (15,801 Entities) schema:WPFooter (5,491 Entities) schema:WPHeader (5,491 Entities) data-vocabulary.org/Person (3,074 Entities) schema:LocalBusiness (2,701 Entities)
Top Properties	Show top values www.w3.org/1999/02/22-rdf-syntax-ns#type(5,636,972 Entities) www.w3.org/1999/xhtml/microdata#item(2,003,281 Entities) purl.org/dc/terms/title(1,000,366 Entities) schema.org/Product/name(820,202 Entities) schema.org/Product/image(792,883 Entities) schema.org/Offer/price(711,716 Entities) schema.org/Product/offers(707,317 Entities) schema.org/Review/datePublished(614,110 Entities) schema.org/Review/author(614,059 Entities) schema.org/Review/name(613,247 Entities) schema.org/Product/aggregateRating(604,359 Entities) schema.org/Rating/ratingValue(595,457 Entities) schema.org/AggregateRating/ratingValue(594,723 Entities) schema.org/Review/reviewRating(584,260 Entities) schema.org/Rating/bestRating(576,302 Entities) schema.org/Rating/worstRating(570,554 Entities) schema.org/Product/review(555,175 Entities) schema.org/Review/reviewBody(552,757 Entities) schema.org/Review/publisher(517,677 Entities) data-vocabulary.org/Breadcrumb/title(513,253 Entities)
Detailed Statistics as Excel file	CorpusStatistics.xlsx (40kb)

Product Corpus and extracted Microdata downloads

Download access to the product corpus, its general statistics and the extracted Microdata as n-quads from the general crawl can be found at the following link. Similarly, we provide the product corpus from the keyword serach crawl here.

6. Baselines

Baselines for Product Feature Extraction

Method

The approach makes use of the properties which are contained for the different products in the product catalog, described in Section 2. From this property names, we generate a dictionary, which we than apply to all web pages in the goldstandard. This means, whenever the name of the feature within the catalog occurs on the web page, we extract this as feature for the product.

Results

We applied the dictionary method described above for each part of our gold standard: (1) title and description form the markup data, (2) specification tables, and (3) specification list; all shown in the table below. The results show that the dictionary method does not perform well for each of the parts of the gold standard and improvement is needed. The reason for the poor performance can be found in the disparity of values coming from different vendors. For instance, the size of a display in our catalog are inches, however some of the vendors use the metric system for that measure.

Markup Data
Category	Precision	Recall	F1
Headphones	0.623	0.588	0.604
Phones	0.601	0.452	0.515
TVs	0.573	0.604	0.590
Table Specification
Category	Precision	Recall	F1
Headphones	0.579	0.614	0.596
Phones	0.443	0.555	0.493
TVs	0.521	0.658	0.581
List Specification
Category	Precision	Recall	F1
Headphones	0.436	0.512	0.494
Phones	0.389	0.571	0.462
TVs	0.418	0.499	0.455

Baselines for Product Matching

We experiment on 3 datasets:

Preprocessed text from the entire HTML
Preprocessed name and description properties from Microdata markup
Specification tables and lists

Common text preprocessing techniques are applied in order to reduce the number of tokens of every textual input and to result in filtered, richer in content unstructured and structured product specifications which will eventually contribute better to the identity resolution task.

The preprocessing starts with the removal of HTML tags which do not contribute to the content of the specification. As a second step, tokenization by non-alphanumeric character is realized. Thirdly, the textual characters are converted to lowercase and stemmed using a Porter Stemming Filter. Finally, stopwords are removed. It should be mentioned that all the steps are equally implemented in the unstructured and structured product specifications apart from the first step which is only realized for the unstructured specifications contained in the HTML pages.

Parametrization

We experiment with 3 different variable parameters. We take all n- grams (n ∈ [1, 2, 3]) and test on them separately. Additionally, we relax to string matching constraints by calculating the Levenshtein distance between potential matches. Lastly, we do maximum and minimum frequency pruning.

Similarity Methods

There are three main similarity methods involved in the experiments: exact string matching, Jaccard similarity and Cosine similarity. On top of the implemented similarity measures Levenshtein Distance can be calculated.

1. Simple Model: Calculation of the percentage of the common unique words in the two inputs (Exact Similarity on ngram level).

i. The list of tokens are transformed to sets of unique grams
   
ii. The containment of the exact unique words is calculated as 
commonGrams/min{UniqueGramsCatalogEntity, UniqueGramsPage}

2. TF-IDF Cosine Similarity:

i. The unique grams of the two inputs are assigned weights* (see the weighting section)
   
ii. Common words between the unique sets are retrieved
   
iii. Cosine Similarity is calculated as the dot product of the weights of the common words
   
divided by the normalization factor

3. Jaccard: Can be applied only for the case of ngrams with n>1. The similarity score is calculated as A∩B/A∪B
Levenshtein Similarity (on top): In order to calculate the intersection between vectors (unique common words) the user can define if they want exact matching between the grams or if common grams should be considered those with a Levenshtein similarity above a certain user defined threshold. The Levenshtein similarity for common grams calculation can be implemented on top of all the previously mentioned similarity methods.

Feature Extraction

We implement 2 feature extraction methods:

Bag-of-Words (BOW): Keeping all tokens from the preprocessing
Dictionary method (See the featrue extraction baseline)

In order to evaluate the implemented similarity methods we measure precision, recall and F1 measure. For this we need to first define a threshold. The threshold selection is done by brute-force search for the best F1 score result over the set of [0,1] using a step size of 0.001.

Results

Below we show the best results for both the BOW and Dictionary approaches. All results are available for download in the respective tables.

BOW results

Category	Precision	Recall	F1
Headphones	0.622	0.559	0.588
Phones	0.298	0.668	0.412
TVs	0.661	0.474	0.552
BOW.xlsx(16kb)

Dictionary results

Category	Precision	Recall	F1
Headphones	0.335	0.556	0.418
Phones	0.617	0.612	0.614
TVs	0.578	0.522	0.553
Dictionary.xlsx(16kb)

7. Extraction of Attribute-Value Pairs from Product Specifications

Automatically extracting specifications from HTML pages is not a trivial task. Qui et al. have provided an analisys on product specification and concluded that technical specifications can be contained in different HTML structures, however they are primarily found in tables and lists. Similar to Qui et al. [3], we employ a two step approach: First, we detect specification tables and list in HTML pages: afterwards, we extract attribute-values pairs from the detected product specifications. We start by training a model for detection of specification tables and lists. Next, we apply that model on the web pages. Subsequently, we use a sample of the detected specifications to learn a model for column attribute and value detection. Finally, with the learned model we extract attribute value pairs from the specifications.

Specification Detection

As stated above product specifications are mostly found in tables and list. Considering that HTML tables and lists have a different base structure, we train separate classi- fiers for detecting specification tables and lists. The models are trained by learning a binary classifier that classifies tables/list into specification and non-specification. The classifiers use the following features which have also been used previously by Qui et al. [3]: Average text length per row, number of rows, overall frequency of the word "specification", number of links and number of images, standard deviation of text size. In addition to these features, we introduce the following new features in order to improve the detection accuracy: Average DOM node depth of the items relative to the root, average number of columns, standard deviation of columns, maximum number of columns, number of non table/list tags, average ratio between numerical and alphabetical characters in a cell, and maximum number of rows. In order to illustrate the correlation of these features with the target attribute (specification vs. non-specification), the figure below shows summary statistics about the evaluation dataset that we use. As it is evident from the figure, there is a clear difference between specifications and non-specifications when considering these features. For instance, when considering “standard deviation of columns”, specification tables hardly deviate from a given layout, while non-specification tables deviate form the layout much more. Another interesting feature is the “number of non-table\list tags” where non-specification structures contain much more tags like “<p>, <span>” etc. than the specification ones. With that said, it is intuitive that the binary classifier could learn a better model if both feature sets are used.

Figure 2: Specification vs. Non-specification table/list features

Specification extraction

Qui et al. [3] apply straight forward heuristics to extract attribute-value pairs from product specifications. Specifically, for tables they use a heuristic that, for each table row, extracts the first cell as attribute name and the remaining ones are extracted as concatenated values. Main limitation of this heuristics is that it can not handle tables that have more than two attribute name columns. Namely, instead yielding a correct result like in Figure 3, this heuristic will treat values in the third column as attribute values, which is obviously incorrect . To solve this issue we train a model which can detect whether a given column is an attribute or a value column. Much like the detection model, we train a binary classifier with the following features: average text length per cell, average text length per column, number of non table tags per column, standard deviation of text size per column, ratio between alphabetical and numerical characters in a cell. Taking into account that lists follow different structure (no columns), we convert list items to columns by separating the items by a delimiter and organise the separation result into columns. We consider the common delimiter characters: “:” and “;”. We don’t use delimiters like “,”, “-” , “/” and “\” since they might be a part of a product identification numbers like MPN. After the conversion process is done we are able to train the same model as described above. However, like in [3], this approach falls short when the lists do not contain any delimiter. Since the percentage of specification with out delimiters is less than 8%, we do not pursue a solution of this case. After the columns have been classified, we continue with pairing attribute with value columns, after which each attribute and value item, row wise, is considered a pair. The pairing of columns is done left to right, that is starting from the most left we pair the first attribute and value columns and we continue to the right. In the case of consecutive attribute or value columns, we concatenate them. Figure 3 shows an example of a table with tagged attribute and value columns, where the first and second column constitute the first attribute-value pairing, while the third and the fourth column constitute the second attribute-value pairing.

Figure 3: Example of extracted Product Specification Table

Details on the methodology and the results are presented in Petrovski and Bizer available here.

Additionally, we provide the code, models and training data at the following repository: Attribute-Value Pairs Extraction. Moreover, we provide extracted product attributes from specification tables and lists with aligned attribute names to the ones in the product catalog (available for download here).

8. References

Petar Petrovski, Anna Primpeli, Robert Meusel, Christian Bizer: Product Data Corpus and Gold Standard for Product Feature Extraction and Matching. 17th International Conference on Electronic Commerce and Web Technologies, Porto, Portugal, September 2016. (To Appear)
Robert Meusel, Anna Primpeli, Christian Meilicke, Heiko Paulheim, Christian Bizer: Exploiting Microdata Annotations to Consistently Categorize Product Offers at Web Scale. 16th International Conference on Electronic Commerce and Web Technologies
Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, Divesh Srivastava: DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web, Proceedings of the VLDB Endowment, Vol. 8, No. 13, 2015.
Hanna Köpcke, Andreas Thor, Stefan Thomas, Erhard Rahm: Tailoring entity resolution for matching product offers. In Proceedings of the 15th Int.Conf. on Extending Database Technology (EDBT), 2012.
Hanna Köpcke, Andreas Thor, Erhard Rahm: Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, Volume 3 Issue 1-2, September 2010 , 2010.
Petar Ristoski, Peter Mika: Enriching product ads with metadata from html annotations. 13th Extendended Semantic Web Conference

Web Data Commons - Gold Standard for Product Matching and Product Feature Extraction

News

Contents

1. About the WDC Gold Standard for Product Matching and Product Feature Extraction

2. Product Selection

3. Gold Standard for Feature Extraction

General statistics per PLD, Web Page and Totals for each category

4. Gold Standard for Product Matching

Product Catalog

Defined Properties per Product Category

Correspondences

Download Page Subset

5. Product Data Corpus

Number of Pages per PLD in the Crawled Corpus

Number of Pages per Product Category

Markup Data of the Crawled Corpus

Statistics on the Markup Data of the Crawled Corpus

Product Corpus and extracted Microdata downloads

6. Baselines

Baselines for Product Feature Extraction

Method

Results

Baselines for Product Matching

Parametrization

Similarity Methods

Feature Extraction

Results

BOW results

Dictionary results

7. Extraction of Attribute-Value Pairs from Product Specifications

Specification Detection

Specification extraction

8. References