This page provides the WDC Categorization Gold Standard for public download. The gold standard consists of more than 24,000 manually labelled product offers from different e-shops. The offers are assigned to a flat catagorization schema consisting of 25 product categories. We also present the results of a baseline cagegorization experiment in which we train and evaluate an ensemble of one-vs-rest classifiers. We apply the learned classifier to the the WDC Training Dataset for Large-scale Product Matching in order to obtain a consistent categorization of all 26 million product offers contained in the dataset, which we also provide for download.
- 1. Motivation
- 2. Gold Standard Creation
- 3. Gold Standard Profiling
- 4. Baseline Experiments
- 5. Corpus Categorization
- 6. Download
- 7. Feedback
Categorizing product offers from different e-shops into a single categorization schema is a task faced by many aggregators in e-commerce. The WDC Categorization Gold Standard allows the comparison of learning-based categorization methods on this task. The gold standard is based on the WDC Training Set for Large-scale Product Matching which contains offers from 79 thousand websites. The offers are marked up on the websites using the schema.org product vocabulary. We created the WDC Categorization Gold Standard by manually categorizing a subset of the offers in the training set into a flat, non-overlapping schema of 25 top-level product categories. We labelled at least 500 offers for each category.
In the following, we describe the methodology that was used to create the categorization gold standard, as well as statistics about the dataset and baseline experiments. Finally, we present statistics about the results of applying the learned classifer to the whole WDC product data corpus.
2. Gold Standard Creation
To create a gold standard that contains a sufficient amount of offers per category to train and evaluate a classification model that performs effectively on all categories, we manually labelled offers from the English Training Corpus.
First, we defined a set of categories with the goal of creating a taxonomy that is comparable to other relevant e-commerce category taxonomies. In order to create such a category set, the Amazon, Google and UNSPSC taxonomies were compared and the overlapping first-level categories were identified. Additionally, some second-level categories were used in the electronics and clothing domains to create a categorization that is not too broad, resulting in a set of 25 representative categories.
To find an equal number of different offers for each category, the results from the transfer learning categorization, which was used as a first categorization experiment on the initial gold standard, were manually verified (for details about the transfer learning approach please refer to the WDC Product Matching website). More specifically, for each category, the offers assigned to that category by the transfer learning approach, as well as the offers in the same cluster, were reviewed and, if they were correctly classified, all offers in the cluster were annotated with the class label. The schema.org properties name, title, description, brand and manufacturer were used in order to identify the correct category. Offers that did not fit into any of the defined categories were labelled with 'Others' and offers, for which the category was unclear due to too less information or non-English attributes, were labelled with 'not found'. For some categories, only a few offers were labelled by the transfer learning approach. In order to obtain more offers belonging to these categories, a keyword search was applied using words specific for the respective domain. Only clusters that contain less than 80 offers were selected to reduce noise. Furthermore, clusters containing more than one offer were preferred and it was ensured that each offer contained at least a title or a name. Finally, the categories in the categorization gold standard that was created for first experiments were adapted to the defined set of categories and was added to the labelled data. It contains 985 offers, categorized into 24 categories and a very imbalanced distribution of categories. Details about the distribution of the offers on the categories in the initial gold standard can be found on the WDC Product Matching website.
3.Gold Standard Statistics
The final categorization gold standard contains at least 50 clusters for each category and 2115 clusters in total. The clusters consist of 24,689 product offers. Table 1 shows the number of offers and clusters per category, as well as the average cluster size. Further, the coverage of the schema.org properties per category and in total (Figure 1) is given. The title property refers to the concatenated name and title for each offer. The distribution of the cluster sizes (number of offers per cluster) is depicted in Figure 2.
|Category||# offers||# clusters||Average size of clusters||title||description||brand||manufacturer|
4. Baseline Experiments
In order to set a baseline for the comparison of different categorization algorithms, we split the Gold Standard into train and test dataset and train a one-vs-rest ensemble of logistic regression classifiers. This ensemble reaches a F1 score of 85% on the test set. In the following we provide details about the experiment.
First, a set of features was created from each offer. The schema.org properties name, title, description, brand and manufacturer were used. The title and name attribute was concatenated to a final title attribute. For each offer that did not have one of the schema.org properties itself, the respective parent schema.org property was used. So if an offer did not have an own manufacturer, the parent entity's manufacturer was assigned to it. Specification table keys and values, if available, were added as a further attribute. In order to extract useful information from the specification table values, only the values belonging to the keys Model, Type, Category, Sub-Category, Manufacturer were used. Additionally, the content of the html pages of each offer was extracted by removing all html tags and code. The specification tables and html pages for the offers in the corpus can be downloaded from the WDC Product Matching website.
The terms of each attribute were lowercased and all punctuation characters and single letters or numbers were removed. Stop words were removed from the descriptions and html content using the stop words list from the Python Natural Language Toolkit NLTK.
The training set was built by grouping the offers in the manually labelled gold standard by their ID clusters, concatenating the values of each attribute in a cluster. The resulting dataset was split into a training and test set, by assigning all clusters containing only one offer to the training set and splitting the remaining clusters by stratified sampling into 80% training and 20% test data. The training set was highly imbalanced regarding the class distribution. The clusters of each category were up-and downsampled to the median amount of 72 clusters per category.
In addition to the training set derived from the WDC Categorization Gold Standarad, we also use a subset of the UCSD Amazon Product Dataset for training. This subset was created by randomly sampling 1000 offers per category that contain a title and description. We also provide this subset for download at the end of the page.
The classification experiments were done using scikit-learn. We created feature vectors by computing tf-idf vectors for each attribute separately in the training set and the corpus. The parameters for the vector creation were optimized using the training set and grid search with 5-fold cross-validation. For the title attribute bigrams were used to create the tf-idf vectors, for the remaining attributes unigrams were used. The number of features was restricted to 10,000 for the manufacturer and specification tables attributes.
For classification a Logistic Regression Classifier was optimized with grid search and 5-fold cross-validation on the training set. The resulting logistic regression model uses stochastic average gradient descent with a one-vs-rest approach. Thus, the multi-class classification problem was reduced to 25 binary classification problems. The model was applied to the offers in the corpus grouped in their ID clusters. Thus, all offers in a cluster were assigned to one category.
Results of the Experiments
The classifier achieves a micro-averaged F1 score of 85% on the test set when trained using the training set derived from the WDC Categorization Gold Standard as well as the Amazon training data. The classifier achieves a micro-averaged F1 score of 82% when trained without the Amazon data. Table 2 shows the category-specific performance of the classifier trained on the WDC and Amazon data.
In addition to the test set that we derived from the WDC Categorization Gold Standard, we also evaluate the classifer that we learned using the training set and the Amazon data using the initial WDC Gold Standard (consisting of 985 offers) as test set. The classifier achieves 84% micro-averaged F1 on the initial gold standard.
5. Corpus Categorization
In order to obtain a consistent categorization for all 26 million product offers contained in the WDC product data corpus, we apply the classifier that we learned using the WDC and Amazon data to all offers in the corpus. The categorization resulted in the following distribution of categories on the individual offers (Figure 3) and on the clusters (Figure 4) in the corpus:
In order to verify the performance of the classifier on the whole corpus, we manually checked a sample of 10 offers per category, excluding non-English offers as well as offers without any information. The results per category are shown in Figure 5. The percentages and colors indicate the number of category offers in the corpus, dark blue representing a high number of offers and light blue a low number in the corpus.
Below, we offer the WDC Categorization Gold Standard for public download. The Gold Standard is a subset of the english training corpus and contains the same properties for each offer as described here. We further provide the subsets of the gold standard that were used for training and testing aggregated in clusters, as well as the Amazon sample that was used as additional training data. The training and test set contain the schema.org, as well as the parent schema.org properties name, title, description, brand and manufacturer. Additionally specification table keys and values and HTML content are contained. Each attribute is concatenated within a cluster. The Amazon training offers contain a title, brand and description along with a category label and an identifier (asin). Finally, we offer the Categorized Training Dataset for Large-scale Product Matching for download. It contains cluster ids and category labels in a csv format.
|Categorization Gold Standard||categories_gold_standard_offers_sample.json||26MB||categories_gold_standard_offers.json.gzip|
|Categorization Training Set||categories_clusters_training_sample.json||303.4MB||categories_clusters_training.json.gzip|
|Categorization Test Set||categories_clusters_testing_sample.json||68.3MB||categories_clusters_testing.json.gzip|
|Amazon Training Data||amazon_training_sample.json||25.4MB||amazon_training.json.gzip|
|Categorized Matching Corpus (English)||categorized_clusters_english_sample.csv||247MB||categories_offers_en_clusters.csv.gzip|
- Primpeli, A., Peeters, R., & Bizer, C.: The WDC training dataset and gold standard for large-scale product matching. In: Companion Proceedings of the 2019 World Wide Web Conference. pp. 381-386 ACM (2019).