This page offers a product categorization gold standard for public download. The gold standard categorizes 9,400 schema.org product desriptions into the product taxonomy of GS1 Global Product Catalogue. The schema.org product descriptions originate from the WebDataCommons 2014 Microdata Corpus. Each product is manually assigned into a category on the third level of the GS1 product taxonomy.
Contents
1. Motivation
A significant number of e-commerce websites have started to annotate product data within their HTML pages using the schema.org vocabulary in order to be nicely presented within the search results of search engines such as Google, Bing, Yahoo, and Yandex. The most commonly used classes for representing product data are schema.org/Product and schema.org/Offer. Although schema.org defines a large number of properties for describing products and offers, only a small subset of these properties is used frequently on the Web. We are particularly interested in the usage of the property category, which is defined to contain a textual description of "A category for the item. Greater signs or slashes can be used to informally indicate a category hierarchy." Analyzing WebDataCommons 2014 product data, we observe that only 2,000 out of the 104,000 that annotate product data also provide categorization information for their products. The product categorization taxonomies used by these 2,000 websites are usually website-specific and the categorization information thus very heterogeneous. In order to calculate statistics about which e-shop is selling products from which product category, it is thus be necessary to classify the schema.org product data into a single shared product taxonomy such as the GS1 taxonomy. The WDC Product Categorization Gold Standard has been created to evaluate classifiers for solving this task.
2. Description of the Gold Standard
For creating the gold standard, we randomly chose 9,414 product descriptions from the WDC 2014 schema.org Product Data Corpus. The product descriptions originate for 818 different e-commerce websites. Every product is described by the following attributes:
- GeneratedID: A automatic generated identify, unique for each product description.
- EntityNodeID: The original entity node of the root entity (schema.org/Product). This id is unique in combination with the URL.
- URL: The URL which embedded the product description.
- HOST: The host based on the URL.
- s:name: The name property as defined by schema.org
- s:description: The description property as defined by schema.org
- s:brand: The brand property as defined by schema.org
- Properties: All the properties describing the current product entity
- s:category: The property as defined by schema.org OR
- s:breadcrumb: The breadcrumb property as defined by schema.org
Below two example product entities are presented in the same way like they appear in the data set. Here you can download a sample of the product data set.
Example Entity 1 | Example Entity 2 | |
GeneratedID | -2118249796 | 1187072929 |
EntityNodeID | node5d4c67d3df98cb2aee588545d24306e | node26f6acb8c1c3a928c126c26a99c6782 |
URL | http://www.tekshop247.com/37tm4zbb-componentrgb-p-155916.html | http://www.tekshop247.com/phoneeasy-331ph-corded-phone-p-176289.html |
HOST | www.tekshop247.com | www.tekshop247.com |
s:name | Panasonic TY-37TM4ZBB RCA Component/RGB | Doro PhoneEasy 331ph Corded Phone |
s:description | RCA Component/RGB s-vid Board For 37 Plasma | Doro PhoneEasy® 331ph Designed to make using a phone as simple as can be. Big buttons on a clear and spacious keypad make dialling easier than ever while photo memories connect you with special people at the simple press of a button. Visual ring indicator and HAC (Hearing Aid Compatibility).- Very easy to operate - 3 one-touch photo buttons - Easily adjustable volume FeaturesTypeCorded ColourWhite |
s:brand | Panasonic | Doro |
Properties | http://schema.org/Product/offers\|http://schema.org/Product/mpn\|http://schema.org/Product/name\|http://schema.org/Product/description\|http://schema.org/Product/image\|http://schema.org/Product/brand | http://schema.org/Product/offers\|http://schema.org/Product/mpn\|http://schema.org/Product/name\|http://schema.org/Product/description\|http://schema.org/Product/image\|http://schema.org/Product/brand |
s:category | Electronics > Video > Video Accessories > Television Accessories | Electronics > Communications > Telephony > Corded Phones |
s:breadcrumb | Browse Tekshop247 » Monitors & Tvs » Plasma Accessories » TY-37TM4ZBB | Browse Tekshop247 » Home Appliances » Telephones » Corded Phone » 4618 |
Density Statistics
The table below contains statistics about the density of the different attributes, meaning for which subset of the 9,414 products the specific attributes are filled.
Total product entries | 9,414 | ||
---|---|---|---|
Distinct PLDs | 818 | ||
Entities with s:category | 7,976 (84%) | Distinct s:category values | 3,653 |
Entities with s:breadcrumb | 1,445 (15%) | Distinct s:breadcrumb values | 1,019 |
Entities with s:name | 9,394 (99%) | Distinct s:name values | 9,025 |
Entities with s:description | 8,738 (92%) | Distinct s:description values | 7,992 |
Entities with s:brand | 4,030 (42%) | Distinct s:brand values | 1,234 |
3. The GS1 Global Product Catalogue
As it is observed from the table presented in the previous section the current categorization of products is not only rare but it also has a high variance degree. Thus a product catalogue for a unified categorization is needed. For this purpose we initially used the product classification proposed by GS1 Global Product Catalogue (GPC). The version of 01-12-2014 of the GS1/GPC catalogue suggests 6 different levels of categorization. The specified levels start with the industry segmentation and are then divided into classes and families of products. The example presented below derives from the GS1 website and depicts the mapping of a milk product in the first four levels.
GS1 GPC at a glance
Categorization Levels | 6 | |
---|---|---|
Level 1: Segment - Distinct categories | 38 | |
Level 2: Family - Distinct categories | 113 | |
Level 3: Class - Distinct categories | 783 | |
Level 4: Brick - Distinct categories | 3,766 | |
Level 5: Core Attribute Type - Distinct categories | 1,858 | |
Level 6: Core Attibute Value - Distinct categories | 10,851 |

4. Creation of the Gold Standard based on GS1/GPC
For the creation of our gold standard we used the three broader classification levels as defined by GS1/GPC namely: segment, family and class. In order to assign a category label to a product entity we first looked at the title of the product. In the case that the title was not sufficient for giving a category label we turned to the description and the website in which the product information was published, which most of the times contained a picture of the described product.
Top 10 GS1/GPC Segments in the Gold Standard
Below we present statistics about the distribution of the 9,414 schema.org product descriptions in the gold standard over the Top 10 GS1/GPC segments.
Entities# | Entities% | |
Clothing | 4,010 | 42.6% |
Personal Accessories | 492 | 5.2% |
Household/Office Furniture/Furnishings | 466 | 5.0% |
Automotive | 430 | 4.6% |
Computing | 346 | 3.7% |
Audio Visual/Photography | 330 | 3.5% |
Healthcare | 300 | 3.2% |
Sports Equipment | 241 | 2.6% |
Pet Care/Food | 241 | 2.6% |
Food/Beverage/Tobacco | 226 | 2.4% |
Others | 2,332 | 24.8% |
In addition to the complete gold standard we also provide a filtered version, where we removed all products from non ".com" URLs, as well as URLs which clearly indicate another language than English e.g. including /de/ in the directory path of the URL. We also removed the products we could not assign to any GS1/GPC category.
5. Mapping between the GS1/GPC Taxonomy and the Google Product Taxonomy
During the labelling process, we observed some limitations of the GS1/GPC taxonomy for products such as house amenities, travelling services and job offers. As a result of these limitations 187 product entities could not be classified. For this reason, we further made use of the Google Product Taxonomy for product categorization which is more focused on the Web. This categorization includes three classification levels. As a first step unique identifiers were given to the Google Product Taxonomy categories of all the three levels and then a mapping between the first two levels and the GS1 categories of the first three levels was attempted. These mappings could be partially implemented as there were no exact equivalences between the two catalogues while in some cases there were 1:n relations between the categories. Here is an example of a partial mapping between the two standards for the Baby Care product category:

The charts below depict the percentage of the GS1/GPC categories of the 3 first levels that could be mapped to at least one Google Product Taxonomy category.
6. Download
You can download both the created gold standard and the file that contains the mappings between GS1/GPC and the Google Product Taxonomy as csv files. In order to increase the readability, we concatinated the ID and the label of the respective category (e.g. "67010000_Clothing"). The Google Product Taxonomy with the generated IDs can be found here.
File | Goldstandard (Version 1) English Goldstandard (Version 1) | GS1/GCP <-> Google Product Taxonomy Mapping (Version 1) |
---|---|---|
Format | csv (";" as separator) | csv (";" as separator) |
Size | 8,856 KB | 109 KB |
#Rows | 9,415 | 786 |
Columns |
|
|
7. License
The extracted data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus.
The data, derived from the GS1 Global Product Catalogue as well as the Google Product Taxonomy (e.g. Google Product Taxonomy with IDs) is provided according the terms of use, disclamer of warranties and limitations of liability defined by the original distributors of the data: GS1 and Google.
8. References
- Robert Meusel, Anna Primpeli, Christian Meilicke, Heiko Paulheim, and Christian Bizer: Exploiting Microdata Annotations to Consistently Categorize Product Offers at Web Scale. 16th International Conference on Electronic Commerce and Web Technologies (EC-Web2015/T2), Valencia, Spain, September 2015.
- Robert Meusel, Petar Petrovski and Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. 13th International Semantic Web Conference (ISWC2014), pp. 277-292, Riva del Garda, Italy, October 2014.
- Petar Petrovski, Volha Bryl, Christian Bizer: Integrating Product Data from Websites offering Microdata Markup. 23rd International Conference on World Wide Web (WWW2014 Companion), pp. 1299-1304, Seoul, Korea, April 2014.
9. Product Datasets
- WebDataCommons schema.org Product Data Corpus containing 287 million product descriptions originating from 104,000 websites. The data set has been extracted from the CommonCrawl web corpus.
- Amazon Product and Review Dataset containing data about 9.4 million products (including categorization information) as well as 147 million product reviews crawled from Amazon.