Categorizing Schema.org Product Descriptions into the GS1 Global Product Catalogue
Anna Primpeli
Robert Meusel
Christian Bizer

This page offers a product categorization gold standard for public download. The gold standard categorizes 9,400 schema.org product desriptions into the product taxonomy of GS1 Global Product Catalogue. The schema.org product descriptions originate from the WebDataCommons 2014 Microdata Corpus. Each product is manually assigned into a category on the third level of the GS1 product taxonomy.

Contents

1. Motivation

A significant number of e-commerce websites have started to annotate product data within their HTML pages using the schema.org vocabulary in order to be nicely presented within the search results of search engines such as Google, Bing, Yahoo, and Yandex. The most commonly used classes for representing product data are schema.org/Product and schema.org/Offer. Although schema.org defines a large number of properties for describing products and offers, only a small subset of these properties is used frequently on the Web. We are particularly interested in the usage of the property category, which is defined to contain a textual description of "A category for the item. Greater signs or slashes can be used to informally indicate a category hierarchy." Analyzing WebDataCommons 2014 product data, we observe that only 2,000 out of the 104,000 that annotate product data also provide categorization information for their products. The product categorization taxonomies used by these 2,000 websites are usually website-specific and the categorization information thus very heterogeneous. In order to calculate statistics about which e-shop is selling products from which product category, it is thus be necessary to classify the schema.org product data into a single shared product taxonomy such as the GS1 taxonomy. The WDC Product Categorization Gold Standard has been created to evaluate classifiers for solving this task.

2. Description of the Gold Standard

For creating the gold standard, we randomly chose 9,414 product descriptions from the WDC 2014 schema.org Product Data Corpus. The product descriptions originate for 818 different e-commerce websites. Every product is described by the following attributes:

Below two example product entities are presented in the same way like they appear in the data set. Here you can download a sample of the product data set.

Example Entity 1Example Entity 2
GeneratedID-21182497961187072929
EntityNodeIDnode5d4c67d3df98cb2aee588545d24306enode26f6acb8c1c3a928c126c26a99c6782
URLhttp://www.tekshop247.com/37tm4zbb-componentrgb-p-155916.htmlhttp://www.tekshop247.com/phoneeasy-331ph-corded-phone-p-176289.html
HOSTwww.tekshop247.comwww.tekshop247.com
s:namePanasonic TY-37TM4ZBB RCA Component/RGBDoro PhoneEasy 331ph Corded Phone
s:descriptionRCA Component/RGB s-vid Board For 37 PlasmaDoro PhoneEasy® 331ph Designed to make using a phone as simple as can be. Big buttons on a clear and spacious keypad make dialling easier than ever while photo memories connect you with special people at the simple press of a button. Visual ring indicator and HAC (Hearing Aid Compatibility).- Very easy to operate - 3 one-touch photo buttons - Easily adjustable volume FeaturesTypeCorded ColourWhite
s:brandPanasonicDoro
Propertieshttp://schema.org/Product/offers\|http://schema.org/Product/mpn\|http://schema.org/Product/name\|http://schema.org/Product/description\|http://schema.org/Product/image\|http://schema.org/Product/brandhttp://schema.org/Product/offers\|http://schema.org/Product/mpn\|http://schema.org/Product/name\|http://schema.org/Product/description\|http://schema.org/Product/image\|http://schema.org/Product/brand
s:categoryElectronics > Video > Video Accessories > Television AccessoriesElectronics > Communications > Telephony > Corded Phones
s:breadcrumbBrowse Tekshop247 » Monitors & Tvs » Plasma Accessories » TY-37TM4ZBBBrowse Tekshop247 » Home Appliances » Telephones » Corded Phone » 4618

 

Density Statistics

The table below contains statistics about the density of the different attributes, meaning for which subset of the 9,414 products the specific attributes are filled.


Total product entries9,414
Distinct PLDs818
Entities with s:category7,976 (84%)Distinct s:category values3,653
Entities with s:breadcrumb1,445 (15%) Distinct s:breadcrumb values1,019
Entities with s:name9,394 (99%)Distinct s:name values9,025
Entities with s:description8,738 (92%) Distinct s:description values7,992
Entities with s:brand4,030 (42%)Distinct s:brand values1,234

3. The GS1 Global Product Catalogue

As it is observed from the table presented in the previous section the current categorization of products is not only rare but it also has a high variance degree. Thus a product catalogue for a unified categorization is needed. For this purpose we initially used the product classification proposed by GS1 Global Product Catalogue (GPC). The version of 01-12-2014 of the GS1/GPC catalogue suggests 6 different levels of categorization. The specified levels start with the industry segmentation and are then divided into classes and families of products. The example presented below derives from the GS1 website and depicts the mapping of a milk product in the first four levels.


GS1 GPC at a glance


Categorization Levels6
Level 1: Segment - Distinct categories38
Level 2: Family - Distinct categories113
Level 3: Class - Distinct categories783
Level 4: Brick - Distinct categories3,766
Level 5: Core Attribute Type - Distinct categories1,858
Level 6: Core Attibute Value - Distinct categories10,851
GPC/GS1 example
GPC/GS1 categorization example

4. Creation of the Gold Standard based on GS1/GPC

For the creation of our gold standard we used the three broader classification levels as defined by GS1/GPC namely: segment, family and class. In order to assign a category label to a product entity we first looked at the title of the product. In the case that the title was not sufficient for giving a category label we turned to the description and the website in which the product information was published, which most of the times contained a picture of the described product.

Top 10 GS1/GPC Segments in the Gold Standard

Below we present statistics about the distribution of the 9,414 schema.org product descriptions in the gold standard over the Top 10 GS1/GPC segments.

Entities#Entities%
Clothing4,01042.6%
Personal Accessories4925.2%
Household/Office Furniture/Furnishings4665.0%
Automotive4304.6%
Computing3463.7%
Audio Visual/Photography3303.5%
Healthcare3003.2%
Sports Equipment2412.6%
Pet Care/Food2412.6%
Food/Beverage/Tobacco2262.4%
Others2,33224.8%

In addition to the complete gold standard we also provide a filtered version, where we removed all products from non ".com" URLs, as well as URLs which clearly indicate another language than English e.g. including /de/ in the directory path of the URL. We also removed the products we could not assign to any GS1/GPC category.

5. Mapping between the GS1/GPC Taxonomy and the Google Product Taxonomy

During the labelling process, we observed some limitations of the GS1/GPC taxonomy for products such as house amenities, travelling services and job offers. As a result of these limitations 187 product entities could not be classified. For this reason, we further made use of the Google Product Taxonomy for product categorization which is more focused on the Web. This categorization includes three classification levels. As a first step unique identifiers were given to the Google Product Taxonomy categories of all the three levels and then a mapping between the first two levels and the GS1 categories of the first three levels was attempted. These mappings could be partially implemented as there were no exact equivalences between the two catalogues while in some cases there were 1:n relations between the categories. Here is an example of a partial mapping between the two standards for the Baby Care product category:

Mapping example

The charts below depict the percentage of the GS1/GPC categories of the 3 first levels that could be mapped to at least one Google Product Taxonomy category.

6. Download

You can download both the created gold standard and the file that contains the mappings between GS1/GPC and the Google Product Taxonomy as csv files. In order to increase the readability, we concatinated the ID and the label of the respective category (e.g. "67010000_Clothing"). The Google Product Taxonomy with the generated IDs can be found here.

FileGoldstandard (Version 1)
English Goldstandard (Version 1)
GS1/GCP <-> Google Product Taxonomy Mapping (Version 1)
Formatcsv (";" as separator)csv (";" as separator)
Size8,856 KB109 KB
#Rows9,415786
Columns
  • GeneratedID
  • EntityNodeID
  • URL
  • HOST
  • s:name
  • s:description
  • s:brand
  • Properties
  • s:category
  • s:breadcrumb
  • GS1_Level1_Category
  • GS1_Level2_Category
  • GS1_Level3_Category
  • Segment Code_Segment Description
  • GS1 Segment-Google TAX
  • Family CodeFamily Description
  • GS1 Family-Google TAX
  • Class CodeClass Description
  • GS1 Class-Google TAX

7. License

The extracted data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus.

The data, derived from the GS1 Global Product Catalogue as well as the Google Product Taxonomy (e.g. Google Product Taxonomy with IDs) is provided according the terms of use, disclamer of warranties and limitations of liability defined by the original distributors of the data: GS1 and Google.

8. References

9. Product Datasets