Gold Standard Design, Statistics, and Download
Dominique Ritze
Oliver Lehmberg
Christian Bizer



This page describes the T2Dv2 Gold Standard for evaluating matching systems on the task of matching Web tables to the DBpedia knowledge base. T2Dv2 is the second version of the T2D entity-level gold standard. A principle difference to the first version of the gold standard is that T2Dv2 does not only contain positive but also negative examples.

Many HTML tables on the Web are used for layout purposes, but a small fraction of all tables contains structured data [Cafarella2008][Crestan2011]. As this data has a wide coverage, it could potentially be very valuable for filling missing values and extending cross-domain knowledge bases such as DBpedia, YAGO or the Google Knowledge Graph. As a prerequisite for being able to use table data for knowledge base extension, the Web tables need to be matched to the knowledge base in question, meaning that correspondences between the rows of the tables and the entities described in the knowledge base as well as between the columns of the tables and the schema of the knowledge base need to be found.

Different systems have been developed to solve this matching task [Venetis2010][Limaye 2010][Ellis2014][Zhang2014]. Up till now, it was difficult to compare the performance of these systems as they were evaluated using (partly) non-public Web tables data as well as different knowledge bases. The T2D Gold Standard tries to fill this gap by providing a large set of human-generated correspondences between a public Web table corpus and the DBpedia knowledge base. The T2Dv2 Gold Standard has been used to generated to results described in [Ritze2017].

The T2Dv2 gold standard consists of manually annotated row-to-instance, attribute-to-property and table-to-class correspondences between 779 Web tables and the DBpedia Knowledgebase Version 2014. The tables originate from the English-language subset of the Web Data Commons Web Tables Corpus. Out of the 779 tables in the gold standard, 237 tables share at least one instance with DBpedia. The tables cover different topics including places, works, and people. Altogether, the gold standard contains 237 class, 25,119 instance and 618 property correspondences.

The main difference to the previous version of this gold standard is that tables are also included that are non-relational or do not have any overlap with DBpedia (negative examples). Since most tables found in the Web are actually non-relational, Web table matching systems first need to filter out all non-relational tables. Second, only a certain amount of tables actually overlaps with a knowledge base like DBpedia. Thus, the matching systems need to be able to distinguish between overlapping and non-overlapping tables. Besides adding the tables, the existing correspondences have manually been checked and updated.

All versions of the T2D gold standard are provided under the terms of the Apache license for public download below.

Contents

1. Table Characteristics

The table below provides basic statistics about the size of the tables that are contained in the gold standard. The number of rows varies between 3 and 5000 while the number of columns is between 2 and 30. For small and narrow tables, it is difficult to decide whether they contain relational data that can be matched to DBpedia since often only little evidence is available. If large and broad tables are incorrectly mapped, it has a large influence on the resulting performance of the matching system.

minmaxmedianaverage
row350001784
column23045

In order to cover the challenges that a Web table matching system needs to face, the gold standard contains three types of tables: non-relational tables (layout, matrix, entity, other), relational tables that do not share any instance with DBpedia and relational tables for which least one instance correspondence can be found. In contrast to other gold standards, e.g. [Limaye 112], for matching the tables of the T2Dv2 gold standard, a matching system must be able to distinugish between these types of tables and filter out the ones that cannot be mapped to the DBpedia knowlege base. An overview of the different table types and how they are classified is given on our extraction page.

In total, 779 tables are contained in the gold standard. Figure 1 shows the distribution of table types. About 70% of the tables are relational tables, the second most common type are entity tables. In our studies and experiments we focus on relational tables since they are potentially mostly useful for knowledge base completion.

Fig. 1 - Distribution of the table types.

Of all 546 relational tables, half of them (237) share at least one instance correspondences with DBpedia. Due to the overlap, these tables can be mapped to DBpedia. All non-overlapping tables cover topics that are not represented in DBpedia, one example are tables about products.

2. Correspondence Characteristics

The T2Dv2 gold standard covers three different types of correspondences: row-to-instance, attribute-to-property and table-to-class correspondences. Altogether, 25 119 row-to-instance, 618 attribute-to-property and 237 table-to-class correspondences are contained. For details about how the tables including the correspondences have been chosen, see the page of the previous version.

2.1. Table-To-Class Correspondences

For the 237 relational overlapping tables, a table-to-class correspondence is available. We grouped these classes based on their super classes (called categories) and show the distribution of tables being mapped to each category in Figure 2. Altogether, the tables are mapped to 33 different DBpedia classes.

Fig. 2 - Distribution of Tables per Category

2.2. Attribute-To-Property Correspondences

In total, 618 correspondences between attributes and properties can be found in the tables. Of them, 237 refer to an entity label attribute and are assigned to the property rdfs:label. The remaining correspondences point to DBpedia properties which are contained in the DBpedia ontology namespace. Figure 3 shows the top 10 DBpedia properties (without rdfs:label) which have been assigned to the attributes. Besides properties referring to objects or strings, we can also find properties with dates and numeric values, e.g. releaseDate or populationTotal. Including properties with different data types shows the strengths and weaknesses of the similarity functions employed by the matching systems/algorithms.

Fig. 3 - Top 10 DBpedia properties

The table below provides the average number of attribute-property correspondences per category.

Category avg. # of attribute-property correspondences
Work3.2
Organization2.2
Architectural Structure2.3
Person1.5
Species2.4
Natural Place2.1
Populated Place4.1

The number of attributes that can be mapped for a category can vary. While tables of the category "Populated Place" can on average be mapped to 4 properties, tables of the category "Person" can on average only be mapped to 1.5 properties. Besides other characteristics, these differences significantly influence the difficulty of matching a certain table. Having a lot of attribute-to-property correspondences can help to better find row-to-entity or table-to-class correspondences since more property values are exploitable.

2.3. Row-To-Instance Correspondences

T2Dv2 covers 25,119 row-to-instance correspondences coming from different topics. The table below provides the average number of row-instance correspondences per category.

Category avg. # of row-instance correspondences
Work131
Organization65
Architectural Structure67
Person89
Species250
Natural Place83
Populated Place130

While tables of the category "Species" can be mapped to a lot of instances, tables about "Architectural Structure" have a tendency to contain less correspondences. Similar to the number of property mappings per table, the amount of rows that can be matched has an influence on the matching difficulty: the more correspondences to instances exist, the easier it becomes to find attribute-property as well as table-class correspondences.

3. Data Format

The Web tables are provided in the JSON format that is also used for the WDC Web Table Corpus 2015. Besides the table itself, also metadata like the sourrounding text and timestamps is included. All tables are uniquely identified by their name, which is the name of the table file without extension.

Correspondences are provided as CSV files. Fields are separated by the comma (' , ') character and all values are double quoted (' " '). There are three different file types for the gold standard: the class correspondence file, the attribute correspondence files, and the entity correspondence files.

The class file contains the class correspondences and the header information for each table. It has the following structure:

table name DBpedia class name DBpedia class URI Header row indices (comma-separated list)
The attributes files contain the attribute correspondences and the key information for the tables. For each table, one attribute file with the same name exists. These files have the following structure:
DBpedia property URI Column header (value from the first row) Is key column (boolean) Column index
The entity files contain the entity correspondences for the tables. For each table, one entity file with the same name exists. These files have the following structure:
DBpedia resource URI Key value Row index

4. Download

To download the corpus of Web tables as well as the correspondences use the following link:

T2Dv2 T2Dv2

5. License

The correspondences of the T2D Gold standard is provided under the terms of the Apache license. The Web tables are provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus.

6. Acknowledgements

We would like to thank Oktie Hassanzadeh, Mariano Rodriguez, Kavitha Srinivas and Michael J. Ward for their feedback on our gold standard.

7. Feedback

Please send questions and feedback to directly to the authors (listed above) or post them in the Web Data Commons Google Group.

8. References

  1. [Cafarella2008] Michael J. Cafarella, Eugene Wu, Alon Halevy, Yang Zhang, Daisy Zhe Wang: WebTables: exploring the power of tables on the web. VLDB 2008.
  2. [Crestan2011] Eric Crestan and Patrick Pantel: Web-scale table census and classification. WSDM 2011.
  3. [Cafarella2009] Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova: Data integration for the relational web. Proc. VLDB Endow. 2009.
  4. [Venetis2010] Venetis, Petros, Alon Halevy, Jayant Madhavan, Marius Pasca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu: Table Search Using Recovered Semantics. 2010.
  5. [Limaye 2010] Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 3, 1-2, 2010.
  6. [Zhang2013] Zhang, Xiaolu, et al.: Mapping entity-attribute web tables to web-scale knowledge bases. In: Database Systems for Advanced Applications. Springer, 2013.
  7. [Wang2012] Jingjing Wang, Haixun Wang, Zhongyuan Wang, and Kenny Q. Zhu: Understanding tables on the web. In Proceedings of the 31st international conference on Conceptual Modeling (ER'12), 2012.
  8. [Ellis2014]Jason Ellis, Achille Fokoue, Okite Hassanzadeh, Anastasios Kementsietsidis, Kavitha Srinivas, Michael J. Ward:Exploring Big Data with Helix: Finding Needles in a Big Haystack. In ACM SIGMOD Record, Volume 43 Issue 4, 2014.
  9. [Zhang2014]Ziqi Zhang:Towards efficient and effective semantic table interpretation. In Proceedings of the 13th International Semantic Web Conference (ISWC 2014), 2014.
  10. [Ritze2017]Dominique Ritze, Christian Bizer:Matching Web Tables To DBpedia - A Feature Utility Study In Proceedings of the 20th International Conference on Extending Database Technology (EDBT 2017), 2017.