Dataset Design, Statistics, and Download
Yaser Oulabi
Christian Bizer

This page describes the T4LTE dataset, a gold standard for the task of long-tail entity extraction from web tables.

Knowledge Bases, like DBpedia, Wikidata or Yago, all rely on data that has been extracted from Wikipedia and as a result cover mostly head instances that fulfill the Wikipedia notability criteria [Oulabi2019]. Their coverage of less well known instances from the long tail is rather low [Dong2014]. As the usefulness of a knowledge base increases with its completeness, adding long-tail instances to a knowledge base is an important task. Web tables [Cafarella2008], which are relational HTML tables extracted from the Web, contain large amounts of structured information, covering a wide range of topics, and describe very specific long tail instances. Web tables are thus a promising source of information for the task of augmenting cross-domain knowledge bases.

This dataset provides annotations for a selection of web tables for the task of augmenting the DBpedia knowledge base with new long-tail entities from those web tables. It includes annotations for the unique number of entities that can be derived from these web tables, and which of these entities are new, given the instances that are already covered in DBpedia. Additionally, there are annotations for values and facts that can be generated from web table data, allowing the evaluation of how well descriptions of new entities were created.

This dataset was used to develop and evaluate a method for augmenting a knowledge base with long-tail entities from web tables [Oulabi2019]. Using this dataset for training we were able to add 187 thousand new Song entities with 394 thousand facts, and 14 thousand new GridironFootballPlayer entities, with 44 thousand new facts to DBpeida. In regards to the number of instances this was an increase of 356 % and 67 % for Song and GridironFootballPlayer respectively [Oulabi2019].


1. Dataset Purpose

The purpose of this dataset is to act as a gold standard for evaluating the extraction of long-tail entities from web tables. It fulfills three tasks:

2. Knowledge Base and Class Selection

We employ DBpedia [Lehmann2015] as the target knowledge base to be extended. It is extracted from Wikipedia and especially Wikipedia infoboxes. As a result, the covered instances are limited to those identified as notable by the Wikipedia community [Oulabi2019]. We use the 2014 release of DBpedia, as this release has been used in related work [Ritze2015, Ritze2016, Oulabi2016, Oulabi2017], and its release date is also closer to the extraction of the web table corpus from which we created this dataset.

From DBpedia we selected three classes for which built the dataset. This selection was done based on four criteria:

Based on this approach we chose the following three classes: (1) GridironFootballPlayer (GF-Player), (2) Song and (3) Settlement, where the class Song also includes all instances of the class Single.

Given those three classes we will profile the existing entities within the knowledge base. The first table provides the number of instances and facts per class, while the second profiles the properties and their densities. The first table shows that DBpedia already covers tens of thousands of instances for the profiled classes. This could indicate that most of the well-known instances are already covered, so that we are especially interested in finding instances from the long tail.

Class Instances Facts
GF-Player 20751 137319
Song 52,533 315,414
Settlement 468,986 1,444,316

The following table reveals that the density differs significantly from property to property. We only consider head properties that have a density of at least 30 %.

Only the properties of class Song have consistently high densities larger than 60 %. The football player class has many properties, but half of them have a density below 50 %. The class Settlement suffers from both, a small number of properties, and low densities for some of them.








97.43 %




92.92 %




86.32 %




64.33 %




55.08 %




54.17 %




48.47 %




48.32 %




38.30 %




38.22 %




38.19 %




89.54 %




85.85 %




81.95 %




80.02 %




77.41 %




64.61 %




60.34 %




92.51 %




88.80 %




62.44 %




32.96 %




31.26 %

3. Web Table Corpus

We extract this dataset from the english-language relational tables set of the Web Data Commons 2012 Web Table Corpus. The set consists of 91.8 million tables. The Table below gives an overview of the general characteristics of tables in the corpus. We can see that the majority of tables are rather short, with an average of 10.4 rows and a median of 2, whereas the average and median number of columns are 3.5 and 3. As a result, a table on average describes 10 instances with 30 values, which likely is a sufficient size and potentially useful for finding new instances and their descriptions. In [Ritze2016] we have profiled the potential of the same corpus for the task of slot filling, meaning to find missing values for existing DBpedia instances.
















For every table we assume that there is one attribute that contains the labels of the instances described by the rows. The remaining columns contain values, which potentially can be used to generate descriptions according to the knowledge base schema.

For the three evaluated classes, the following table shows the result of matching the web table corpus to existing instances and properties in DBpedia, using the T2K Matching Framework [Ritze2015, Ritze2016]. The first column shows the number of matched tables that have at least one matched attribute column. Rows of those tables were matched directly to existing instances of DBpedia. From the second and third columns we see how many values were matched to existing instances and how many values remained unmatched. While more values were matched, the number of unmatched values is still large, especially for the songs class.

















3. Dataset Creation

In this section we will outline the creation of the dataset. This includes how we selected tables from the corpus, what the labeling process is, and what annotations are included in the dataset.

3.1 Web Tables Selection

For the gold standard we had to select a certain number of tables per class to annotate. We first matched tables to classes in the knowledge base using the T2K framework [Ritze2015]. We then select the tables per class separately. To do this we first divided the instances in the knowledge base into quartiles of popularity using the indegree count based on a dataset of Wikipedia page-links [1, 2]. We then select three instances per quartile, overall 12 per class. We then look in the table corpus to find which labels in the corpus, for which we can not find a match in the knowledge base, co-occur most often with the label of the selected instance, as we select one label without a match for each of the 12 instances selected from the knowledge base. For both, the labels of the 12 knowledge base instances and the additional 12 "new" labels, we extract up to 15 tables per label, ensuring that few tables are chosen from the same PLD and that tables have a variety in their types of attributes.

3.2 Labeling Process

Using the method for table selection described above we ended up with 280, 217, 620 tables for the classes GridironFootballPlayer, Song and Settlement respectively. We did not label all tables and especially not all rows of these tables, but we looked for potentially new entities and entities with potentially conflicting names (homonyms). For those we then created clusters, by identifying the rows within the tables that describe the same real-world entitiy. From these row clusters entities can be created and added to the knowledge base. For each cluster we then identified whether the entity already exists in DBpedia or is a new entity, that can be added to DBpedia. For existing entities, we also added a correspondence to the URI of the entity in DBpedia.

For all web tables from which we created row clusters, we matched the table columns to the properties of the knowledge base. These property correspondences allow us to identify how many candidate values exist for a certain combination of entity and property. Finally, we also annotated for all clusters facts, i.e. the correct values given certain properties. We only annotated facts for properties for which a candidate value exists among the data in the table. I.e. for a row cluster, there is a row within a table that described within one column a certain property, only then did we annotate the correct fact. We also annotated whether the value of the correct fact was present among the values in the web tables.

When labeling rows, we aimed for labeling interesting row clusters first. As a result, most tables only have a small number rows in them labeled. This does not apply to columns. Whenever we label one row in a table, we always ensure to label all of its columns.

Finally, for the class Song, we include additional row cluster for existing entities for learning purposes only. These clusters are not fully labeled, as they are missing the fact annotations.

3.2. Annotation Types

The dataset contains various annotation types, which are all described in the table below.

Annotation Type Description Format
Table-To-Class Annotation All tables included in the dataset are matched to one of the three classes we chose to evaluate. The tables are placed in separate folders per class.
Row-To-Instance Annotation For a selection of row tables, we annotate if they belong to an instance, existing or new. If the instance described by the row already exists in DBpedia, the instance corresponds to the entity URI of that instance in DBpedia. Otherwise we generate a new random URI, but keep the prefix of DBpedia entity URIs. All rows matched to same instance form a row cluster. CSV file format (see 7.3)
New Instance Annotation We provide the list of entity URIs that we crated to describe new instances, that do not yet exist in DBpedia. LST file format (see 7.6)
Attribute-To-Property Annotation Given the columns of a web table, we annotate which of these columns describe information that corresponds to a property in the schema of the knowledge base. CSV file format (see 7.2)
Fact Annotations Given row clusters and attribute-to-property correspondences, we can determine for each entity, existing or new, described in the dataset, for which triples we have candidate values in the web tables. For these triples, we annotate the correct facts to allow for evaluation. We additionally annotate, whether the correct fact is present among the candidate values within the web table data. CSV file format (see 7.5)

5 Dataset Statistics

The following table provides an overview of the number of annotations in the dataset. In the first three columns we see the number of table, attribute and row annotations. On average, we have 1.85 attribute annotations per table, not counting the label attribute. The two following columns show the number of annotated clusters, followed by the number of values within those clusters that match a knowledge base property. We overall annotated 266 clusters, of which 103 are new, and where each cluster has on average 3.63 rows and 7.85 matched values. The last two columns show the number of unique facts that can be derived for those clusters, and the number of facts for which a correct value is present. Per cluster we can derive on average 3.17 facts, for 92 % of which, the correct value is present.

Class Tables Attributes Rows Existing
Facts Correct
Value Present
GF-Player 192 572 358 80 17 1,177 460 436
Song 152 248 195 34 63 428 231 212
Settlement 188 162 413 51 23 487 152 124
Sum 532 982 966 165 103 2,092 843 772

The number of row clusters describing existing entities is low for Song, especially when compared to the number of those clusters describing new entities. This is not relevant for evaluation purposes, but for learning, more training examples for existing entities might be required. We therefore additionally include 15 existing entities for learning purposes only. Unlike the other existing entities, we did not annotate any facts for those entities. For these entities we also include 17 additional tables for the class Song, so that we have 169 overall tables for the class Song included in the dataset.

The following three tables show the distribution of properties among the matched values annotated and facts annotated in the dataset per Class. We notice that for all three classes there are clear head properties, for which much more values were matched than for the remaining properties. We also find that for some properties, we have barely any matched values. This number is especially high for the class Settlement.

GridironFootballPlayer Matched
Facts Correct
Value Present 50 37 33 5 5 3 246 82 81 48 20 20 10 10 10 20 10 10 134 61 58 4 4 4 72 40 37 141 63 56 269 78 74 178 50 50

Song Matched
Facts Correct
Value Present 98 54 50 1 1 1 9 7 6 167 64 64 1 1 1 16 10 8 53 38 33 78 52 45 5 4 4

Settlement Matched
Facts Correct
Value Present 3 3 0 2 2 2 156 22 22 4 2 0 158 50 50 3 3 0 8 7 0 30 21 10 100 22 22 5 4 4 9 8 6 9 8 8

6. Cross-Validation Splits

We use the gold standard for learning and testing. For this, we split the data into three folds and performed cross-validation in our research [Oulabi2019]. To allow for comparable results, we provide the exact folds used in our work.

We split by cluster, so that the rows of one cluster are always fully included in one fold. We ensured that we evenly split both new clusters and homonym groups. A homonym group, is a group of clusters with highly similar labels. All clusters of a homonym group were always placed in one fold.

Class Fold Clusters New Homonym Groups Clusters in Homonym Groups Rows
GridironFootballPlayer All 97 17 10 21 358
GridironFootballPlayer 0 31 5 3 6 126
GridironFootballPlayer 1 33 5 4 8 118
GridironFootballPlayer 2 33 7 3 7 114
Song All 97 63 20 65 195
Song 0 32 18 6 21 51
Song 1 34 24 7 27 86
Song 2 31 21 7 17 58
Settlement All 74 23 14 31 413
Settlement 0 26 7 5 11 150
Settlement 1 24 9 4 9 106
Settlement 2 24 7 5 11 157

7. Structure and File Formats

The dataset contains broadly three different file formats:

All files are encoded using UTF-8. All CSV files have no headers, use comma as separators and and double quotation marks as quotation characters. In LST files each new line corresponds to an entry in the list. No quotation or separation characters are used in LST files.

7.1 Dataset Directory Structure

The gold standard is split into three separate folder by each knowledge base class. These folders have the following structure:

CLASS_NAME (GridironFootballPlayer, Song, Settlement)

├─ attributeMapping
│ ├─ table1.csv
│ ├─ table2.csv
│ ├─ table3.csv
│ ├─ ...
│ └─ ...

├─ rowMapping
│ ├─ table1.csv
│ ├─ table2.csv
│ ├─ table3.csv
│ ├─ ...
│ └─ ...

├─ tables
│ ├─ table1.json
│ ├─ table2.json
│ ├─ table3.json
│ ├─ ...
│ └─ ...

├─ facts.csv
├─ fold0.lst
├─ fold1.lst
├─ fold2.lst
├─ forLearning.lst (Song only)
├─ newInstances.lst
├─ referencedEntities.csv (Song only)
└─ tableList.lst

7.2 Attribute Mapping CSV Format

The attribute mapping consists of files that describe correspondences between columns of tables included in the dataset and properties of the knowledge base. For each table we include one CSV file, where the name of the table corresponds to the name of the file without the ".csv" extension.

Each row of this file describes two values. The first value contains the web table column number, while the second contains the mapped DBpedia property. The first column of a table has the number 0.

Example: Song/attributeMapping/1346981172250_1346981783559_593.arc5217493555668914181#52592088_6_1764789873557608114.csv


7.3 Row Mappings CSV Format

The row mapping consists of files that describe which table rows correspond to which entity URI. For each table we include one CSV file, where the name of the table corresponds to the name of the file without the ".csv" extension.

Each row of this file describes two values. The first value contains the web table row number, while the second contains the full URI of the entity. The first row of the table, which is very often the header row, has the number 0.

Example: GridironFootballPlayer/rowMapping/1346876860779_1346961276986_5601.arc3719795019286941883#45451800_0_7520172109909715831.csv

"50"," Donald-248fa1b2-6061-4e39-b394-4b0717de75b4"

7.4 Table JSON Format

The JSON files within the tables folder describe the individual tables included in the dataset fully, including rows that were not annotated as part of the dataset. The JSON Format is described further below using the example.These tables can also be found in the web table corpus linked above.

Two properties are important. First the relation property describes the actual content of the table. It is an array of arrays, where the outer array contains the columns of the table, and each inner array describes all rows of that column. The second important property is the keyColumnIndex, which is the property that sets which column is the key column of the table and is therefore linked to the Label property of the knowledge base.

Example: GridironFootballPlayer\tables\1346823846150_1346837956103_5045.arc6474871151262925852#91994528_4_1071800122125102457.json

         "ben grubbs",
         "kenny irons",
         "will herring",
         "david irons",
         "courtney taylor"
         "nfl team",
         "baltimore ravens",
         "cincinnati bengals",
         "seattle seahawks",
         "atlanta falcons",
         "seattle seahawks"

7.5 Facts CSV Format

Each line in the facts file describes one individual annotated fact. Per line, there are four values. The first contains the URI of the entity, while the second contains the URI of the property. The third contains the annotated fact, while the last is a boolean flag, on whether the correct value of a fact is present among the values found in the web table data, where the values "true" and "false" correspond to present and not present respectively.

While for most facts, there is only one correct value present, for some there can be multiple correct values. Multiple values are separated by a simple |, and values need to be split accordingly when using the dataset.

Parsing the first two and the last values is simple, for the actual fact annotation, the parsing depends on the fact data-type. The table below provides parsing instructions:

Data-Type Description Format Example
Date A format describing date, either with a year or day granularity yyyy OR
Reference DBpedia URI (needs to prefixed with No parsing required Nina_Simone
String Literal string No parsing required FH3312
Integer Integer numbers No parsing required 21
Decimal Mixed decimal number. Some numbers might not be be mixed and simple integers I.F
Signed Decimal Mixed decimal number with a sign ±I.F +1.0
Runtime A format describing runtime in minutes and seconds h:ss 4:01

This table provides a mapping between the properties and the data-types above. Additionally we provide some notes per property if applicable.

Class Property Data-Type Note
GridironFootballPlayer birthDate Date  
GridironFootballPlayer birthPlace Reference  
GridironFootballPlayer college Reference  
GridironFootballPlayer draftPick Integer  
GridironFootballPlayer draftRound Integer  
GridironFootballPlayer draftYear Date  
GridironFootballPlayer height Decimal We record height in centimeters, while DBpedia records height in meters, so that a conversion is necessary. Also, all tables exclusively record height in foot and inches.
GridironFootballPlayer highschool Reference  
GridironFootballPlayer number Integer  
GridironFootballPlayer Person/weight Decimal We record weight in kg, and so does DBpedia. All tables exclusively record weight in pounds.
GridironFootballPlayer position Reference  
GridironFootballPlayer team Reference  
Song album Reference  
Song bSide Reference  
Song genre Reference  
Song musicalArtist Reference  
Song producer Reference  
Song recordLabel Reference  
Song releaseDate Date  
Song runtime Time DBpedia records runtime in seconds as a simple numeric property, while we record it as time in minutes and seconds. As a result, a conversion is necessary.
Song writer Reference  
Settlement area Decimal  
Settlement continent Reference  
Settlement country Reference  
Settlement elevation Decimal  
Settlement isPartOf Reference  
Settlement populationDensity Decimal  
Settlement populationMetro Decimal  
Settlement populationTotal Decimal  
Settlement postalCode String  
Settlement utcOffset Signed Decimal  
Settlement wgs84_pos#long Signed Decimal  
Settlement wgs84_pos#lat Signed Decimal  

Below you will find some examples of the facts CSV file for all three classes.

Example: GridironFootballPlayer/facts.csv


Example: Song/facts.csv


Example: Settlement/facts.csv


7.6 Table, New Instance, Folds and for Learning Lists

We use the list format for a different number of types:

These list files have the extension .lst. Each line of the file is another entry in the list. There are no quoting characters.

Example: Settlement/fold0.lst,_Vaucluse,_Lincolnshire,_Me%C4%91imurje_County

Example: Song/tableList.lst


Example: GridironFootballPlayer/newInstances.lst Donald-248fa1b2-6061-4e39-b394-4b0717de75b4

7.7 Referenced Entities

For the class Song there exist reference facts that reference entities, that do not exist in DBpedia, i.e. they are long-tail entities themselves. We provide these additional entities in a seperate file. The file is especially useful, as it provides both labels and the class of those referenced entities.

For this file we again use the CSV file format, with three values. The first is the URI of the referenced entity, the second its label, and the third, its class alignment.

Example: Song/referencedEntities.csv

"","Shelley Laine","MusicalArtist"
"","Skipping Stones","Album"
"","Best Of 1991-2001","Album"
"","Terry Steele","Writer"
"","David L. Elliott","Writer"
"","Anthology: 1965-1972","Album"
" Rules-34473575-5c29-4048-8435-f96717404db7","Old School New Rules","Album"

8. Download

You can download the dataset here:

9. Feedback

Please send questions and feedback to directly to the authors (listed above) or post them in the Web Data Commons Google Group.

10. References

  1. [Cafarella2008] Cafarella, Michael J. and Halevy, Alon Y. and Zhang, Yang and Wang, Daisy Zhe and Wu, Eugene (2008), "Uncovering the Relational Web", In WebDB '08.
  2. [Dong2014] Dong, Xin and Gabrilovich, Evgeniy and Heitz, Geremy and Horn, Wilko and Lao, Ni and Murphy, Kevin and Strohmann, Thomas and Sun, Shaohua and Zhang, Wei (2014), "Knowledge Vault: A Web-scale Approach to Probabilistic Knowledge Fusion", In KDD '14.
  3. [Lehmann2015] Lehmann, Jens and Isele, Robert and Jakob, Max and Jentzsch, Anja and Kontokostas, Dimitris and Mendes, Pablo N and Hellmann, Sebastian and Morsey, Mohamed and Van Kleef, Patrick and Auer, Sören and others (2015), "DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia", Semantic Web. Vol. 6(2), pp. 167-195. IOS Press.
  4. [Oulabi2016] Oulabi, Yaser and Meusel, Robert and Bizer, Christian (2016), "Fusing Time-dependent Web Table Data", In WebDB '16.
  5. [Oulabi2017] Oulabi, Yaser and Bizer, Christian (2017), "Estimating missing temporal meta-information using Knowledge-Based-Trust", In KDWeb '17.
  6. [Oulabi2019] Oulabi, Yaser and Bizer, Christian (2019), "Extending Cross-Domain Knowledge Bases with Long Tail Entities using Web Table Data", In EDBT '19.
  7. [Ritze2015] Ritze, Dominique and Lehmberg, Oliver and Bizer, Christian (2015), "Matching HTML Tables to DBpedia", In WIMS '15.
  8. [Ritze2016] Ritze, Dominique and Lehmberg, Oliver and Oulabi, Yaser and Bizer, Christian (2016), "Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases", In WWW '16.

Released: 15.07.19