T4LTE - Web Tables for Long-Tail Entity Extraction

This page describes the T4LTE dataset, a gold standard for the task of long-tail entity extraction from web tables.

Knowledge Bases, like DBpedia, Wikidata or Yago, all rely on data that has been extracted from Wikipedia and as a result cover mostly head instances that fulfill the Wikipedia notability criteria [Oulabi2019]. Their coverage of less well known instances from the long tail is rather low [Dong2014]. As the usefulness of a knowledge base increases with its completeness, adding long-tail instances to a knowledge base is an important task. Web tables [Cafarella2008], which are relational HTML tables extracted from the Web, contain large amounts of structured information, covering a wide range of topics, and describe very specific long tail instances. Web tables are thus a promising source of information for the task of augmenting cross-domain knowledge bases.

This dataset provides annotations for a selection of web tables for the task of augmenting the DBpedia knowledge base with new long-tail entities from those web tables. It includes annotations for the unique number of entities that can be derived from these web tables, and which of these entities are new, given the instances that are already covered in DBpedia. Additionally, there are annotations for values and facts that can be generated from web table data, allowing the evaluation of how well descriptions of new entities were created.

This dataset was used to develop and evaluate a method for augmenting a knowledge base with long-tail entities from web tables [Oulabi2019]. Using this dataset for training we were able to add 187 thousand new Song entities with 394 thousand facts, and 14 thousand new GridironFootballPlayer entities, with 44 thousand new facts to DBpeida. In regards to the number of instances this was an increase of 356 % and 67 % for Song and GridironFootballPlayer respectively [Oulabi2019].

1. Dataset Purpose

The purpose of this dataset is to act as a gold standard for evaluating the extraction of long-tail entities from web tables. It fulfills three tasks:

Allow the measuring the performance of long-tail entity extraction, including recall. By focusing on recall we can ensure that methods retrieve a large number potential new entities from web tables.
Allow the automatic evaluation of implemented methods.
Allow us to train long-tail entity extraction methods.

2. Knowledge Base and Class Selection

We employ DBpedia [Lehmann2015] as the target knowledge base to be extended. It is extracted from Wikipedia and especially Wikipedia infoboxes. As a result, the covered instances are limited to those identified as notable by the Wikipedia community [Oulabi2019]. We use the 2014 release of DBpedia, as this release has been used in related work [Ritze2015, Ritze2016, Oulabi2016, Oulabi2017], and its release date is also closer to the extraction of the web table corpus from which we created this dataset.

From DBpedia we selected three classes for which built the dataset. This selection was done based on four criteria:

Versatility: the three chosen classes must be from three different first-level classes. The first level classes in DBpedia are Species, Work, Agent and Place.
Specificity: we preferred classes further down in the DBpedia set hierarchy.
Potential: using a baseline set expansion approach we measured how many new instances and facts can potentially be added to the knowledge base. Classes with higher numbers were preferred.
Name conflict likelihood: we utilize the labels of instances in the knowledge base to measure the potential for homonyms given a certain class. Classes with a higher relative occurrence of homonyms were preferred, as those represent classes for which this task is more difficult.

Based on this approach we chose the following three classes: (1) GridironFootballPlayer (GF-Player), (2) Song and (3) Settlement, where the class Song also includes all instances of the class Single.

Given those three classes we will profile the existing entities within the knowledge base. The first table provides the number of instances and facts per class, while the second profiles the properties and their densities. The first table shows that DBpedia already covers tens of thousands of instances for the profiled classes. This could indicate that most of the well-known instances are already covered, so that we are especially interested in finding instances from the long tail.

Class	Instances	Facts
GF-Player	20751	137319
Song	52,533	315,414
Settlement	468,986	1,444,316

The following table reveals that the density differs significantly from property to property. We only consider head properties that have a density of at least 30 %.

Only the properties of class Song have consistently high densities larger than 60 %. The football player class has many properties, but half of them have a density below 50 %. The class Settlement suffers from both, a small number of properties, and low densities for some of them.

Class	Property	Facts	Density
GF-Player	birthDate	20,218	97.43 %
GF-Player	college	19,281	92.92 %
GF-Player	birthPlace	17,912	86.32 %
GF-Player	team	13,349	64.33 %
GF-Player	number	11,430	55.08 %
GF-Player	position	11,240	54.17 %
GF-Player	height	10,059	48.47 %
GF-Player	weight	10,027	48.32 %
GF-Player	draftYear	7,947	38.30 %
GF-Player	draftRound	7,932	38.22 %
GF-Player	draftPick	7,924	38.19 %

Song	genre	47,040	89.54 %
Song	musicalArtist	45,097	85.85 %
Song	recordLabel	43,053	81.95 %
Song	runtime	42,035	80.02 %
Song	album	40,666	77.41 %
Song	writer	33,942	64.61 %
Song	releaseDate	31,696	60.34 %

Settlement	country	433,838	92.51 %
Settlement	isPartOf	416,454	88.80 %
Settlement	populationTotal	292,831	62.44 %
Settlement	postalCode	154,575	32.96 %
Settlement	elevation	146,618	31.26 %

3. Web Table Corpus

We extract this dataset from the english-language relational tables set of the Web Data Commons 2012 Web Table Corpus. The set consists of 91.8 million tables. The Table below gives an overview of the general characteristics of tables in the corpus. We can see that the majority of tables are rather short, with an average of 10.4 rows and a median of 2, whereas the average and median number of columns are 3.5 and 3. As a result, a table on average describes 10 instances with 30 values, which likely is a sufficient size and potentially useful for finding new instances and their descriptions. In [Ritze2016] we have profiled the potential of the same corpus for the task of slot filling, meaning to find missing values for existing DBpedia instances.

	Average	Median	Min	Max
Rows	10.37	2	1	35,640
Columns	3.48	3	2	713

For every table we assume that there is one attribute that contains the labels of the instances described by the rows. The remaining columns contain values, which potentially can be used to generate descriptions according to the knowledge base schema.

For the three evaluated classes, the following table shows the result of matching the web table corpus to existing instances and properties in DBpedia, using the T2K Matching Framework [Ritze2015, Ritze2016]. The first column shows the number of matched tables that have at least one matched attribute column. Rows of those tables were matched directly to existing instances of DBpedia. From the second and third columns we see how many values were matched to existing instances and how many values remained unmatched. While more values were matched, the number of unmatched values is still large, especially for the songs class.

Class	Tables	V_Matched	V_Unmatched
GF-Player	10,432	206,847	35,968
Song	58,594	1,315,381	443,194
Settlement	11,757	82,816	13,735

3. Dataset Creation

In this section we will outline the creation of the dataset. This includes how we selected tables from the corpus, what the labeling process is, and what annotations are included in the dataset.

3.1 Web Tables Selection

For the gold standard we had to select a certain number of tables per class to annotate. We first matched tables to classes in the knowledge base using the T2K framework [Ritze2015]. We then select the tables per class separately. To do this we first divided the instances in the knowledge base into quartiles of popularity using the indegree count based on a dataset of Wikipedia page-links [1, 2]. We then select three instances per quartile, overall 12 per class. We then look in the table corpus to find which labels in the corpus, for which we can not find a match in the knowledge base, co-occur most often with the label of the selected instance, as we select one label without a match for each of the 12 instances selected from the knowledge base. For both, the labels of the 12 knowledge base instances and the additional 12 "new" labels, we extract up to 15 tables per label, ensuring that few tables are chosen from the same PLD and that tables have a variety in their types of attributes.

3.2 Labeling Process

Using the method for table selection described above we ended up with 280, 217, 620 tables for the classes GridironFootballPlayer, Song and Settlement respectively. We did not label all tables and especially not all rows of these tables, but we looked for potentially new entities and entities with potentially conflicting names (homonyms). For those we then created clusters, by identifying the rows within the tables that describe the same real-world entitiy. From these row clusters entities can be created and added to the knowledge base. For each cluster we then identified whether the entity already exists in DBpedia or is a new entity, that can be added to DBpedia. For existing entities, we also added a correspondence to the URI of the entity in DBpedia.

For all web tables from which we created row clusters, we matched the table columns to the properties of the knowledge base. These property correspondences allow us to identify how many candidate values exist for a certain combination of entity and property. Finally, we also annotated for all clusters facts, i.e. the correct values given certain properties. We only annotated facts for properties for which a candidate value exists among the data in the table. I.e. for a row cluster, there is a row within a table that described within one column a certain property, only then did we annotate the correct fact. We also annotated whether the value of the correct fact was present among the values in the web tables.

When labeling rows, we aimed for labeling interesting row clusters first. As a result, most tables only have a small number rows in them labeled. This does not apply to columns. Whenever we label one row in a table, we always ensure to label all of its columns.

Finally, for the class Song, we include additional row cluster for existing entities for learning purposes only. These clusters are not fully labeled, as they are missing the fact annotations.

3.2. Annotation Types

The dataset contains various annotation types, which are all described in the table below.

Annotation Type	Description	Format
Table-To-Class Annotation	All tables included in the dataset are matched to one of the three classes we chose to evaluate.	The tables are placed in separate folders per class.
Row-To-Instance Annotation	For a selection of row tables, we annotate if they belong to an instance, existing or new. If the instance described by the row already exists in DBpedia, the instance corresponds to the entity URI of that instance in DBpedia. Otherwise we generate a new random URI, but keep the prefix of DBpedia entity URIs. All rows matched to same instance form a row cluster.	CSV file format (see 7.3)
New Instance Annotation	We provide the list of entity URIs that we crated to describe new instances, that do not yet exist in DBpedia.	LST file format (see 7.6)
Attribute-To-Property Annotation	Given the columns of a web table, we annotate which of these columns describe information that corresponds to a property in the schema of the knowledge base.	CSV file format (see 7.2)
Fact Annotations	Given row clusters and attribute-to-property correspondences, we can determine for each entity, existing or new, described in the dataset, for which triples we have candidate values in the web tables. For these triples, we annotate the correct facts to allow for evaluation. We additionally annotate, whether the correct fact is present among the candidate values within the web table data.	CSV file format (see 7.5)

5 Dataset Statistics

The following table provides an overview of the number of annotations in the dataset. In the first three columns we see the number of table, attribute and row annotations. On average, we have 1.85 attribute annotations per table, not counting the label attribute. The two following columns show the number of annotated clusters, followed by the number of values within those clusters that match a knowledge base property. We overall annotated 266 clusters, of which 103 are new, and where each cluster has on average 3.63 rows and 7.85 matched values. The last two columns show the number of unique facts that can be derived for those clusters, and the number of facts for which a correct value is present. Per cluster we can derive on average 3.17 facts, for 92 % of which, the correct value is present.

Class	Tables	Attributes	Rows	Existing Clusters	New Clusters	Matched Values	Facts	Correct Value Present
GF-Player	192	572	358	80	17	1,177	460	436
Song	152	248	195	34	63	428	231	212
Settlement	188	162	413	51	23	487	152	124
Sum	532	982	966	165	103	2,092	843	772

The number of row clusters describing existing entities is low for Song, especially when compared to the number of those clusters describing new entities. This is not relevant for evaluation purposes, but for learning, more training examples for existing entities might be required. We therefore additionally include 15 existing entities for learning purposes only. Unlike the other existing entities, we did not annotate any facts for those entities. For these entities we also include 17 additional tables for the class Song, so that we have 169 overall tables for the class Song included in the dataset.

The following three tables show the distribution of properties among the matched values annotated and facts annotated in the dataset per Class. We notice that for all three classes there are clear head properties, for which much more values were matched than for the remaining properties. We also find that for some properties, we have barely any matched values. This number is especially high for the class Settlement.

GridironFootballPlayer	Matched Values	Facts	Correct Value Present
http://dbpedia.org/ontology/birthDate	50	37	33
http://dbpedia.org/ontology/birthPlace	5	5	3
http://dbpedia.org/ontology/college	246	82	81
http://dbpedia.org/ontology/draftPick	48	20	20
http://dbpedia.org/ontology/draftRound	10	10	10
http://dbpedia.org/ontology/draftYear	20	10	10
http://dbpedia.org/ontology/height	134	61	58
http://dbpedia.org/ontology/highschool	4	4	4
http://dbpedia.org/ontology/number	72	40	37
http://dbpedia.org/ontology/Person/weight	141	63	56
http://dbpedia.org/ontology/position	269	78	74
http://dbpedia.org/ontology/team	178	50	50

Song	Matched Values	Facts	Correct Value Present
http://dbpedia.org/ontology/album	98	54	50
http://dbpedia.org/ontology/bSide	1	1	1
http://dbpedia.org/ontology/genre	9	7	6
http://dbpedia.org/ontology/musicalArtist	167	64	64
http://dbpedia.org/ontology/producer	1	1	1
http://dbpedia.org/ontology/recordLabel	16	10	8
http://dbpedia.org/ontology/releaseDate	53	38	33
http://dbpedia.org/ontology/runtime	78	52	45
http://dbpedia.org/ontology/writer	5	4	4

Settlement	Matched Values	Facts	Correct Value Present
http://dbpedia.org/ontology/area	3	3	0
http://dbpedia.org/ontology/continent	2	2	2
http://dbpedia.org/ontology/country	156	22	22
http://dbpedia.org/ontology/elevation	4	2	0
http://dbpedia.org/ontology/isPartOf	158	50	50
http://dbpedia.org/ontology/populationDensity	3	3	0
http://dbpedia.org/ontology/populationMetro	8	7	0
http://dbpedia.org/ontology/populationTotal	30	21	10
http://dbpedia.org/ontology/postalCode	100	22	22
http://dbpedia.org/ontology/utcOffset	5	4	4
http://www.w3.org/2003/01/geo/wgs84_pos#long	9	8	6
http://www.w3.org/2003/01/geo/wgs84_pos#lat	9	8	8

6. Cross-Validation Splits

We use the gold standard for learning and testing. For this, we split the data into three folds and performed cross-validation in our research [Oulabi2019]. To allow for comparable results, we provide the exact folds used in our work.

We split by cluster, so that the rows of one cluster are always fully included in one fold. We ensured that we evenly split both new clusters and homonym groups. A homonym group, is a group of clusters with highly similar labels. All clusters of a homonym group were always placed in one fold.

Class	Fold	Clusters	New	Homonym Groups	Clusters in Homonym Groups	Rows
GridironFootballPlayer	All	97	17	10	21	358
GridironFootballPlayer	0	31	5	3	6	126
GridironFootballPlayer	1	33	5	4	8	118
GridironFootballPlayer	2	33	7	3	7	114

Song	All	97	63	20	65	195
Song	0	32	18	6	21	51
Song	1	34	24	7	27	86
Song	2	31	21	7	17	58

Settlement	All	74	23	14	31	413
Settlement	0	26	7	5	11	150
Settlement	1	24	9	4	9	106
Settlement	2	24	7	5	11	157

7. Structure and File Formats

The dataset contains broadly three different file formats:

CSV: used for row mappings, attribute mappings, facts and referenced entities
LST: used to list e.g. tables and entities
JSON: used to describe the whole table

All files are encoded using UTF-8. All CSV files have no headers, use comma as separators and and double quotation marks as quotation characters. In LST files each new line corresponds to an entry in the list. No quotation or separation characters are used in LST files.

7.1 Dataset Directory Structure

The gold standard is split into three separate folder by each knowledge base class. These folders have the following structure:

CLASS_NAME (GridironFootballPlayer, Song, Settlement)
│
├─ attributeMapping
│  ├─ table1.csv
│  ├─ table2.csv
│  ├─ table3.csv
│  ├─ ...
│  └─ ...
│
├─ rowMapping
│  ├─ table1.csv
│  ├─ table2.csv
│  ├─ table3.csv
│  ├─ ...
│  └─ ...
│
├─ tables
│  ├─ table1.json
│  ├─ table2.json
│  ├─ table3.json
│  ├─ ...
│  └─ ...
│
├─ facts.csv
├─ fold0.lst
├─ fold1.lst
├─ fold2.lst
├─ forLearning.lst (Song only)
├─ newInstances.lst
├─ referencedEntities.csv (Song only)
└─ tableList.lst

7.2 Attribute Mapping CSV Format

The attribute mapping consists of files that describe correspondences between columns of tables included in the dataset and properties of the knowledge base. For each table we include one CSV file, where the name of the table corresponds to the name of the file without the ".csv" extension.

Each row of this file describes two values. The first value contains the web table column number, while the second contains the mapped DBpedia property. The first column of a table has the number 0.

Example: Song/attributeMapping/1346981172250_1346981783559_593.arc5217493555668914181#52592088_6_1764789873557608114.csv

"0","http://dbpedia.org/ontology/musicalArtist"
"3","http://dbpedia.org/ontology/releaseDate"
"4","http://dbpedia.org/ontology/recordLabel"
"6","http://dbpedia.org/ontology/genre"

7.3 Row Mappings CSV Format

The row mapping consists of files that describe which table rows correspond to which entity URI. For each table we include one CSV file, where the name of the table corresponds to the name of the file without the ".csv" extension.

Each row of this file describes two values. The first value contains the web table row number, while the second contains the full URI of the entity. The first row of the table, which is very often the header row, has the number 0.

Example: GridironFootballPlayer/rowMapping/1346876860779_1346961276986_5601.arc3719795019286941883#45451800_0_7520172109909715831.csv

"28","http://dbpedia.org/resource/Andrew_Sweat"
"32","http://dbpedia.org/resource/Ameet_Pall"
"46","http://dbpedia.org/resource/Jerrell_Wedge-00869aeb-d468-46fc-8a33-e11e6b771730"
"50","http://dbpedia.org/resource/Chris Donald-248fa1b2-6061-4e39-b394-4b0717de75b4"
"35","http://dbpedia.org/resource/shelly_lyons-c026bb63-4fa2-11e8-9b01-1d14cf16e545"
"24","http://dbpedia.org/resource/Brandon_Marshall_(linebacker)"
"23","http://dbpedia.org/resource/Jerrell_Harris"

7.4 Table JSON Format

The JSON files within the tables folder describe the individual tables included in the dataset fully, including rows that were not annotated as part of the dataset. The JSON Format is described further below using the example.These tables can also be found in the web table corpus linked above.

Two properties are important. First the relation property describes the actual content of the table. It is an array of arrays, where the outer array contains the columns of the table, and each inner array describes all rows of that column. The second important property is the keyColumnIndex, which is the property that sets which column is the key column of the table and is therefore linked to the Label property of the knowledge base.

Example: GridironFootballPlayer\tables\1346823846150_1346837956103_5045.arc6474871151262925852#91994528_4_1071800122125102457.json

{
   "hasKeyColumn":true,
   "headerRowIndex":0,
   "keyColumnIndex":0,
   "pld":"draftboardinsider.com",
   "url":"http://www.draftboardinsider.com/ncaateams/sec/auburn.shtml",
   "relation":[
      [
         "player",
         "ben grubbs",
         "kenny irons",
         "will herring",
         "david irons",
         "courtney taylor"
      ],
      [
         "pos",
         "og",
         "rb",
         "olb",
         "cb",
         "wr"
      ],
      [
         "pick",
         "29",
         "49",
         "161",
         "194",
         "197"
      ],
      [
         "nfl team",
         "baltimore ravens",
         "cincinnati bengals",
         "seattle seahawks",
         "atlanta falcons",
         "seattle seahawks"
      ]
   ]
}

7.5 Facts CSV Format

Each line in the facts file describes one individual annotated fact. Per line, there are four values. The first contains the URI of the entity, while the second contains the URI of the property. The third contains the annotated fact, while the last is a boolean flag, on whether the correct value of a fact is present among the values found in the web table data, where the values "true" and "false" correspond to present and not present respectively.

While for most facts, there is only one correct value present, for some there can be multiple correct values. Multiple values are separated by a simple |, and values need to be split accordingly when using the dataset.

Parsing the first two and the last values is simple, for the actual fact annotation, the parsing depends on the fact data-type. The table below provides parsing instructions:

Data-Type	Description	Format	Example
Date	A format describing date, either with a year or day granularity	yyyy OR yyyy-mm-dd	2000 2012-04-20
Reference	DBpedia URI (needs to prefixed with http://dbpedia.org/resource/)	No parsing required	Nina_Simone
String	Literal string	No parsing required	FH3312
Integer	Integer numbers	No parsing required	21
Decimal	Mixed decimal number. Some numbers might not be be mixed and simple integers	I.F I	187.96 88.45051 4144
Signed Decimal	Mixed decimal number with a sign	�I.F	+1.0 -4.5333333
Runtime	A format describing runtime in minutes and seconds	h:ss	4:01 5:13

This table provides a mapping between the properties and the data-types above. Additionally we provide some notes per property if applicable.

Class	Property	Data-Type	Note
GridironFootballPlayer	birthDate	Date
GridironFootballPlayer	birthPlace	Reference
GridironFootballPlayer	college	Reference
GridironFootballPlayer	draftPick	Integer
GridironFootballPlayer	draftRound	Integer
GridironFootballPlayer	draftYear	Date
GridironFootballPlayer	height	Decimal	We record height in centimeters, while DBpedia records height in meters, so that a conversion is necessary. Also, all tables exclusively record height in foot and inches.
GridironFootballPlayer	highschool	Reference
GridironFootballPlayer	number	Integer
GridironFootballPlayer	Person/weight	Decimal	We record weight in kg, and so does DBpedia. All tables exclusively record weight in pounds.
GridironFootballPlayer	position	Reference
GridironFootballPlayer	team	Reference

Song	album	Reference
Song	bSide	Reference
Song	genre	Reference
Song	musicalArtist	Reference
Song	producer	Reference
Song	recordLabel	Reference
Song	releaseDate	Date
Song	runtime	Time	DBpedia records runtime in seconds as a simple numeric property, while we record it as time in minutes and seconds. As a result, a conversion is necessary.
Song	writer	Reference

Settlement	area	Decimal
Settlement	continent	Reference
Settlement	country	Reference
Settlement	elevation	Decimal
Settlement	isPartOf	Reference
Settlement	populationDensity	Decimal
Settlement	populationMetro	Decimal
Settlement	populationTotal	Decimal
Settlement	postalCode	String
Settlement	utcOffset	Signed Decimal
Settlement	wgs84_pos#long	Signed Decimal
Settlement	wgs84_pos#lat	Signed Decimal

Below you will find some examples of the facts CSV file for all three classes.

Example: GridironFootballPlayer/facts.csv

"http://dbpedia.org/resource/Al_Harris_(defensive_lineman)","http://dbpedia.org/ontology/position","Defensive_end|Linebacker","true"
"http://dbpedia.org/resource/Allen_Reisner","http://dbpedia.org/ontology/birthDate","1988-09-29","true"
"http://dbpedia.org/resource/Mike_Bell_(defensive_lineman)","http://dbpedia.org/ontology/college","Colorado_State_Rams","true"
"http://dbpedia.org/resource/Andre_Roberts_(American_football)","http://dbpedia.org/ontology/Person/weight","88.45051","true"
"http://dbpedia.org/resource/louis_nzegwu-90617f1a-1dbf-48c0-ae52-dfb4cb5043ab","http://dbpedia.org/ontology/team","Atlanta_Falcons","true"
"http://dbpedia.org/resource/Donald_Jones_(American_football)","http://dbpedia.org/ontology/birthDate","1987-12-17","true"
"http://dbpedia.org/resource/Mike_Williams_(wide_receiver,_born_1987)","http://dbpedia.org/ontology/height","187.96","true"
"http://dbpedia.org/resource/Anquan_Boldin","http://dbpedia.org/ontology/team","Arizona_Cardinals|Baltimore_Ravens|San_Francisco_49ers|Detroit_Lions|Buffalo_Bills","true"
"http://dbpedia.org/resource/Al_Harris_(defensive_lineman)","http://dbpedia.org/ontology/team","Chicago_Bears","true"
"http://dbpedia.org/resource/Mike_Williams_(wide_receiver,_born_1984)","http://dbpedia.org/ontology/draftPick","10","true"
...

Example: Song/facts.csv

"http://dbpedia.org/resource/rhythm_of_life-17f821d8-8424-49b9-ad35-aae0094d475c","http://dbpedia.org/ontology/musicalArtist","U96","true"
"http://dbpedia.org/resource/Men's_Needs","http://dbpedia.org/ontology/musicalArtist","The_Cribs","true"
"http://dbpedia.org/resource/The_Lemon_Song","http://dbpedia.org/ontology/runtime","6:19","true"
"http://dbpedia.org/resource/lemon_tree-c4ef5525-b118-4ed9-8206-2d64b91a0b89","http://dbpedia.org/ontology/album","The_Very_Best_of_Peter,_Paul_and_Mary-f3c362a9-764f-45b2-a3b4-dc32f71c8902","true"
"http://dbpedia.org/resource/Seek_&_Destroy","http://dbpedia.org/ontology/album","Kill_%27Em_All","true"
"http://dbpedia.org/resource/I'm_Ready_for_Love","http://dbpedia.org/ontology/musicalArtist","Martha_and_the_Vandellas","true"
"http://dbpedia.org/resource/Something_About_You-646ffccf-6fb9-4279-9b84-eb582b959388","http://dbpedia.org/ontology/album","Aliens_&_Rainbows","true"
"http://dbpedia.org/resource/Beautiful_(Mai_Kuraki_song)","http://dbpedia.org/ontology/releaseDate","2009-06-10","true"
"http://dbpedia.org/resource/Lemon_(song)","http://dbpedia.org/ontology/releaseDate","1993","true"
...

Example: Settlement/facts.csv

"http://dbpedia.org/resource/Beijing","http://www.w3.org/2003/01/geo/wgs84_pos#long","116.383333","true"
"http://dbpedia.org/resource/Rome","http://dbpedia.org/ontology/area","1285","false"
"http://dbpedia.org/resource/Bakar","http://dbpedia.org/ontology/country","Croatia","true"
"http://dbpedia.org/resource/Rome","http://dbpedia.org/ontology/utcOffset","+1.0","true"
"http://dbpedia.org/resource/arriondas-a4a0fc90-8a84-11e8-82e1-3fad23f94135","http://dbpedia.org/ontology/country","Spain","true"
"http://dbpedia.org/resource/Bwaga_Cheti","http://www.w3.org/2003/01/geo/wgs84_pos#lat","-4.5333333","true"
"http://dbpedia.org/resource/belec-a4a20efb-8a84-11e8-82e1-55488444b4bb","http://dbpedia.org/ontology/postalCode","49254","true"
"http://dbpedia.org/resource/Bakarac","http://dbpedia.org/ontology/isPartOf","Primorje-Gorski_Kotar_County","true"
...

7.6 Table, New Instance, Folds and for Learning Lists

We use the list format for a different number of types:

For the cross-validation folds, providing a listing of the entities for each fold.
We provide a list of all tables per class.
We provide a list of entity URIs of new entities.
For the class Song we additionally list the 15 existing entities which are to be used for learning purposes only.

These list files have the extension .lst. Each line of the file is another entry in the list. There are no quoting characters.

Example: Settlement/fold0.lst

http://dbpedia.org/resource/Aurel,_Vaucluse
http://dbpedia.org/resource/Stamford,_Lincolnshire
http://dbpedia.org/resource/Bwaga_Cheti
http://dbpedia.org/resource/kalakho-f889b410-4fa2-11e8-9b01-4b7e9cb868f5
http://dbpedia.org/resource/Belica,_Me%C4%91imurje_County
http://dbpedia.org/resource/burgo_ranero_(el)-a4a1e6e3-8a84-11e8-82e1-f765c0073ff5
http://dbpedia.org/resource/Parys
http://dbpedia.org/resource/Bakar
http://dbpedia.org/resource/Beli_Manastir
http://dbpedia.org/resource/Chaville
http://dbpedia.org/resource/Bonyunyu
...

Example: Song/tableList.lst

1346981172231_1347009637666_1623.arc2035420803784423551#8157737_1_7371407078293434892
1346981172137_1346990327917_1687.arc8790234217643537183#29324093_0_4104648016207008655
1350433107059_1350464532277_262.arc3274150641837087721#45894422_0_4048022465851316720
1346876860798_1346941853400_2953.arc2527404313287902461#59379225_0_6616355908335718299
1346876860596_1346938127566_2223.arc3753870089959127664#66546593_0_554336699268001312
1346876860493_1346903186037_233.arc7106100585551357027#48571984_0_6773473850340800215
1350433107058_1350501694041_732.arc1602187029723264891#37777045_0_5887313017136165099
1346981172186_1346995420474_2788.arc8792188372387791527#96007354_0_5596511497072590105
1346876860840_1346953273866_1315.arc7476207801019051251#90975928_0_7687754714967118394
1346981172231_1347010609872_3515.arc2403866143077224377#1524220_0_7599368370767966283
1346876860596_1346938903566_3154.arc4273244386981436402#87484324_1_1397156714755041772
1346876860611_1346928074552_1536.arc941312314634454173#7668872_0_4460134851750954295
1346876860807_1346934722734_147.arc4322826721152635511#74438125_0_3796119154304144126
1346981172155_1347002310061_2451.arc2996742124595566891#69073711_0_6127538729831462210
1346981172239_1346995950584_63.arc1630314548530234317#65396150_0_5145880845606151839
...

Example: GridironFootballPlayer/newInstances.lst

http://dbpedia.org/resource/shelly_lyons-c026bb63-4fa2-11e8-9b01-1d14cf16e545
http://dbpedia.org/resource/mike_ball-86931519-f533-4e99-b1be-b45fb805e7e5
http://dbpedia.org/resource/michael_vandermeulen-c020ef8f-4fa2-11e8-9b01-c3c34fe696e5
http://dbpedia.org/resource/Chris Donald-248fa1b2-6061-4e39-b394-4b0717de75b4
http://dbpedia.org/resource/james_carmon-c02869e4-4fa2-11e8-9b01-81f205eb29eb
http://dbpedia.org/resource/alvin_mitchell-33d17a8e-ed7d-433e-9728-3e1c26658a6a
http://dbpedia.org/resource/ben_buchanan-c0289116-4fa2-11e8-9b01-fdd98ba7cc66
http://dbpedia.org/resource/merritt_kersey-c025d110-4fa2-11e8-9b01-65117e7d5512
http://dbpedia.org/resource/aderious_simmoms-c023ae8d-4fa2-11e8-9b01-89f1d0e15a08
...

7.7 Referenced Entities

For the class Song there exist reference facts that reference entities, that do not exist in DBpedia, i.e. they are long-tail entities themselves. We provide these additional entities in a seperate file. The file is especially useful, as it provides both labels and the class of those referenced entities.

For this file we again use the CSV file format, with three values. The first is the URI of the referenced entity, the second its label, and the third, its class alignment.

Example: Song/referencedEntities.csv

"http://dbpedia.org/resource/Shelley_Laine-8ee3f06c-a68a-495b-a788-5a4473e39384","Shelley Laine","MusicalArtist"
"http://dbpedia.org/resource/Skipping_Stones-b3ff0194-6a65-4835-8635-f02ba6d58e3d","Skipping Stones","Album"
"http://dbpedia.org/resource/Best_Of_1991-2001-ab96f5ab-730a-48a2-a0b9-e3275993bf07","Best Of 1991-2001","Album"
"http://dbpedia.org/resource/Terry_Steele-67a5c337-4b29-497e-9852-28ae222c7bfd","Terry Steele","Writer"
"http://dbpedia.org/resource/David_L._Elliott-b11a4b31-8e01-4c06-8847-af359577525a","David L. Elliott","Writer"
"http://dbpedia.org/resource/Anthology:_1965-1972-013145ea-382f-448e-ae7f-dfd3d959a2e0","Anthology: 1965-1972","Album"
"http://dbpedia.org/resource/Bangarang_(EP)-b3bf5bc8-454a-47b6-9b4a-8b8b1b53f728","Bangarang_(EP)","Album"
"http://dbpedia.org/resource/Old_School_New Rules-34473575-5c29-4048-8435-f96717404db7","Old School New Rules","Album"
"http://dbpedia.org/resource/Wanna-d7a7259b-da78-41cd-bcac-bcd51ff040f2","Wanna","MusicalWork"
...

8. Download

You can download the dataset here: T4LTE.zip

9. Feedback

Please send questions and feedback to directly to the authors (listed above) or post them in the Web Data Commons Google Group.

10. References

[Cafarella2008] Cafarella, Michael J. and Halevy, Alon Y. and Zhang, Yang and Wang, Daisy Zhe and Wu, Eugene (2008), "Uncovering the Relational Web", In WebDB '08.
[Dong2014] Dong, Xin and Gabrilovich, Evgeniy and Heitz, Geremy and Horn, Wilko and Lao, Ni and Murphy, Kevin and Strohmann, Thomas and Sun, Shaohua and Zhang, Wei (2014), "Knowledge Vault: A Web-scale Approach to Probabilistic Knowledge Fusion", In KDD '14.
[Lehmann2015] Lehmann, Jens and Isele, Robert and Jakob, Max and Jentzsch, Anja and Kontokostas, Dimitris and Mendes, Pablo N and Hellmann, Sebastian and Morsey, Mohamed and Van Kleef, Patrick and Auer, S�ren and others (2015), "DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia", Semantic Web. Vol. 6(2), pp. 167-195. IOS Press.
[Oulabi2016] Oulabi, Yaser and Meusel, Robert and Bizer, Christian (2016), "Fusing Time-dependent Web Table Data", In WebDB '16.
[Oulabi2017] Oulabi, Yaser and Bizer, Christian (2017), "Estimating missing temporal meta-information using Knowledge-Based-Trust", In KDWeb '17.
[Oulabi2019] Oulabi, Yaser and Bizer, Christian (2019), "Extending Cross-Domain Knowledge Bases with Long Tail Entities using Web Table Data", In EDBT '19.
[Ritze2015] Ritze, Dominique and Lehmberg, Oliver and Bizer, Christian (2015), "Matching HTML Tables to DBpedia", In WIMS '15.
[Ritze2016] Ritze, Dominique and Lehmberg, Oliver and Oulabi, Yaser and Bizer, Christian (2016), "Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases", In WWW '16.

Released: 15.07.19