Dataset Design, Statistics, and Download
Yaser Oulabi
Christian Bizer


This page describes the TDGT Dataset of time-dependent data, useful for evaluating tasks that make use of the time-aspect of data.

The dataset was built to evaluate fusion methods for slot filling time-dependent attributes of a knowledge base from web table data. Data fusion methods, also called Truth Discovery methods, try to select the correct value given a set of alternative input values [Bleiholder2009]. In time-dependent data, also called long-data [Dong2016], the correctness of a value depends on a certain point in time or a time range. For any given combination of Entity and Property multiple values can be considered correct, as they reflect a different period or point in time [Oulabi2016]. For evaluating the performance of data fusion methods for time-dependent data, it is necessary to know the correct values of a property at different points in time. The TDGT - Time-Dependent Ground-Truth Dataset consists of high-quality data describing countries, cities, athletes, companies, politicians and officeholders. All ground-truth data is annotated with temporal meta-information and can thus be used to evaluate data fusion methods for time-dependent data.

Contents

1. Dataset Overview

This dataset provides data for a selection of different time-dependent attributes in a uniform format. We built this dataset in the context of knowledge base augmentation from web table data. Web tables, which are relational HTML tables extracted from the Web, contain large amounts of valuable information describing temporal data [Lehmberg2016, Zhang2013]. We employ this dataset to evaluate fusion methods for augmenting from this web table data a temporal knowledge base [Oulabi2017] with new facts. Unlike snapshot-based knowledge bases, like DBpedia, which try to reflect only the most recent facts, temporal knowledge bases, like Wikidata, store time-dependent data as series of timed facts. Similarly, this ground truth provides a variety of time-dependent data as a series of timed facts. In previous work [Oulabi2016] we have presented and evaluated a method for slot filling snapshot-based knowledge bases.

The data was consolidated from different sources and covers the following domains:
  1. Countries and Cities
  2. Athletes
  3. Companies
  4. Politicians and Officeholders

During the creation of this dataset, we ensured that all entities part of the dataset are matched to a comprehensive knowledge-base, in our case Wikidata. In addition we aimed at completeness of values per entity, so that one can assume that all possible values of a certain time-dependent attributes for a given entity are available.

2. Entity Classes Statistics

There are overall seven entity classes present in this dataset. The table below provides an overview of those classes, including the number of instances, properties and values per class. Additionally the name of the JSON file of the entity is provided in the table.

Class JSON file name # of Instances # of Properties # of Values Source
Basketball Athlete (mostly NBA) basketballAthlete.json 3781 1 10,625 www.basketball-reference.com
City city.json 11372 2 40,666 Wikidata
Country country.json 197 7 44,013 WorldBank (World Development Indicators)
NFL Athlete nflAthlete.json 12756 2 96,711 www.footballdb.com
Politician / Holder of a political office politician.json 22062 1 36,816 Wikidata
Soccer Athlete soccerAthlete.json 134617 1 778,602 Wikidata
Traded Company (NYSE and NASDAQ) tradedCompany.json 1646 5 73,806 www.stockrow.com

3. Time-dependent Properties Overview

There are overall 19 time-dependent attributes in this dataset. The following table provides an overview of those attributes including their type, datatype and number of values. Additionally the JSON name of the property is included.

Class Property Name in JSON file Type Datatype # of Values
Basketball Athlete Team team Time Range Reference 10,625
City Mayor mayor Time Range Reference 1,742
City Population population Point in Time Number 38,924
Country Head of Government headOfGovernment Time Range Reference 1,641
Country Head of State headOfState Time Range Reference 2,727
Country Memberships memberOf Time Range Reference 1,540
Country Nominal GDP gdp Point in Time Number 8,417
Country Nominal GDP per Capita gdpPerCapita Point in Time Number 8,414
Country Population population Point in Time Number 10,845
Country Population Density populationDensity Point in Time Number 10,429
NFL Athlete Sports Number sportsNumber Point in Time Number 28,379
NFL Athlete Team team Point in Time Reference 68,332
Politician Position Held positionHeld Time Range Reference 36,816
Soccer Athlete Team team Time Range Reference 778,602
Traded Company Earnings before interests and taxes ebit Point in Time Number 14,945
Traded Company Net Income netIncome Point in Time Number 14,948
Traded Company Total Assets totalAssets Point in Time Number 14,585
Traded Company Total Equity totalEquity Point in Time Number 14,527
Traded Company Total Revenue totalRevenue Point in Time Number 14,801

4. Data Format

The dataset is provided in a JSON Format, where every entity class is stored in a seperate file.

4.1. JSON File Format

Every JSON file is an array of JSON Objects with the format described below. The JSON file has the following format:

[
{ENTITY_OBJECT_1},
{ENTITY_OBJECT_2},
{ENTITY_OBJECT_3},
{ENTITY_OBJECT_4},
{ENTITY_OBJECT_5},
......
{ENTITY_OBJECT_97},
{ENTITY_OBJECT_98},
{ENTITY_OBJECT_99}
]

Any {ENTITY_OBJECT_XX} is a JSON Object as described in Section 4.2 below.

4.2. Entity Object JSON Format

Entity Objects have the following Formats:

{
"name": "{ENTITY_NAME}",
"wikidataId": "{WIKIDATA_ID}",
"values": [
{VALUE_OBJECT_1},
{VALUE_OBJECT_2},
{VALUE_OBJECT_3},
{VALUE_OBJECT_4},
{VALUE_OBJECT_5},
......
{VALUE_OBJECT_97},
{VALUE_OBJECT_98},
{VALUE_OBJECT_99}
]
}

{ENTITY_NAME}, {WIKIDATA_ID}, {PROPERTY_NAME} are string values. {ENTITY_NAME} is a label of the entity, while {WIKIDATA_ID} is the ID of the entity in Wikidata.

Any {VALUE_OBJECT_XX} is a JSON Object as described in Section 4.3 below.

4.3. Value Object JSON Format

The JSON format of the value object has two variations. Both variations have a string property type, that determines the type, and a string property property ({PROPERTY_NAME}), which is the name of the property as shown in the table in Section 3 above.

4.3.1. Point-in-Time value

The point in time value reflects values that are valid for a given point in time. The type property has the value "point".

{
"propertyName" : "{PROPERTY_NAME}"
"point": "{POINT_DATE}",
"type": "point",
"value": {VALUE_TYPE_OBJECT}
}

{POINT_DATE} is a string value and needs to be parsed. {POINT_DATE} can be of the following formats: yyyy-mm-dd or yyyy. It is in the format of a string and needs to be parsed.

4.3.2. Time-Range value

The point in time value reflects values that are valid for a given time range. The type property has the value "range".

{
"propertyName" : "{PROPERTY_NAME}"
"from": "{FROM_DATE}",
"to": "{TO_DATE}",
"type": "range",
"value": {VALUE_TYPE_OBJECT}
}

{FROM_DATE} and {TO_DATE} are both string values and need to be parsed. They can both be of the following formats: yyyy-mm-dd or yyyy. The to property is optional.

4.4. Value Type Object

The value type objects are JSON Objects that describe the actual value. They include a string property type that describes the type of the value.

4.4.1 Reference

For the reference type the value of the type property is simply reference. The reference type has two additional properties. First there is a label property, that provides a name for the referenced entitity. There is also wikidataId, which provides the ID of the referenced entity in the Wikidata Knowledge-Base. Both properties are string values.

{
"wikidataId": "{WIKIDATA_ID}",
"label": "{LABEL}",
"type": "reference"
}

4.4.2. Number

For the number type the value of the type property is simply number. There is an additional property amount, which incluedes the actual number. The amount property is of type string, and needs to parsed.

{
"amount": "{AMOUNT}",
"type": "number"
}

5. Sample Entity Object

You can download this sample below.

{
"name":"Mannheim",
"wikidataId":"Q2119",
"values":[
{
"from":"2007",
"propertyName":"mayor",
"type":"range",
"value":{
"wikidataId":"Q2076493",
"label":"Peter Kurz",
"type":"reference"
}
},
{
"from":"1983",
"to":"2007",
"propertyName":"mayor",
"type":"range",
"value":{
"wikidataId":"Q1512753",
"label":"Gerhard Widder",
"type":"reference"
}
},
{
"from":"1980",
"to":"1983",
"propertyName":"mayor",
"type":"range",
"value":{
"wikidataId":"Q2575485",
"label":"Wilhelm Varnholt",
"type":"reference"
}
},
{
"point":"2013-12-31",
"propertyName":"population",
"type":"point",
"value":{
"amount":"296690",
"type":"number"
}
},
{
"point":"2012-12-31",
"propertyName":"population",
"type":"point",
"value":{
"amount":"294627",
"type":"number"
}
},
{
"point":"1961",
"propertyName":"population",
"type":"point",
"value":{
"amount":"313890",
"type":"number"
}
},
{
"point":"1962",
"propertyName":"population",
"type":"point",
"value":{
"amount":"318919",
"type":"number"
}
},
{
"point":"1963",
"propertyName":"population",
"type":"point",
"value":{
"amount":"321075",
"type":"number"
}
},
{
"point":"1964",
"propertyName":"population",
"type":"point",
"value":{
"amount":"323444",
"type":"number"
}
},
{
"point":"1965",
"propertyName":"population",
"type":"point",
"value":{
"amount":"328156",
"type":"number"
}
},
{
"point":"1966",
"propertyName":"population",
"type":"point",
"value":{
"amount":"329301",
"type":"number"
}
},
{
"point":"1967",
"propertyName":"population",
"type":"point",
"value":{
"amount":"323744",
"type":"number"
}
},
{
"point":"1968",
"propertyName":"population",
"type":"point",
"value":{
"amount":"326302",
"type":"number"
}
},
{
"point":"1969",
"propertyName":"population",
"type":"point",
"value":{
"amount":"330920",
"type":"number"
}
},
{
"point":"1970",
"propertyName":"population",
"type":"point",
"value":{
"amount":"332163",
"type":"number"
}
},
{
"point":"1971",
"propertyName":"population",
"type":"point",
"value":{
"amount":"330635",
"type":"number"
}
},
{
"point":"1972",
"propertyName":"population",
"type":"point",
"value":{
"amount":"328411",
"type":"number"
}
},
{
"point":"1973",
"propertyName":"population",
"type":"point",
"value":{
"amount":"325386",
"type":"number"
}
},
{
"point":"1974",
"propertyName":"population",
"type":"point",
"value":{
"amount":"320508",
"type":"number"
}
},
{
"point":"1975",
"propertyName":"population",
"type":"point",
"value":{
"amount":"314086",
"type":"number"
}
},
{
"point":"1976",
"propertyName":"population",
"type":"point",
"value":{
"amount":"309059",
"type":"number"
}
},
{
"point":"1977",
"propertyName":"population",
"type":"point",
"value":{
"amount":"305741",
"type":"number"
}
},
{
"point":"1978",
"propertyName":"population",
"type":"point",
"value":{
"amount":"302794",
"type":"number"
}
},
{
"point":"1979",
"propertyName":"population",
"type":"point",
"value":{
"amount":"303247",
"type":"number"
}
},
{
"point":"1980",
"propertyName":"population",
"type":"point",
"value":{
"amount":"304303",
"type":"number"
}
},
{
"point":"1981",
"propertyName":"population",
"type":"point",
"value":{
"amount":"304219",
"type":"number"
}
},
{
"point":"1982",
"propertyName":"population",
"type":"point",
"value":{
"amount":"302621",
"type":"number"
}
},
{
"point":"1983",
"propertyName":"population",
"type":"point",
"value":{
"amount":"298042",
"type":"number"
}
},
{
"point":"1984",
"propertyName":"population",
"type":"point",
"value":{
"amount":"295178",
"type":"number"
}
},
{
"point":"1985",
"propertyName":"population",
"type":"point",
"value":{
"amount":"294984",
"type":"number"
}
},
{
"point":"1986",
"propertyName":"population",
"type":"point",
"value":{
"amount":"294648",
"type":"number"
}
},
{
"point":"1987",
"propertyName":"population",
"type":"point",
"value":{
"amount":"295191",
"type":"number"
}
},
{
"point":"1988",
"propertyName":"population",
"type":"point",
"value":{
"amount":"300468",
"type":"number"
}
},
{
"point":"1989",
"propertyName":"population",
"type":"point",
"value":{
"amount":"305974",
"type":"number"
}
},
{
"point":"1990",
"propertyName":"population",
"type":"point",
"value":{
"amount":"310411",
"type":"number"
}
},
{
"point":"1991",
"propertyName":"population",
"type":"point",
"value":{
"amount":"314685",
"type":"number"
}
},
{
"point":"1992",
"propertyName":"population",
"type":"point",
"value":{
"amount":"318446",
"type":"number"
}
},
{
"point":"1993",
"propertyName":"population",
"type":"point",
"value":{
"amount":"318025",
"type":"number"
}
},
{
"point":"1994",
"propertyName":"population",
"type":"point",
"value":{
"amount":"316223",
"type":"number"
}
},
{
"point":"1995",
"propertyName":"population",
"type":"point",
"value":{
"amount":"311292",
"type":"number"
}
},
{
"point":"1996",
"propertyName":"population",
"type":"point",
"value":{
"amount":"312216",
"type":"number"
}
},
{
"point":"1997",
"propertyName":"population",
"type":"point",
"value":{
"amount":"310475",
"type":"number"
}
},
{
"point":"1998",
"propertyName":"population",
"type":"point",
"value":{
"amount":"308903",
"type":"number"
}
},
{
"point":"1999",
"propertyName":"population",
"type":"point",
"value":{
"amount":"307730",
"type":"number"
}
},
{
"point":"2000",
"propertyName":"population",
"type":"point",
"value":{
"amount":"306729",
"type":"number"
}
},
{
"point":"2001",
"propertyName":"population",
"type":"point",
"value":{
"amount":"308385",
"type":"number"
}
},
{
"point":"2002",
"propertyName":"population",
"type":"point",
"value":{
"amount":"308759",
"type":"number"
}
},
{
"point":"2003",
"propertyName":"population",
"type":"point",
"value":{
"amount":"308353",
"type":"number"
}
},
{
"point":"2004",
"propertyName":"population",
"type":"point",
"value":{
"amount":"307499",
"type":"number"
}
},
{
"point":"2005",
"propertyName":"population",
"type":"point",
"value":{
"amount":"307900",
"type":"number"
}
},
{
"point":"2006",
"propertyName":"population",
"type":"point",
"value":{
"amount":"307914",
"type":"number"
}
},
{
"point":"2007",
"propertyName":"population",
"type":"point",
"value":{
"amount":"309795",
"type":"number"
}
},
{
"point":"2008",
"propertyName":"population",
"type":"point",
"value":{
"amount":"311342",
"type":"number"
}
},
{
"point":"2009",
"propertyName":"population",
"type":"point",
"value":{
"amount":"311969",
"type":"number"
}
},
{
"point":"2010",
"propertyName":"population",
"type":"point",
"value":{
"amount":"313174",
"type":"number"
}
},
{
"point":"2011",
"propertyName":"population",
"type":"point",
"value":{
"amount":"291458",
"type":"number"
}
},
{
"point":"2014",
"propertyName":"population",
"type":"point",
"value":{
"amount":"299844",
"type":"number"
}
}
]
}

6. Download

You can download the dataset here:

  1. Download full dataset
  2. Download sample entity object

7. Feedback

Please send questions and feedback to directly to the authors (listed above) or post them in the Web Data Commons Google Group.

8. References

  1. [Bleiholder2009] Jens Bleiholder, and Felix Naumann. 2009. Data Fusion. ACM Computing Surveys, ACM, 2009, 41, 1:1-1:41 (January 2009).
  2. [Dong2016] Xin Luna Dong, Anastasios Kementsietsidis, and Wang-Chiew Tan. 2016. A Time Machine for Information: Looking Back to Look Forward. SIGMOD Rec. 45, 2 (September 2016).
  3. [Oulabi2016] Yaser Oulabi, Robert Meusel, and Christian Bizer. 2016. Fusing time-dependent web table data. In Proceedings of the 19th International Workshop on Web and Databases (WebDB '16). ACM, New York, NY, USA, , Article 3 , 7 pages.
  4. [Oulabi2017] Yaser Oulabi, and Christian Bizer. 2017. Estimating Missing Temporal Meta-Information using Knowledge-Based-Trust. In Proceedings of the 3rd International Workshop on Knowledge Discovery on the WEB (KDWeb '16). CEUR Workshop Proceedings, RWTH: Aachen.
  5. [Lehmberg2016] Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. 2016. A Large Public Corpus of Web Tables containing Time and Context Metadata. In Proceedings of the 25th International Conference Companion on World Wide Web (WWW '16 Companion). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 75-76.
  6. [Zhang2013] Meihui Zhang and Kaushik Chakrabarti. 2013. InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD '13). ACM, New York, NY, USA, 145-156.

Released: 15.07.19