WDC Web Table Corpus 2012 - Statistics about Relational Subset

Petar Ristoski
Oliver Lehmberg
Heiko Paulheim
Robert Meusel
Christian Bizer
Alexander Diete
Nicolas Heist
Sascha Krstanovic
Thorsten Andre Knöller


This page provides basis statistics describing the relational subset of the WDC Web TablesCorpus 2012. The subset consists of 147 million relational tables. In relational tables, a set of entities is described with one or more attributes. In addition to this subset, we offer statistics about a subset consisting of only English-language relational tables. All tables are publicly available for download.

Contents

1. TLDs Distribution

Figure 1 shows the distribution of extracted Web tables per top-level domain.

Fig. 1 - Number of tables per TLD

The complete distribution of tables per top-level domain can be found here. The file contains a list of two tab separated fields, TLD and #tables. E.g. the first entry of the file, com 75229798, means that there are 75229798 tables extracted from the "com" domain.

2. Number of Columns and Rows Distribution

The table below provides basic statistics for the tables' size in the complete corpus. The rows number excludes the header row and thus refers to the data rows of the table.

min.max.averagemedian
columns2 2 3683.493
rows170 06812.416

2.1 Number of Columns Distribution

Figure 2 shows the distribution of number of columns per table.

Fig. 2 - Distribution of Number of Columns per Table


The complete distribution of number of columns per table can be found here. The file contains a list of two tab separated fields, #columns and #tables. E.g. the first entry of the file, 2 70147349, means that there are 70147349 tables that have exactly two columns.

2.2 Number of Rows Distribution

Figure 3 shows the distribution of number of data rows per table.
Data rows are all rows of the table that are positioned under the header row and contain at least one non-empty cell.

Fig. 3 - Distribution of Number of Rows per Table

The complete distribution of number of rows per table can be found here. The file contains a list of two tab separated fields, #rows and #tables. E.g. the first entry of the file, 1 426104, means that there are 426104 tables that have exactly one data row.

3. Headers Distribution

In order to get a first impression about the topics of the tables, we applied a simple heuristic for identifying the column headers of each Web table. Our heuristic assumed that the column headers are in the first row of the Web table that contains at least 80% non-empty cells of the number of cells of the row with highest number of non-empty cells in the table. The heuristic will fail on vertical tables [Crestan2011], on tables that require more sophisticated header unfolding [Chen2013], as well as on table that do not have headers (20% of all tables according to [Pimplikar2012]). We also did not take column name synonyms like 'population' and 'number of inhabitants' into account. Thus, the numbers presented below should be understood as lower bounds.
With the current approach were able to identify total of 509,351,189 column headers out of which 28,072,596 are different.

Figure 4 shows the number of tables in the corpus that contain some popular column headers.

Fig. 4 - Popular Column Headers

The complete distribution of headers can be found here. The file contains a list of two tab separated fields, header and #tables. E.g. the first entry of the file, name 4653155, means that there are 4653155 tables that contain column with header name.

To get a better understanding which topics are covered in the corpus, we performed a rough matching to the cross-domain knowledge base DBpedia, which is a structured data version of a subset of Wikipedia. We scanned the tables for properties used in DBpedia which are also used as table headers in our dataset. The complete list can be found here here. The file contains a list of two tab separated fields, DBpediaProperty and #tables. E.g. the entry, title 2121028, means that there are 2121028 tables that contain column with header title.

4. Label Distribution

Most applications working with Web tables assume that the tables are entity-attribute tables and that they contain a string column that provides the name of the described entity (label column). To get an initial insight of the entity coverage of the corpus, we determined the label column of the tables using a simple heuristic and counted value occurrences in the label column of all Web tables. Our heuristic assumed the left-most column that is not a number or a date and has almost unique values to be the label column. [Venetis2010] report an accuracy of 83% using a similar simple heuristic.
Before counting, all values are normalized, and stop-word are removed. E.g. the music album name The Dark Side of the Moon will be normalized to dark side moon. While counting the value occurrences, we do not take surface form synonyms into account (like 'New York' and 'New York City'). Thus, the reported numbers should be understood as lower bounds. In the corpus of Web tables we were able to identify total of 1,742,015,870 label column values, where 253,001,795 are different values.

In Table 1 is shown values coverage from different topics.

Countries Cities Rivers Movies Camera Models Music Albums Footballers
Name #Tables Name #Tables Name #Tables Name #Tables Name #Tables Name #Tables Name #Tables
usa 135688 new york 59398 mississippi 87367 avatar 11080 nikon d 200 1390 thriller 4268 robin van persie 7439
germany 91170 luxembourg 47722 lena 8717 inception 8121 canon eos 20 d 480 aftermath 2466 david beckham 3041
japan 76512 berlin 46850 don 6504 taxi 6292 canon eos 40 d 355 twist shout 2017 cristiano ronaldo 2927
united states 73169 london 37541 mackenzie 3346 titanic 4270 nikon d 5000 351 true blue 1737 lionel messi 1748
italy 71129 amsterdam 31548 yangtze 2241 fantastic four 2113 canon eos 30 d 346 like prayer 1616 ronaldo 1716
austria 56622 madrid 30486 oka 1708 moulin rouge 1616 nikon d 80 339 like virgin 1414 gareth bale 1708
netherlands 56533 andorra 21075 loire 1096 black knight 1298 canon eos 50 d 304 yellow submarine 1405 fernando torres 1641
mexico 55267 dublin 19790 tigris 946 deception 1286 nikon d 90 274 dark side moon 1201 frank lampard 1461
belgium 53175 athens 12228 volga 904 minority report 1201 canon eos 10 d 248 abbey road 971 thierry henry 1332
ireland 48543 budapest 9702 sava 873 ice age 1201 nikon d 60 233 something new 919 ronaldinho 1195
denmark 48389 helsinki 7761 volta 710 unfaithful 1179 nikon d 100 191 please please me 886 roberto carlos 817
finland 45156 bern 5839 vardar 595 glitter 943 canon eos d 30 172 shine light 833 xabi alonso 735
greece 42314 new york city 5611 kama 582 joy ride 674 sony cybershot dsc w120 104 some girls 801 oliver kahn 710
russia 41729 brussels 5305 tisa 552 from hell 520 canon eos d 60 93 sticky fingers 740 sergio ramos 647
hungary 38536 copenhagen 4949 ural 437 just married 459 sony cybershot dsc s3000 67 one day your life 711 paolo maldini 638
malta 37009 bratislava 4938 indus 420 shallow hal 265 sony cybershot dsc w520 64 exciter 543 zinedine zidane 517
bulgaria 36523 belgrade 4460 elbe 382 highn crimes 247 sony cybershot dsc w510 62 let bleed 492 fabio cannavaro 348
croatia 29022 lisbon 4194 danube 365 monkeybone 228 olympus e 500 53 rubber soul 464 rivaldo 331
egypt 27725 kiev 2406 rhine 352 like mike 175 sony cybershot dsc w570 45 blood dance floor 382 roberto baggio 251
cyprus 25828 bucharest 2180 seine 225 joe somebody 160 olympus e 30 38 black celebration 338 marco van basten 243

Table. 1 - Values Coverage

5. Column Data Types Distribution

We used a rough type guessing algorithm to detect the data type of each table column. First, the data type of each column value was detect, using 5 pre-defined data types: string, numeric, date, boolean and list. Afterwards, the most used data type in the column was chosen as the final data type of the column.

Figure 5 shows distribution of column data types.

Fig. 5 - Column Data Types Distribution

6. Feedback

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

More information about Web Data Commons is found here.

7. Credits

The extraction of the Web Table Corpus was supported by the German Research Foundation (DFG) under grant number PA 2373/1-1 (Mine@LOD), an Amazon Web Services in Education Grant award and by the EU FP7 research project PlanetData.

DFG Logo       AWS Logo       PlanetData Logo