Dominique Ritze
Oliver Lehmberg
Robert Meusel
Christian Bizer
Sanikumar Zope

On this page, we provide the results of some initial profiling of entity tables from the WDC Web Table Corpus 2015. An entity table usually describes exactly one entity with several attributes while the name of the entity itself is not contained in the table but can be concluded by considering context. Off all 233 milliom extracted tables, 139,687,207 tables are of type entity. In addition, we offer statistics about a subset consisting of only relational tables and a subset containing English tables. All tables are publicly available for download.

Contents

1. TLD Distribution

Figure 1 shows the distribution of extracted Web tables per top-level domain.

Fig. 1 - Number of tables per TLD

The complete distribution of tables per top-level domain can be found here. The file contains a list of key values pairs, with TLD as key and #tables as value. For example the entry, Key : .com Value : 62249515, means that there are 62,249,515 tables extracted from the "com" domain.

2. PLD Distribution

Figure 2 shows the distribution of extracted Web tables per pay-level domain.

Fig. 2 - Number of tables per PLD

Altogether, 616,946 different PLDs are represented in all the 140 million entity tables. The complete distribution of tables per pay-level domain can be found here. Again, the file contains key value pairs where the key represents the PLD and the value the number of tables per PLD.

3. Table Sizes and Distribution

Table 1 shows the overall number of extracted entity tables, divided into horizontal and vertical tables. In a horizontal table, the entities are represented in rows and the attributes in columns. Whenever the entities are included in columns and the attributes in rows, we talk about vertical tables. A vertical table can be transferred into a horizontal table by simply flipping it.

#tables
horizontal76,699,222
vertical62,987,985
sum139,687,207
Table 1: Number of horizontal and vertical entity tables

Table 2 provides basic statistics for the tables' size. The row numbers exclude the header row (if present) and thus refer data rows. Data rows are all rows of the table that are positioned under the header row and contain at least one non-empty cell.

min.max.averagemedian
Columns Horizontal Tables (attributes) 2 2562.402
Rows Horizontal Tables (entities) 3469949.083
Columns Vertical Tables (entities) 3 242727.533
Rows Vertical Tables (attributes) 22412.062
Table 2: Statistics about the columns and rows

3.1 Horizontal Tables

Figure 3 shows the distribution of number of columns (attributes) per table (Horizontal Table).

Fig. 3 - Distribution of Number of Columns (attributes) per Table (Horizontal Table)


The complete distribution of number of columns per horizontal table can be found here. The key of the key value pair represents the number of columns and the value the number of horizontal tables having exactly this number of columns, e.g. the line Key : 33 Value : 1323 means that there are 1,323 tables with exactly 13 columns.

Figure 4 shows the distribution of number of data rows (entities) per table (Horizontal Table).

Fig. 4 - Distribution of Number of Rows (entities) per Table (Horizontal Table)

The complete distribution of number of rows per horizontal table can be found here. The key of the key value pair represents the number of data rows and the value the number of horizontal tables having exactly this number of data rows, e.g. the line Key : 8 Value : 2932320 means that there are 2,932,320 tables with exactly 8 data rows.

3.2. Vertical Tables

Figure 5 shows the distribution of number of columns (entities) per table (Vertical Table).

Fig. 5 - Distribution of Number of Columns (entities) per Table (Vertical Table)


The complete distribution of number of columns per vertical table can be found here. The key of the key value pair represents the number of columns and the value the number of vertical tables having exactly this number of columns, e.g. the line Key : 13 Value : 42917 means that there are 42,917 tables with exactly 13 columns. Since we know that these tables are vertical, this number corresponds to the number of rows after flipping the table.

Figure 6 shows the distribution of number of data rows (attributes) per table (Vertical Table).

Fig. 6 - Distribution of Number of Rows (attributes) per Table (Vertical Table)

The complete distribution of number of rows per vertical table can be found here. The key of the key value pair represents the number of data rows and the value the number of vertical tables having exactly this number of data rows, e.g. the line Key : 8 Value : 38554 means that there are 38,554 tables with exactly 8 rows. Since we know that these tables are vertical, this number corresponds to the number of columns after flipping the table.

4. Header Distribution

In order to get a first impression about the topics of the tables, we applied a simple heuristic for identifying the column headers of each Web table. Our heuristic is based on the Cell Content Pattern, which is defined as a tuple containing a representation of the composition of characters in a cell [Tang2006]. After extracting the content pattern of a cell, a comparison between the patterns in the current row and the following rows is made. If one row shows different patterns compared to its following rows, we consider this row as header containing the column names. By now, we only consider two cases: the first row is the header row or no header row exists. A more sophisticated header unfolding [Chen2013] would be necessary to find for example headers that spanning over several rows.

In contrast to our previous extraction, we can now deal with headers of vertical tables [Crestan2011] and we know whether a header is present or not (about 20% of all tables do not have a header according to [Pimplikar2012]). We did not take any column name synonyms like 'population' and 'number of inhabitants' into account. The only simple normalization we apply is to remove trailing 's' to get singular forms of nouns. Thus, the number of different headers can be seen as upper bound.


With the current approach were able to identify total of 100,420,803 column headers from which 17,812,609 are different. Figure 7 shows popular (useful) column headers together with their number of occurrences. The most often used header is the empty string (20,060,263 times) which we exclude in the figure since it does not provide any information about the content of the tables.

Fig. 7 - Popular Column Headers

The complete distribution of headers can be found here. The key of the key value pair represents the header and the value the number of columns having exactly this header, e.g. the line Key : Title Value : 5934288 means that there are 5,934,288 columns with title as header.

5. Column Data Types Distribution

We used a rough type guessing algorithm to detect the data type of each table column. First, the data type of each column value was detect, using 5 pre-defined data types: string, numeric, date, boolean and list. Afterwards, the most used data type in the column was chosen as the final data type of the column.

Figure 8 shows distribution of column data types.

Fig. 8 - Column Data Types Distribution

6. Context Information

Table 3 provides basic statistics about context related data which we further extracted. For each tables, we extract the 200 words before and after the table. In previous experiments, we found out that without additional context or temporal information, it is difficult to further process the tables, e.g. to match them to a knowledge base [Zhang2013]. For almost half of the tables, we can extract a timestamp which is located after the entity table. In many cases, this timestamp is the imprint of the webpage. The last modified date comes from the HTTP header of the HTML page.

#number of tables
Timestamp Before Entity Table15,044,690
Timestamp After Entity Table43,998,029
Last Modified Date19,654,944
Table 3: Extracted Context Information