Dominique Ritze
Oliver Lehmberg
Robert Meusel
Christian Bizer
Sanikumar Zope

This page provides basic statistics about the subset of relational Web tables in the WDC Web Table Corpus 2015 that are in English. The subset consists of 50,820,165 tables out of the 90 million relational Web tables in the corpus. In addition, we offer statistics about a subset consisting of only relational tables and a subset containing entity tables. All tables are publicly available for download.

Contents

1. Identifying English-language Web Tables

In contrast to previous extractions where English-language tables are filtered by their top-level domain ("com", "org", "net", "eu", "uk"), we now use a language detection to identify English-language Web tables. The language detection uses language profiles which have been learnt on Wikipedia abstracts using a naive Bayesian filter. Accroding to the authors, the algorithm achieves a precision of over 99% and can be used for 53 languages.

As input for the language detection, we take the page title, the table header as well as the text before and after the table. To test the language detection, we used the extracted relational Web tables from one warc-file (1461 Web tables). We manually evaluated whether it classfies a table as English or non-English table. The algorithm decides for the correct class in 99.1% of the cases while the heuristic using top-level domains only in 86.1% of the cases.

2. TLDs Distribution

Figure 1 shows the distribution of extracted English-language Web tables per top-level domain.

Fig. 1 - Number of tables per TLD

The complete distribution of tables per top-level domain can be found here. The file contains a list of key values pairs, with TLD as key and #tables as value. For example the entry, Key : .com Value : 35469047, means that there are 35,469,047 tables extracted from the "com" domain. Compared to the previous corpus, it is noticeable that the "gov" domain was not even among the top 20 TLDs.

3. PLD Distribution

Figure 2 shows the distribution of extracted Web tables per pay-level domain.

Fig. 2 - Number of tables per PLD

Altogether, 323,160 different PLDs are represented in all the 51 million English-language tables. The complete distribution of tables per pay-level domain can be found here. Again, the file contains key value pairs where the key represents the PLD and the value the number of tables per PLD.

4. Table size and Distribution

Table 1 shows the overall number of English-language relational tables, divided into horizontal and vertical tables. In a horizontal table, the entities are represented in rows and the attributes in columns. Whenever the entities are included in columns and the attributes in rows, we talk about vertical tables. A vertical table can be transferred into a horizontal table by simply flipping it.

#tables
horizontal47,669,450
vertical3,150,715
sum50,820,165
Table 1: Number of horizontal and vertical relational tables

Table 2 provides basic statistics for the tables' size. The row numbers exclude the header row (if present) and thus refer data rows. Data rows are all rows of the table that are positioned under the header row and contain at least one non-empty cell.

min.max.averagemedian
Columns Horizontal Tables (attributes)2 181065.224
Rows Horizontal Tables (entities)31703316.067
Columns Vertical Tables (entities)3 161428.474
Rows Vertical Tables (attributes)24864.476
Table 2: Statistics about the columns and rows

4.1. Horizontal Tables

Figure 3 shows the distribution of number of columns (attributes) per table (Horizontal Table).

Fig. 3 - Distribution of Number of Columns (attributes) per Table (Horizontal Table)

The complete distribution of number of columns per horizontal table can be found here. The key of the key value pair represents the number of columns and the value the number of horizontal tables having exactly this number of columns, e.g. the line Key : 33 Value : 1323 means that there are 1,323 tables with exactly 13 columns.

Figure 4 shows the distribution of number of data rows (entities) per table. Data rows are all rows of the table that are positioned under the header row and contain at least one non-empty cell.

Fig. 4 - Distribution of Number of Rows (entities) per Table (Horizontal Table)

The complete distribution of number of rows per horizontal table can be found here. The key of the key value pair represents the number of data rows and the value the number of horizontal tables having exactly this number of data rows, e.g. the line Key : 8 Value : 2932320 means that there are 2,932,320 tables with exactly 8 data rows.

4.2. Vertical Tables

Figure 5 shows the distribution of number of columns (entities) per table (Vertical Table).

Fig. 5 - Distribution of Number of Columns (entities) per Table (Vertical Table)


The complete distribution of number of columns per vertical table can be found here. The key of the key value pair represents the number of columns and the value the number of vertical tables having exactly this number of columns, e.g. the line Key : 13 Value : 42917 means that there are 42,917 tables with exactly 13 columns. Since we know that these tables are vertical, this number corresponds to the number of rows after flipping the table.

Figure 6 shows the distribution of number of data rows (attributes) per table (Vertical Table).

Fig. 6 - Distribution of Number of Rows (attributes) per Table(Vertical Table)

The complete distribution of number of rows per vertical table can be found here. The key of the key value pair represents the number of data rows and the value the number of vertical tables having exactly this number of data rows, e.g. the line Key : 8 Value : 38554 means that there are 38,554 tables with exactly 8 rows. Since we know that these tables are vertical, this number corresponds to the number of columns after flipping the table.

5. Headers Distribution

In order to get a first impression about the topics of the tables, we applied a simple heuristic for identifying the column headers of each Web table. Our heuristic is based on the Cell Content Pattern, which is defined as a tuple containing a representation of the composition of characters in a cell [Tang2006]. After extracting the content pattern of a cell, a comparison between the patterns in the current row and the following rows is made. If one row shows different patterns compared to its following rows, we consider this row as header containing the column names. By now, we only consider two cases: the first row is the header row or no header row exists. A more sophisticated header unfolding [Chen2013] would be necessary to find for example headers that spanning over several rows.

In contrast to our previous extraction, we can now deal with headers of vertical tables [Crestan2011] and we know whether a header is present or not (about 20% of all tables do not have a header according to [Pimplikar2012]). We did not take any column name synonyms like 'population' and 'number of inhabitants' into account. The only simple normalization we apply is to remove trailing 's' to get singular forms of nouns. Thus, the number of different headers can be seen as upper bound.


With the current approach were able to identify total of 49,496,022 column headers from which 2,823,260 are different. Figure 7 shows popular (useful) column headers together with their number of occurrences. The most often used header is the empty string (17,077,810 times) which we exclude in the figure since it does not provide any information about the content of the tables.

Fig. 7 - Popular Column Headers

The complete distribution of headers can be found here. The file contains a list of two tab separated fields, header and #tables. E.g. the entry of the file, Key : date Value : 4226269, means that there are 4,226,269 tables that contain a column with header date.

6. Column Data Types Distribution

Currently computed

We used a rough type guessing algorithm to detect the data type of each table column. First, the data type of each column value was detect, using 5 pre-defined data types: string, numeric, date, boolean and list. Afterwards, the most used data type in the column was chosen as the final data type of the column.

Figure 8 shows distribution of column data types.

Fig. 8 - Column Data Types Distribution

7. Context Information

Table 3 provides basic statistics about context related data. It shows for how many of the English-language relational tables, context related data exists (the sourrounding words are extracted for all of the tables). The last modified date comes from the HTTP header of the HTML page.

#number of tables
Timestamp Before Relational Table7,559,004
Timestamp After Relational Table23,966,060
Last Modified Date of HTML Page10,253,797
Table 3: Extracted Context Information