This page describes the data format that is used to represent table data. Further, it contains the download instructions for the WDC Web Table Corpora 2012. General information about the WDC Web Tables Coprus 2012 can be found on the overview page.
1. Data Formats and Download
The main corpus of Web tables is divided into 854,083 gzip files, which are then packed in 885 tar archives. Each tar archive in the complete corpus contains 1,000 gzip files, where each gzip file contains Web tables extracted from a couple thousand Web pages. For each Web page that contains at least one content Web table, we provide the corresponding
HTML file, the set of extracted Web tables in CSV format, and a JSON file that contains meta data for the extraction of the Web tables. Each JSON file contains the URL of the Web page,
a reference to the corresponding HTML file in the gzip file, and information for each of the extracted Web tables. All files that are referring to the same Web page, share the same file name prefix,
e.g. a JSON file with the name 71657325_XXXXXXX.json
would referre to the HTML file 71657325_YYYYYY
, and a list of CSV files: 71657325_0_ZZZZZZZ.csv
, 71657325_1_ZZZZZZZ.csv
etc...
For each of the extracted Web tables, the JSON file contains the position of the table inside the HTML file, and basic statistics for the data in the Web tables. The complete JSON Schema can be found here.
To download the corpora of Web table use the following links:
Data Set | Size | #Files |
---|---|---|
Relational Corpus 2012 | 1 020 GB | 885 (.tar) |
Enlgish-Language Relational Web Tables | 697 GB | 180 (.tar) |
2. Feedback
Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.
More information about Web Data Commons is found here.
3. Credits
The extraction of the Web Table Corpus was supported by the German Research Foundation (DFG) under grant number PA 2373/1-1 (Mine@LOD), an Amazon Web Services in Education Grant award and by the EU FP7 research project PlanetData.