Download Instructions for the WDC RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (December 2020)

This document contains instructions on how to download the December 2020 version of the Web Data Commons RDFa, Microdata, Embedded JSONLD, and Microformats data sets.

Download the Extracted Data

The extracted RDFa, Microdata, Microformat and Embedded JSONLD data is provided for download as N-Quads. Files are compressed using GZIP and each file is around 100 MB large, except the RDFa files which are around 30MB large. Overall 21,346 files with a total size of 1.9 TB are provided.

List of download URLs for RDF from the December 2020 corpus (Example Content)

The extracted RDF data can be downloaded using wget with the command wget -i http://webdatacommons.org/structureddata/2020-12/files/file.list. The files, containing quads for a specific formats can be found in the table below, together with more detailed statistics about the number of files and sizes.

If you want to download the class-specific schema.org datasets please navigate to the WDC class-specific subsets page.

Format Number Of Files Approx. Total File Size File List
html-rdfa 5,167 166 GB html-rdfa.list
html-microdata 8,480 862 GB html-microdata.list
html-embedded-jsonld 5,273 536 GB html-embedded-jsonld.list
html-mf-geo 4 390 MB html-mf-geo.list
html-mf-hcalendar 13 1.2 GB html-mf-hcalendar.list
html-mf-hcard 2,316 229 GB html-mf-hcard.list
html-mf-adr 19 1.83 GB html-mf-adr.list
html-mf-hrecipe 4 301 MB html-mf-hrecipe.list
html-mf-hlisting 4 334 MB html-mf-hlisting.list
html-mf-hresume 1 1.52 MB html-mf-hresume.list
html-mf-hreview 15 1.38 GB html-mf-hreview.list
html-mf-species 1 7.47 MB html-mf-species.list
html-mf-xfn 46 4.42 GB html-mf-xfn.list

Get the Code

The source code can be checked out from our Subversion repository. The extraction of December 2020 was done with version 1.0.5 of the extractor. For more information about the framework and a detailed description how to run a own extraction visit the framework page.
The analysis of the quads was done using the Version 0.0.1 of the StructuredDataProfiler, which is also available within the Subversion repository.

Get Support

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.