WDC - RDFa, Microdata, and Microformat Data Sets

More and more websites have started to embed structured data describing products, people, organizations, places, and events into their HTML pages using markup standards such as Microdata, JSON-LD, RDFa, and Microformats. The Web Data Commons project extracts this data from several billion web pages. So far the project provides 12 different data set releases extracted from the Common Crawls 2010 to 2023. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.

News

2025-01-10: We have released the WDC RDFa, Microdata, Microformat, and Embedded JSON-LD data sets extracted from the October 2024 Common Crawl corpus and created multiple schema.org class-specific subsets.
2024-02-01: We have released the WDC Schema.org Table Corpus 2023 which contains ~5M tables and is based on the October 2023 WDC schema.org extraction.
2024-01-08: We have released the WDC RDFa, Microdata, Microformat, and Embedded JSON-LD data sets extracted from the October 2023 Common Crawl corpus and created multiple schema.org class-specific subsets.
2023-06-22: We have released WDC Block a benchmark for comparing the performance of blocking methods. WDC Block features a maximal Cartesian product of 200 billion pairs of product offers which were extracted form 3,259 e-shops.
2023-04-30: We present the Web Data Commons Schema.org Data Set Series at the poster session of the ACM Web Conference 2023 (WWW'23). You can find the poster here.
The poster session takes place at the Zlotnik Foyer on Tuesday, May 2 from 5:00 to 6:00pm. Our Poster Number is 70.
2023-01-25: We have released the WDC RDFa, Microdata, Microformat, and Embedded JSON-LD data sets extracted from the October 2022 Common Crawl corpus and created multiple schema.org class-specific subsets.
2022-09-22: We have released the WDC Schema.org Table Annotation Benchmark for evaluating the performance of methods for annotating columns of Web tables with terms from the Schema.org vocabulary.
2022-01-04: We have released the WDC RDFa, Microdata, Microformat, and Embedded JSON-LD data sets extracted from the October 2021 Common Crawl corpus and created multiple schema.org class-specific subsets.
2021-03-29: We have released the WDC Schema.org Table Corpus, which was created by grouping the December 2020 schema.org class-specific subsets into relational tables.
2021-01-21: We have released the WDC RDFa, Microdata, Microformat, and Embedded JSON-LD data sets extracted from the September 2020 Common Crawl corpus and created multiple schema.org class-specific subsets.
2020-01-13: We have released the WDC RDFa, Microdata, Microformat, and Embedded JSON-LD data sets extracted from the November 2019 Common Crawl corpus and created multiple schema.org class-specific subsets.
2019-10-23: Version 2.0 of the WDC Product Data Corpus and Gold Standard for Large-Scale Product Matching released.
2019-05-15: Journal Article about Using the Semantic Web as a Source of Training Data has been published by the Datenbank-Spektrum Journal.
2019-02-27: Paper about The WDC training dataset and gold standard for large-scale product matching accepted at ECNLP Workshop at WWW2019 conference in San Francisco.
2019-01-17: We have released the WDC RDFa, Microdata, Microformat, and Embedded JSON-LD data sets extracted from the November 2018 Common Crawl corpus and created multiple schema.org class-specific subsets.
2018-12-20: We have released the WDC Training Dataset and Gold Standard for Large-Scale Product Matching.
2018-01-08: RDFa, Microdata, Microformat, and Embedded JSON-LD data sets extracted from the November 2017 Common Crawl corpus available for download.
2018-01-08: We have created several Class-Specific Subsets and statistics of the Schema.org Data contained in the 2017 Microdata Corpus (e.g. schema.org/Product and schema.org/Offer) in order to make it easier to work with the data.
2017-01-17: RDFa, Microdata, Microformat, and Embedded JSON-LD data sets extracted from the October 2016 Common Crawl corpus available for download.
2017-01-17: We have created several Class-Specific Subsets and statistics of the Schema.org Data contained in the 2016 Microdata Corpus (e.g. schema.org/Product and schema.org/Offer) in order to make it easier to work with the data.
2016-04-25: RDFa, Microdata, Microformat, and Embedded JSON-LD data sets extracted from the November 2015 Common Crawl corpus available for download.
2015-06-20: Paper about A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time accepted at WIMS2015 conference in Limassol, Cyprus.
2015-04-02: RDFa, Microdata, and Microformat data sets extracted from the December 2014 Common Crawl corpus available for download.
2015-03-30: Paper about Heuristics for Fixing Common Errors in Deployed schema.org Microdata accepted at ESWC2015 conference in Portoroz, Slovenia.
2014-12-04: We have created several Class-Specific Subsets of the Schema.org Data contained in the Winter 2013 Microdata Corpus (e.g. schema.org/Product and schema.org/Offer) in order to make it easier to work with the data.
2014-08-27: We have released an easy to customize version of the WDC Extraction Framework including a tutorial, which explains the usage and customization in detail. See also our guest post at the Common Crawl Blog.
2014-07-06: Paper about WebDataCommons Microdata, Rdfa and Microformats Dataset Series accepted at ISWC'14 conference in Riva del Garda - Trentino, Italy: The WebDataCommons Microdata, RDFa and Microformat Dataset Series
2014-04-01: RDFa, Microdata, and Microformat data sets extracted from the Winter 2013 Common Crawl corpus available for download.
2014-01-20: Paper about the integration of product data from the WDC Microdata data set accpeted at the DEOS 2014 workshop at the WWW 2014 conference in Seoul: Integrating Product Data from Websites offering Microdata Markup

1. About Web Data Commons

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using markup formats such as Microdata, JSON-LD, RDFa, and Microformats. The Web Data Commons project extracts all Microdata, JSON-LD, RDFa, and Microformats data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads. In addition, we calculate and publish statistics about the deployment of the different formats as well as the vocabularies that are used together with each format.

Up till now, we have extracted all Microdata, JSON-LD, RDFa, and Microformats data from the following releases of the Common Crawl web corpora:

For the future, we plan to rerun our extraction on a regular basis as new Common Crawl corpora are becoming available. We publish the corpora for research purposes only.

Below, you find information about the extracted data formats and detailed statistics about the extraction results. In addition, we have analyzed trends in the deployment of the most widely spread formats as well as in the deployment of selected RDFa and Microdata classes. This analysis can be found here.

2. Extracted Data Formats

The table below provides an overview of the different structured data formats that we extract from the Common Crawl. The table contains references to the specifications of the formats as well as short descriptions of the formats. Web Data Commons packages the extracted data for each format separately for download. The table also defines the format identifiers that are used in the following.

Format	Description	Identifier
Embedded JSON-LD	JSON-LD is a lightweight Linked Data format. It is easy for humans to read and write.	`html-embeddedjsonld`
RDFa	RDFa is a specification for attributes to express structured data in any markup language, e.g HTML. The underlying abstract representation is RDF, which lets publishers build their own vocabulary, extend others, and evolve their vocabulary with maximal interoperability over time.	`html-rdfa`
HTML Microdata	Microdata allows nested groups of name-value pairs to be added to HTML documents, in parallel with the existing content.	`html-microdata`
hCalendar Microformat	hCalendar is a calendaring and events format, using a 1:1 representation of standard iCalendar (RFC2445) VEVENT properties and values in HTML.	`html-mf-hcard`
hCard Microformat	hCard is a format for representing people, companies, organizations, and places, using a 1:1 representation of vCard (RFC2426) properties and values in HTML.	`html-mf-hcard`
Geo Microformat	Geo a 1:1 representation of the "geo" property from the vCard standard, reusing the geo property and sub-properties as-is from the hCard microformat. It can be used to markup latitude/longitude coordinates in HTML.	`html-mf-geo`
hListing Microformat	hListing is a proposal for a listings (UK English: small-ads; classifieds) format suitable for embedding in HTML.	`html-mf-hlisting`
hResume Microformat	The hResume format is based on a set of fields common to numerous resumes published today on the web embedded in HTML.	`html-mf-hresume`
hReview Microformat	hReview is a format suitable for embedding reviews (of products, services, businesses, events, etc.) in HTML.	`html-mf-hreview`
hRecipe Microformat	hRecipe is a format suitable for embedding information about recipes for cooking in HTML.	`html-mf-recipe`
Species Microformat	The Species proposal enables marking up taxonomic names for species in HTML.	`html-mf-species`
XFN Microformat	XFN (XHTML Friends Network) is a simple format to represent human relationships using hyperlinks.	`html-mf-xfn`

3. Extraction Results

3.1. Trends

We give an overview of the structured data adoption through the years by analyzing trends in the deployment of the most widely spread formats as well as in the deployment of selected RDFa and Microdata classes. For this analysis we use the data sets released between 2012 and 2024.
It is important to mention that the corresponding CommonCrawl web corpora have different sizes (2 billion to 3 billion HTML pages), cover different amounts of websites (12 million PLDs to 40 million PLDs, selection by importance of PLD) and also only partly overlap in the covered HTML pages. Thus, the following trends must be interpreted with caution.

Adoption by Format

The diagram below shows the total number of pay-level domains (PLD) making use of one of the four most widely spread markup formats (Microdata, Microformat hCard, Embedded JSON-LD and RDFa) within the twelve web corpora. In total, we see a continuous increase in the deployment of structured data. Microdata and JSON-LD dominate over the other markup formats, while the JSON-LD has an even faster growing rate in comparison to Microdata, which is mostly observed when looking at the number of PLDs with JSON-LD annotations in 2021. Indeed, this trend can be verified by looking at the format specific stats of 2021, 2022, 2023 and 2024 where we can see that some domains switch their main annotation format from Microdata to JSON-LD. Microformat hCard is used by a steady number of websites while the growth of RDFa markup format has stopped. The second diagram shows how the amount of triples that we extracted from the crawls has developed between 2012 and 2024. The growing trends are similar to ones the appearing in the PLD diagram, however we observe a high peak on the amount of triples for the year 2016, 2020, 2021, 2022 and 2023. We can interpret this peak by looking at the crawl size and depth of the corresponding crawl versions. The October 2016, September 2020, October 2021, October 2022 and October 2023 crawls were the largest in size in comparison to the crawls of November 2017, 2018 and 2019. Additionally, they are much deeper, i.e. more pages of the same PLD are crawled. The October 2024 crawl is smaller than the October 2023 crawl explaining the decline in triples.

Adoption of Selected Schema.org Classes

Below, we analyze the development of the adoption of selected schema.org classes embedded using the Microdata and JSON-LD syntax. The two diagrams below show again the deployment of those classes by number of deploying PLDs and number of entities extracted from the crawls. We can see a continuous increase in the number of PLDs adopting the schema.org classes. The adoption of the schema:Product and schema:PostalAddress class is considerably growing mostly over the last years. We notice a drop from 2018 to 2019 in the number of PLDs using the schema:LocalBusiness class with the JSON-LD format. However, related classes such as schema:Organization and schema:Hotel noted significant growth from 2018 to 2019: 1.3M PLDs to 2.2M PLDs for the schema:Organization class and 2.7 PLDs to 9.8K PLDs for the schema:Hotel class. Therefore, the LocalBusiness class might have been replaced by other related classes from certain PLDs which could easily explain the drop. To verify this, further analysis is required. In 2018 schema:FAQPage was released and was used by only 19 PLDs. By 2024 it has been adopted by 308K PLDs. Finally, we observe that between the years 2019 and 2021 the usage of the microdata format peaks and then drops in 2022 and 2023. However, this is not the case for the JSON-LD format which has an exceptional growth in 2020 and continues to grow in 2021, 2022, 2023 and 2024 in terms of number of PLDs using the specific schema.org classes.

Adoption of Selected RDFa Classes

In the following we report trends in the adoption of selected RDFa classes. The first diagram shows the number of PLDs using each class. The second diagram shows the total number of entities of each class contained in the WDC RDFa data sets. We see that the number of websites deploying the Facebook Open Graph Protocol classes og:website, og:article, foaf:Document, gd:breadcrumb and og:product declines.

3.2. Extraction Results from the October 2024 Common Crawl Corpus

The October 2024 Common Crawl Corpus is available on Amazon S3 in the bucket commoncrawl under the key prefix /crawl-data/CC-MAIN-2024-42/ .

Extraction Statistics

Crawl Date	October 2024
Total Data	77.33 Terabyte	(compressed)
Parsed HTML URLs	2,391,039,772
URLs with Triples	1,245,622,627
Domains in Crawl	37,447,141
Domains with Triples	16,525,070
Typed Entities	15,647,463,083
Triples	73,993,669,093
Size of Extracted Data	1.4 Terabyte	(compressed)

Extraction Costs

The costs for parsing the 77.33 Terabytes of compressed input data of the October 2024 Common Crawl corpus, extracting the schema.org data and storing the extracted data on S3 totaled 619 USD in AWS fees. We used up to 250 spot instances of type c7.2xlarge for the extraction which altogether required 3,169 machine hours.

3.3. Extraction Results from the October 2023 Common Crawl Corpus

The October 2023 Common Crawl Corpus is available on Amazon S3 in the bucket commoncrawl under the key prefix /crawl-data/CC-MAIN-2023-40/ .

Extraction Statistics

Crawl Date	October 2023
Total Data	98.38 Terabyte	(compressed)
Parsed HTML URLs	3,353,090,410
URLs with Triples	1,696,953,312
Domains in Crawl	34,144,094
Domains with Triples	14,646,081
Typed Entities	20,890,660,670
Triples	97,689,391,384
Size of Extracted Data	1.8 Terabyte	(compressed)

Extraction Costs

The costs for parsing the 98.38 Terabytes of compressed input data of the October 2023 Common Crawl corpus, extracting the schema.org data and storing the extracted data on S3 totaled 1,075 USD in AWS fees. We used up to 250 spot instances of type c7.2xlarge for the extraction which altogether required 4,602 machine hours.

3.4. Extraction Results from the October 2022 Common Crawl Corpus

The October 2022 Common Crawl Corpus is available on Amazon S3 in the bucket commoncrawl under the key prefix /crawl-data/CC-MAIN-2022-40/ .

Extraction Statistics

Crawl Date	October 2022
Total Data	82.71 Terabyte	(compressed)
Parsed HTML URLs	3,048,746,652
URLs with Triples	1,518,609,988
Domains in Crawl	33,820,102
Domains with Triples	14,235,035
Typed Entities	19,072,628,514
Triples	86,462,816,435
Size of Extracted Data	1.6 Terabyte	(compressed)

Extraction Costs

The costs for parsing the 82.71 Terabytes of compressed input data of the October 2022 Common Crawl corpus, extracting the schema.org data and storing the extracted data on S3 totaled 1,102 USD in AWS fees. We used 100 spot instances of type c6.2xlarge for the extraction which altogether required 5,290 machine hours.

3.5. Extraction Results from the October 2021 Common Crawl Corpus

The October 2021 Common Crawl Corpus is available on Amazon S3 in the bucket commoncrawl under the key prefix /crawl-data/CC-MAIN-2021-43/ .

Extraction Statistics

Crawl Date	October 2021
Total Data	85.11 Terabyte	(compressed)
Parsed HTML URLs	3,195,003,256
URLs with Triples	1,516,194,663
Domains in Crawl	35,377,372
Domains with Triples	14,564,790
Typed Entities	18,483,343,653
Triples	82,142,918,869
Size of Extracted Data	1.6 Terabyte	(compressed)

Extraction Costs

The costs for parsing the 85.11 Terabytes of compressed input data of the October 2021 Common Crawl corpus, extracting the schema.org data and storing the extracted data on S3 totaled 1,363 USD in AWS fees. We used 100 spot instances of type c5.2xlarge for the extraction which altogether required 6,400 machine hours.

3.6. Extraction Results from the September 2020 Common Crawl Corpus

The September 2020 Common Crawl Corpus is available on Amazon S3 in the bucket commoncrawl under the key prefix /crawl-data/CC-MAIN-2020-40/ .

Extraction Statistics

Crawl Date	September 2020
Total Data	81.8 Terabyte	(compressed)
Parsed HTML URLs	3,410,268,379
URLs with Triples	1,701,573,394
Domains in Crawl	34,596,585
Domains with Triples	15,316,527
Typed Entities	21,636,494,250
Triples	86,381,005,124
Size of Extracted Data	1,9 Terabyte	(compressed)

3.7. Extraction Results from the November 2019 Common Crawl Corpus

The November 2019 Common Crawl Corpus is available on Amazon S3 in the bucket commoncrawl under the key prefix /crawl-data/CC-MAIN-2019-47/ .

Extraction Statistics

Crawl Date	November 2019
Total Data	53.9 Terabyte	(compressed)
Parsed HTML URLs	2,454,900,132
URLs with Triples	934,814,452
Domains in Crawl	32,040,026
Domains with Triples	11,917,576
Typed Entities	14,450,406,289
Triples	44,245,690,165
Size of Extracted Data	1.01TB	(compressed)

3.8. Extraction Results from the November 2018 Common Crawl Corpus

The November 2018 Common Crawl Corpus is available on Amazon S3 in the bucket commoncrawl under the key prefix /crawl-data/CC-MAIN-2018-47/ .

Extraction Statistics

Crawl Date	November 2018
Total Data	54 Terabyte	(compressed)
Parsed HTML URLs	2,544,381,895
URLs with Triples	944,883,841
Domains in Crawl	32,884,530
Domains with Triples	9,650,571
Typed Entities	7,096,578,647
Triples	31,563,782,097
Size of Extracted Data	739GB	(compressed)

3.9. Extraction Results from the November 2017 Common Crawl Corpus

The November 2017 Common Crawl Corpus is available on Amazon S3 in the bucket commoncrawl under the key prefix /crawl-data/CC-MAIN-2017-47/ .

Extraction Statistics

Crawl Date	November 2017
Total Data	66 Terabyte	(compressed)
Parsed HTML URLs	3,155,601,774
URLs with Triples	1,228,129,002
Domains in Crawl	26,271,491
Domains with Triples	7,422,886
Typed Entities	9,430,164,323
Triples	38,721,044,133

3.10. Extraction Results from the October 2016 Common Crawl Corpus

The October 2016 Common Crawl Corpus is available on Amazon S3 in the bucket commoncrawl under the key prefix /crawl-data/CC-MAIN-2016-44/ .

Extraction Statistics

Crawl Date	October 2016
Total Data	56 Terabyte	(compressed)
Parsed HTML URLs	3,181,199,447
URLs with Triples	1,242,727,852
Domains in Crawl	34,076,469
Domains with Triples	5,638,796
Typed Entities	9,590,731,005
Triples	44,242,655,138

3.11. Extraction Results from the November 2015 Common Crawl Corpus

The November 2015 Common Crawl Corpus is available on Amazon S3 in the bucket commoncrawl under the key prefix /crawl-data/CC-MAIN-2015-48/ .

Extraction Statistics

Crawl Date	November 2015
Total Data	151 Terabyte	(compressed)
Parsed HTML URLs	1,770,525,212
URLs with Triples	541,514,775
Domains in Crawl	14,409,425
Domains with Triples	2,724,591
Typed Entities	6,107,584,968
Triples	24,377,132,352

Format Breakdown

As the charts show a large fraction of websites make already use of embedded JSON-LD. In most cases (>90%) the websites use the syntax to enable Google to create a search box within the search results, as annonced by Google in September 2014. A interesting discussion about the topic can be found in the Google+ Posting by Aaron Bradley.

3.12. Extraction Results from the December 2014 Common Crawl Corpus

The December 2014 Common Crawl Corpus is available on Amazon S3 in the bucket commoncrawl under the key prefix /crawl-data/CC-MAIN-2014-52/ .

Extraction Statistics

Crawl Date	Winter 2014
Total Data	64 Terabyte	(compressed)
Parsed HTML URLs	2,014,175,679
URLs with Triples	620,151,400
Domains in Crawl	15,668,667
Domains with Triples	2,722,425
Typed Entities	5,516,068,263
Triples	20,484,755,485

Format Breakdown

3.13. Extraction Results from the November 2013 Common Crawl Corpus

The November 2013 Common Crawl Corpus is available on Amazon S3 in the bucket commoncrawl under the key prefix /crawl-data/CC-MAIN-2013-48/ .

Extraction Statistics

Crawl Date	Winter 2013
Total Data	44 Terabyte	(compressed)
Parsed HTML URLs	2,224,829,946
URLs with Triples	585,792,337
Domains in Crawl	12,831,509
Domains with Triples	1,779,935
Typed Entities	4,264,562,758
Triples	17,241,313,916

Format Breakdown

3.14. Extraction Results from the August 2012 Common Crawl Corpus

The August 2012 Common Crawl Corpus is available on Amazon S3 in the bucket commoncrawl under the key prefix /parse-output/segment/ .

Extraction Statistics

Crawl Date	January-June 2012
Total Data	40.1 Terabyte	(compressed)
Parsed HTML URLs	3,005,629,093
URLs with Triples	369,254,196
Domains in Crawl	40,600,000
Domains with Triples	2,286,277
Typed Entities	1,811,471,956
Triples	7,350,953,995

Format Breakdown

Extraction Costs

The costs for parsing the 40.1 Terabytes of compressed input data of the August 2012 Common Crawl corpus, extracting the RDF data and storing the extracted data on S3 totaled 398 USD in Amazon EC2 fees. We used 100 spot instances of type c1.xlarge for the extraction which altogether required 5,636 machine hours.

3.14b. Extraction Results from the February 2012 Common Crawl Corpus

Common Crawl did publish a pre-release version of its 2012 corpus in February. The pages contained in the pre-release are a subset of the pages contained in the August 2012 Common Crawl corpus. We also extracted the structured data from this pre-release. The resulting statistics are found here, but are superseded by the August 2012 statistics.

3.15. Extraction Results from the 2009/2010 Common Crawl Corpus

The 2009/2010 Common Crawl Corpus is available on Amazon S3 in the bucket commoncrawl under the key prefix /crawl-002/ .

Extraction Statistics

Crawl Dates	Sept 2009 (4 TB) Jan 2010 (6.9 TB) Feb 2010 (4.3 TB) Apr 2010 (4.4 TB) Aug 2010 (3.6 TB) Sept 2010 (6 TB)
Total Data	28.9 Terabyte	(compressed)
Total URLs	2,804,054,789
Parsed HTML URLs	2,565,741,671
Domains with Triples	19,113,929
URLs with Triples	147,871,837
Typed Entities	1,546,905,880
Triples	5,193,276,058

Format Breakdown

Extraction Costs

The costs for parsing the 28.9 Terabytes of compressed input data of the 2009/2010 Common Crawl corpus, extracting the RDF data and storing the extracted data on S3 totaled 576 EUR (excluding VAT) in Amazon EC2 fees. We used 100 spot instances of type c1.xlarge for the extraction which altogether required 3,537 machine hours.

4. Example Data

For each data format, we provide a small subset of the extracted data below for testing purposes. The data is encoded as N-Quads, with the forth element used to represent the provenance of each triple (the URL of the page the triple was extracted from). Be advised to use a parser which is able to skip invalid lines, since they could present in the data files.

5. Note about the N-Quads Download Files

It is important to note that the N-Quads download files do not conform completely with the N-Quads specification concerning blank node identifiers. The specification requires labels of distinct blank nodes to be unique with respect to the complete N-Quads document. In our N-Quads files, the blank node labels are unique only with respect to the HTML page from which the data was extracted. This means that different blank nodes in a download file may have the same label. For distinguishing between these nodes, the blank node label needs to be considered together with the URL of the page from which the data was extracted (the fourth element of the quad). This issue is due to 100 machines working in parallel on the extraction of the data from the web corpus without communicating with each other. We may fix this issue in upcoming WDC releases by renaming the blank nodes.

6. Conversion to Other Formats

We provide the extracted data for download using a variation of the N-Quads format. For users who prefer other formats, we provide code for converting the download files into the CSV and JSON formats, which are supported by a wider range of spreadsheet applications, relational databases are data mining tools.

The conversion tool takes the following parameters:

Parameter Name Description

out Folder where the outputfile(s) are written to

in Folder containing WDC download files in N-Quads format

threads Number of threads used for the conversion

convert Indicates the output format. Supported formats: JSON, CSV

density Indicates the minimum density of properties in order be included in the output file. Range: 0 - 1. A density of 0.2 indicates that the properties that have more than 20% non-null values will be included in the output .

multiplePropValues Indicates if the converted result should contain all property values for a certain property of a subject or if one value per property is enough. Range: [true/false]

Parameter Name	Description
out	Folder where the outputfile(s) are written to
in	Folder containing WDC download files in N-Quads format
threads	Number of threads used for the conversion
convert	Indicates the output format. Supported formats: JSON, CSV
density	Indicates the minimum density of properties in order be included in the output file. Range: 0 - 1. A density of 0.2 indicates that the properties that have more than 20% non-null values will be included in the output .
multiplePropValues	Indicates if the converted result should contain all property values for a certain property of a subject or if one value per property is enough. Range: [true/false]

Below you can find an example command which transforms the files found in the input directory to JSON files using 5 threads and density as well as property value filtering.

java -cp  StatsCreator-0.0.2-SNAPSHOT-jar-with-dependencies.jar org.webdatacommons.structureddata.stats.WDCQuadConverter  -out "output_convert"
            
-in "input_convert" -threads 5 -tp "http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://ogp.me/ns#type,http://opengraphprotocol.org/schema/type"
            
-convert "JSON" -density 0.15 -multiplePropValues true

File Structure

CSV file format

Each file starts with three fixed headers [graph, subject, type] followed by the property set of headers. Every line after the header represents one entity. You can find a sample CSV file with the structure of the conversion output here.

JSON file format

Each file contains a list of JSON objects with three fixed properties [graph, subject, type] followed by the property set describing the concrete entity. Every JSON object in the file represents one entity. You can find a sample JSON file with the structure of the conversion output here.

Conversion Process

In the following, we document the conversion process that is performed by the tool: The first step of the conversion process is to sort the input .nq file by subject and URL. For this purpose a temporary file containing the sorted entities is created and deleted by the end of the conversion. The size of the temporary file is equal to the size of the input .nq file. Next, the retrieved entities are written in the converted file. In the case of the CSV file format, all the distinct predicates are stored during parsing for the header row to be filled. In the case of the JSON file format the entities are transformed from Java objects to JSON ones with the help of the Gson library. The provided tool supports parallel execution on the directory level, meaning that multiple files can be converted simultaneously. In addition, the conversion tool provides density and property value filtering. In the case of density filtering the user can set a density threshold in order to filter uncommon properties. Please note that in the average case the maximum property density is calculated to be 35%, so a relatively high threshold could lead to empty property value results. In the case of property value filtering, the user can choose if the converted file should keep track of all the multiple values of a certain property belonging to a certain subject or if one value is enough for his/ her purposes.

7. Extraction Process

Since the Common Crawl data sets are stored in the AWS Simple Storage Service (S3), it made sense to perform the extraction in the Amazon cloud (EC2). The main criteria here is the cost to achieve a certain task. Instead of using the ubiquitous Hadoop framework, we found using the Simple Queue Service (SQS) for our extraction process increased efficiency. SQS provides a message queue implementation, which we use to co-ordinate the extraction nodes. The Common Crawl dataset is readily partitioned into compressed files of around 100MB each. We add the identifiers of each of these files as messages to the queue. A number of EC2 nodes monitor this queue, and take file identifiers from it. The corresponding file is then downloaded from S3. Using the ARC file parser from the Common Crawl codebase, the file is split into individual web pages. On each page, we run our RDF extractor based on the Anything To Triples (Any23) library. The resulting RDF triples are then written back to S3 together with the extraction statistics, which are later collected. The advantage of this queue is that messages have to be explicitly marked as processed, which is done after the entire file has been extracted. Should any error occur, the message is requeued after some time and processed again.

Any23 parses web pages for structured data by building a DOM tree and then evaluates XPath expressions to find structured data. While profiling, we found this tree generation to account for much of the parsing cost, and we have thus searched for a way to reduce the number of times this tree is built. Our solution is to run (Java) regular expressions against each webpages prior to extraction, which detect the presence of a microformat in a HTML page, and then only run the Any23 extractor when the regular expression find potentional matches. The formats html-mf-hcard, html-mf-hcalendar, html-mf-hlisting, html-mf-hresume, html-mf-hreview and html-mf-recipe define unique enough class names, so that the presence of the class name in the HTML document is ample indication of the Microformat being present. For the remaining formats, the following table shows the used regular expressions.

Format	Regular Expression
html-rdfa	`(property\|typeof\|about\|resource)\\s*=`
html-microdata	`(itemscope\|itemprop\\s*=)`
html-mf-xfn	`<a[^>]rel\\s=\\s(\"\|')[^\"'](contact\|acquaintance\|friend\|met\|co-worker\|colleague\|co-resident\|neighbor\|child\|parent\|sibling\|spouse\|kin\|muse\|crush\|date\|sweetheart\|me)`
html-mf-geo	`class\\s=\\s(\"\|')[^\"']*geo`
html-mf-species	`class\\s=\\s(\"\|')[^\"']*species`

8. Source Code

The source code can be checked out from our Github repository. Afterwards, create your own configuration by copying src/main/resources/ccrdf.properties.dist to src/main/resources/ccrdf.properties, then fill in your AWS authentication information and bucket names. Compilation is performed using Maven, thus changing into the source root directory and typing mvn install should be sufficient to create a build. In order to run the extractor on more than 10 EC2 nodes, you will have to request an EC2 instance limit increase for your AWS account. More information about running the extractor is provided in the file readme.txt .

9. License

The Web Data Commons extraction framework can be used under the terms of the Apache Software License.

10. Feedback

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

More information about Web Data Commons is found here.

11. Credits

Web Data Commons has started as a joint effort of the Freie Univeristät Berlin and the Institute AIFB at the Karlsruhe Institute of Technology in early 2012. Now it is mainly maintained by the Research Group Data and Web Science at the University of Mannheim.

We thank our former contributors for the help and support:

Hannes Mühleisen, now working at CWI in Amsterdam for the initial version of the extraction code and the first two releases.
Andreas Harth and Steffen Stadtmüller, working at Institute AIFB at KIT for their support during the first releases in 2012 and the hosting parts of the data.
Michael Schuhmacher, Johanna Völker working at the DWS Group and Kai Eckert, now working at the Hochschule der Medien in Stuttgart for their support creating the expressive statistics of the release in 2012.
Petar Petrovski for extracting the data and creating the statistics for the 2013 release.

Also lots of thanks to

the Common Crawl project for providing their great web crawl and thus enabling Web Data Commons.
the Any23 project for providing their great library of structured data parsers.

Web Data Commons was supported by the PlanetData and LOD2 research projects.

The extraction and analysis of the October 2016 and the November 2017 corpora was supported by the ViCE research project and the Ministry of Economy, Research and Arts of Baden - Württemberg .

12. References

Alexander Brinkmann, Anna Primpeli, Christian Bizer: The Web Data Commons Schema.org Data Set Series. In Companion Proceedings of the ACM Web Conference 2023 (WWW ’23 Companion), Austin, Texas, USA, April 2023.
Robert Meusel, Petar Petrovski, Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. In Proceedings of the 13th International Semantic Web Conference: Replication, Benchmark, Data and Software Track (ISWC2014).
Robert Meusel, Christian Bizer, Heiko Paulheim: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time. 5th International Conference on Web Intelligence, Mining and Semantics (WIMS2015), Limassol, Cyprus, July 2015.
Robert Meusel and Heiko Paulheim: Heuristics for Fixing Common Errors in Deployed schema.org Microdata, To appear in: Proceedings of the 12th Extended Semantic Web Conference (ESWC 2015), Portoroz, Slovenia, May 2015.
Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher, and Johanna Völker: Deployment of RDFa, Microdata, and Microformats on the Web - A Quantitative Analysis. In Proceedings of the 12th International Semantic Web Conference, Part II: In-Use Track, pp.17-32 (ISWC2013).
Hannes Mühleisen, Christian Bizer: Web Data Commons - Extracting Structured Data from Two Large Web Corpora. In Proceedings of the WWW2012 Workshop on Linked Data on the Web (LDOW2012).
Peter Mika, Tim Potter: Metadata Statistics for a Large Web Corpus. In Proceedings of the WWW2012 Workshop on Linked Data on the Web (LDOW2012).
Peter Mika: Microformats and RDFa deployment across the Web. Blog Post.
Class Statistics from the Sindice data search engine.

Web Data Commons - Microdata, RDFa, JSON-LD, and Microformat Data Sets

News

Contents

1. About Web Data Commons

2. Extracted Data Formats

3. Extraction Results

3.1. Trends

Adoption by Format

Adoption of Selected Schema.org Classes

Adoption of Selected RDFa Classes

3.2. Extraction Results from the October 2024 Common Crawl Corpus

Extraction Statistics

Extraction Costs

3.3. Extraction Results from the October 2023 Common Crawl Corpus

Extraction Statistics

Extraction Costs

3.4. Extraction Results from the October 2022 Common Crawl Corpus

Extraction Statistics

Extraction Costs

3.5. Extraction Results from the October 2021 Common Crawl Corpus

Extraction Statistics

Extraction Costs

3.6. Extraction Results from the September 2020 Common Crawl Corpus

Extraction Statistics

3.7. Extraction Results from the November 2019 Common Crawl Corpus

Extraction Statistics

3.8. Extraction Results from the November 2018 Common Crawl Corpus

Extraction Statistics

3.9. Extraction Results from the November 2017 Common Crawl Corpus

Extraction Statistics

3.10. Extraction Results from the October 2016 Common Crawl Corpus

Extraction Statistics

3.11. Extraction Results from the November 2015 Common Crawl Corpus

Extraction Statistics

Format Breakdown

3.12. Extraction Results from the December 2014 Common Crawl Corpus

Extraction Statistics

Format Breakdown

3.13. Extraction Results from the November 2013 Common Crawl Corpus

Extraction Statistics

Format Breakdown

3.14. Extraction Results from the August 2012 Common Crawl Corpus

Extraction Statistics

Format Breakdown

Extraction Costs

3.14b. Extraction Results from the February 2012 Common Crawl Corpus

3.15. Extraction Results from the 2009/2010 Common Crawl Corpus

Extraction Statistics

Format Breakdown

Extraction Costs

4. Example Data

5. Note about the N-Quads Download Files

6. Conversion to Other Formats

File Structure

CSV file format

JSON file format

Conversion Process

7. Extraction Process

8. Source Code

9. License

10. Feedback

11. Credits

12. References