Fixing Common Errors in Deployed Schema.org Microdata

This document provides an overview about the most common errors within deployed schema.org Microdata, within the Web Data Commons Microdata corpus which have been extracted from the November 2013 release of the Common Crawl.

RDF Description of Schema.org

In order to identify violations of the suggested schema of Schema.org we extracted the list of all classes and property combination with the expected value types using the official Schema.org RDFs. We separated the properties into datatype properties, which expect as range a http://schema.org/DataType and object properties, exptecting any kind of http://schema.org/Thing as range. We also identified a set of properties, which are both, datatype and object properties.
The list of files which were used can be found below:

Common Mistakes

We have followed the analysis of common errors of Weaving the pedantic web. In: Linked Data on the Web by Hogan et al. (2010) for our analysis of errors in the deployment of schema.org within the Web. In the following we briefly outline the most common groups of errors which can be found. A more comprehensive analysis can be found within our paper, submitted to the ESWC 2015.

Wrong Namespaces

Within the corpus we have found over 360 different namespaces including the two most common: data-vocabulary.org and schema.org. As for our analysis we are only interested in schema.org, we filtered those including at least the substring 'schema.org'. This results in 149 remaning namespaces. The files of the resulting corpus can obtained from the MDSchemaOrgQuads.list file.

The most common errors were:

Undefined Types and Properties

Regarding the types and properties, we found 6.16% pay-level domains making use of not defined types and 8.02% making use of undefined properties. Inspecting the errors within the types major reasons are missing slashes and wrong capitalization within the names of the types. For the properties, we divide the problem into to: (1) properties which are not defined at all and (2) properties which are not defined for this particular type they are used with. For the first case, main reasons are wrong capitalization and spelling errors, as well as newly made up properties as http://schema.org/postId. For the second case, the borrowed properties, we found that webmasters seems to forget to model the type in between, e.g. for http://schema.org/Product/ they modelled directly a http://schema.org/price instead of adding a http://schema.org/Offer object first.

Domain/Range Violations

In the area of domain/range violations for object and datatype properties we found several common mistakes.
First, over half of the pay-level domains making use of object properties (92 449 PLDs) maintain a datatype as value at least once. This means that e.g. for the property http://schema.org/author a simple string, mostly representing the name of the author is marked up, instead of creating a new entity of type http://schema.org/Person with a name property. The marking up of datatype properties with object is however seldomly done. Only 0.8% of all pay-level domains using datatype properties markup an object as value for those properties at least once.

Heuristics

In order to fix most of the common errors mentioned above, we have identified some simple heuristics to reduce the number of erroneous pay-level domains:

Fixed Version of the Schema.org Microdata Corpus

Using the heuristics mentioned above, we have created a fixed version of original Microdata Schema.org corpus. The files of this new fixed corpus are listed in the fixedMDSchemaOrgQuad.list file.
The corpus can be downloaded using wget with the command wget -i http://webdatacommons.org/structureddata/2013-11/files/fixedMDSchemaOrgQuad.list The quads within this files are sorted by URL and subject.

Feedback

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

More information about Web Data Commons is found here.