This document provides an overview about the most common errors within deployed schema.org Microdata, within the Web Data Commons Microdata corpus which have been extracted from the November 2013 release of the Common Crawl.
In order to identify violations of the suggested schema of Schema.org we extracted the list of all classes and property combination with the expected value types using the official Schema.org RDFs. We separated the properties into datatype properties, which expect as range a http://schema.org/DataType
and object properties, exptecting any kind of http://schema.org/Thing
as range. We also identified a set of properties, which are both, datatype and object properties.
The list of files which were used can be found below:
We have followed the analysis of common errors of Weaving the pedantic web. In: Linked Data on the Web by Hogan et al. (2010) for our analysis of errors in the deployment of schema.org within the Web. In the following we briefly outline the most common groups of errors which can be found. A more comprehensive analysis can be found within our paper, submitted to the ESWC 2015.
Within the corpus we have found over 360 different namespaces including the two most common: data-vocabulary.org
and schema.org
. As for our analysis we are only interested in schema.org, we filtered those including at least the substring 'schema.org'. This results in 149 remaning namespaces. The files of the resulting corpus can obtained from the MDSchemaOrgQuads.list file.
Regarding the types and properties, we found 6.16% pay-level domains making use of not defined types and 8.02% making use of undefined properties.
Inspecting the errors within the types major reasons are missing slashes and wrong capitalization within the names of the types.
For the properties, we divide the problem into to: (1) properties which are not defined at all and (2) properties which are not defined for this particular type they are used with.
For the first case, main reasons are wrong capitalization and spelling errors, as well as newly made up properties as http://schema.org/postId
.
For the second case, the borrowed properties, we found that webmasters seems to forget to model the type in between, e.g. for http://schema.org/Product/
they modelled directly a http://schema.org/price
instead of adding a http://schema.org/Offer
object first.
In the area of domain/range violations for object and datatype properties we found several common mistakes.
First, over half of the pay-level domains making use of object properties (92 449 PLDs) maintain a datatype as value at least once. This means that e.g. for the property http://schema.org/author
a simple string, mostly representing the name of the author is marked up, instead of creating a new entity of type http://schema.org/Person
with a name property. The marking up of datatype properties with object is however seldomly done. Only 0.8% of all pay-level domains using datatype properties markup an object as value for those properties at least once.
In order to fix most of the common errors mentioned above, we have identified some simple heuristics to reduce the number of erroneous pay-level domains:
http://schema.org/name
or http://schema.org/url
property. As in some cases, the objectproperties allow different types we recommend to use the the least abstract common supertype of all possible types.
Using the heuristics mentioned above, we have created a fixed version of original Microdata Schema.org corpus. The files of this new fixed corpus are listed in the fixedMDSchemaOrgQuad.list file.
The corpus can be downloaded using wget with the command wget -i http://webdatacommons.org/structureddata/2013-11/files/fixedMDSchemaOrgQuad.list
The quads within this files are sorted by URL and subject.
Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.
More information about Web Data Commons is found here.