Extending and Combining Microdata Vocabularies

Richard Cyganiak ● richard at cyganiak dot de ● Last update: 25 October 2011

Microdata is a syntax for marking up structured data inside HTML documents. It is superior to previous efforts such as microformats and RDFa, but has a significant drawback: it does not support the use of multiple vocabularies in conjunction. This document is a proposal for addressing this limitation. It is based on two patterns that relate itemtype URLs, property shortnames, and absolute property URLs in microdata vocabularies, called “vocabulary expansion” and “itemtype expansion”. The proposal does not change the syntax of microdata, but requires an update to microdata parsers, as well as to RDF parsers. It also suggests a whitelist of vocabularies that support the patterns.

@@@ Work in progress!!! This is an incomplete draft.

Table of Contents

  1. 1 Introduction
  2. 2 Use cases
    1. 2.1 Use a property from another vocabulary to augment schema.org
    2. 2.2 Use RDF and OWL vocabularies as microdata vocabularies
    3. 2.3 Use several vocabularies side-by-side on a single itemscope
    4. 2.4 Use microdata with RDF tools
  3. 3 Simple property URL expansion is not the answer
  4. 4 Rules for vocabulary maintainers
  5. 5 On vocabulary whitelists
  6. 6 Rules for microdata parsers
  7. 7 Rules for RDF parsers
  8. 8 Rules for HTML markup authors
    1. 8.1 Augment a vocabulary with a foreign property
    2. 8.2 Use two vocabularies side by side
    3. 8.3 Use RDF/OWL vocabularies as microdata vocabularies
  9. 9 Acknowledgements
  10. 10 Changelog

1 Introduction

Microdata is a syntax for marking up structured data inside HTML documents. It is part of the WHATWG's work on evolving HTML. It is superior to previous efforts such as microformats and RDFa thanks to dedicated support in the HTML language combined with a clean and simple design focused on author usability and practical use cases. It has a significant drawback though: it does not support the use of multiple vocabularies in conjunction. This is a roadblock for microdata adoption in situations where the use of other vocabularies besides the dominant one (the Google-backed schema.org) is desirable, and prevents early adopters of RDFa from upgrading to the superior syntax.

This is a proposal for addressing this limitation. It is based on two patterns that relate itemtype URLs, property names, and absolute property URLs in microdata vocabularies, called “vocabulary URL expansion” and “itemtype URL expansion”. The proposal does not change the syntax of microdata, but requires an update to microdata parsers, as well as to RDF parsers. It also suggests a whitelist of vocabularies that support the patterns.

2 Use Cases

I will consider four use cases.

2.1 Use a property from another vocabulary to augment schema.org

An HTML author marks up their web page with terms from schema.org. They find that schema.org lacks a property that would be useful. However, they know that some other vocabulary already provides exactly the right term, and they would like to use that term on an item typed with a schema.org itemtype.

Microdata requires that a property of a typed item must either be a property name allowed for that type (in this case, a schema.org property), or it must be an absolute URL. It follows that the property from the other vocabulary can only be used if it is a full URL.

Example:

<div itemscope itemtype="http://schema.org/Person">
  <span itemprop="name">Philip Jägenstedt</span>
  (<span itemprop="http://microformats.org/profile/hcard#nickname">foolip</span>)
  works at <span itemprop="memberOf">Opera</span>
</div>

Here, the nickname property, from the hCard microformat is used in a schema.org-typed item. This would not be allowed in plain microdata, because the hCard specification only specifies short names for properties (“nickname”); it doesn't specify absolute URLs for properties (“http://microformats.org/profile/hcard#nickname”).

2.2 Use RDF and OWL vocabularies as microdata vocabularies

Over the past decade, metadata experts and data modellers have already defined hundreds of vocabularies for use on and off the web, such as:

Making all this work available for microdata markup authors would be a huge win.

However, because of their heritage in RDF and OWL, these vocabularies only define full absolute URLs for their properties. (RDF and OWL require full URLs for both properties and types.) This makes their use in microdata markup awkward, compared to dedicated microdata vocabularies with short property names.

Example:

<div itemscope itemtype="http://xmlns.com/foaf/0.1/Person">
  <span itemprop="name">Philip Jägenstedt</span>
  (<span itemprop="nick">foolip</span>)
</div>

Here we use FOAF, a popular RDF vocabulary for describing online profiles of people that predates microdata, to mark up a person.

In current microdata, the use of short names such as name and nick would not be allowed as the relevant specifications only define absolute URLs such as http://xmlns.com/foaf/0.1/nick.

2.3 Use several vocabularies side-by-side on a single itemscope

Among all vocabularies, schema.org enjoys a special place: Thanks to Rich Snippets, schema.org markup will make your web pages look better on the mother of all traffic sources, Google. This creates very strong incentives for using schema.org terms wherever possible, and every other vocabulary should better play nice with schema.org or it won't stand much of a chance. But since current microdata only supports a single itemtype per itemscope, it is very hard and sometimes impossible to use other vocabularies on a page that is already marked up with schema.org terms.

Solving this problem is hard. There have been requests for allowing multiple itemtypes per item, but that only works if all the types share the same properties, otherwise one cannot associate the properties with the right itemtype.

The solution is to extended microdata to allow multiple items on a single itemscope.

Example:

<div itemscope itemtype="http://schema.org/Person">
  <meta itemprop="http://n.whatwg.org/alt-itemtype" content="http://xmlns.com/foaf/0.1/Person">
  <span itemprop="name http://xmlns.com/foaf/0.1/name">Philip Jägenstedt</span>
  (<span itemprop="http://xmlns.com/foaf/0.1/nick">foolip</span>)
  works at <span itemprop="memberOf">Opera</span>
</div>

This requires some explanation. An extended microdata parser would create two items from this snippet. First, the obvious one:

item
  itemtype: http://schema.org/Person
  name: Philip Jägenstedt
  http://xmlns.com/foaf/0.1/name: Philip Jägenstedt
  http://xmlns.com/foaf/0.1/nick: foolip
  memberOf: Opera

In addition, it would create a second alternate item, because the itemscope contains an alt-itemtype property:

item
  itemtype: http://xmlns.com/foaf/0.1/Person
  name: Philip Jägenstedt
  nick: foolip

The itemtype URL of this alternate item is the value of alt-itemtype. The properties are selected by scanning the main item for absolute property URLs that share a namespace with the itemtype URL. A short property name is re-constructed from the full property URL. This requires some knowledge of the vocabulary in the parser: The vocabulary needs to specify that nick in the FOAF namespace is indeed a property that can be used with the Person type.

Two independent items were created, with different itemtypes. A microdata client that was created specifically for the schema.org vocabulary (such as the Google Rich Snippets bot) would only consider the first item and ignore the second. A microdata client that was specifically created to work with the FOAF vocabulary, on the other hand, would only use the second item and ignore the first.

2.4 Use microdata with RDF tools

The microdata spec, when originally published, contained a section on parsing microdata into RDF. This made microdata a viable alternative to the flawed RDFa syntax for users of RDF, and offered a migration path for organisations that had already cast their lot with RDFa.

This section was removed due to difficulties with making it work in a way that is both generic and produces output that is useful to RDF consumers. The problem is that RDF requires absolute URLs for properties, while current microdata works best if vocabularies define only short names for properties. RDF consumers need a way to produce idiomatic absolute URLs from these short names.

Example:

<div itemscope itemtype="http://schema.org/Person">
<span itemprop="name">Philip Jägenstedt</span>
(foolip)
</div>

This should produce idiomatic RDF:

_:1 a <http://schema.org/Person>
_:1 <http://schema.org/name> "Philip Jägenstedt".

3 Simple property URL expansion is not the answer

One proposal suggests itself, but is not a good solution. Microdata vocabulary designers could simply define two forms of each property, a short form such as name, and a full URL such as http://schema.org/name. The vocabulary could state that both are equivalent. This speaks to several of the use cases above, but has two major problems:

  1. It puts undue burden on microdata consumers. Consider a consumer that works with schema.org types. Instead of just checking for the presence of a name property, it now has to check for both name and http://schema.org/name everywhere. In practice, some consumers are likely to cut corners and check for only one case, leading to poor interoperability.
  2. It puts undue burden on vocabulary creators. It essentially asks every vocabulary creator to define each term twice in order to compensate for a weakness in the microdata specification. Surely there must be a way to shoulder at least some of that burden in the microdata specification.

Nevertheless, the basic idea – having absolute URLs that correspond 1:1 to the short property names – is sound. I refine this idea below by adding two twists:

  1. Use absolute URLs only where absolutely necessary: when using a property on an item of a different itemtype; when marking up an alternate item; or in RDF parsed from microdata. This ensures that microdata consumers consistently see the short form.
  2. Vocabulary maintainers only have to indicate adherence to a certain pattern that relates itemtype URLs, absolute property URLs, and short names. This declaration, made by the vocabulary maintainer, indicates that the vocabulary is safe to use in all the use cases presented above.

The following sections describe the rules that different parties – vocabulary maintainers, microdata parser implementers, RDF parser implementers, and HTML markup authors – have to follow to make all of this work. There is also a section on vocabulary whitelists.

4 Rules for vocabulary maintainers

Designers of microdata vocabularies may wish to follow one of the two patterns described in this section to make their vocabulary expansion safe.

Definition: A global property is a microdata property that is an absolute URL. A shortname property is a microdata property that is not an absolute URL and that is defined for use on a particular itemtype.

Definition: A vocabulary is a collection of itemtypes that share a common URL prefix. That prefix is the vocabulary URL. A vocabulary URL is not allowed to be a prefix of another vocabulary URL.

Note: The definition above differs from the language in the HTML spec and is just for the purpose of this document. In HTML, a vocabulary is a specification, and doesn't have a URL. In our view, if one specification defines ten itemtypes, then these could be treated as one vocabulary or as ten distinct vocabularies; it is entirely up to the vocabulary creator.

Definition: A vocabulary is safe for vocabulary expansion if all its itemtypes are safe for vocabulary expansion. An itemtype is safe for vocabulary expansion if all its properties are safe for vocabulary expansion.

Definition: A vocabulary is safe for itemtype expansion if all its itemtypes are safe for itemtype expansion. An itemtype is safe for itemtype expansion if all its properties are safe for itemtype expansion.

Definition: If an itemtype is expansion safe, then each of its shortname properties has an equivalent global property, which is the absolute URL derived as follows:

Condition: A shortname property is only safe for expansion if applying the respective kind of expansion will not result in a clash. A clash is:

Vocabulary designers that follow one of the patterns SHOULD add a statement to the specification that

  1. states the vocabulary URL, and
  2. states that the vocabulary is vocabulary expansion safe or itemtype expansion safe.

If a vocabulary could be safe for both kinds of expansion, then vocabulary creators SHOULD prefer vocabulary expansion. It yields shorter URLs, and yields the same URL when one property is applicable to multiple types.

Example

Schema.org is vocabulary expansion safe.

Let props be the collection of all properties defined for any itemtype whose URL starts with http://schema.org/.

Observation 1: There are no two properties in props that have the same name and different semantics. Observation 2: For all properties in props, the URL http://schema.org/{propertyname} is either undefined by schema.org, or is an alternate name for the property.

Therefore, the namespace http://schema.org/ is property expansion safe.

Example

@@@ Also consider http://microformats.org/profile/hcard and/or http://n.whatwg.org/work as itemtype expansion safe vocabularies

Example

@@@ Unsafe example: A license property on itemtype http://example.org/work is not itemtype expansion safe if the anchor http://example.org/work#license happens to be a section in the spec that states the license of the spec. That's because we have a URL clash: the URL can't identify both a section in the spec and a microdata property.

Example

@@@ Unsafe example: http://example.org/Person with property title (Dr or Sir), and http://example.org/Document with property title (the document title). This vocabulary is not vocabulary expansion safe because http://example.org/title would be a property clash: it can't identify the two different and incompatible title properties at the same time.

5 On vocabulary whitelists

It would be useful to have a central whitelist that keeps track of vocabularies that follow the patterns. This simplifies the implementation of parsers, because they can use the whitelist to decide how to handle each vocabulary.

The data structure of a whitelist is simply two sets of URLs: one for vocabulary expansion safe vocabularies, and one for itemtype expansion safe vocabularies.

It's tempting to call this a “registry”, but that would be inaccurate because it isn't responsible for assigning identifiers or managing a namespace. It just documents which vocabularies follow a pattern.

A strict whitelist would only contain vocabularies where the vocabulary's specification explicitly says that it's safe. Less strict whitelists, however, could rely on a user community that examines vocabularies and adds them to the list if they are found safe. Initially, this is more realistic than expecting a large number of vocabulary maintainers to update their vocabularies (even if the update would just be one added sentence).

However, one cannot with certainty assume that a vocabulary will always be expansion safe just because it is now…

6 Rules for microdata parsers

@@@ If an itemtype is known to be safe and used in an alt-itemtype property of an item, then conforming parsers MUST build a new item of that itemtype, and copy all safe properties in the same namespace over in short form, and copy any other full URL properties as well. The new item has the same itemid.

@@@ Add an example that shows it's still ok if used with schema.org extensions, like http://schema.org/Person/President or http://schema.org/name/nick

7 Rules for RDF parsers

If an itemtype is expansion safe (in other words, its itemtype URL is a vocabulary URL known to be expansion safe, or has as its prefix a vocabulary URL that is known to be expansion safe), then an RDF parser MUST generate triples that use the full URL version of any shortname properties on that item.

For items of an itemtype that is not known to be safe, continue to use the ugly microdata-style property URLs just as defined in the last version of the microdata-to-RDF conversion algorithm.

This rule enables Use Case 4.

8 Rules for HTML markup authors

This section shall mention three patterns that HTML markup authors may wish to use when marking up microdata items.

8.1 Augment a vocabulary with a foreign property

If a vocabulary is safe, authors MAY use the global URL form of its properties on any itemtype from any vocabulary (assuming the use of the property makes semantic sense on that itemtype).

This pattern allows higher-fidelity microdata markup when the main itemtype lacks a useful property.

An example is given in Use Case 1 above.

8.2 Use two vocabularies side by side

If a vocabulary is safe, it can be used side by side with another vocabulary on the same itemscope.

Choose the itemtype from one vocabaluary as the main itemtype, and the itemtype from the other vocabulary (which must be safe) as the alternate itemtype. Use shortnames for the properties of the main itemtype. Use full URLs for the properties of the alternate itemtype.

You can have multiple alternate itemtypes.

This pattern is appropriate when different data consumers are likely to support different vocabularies.

An example is given in Use Case 3 above.

8.3 Use RDF/OWL vocabularies as microdata vocabularies

Any RDF or OWL vocabulary that has all its terms (classes and properties) under a single namespace URI is vocabulary expansion safe by definition. It can be used with microdata. This is the case for almost all RDF and OWL vocabularies. It does not matter if the vocabulary uses “hash URIs” or “slash URIs”.

Use the short form of its properties.

Note that microdata doesn't support typed literals, so it cannot be easily used with ontologies that require them.

An example is given in Use Case 2 above.

9 Acknowledgements

My thinking on this owes much to discussions with Jeni Tennison, Philip Jägenstedt, Henri Sivonen and Lin Clark. Lin also bribed me into this up.

10 Cangelog