URIs have a namespace part and a local part, right?

This is a technical post on the way URIs break down into “namespace parts” and “local parts” in RDF. It was prompted by this comment in a recent discussion:

In a URI, the namespace part ends with the last slash or hash, right?

So, the namespace of <http://example.com/foo/bar-123> ends after /foo/, and bar-123 is the local part, right?

The answer is neither yes nor no. The question is based on a wrong assumption.

Let me explain.

Namespaces in RDF are a strange beast

On the one hand, one could say that they are purely a syntactic convenience for shortening repetitive URIs, and carry no syntactic meaning.

On the other hand, one could say that they are an integral part of the “user interface” to RDF data. Users, for example those writing SPARQL queries, much prefer to interact with the data in its prefix-abbreviated form and not through the full URIs.

Different parts of the RDF stack embody these different views. For example, the N-Triples syntax doesnt support namespace abbreviation at all. But the RDF/XML syntax doesn’t work without namespace abbreviation (and furthermore makes it impossible to abbreviate certain valid URIs).

So where does the namespace part end???

It’s sloppy to say that URIs have “namespace parts” and “local parts”. Rather, it would be more accurate to say:

Given a certain prefix mapping, a URI can be broken up into a namespace part and a local part, possibly in different ways.

Consider this prefix mapping, written in Turtle:

@prefix a: <http://example.com/>.
@prefix b: <http://example.com/foo/>.
@prefix c: <http://example.com/foo/bar->.

Now, given this prefix mapping, the URI <http://example.com/foo/bar-123> can be broken up into namespace and local parts in three different ways, yielding three different local names:

a:foo/bar-123
b:bar-123
c:123

Now a couple of observations:

  1. There’s nothing inherently special about hashes or slashes in URI patterns, as shown in the third prefix.
    It’s a common convention to define prefix mappings that go up to the last hash or slash (again due to the influence of old RDF/XML where you couldn’t have hashes or slashes in local names), and many tools that automatically create prefix mappings will do this, so you will see b:bar-123 more often than the other forms. But that’s merely a convention, and the other forms may well be more convenient for users sometimes.
  2. The form a:foo/bar-123 actually needs to be escaped as a:foo\/bar-123 if written in Turtle or SPARQL, because unescaped slashes are not allowed in the local part of a prefixed name.
  3. This escape mechanism was only introduced in the W3C Recommendation version of Turtle and in SPARQL 1.1, so may not work in older parsers, and community awareness of this form is regrettably low.
  4. The form c:123 will work in Turtle and SPARQL but not in RDF/XML, because XML requires that element names start with a letter or underscore. So, in RDF/XML, only b:bar-123 works.

To summarise, URIs don’t simply have a namespace part and local part. Rather, someone defines a prefix mapping, and under that prefix mapping, there may be zero, one or more ways of abbreviating any given URI.

Posted in General, Semantic Web | Comments Off on URIs have a namespace part and a local part, right?

Multiple itemtypes in Microdata

There’s a lot of discussion recently around HTML5’s microdata proposal, and how it relates to W3C’s earlier RDFa standard that is currently being updated for HTML5. Microdata solves many of the use cases of RDFa in a much simpler way. But some other use cases it cannot solve. This is because microdata assumes a world where there are very few or even just a single vocabulary; mixing vocabularies on a single item is rather difficult. Jeni Tennison has an excellent statement of the problem, along with a proposed solution.

In this post I put forward another proposal for addressing at least part of the problem.

The problem: microdata is limited to a single itemtype per element.

Why is this a problem? Because it makes mixing vocabularies really hard. If I decide to mark up an address with schema.org’s PostalAddress, then I can’t easily add markup for microdata’s built-in vCard vocabulary. I’ll have to repeat content in order to use both vocabularies. This design benefits the Google-backed schema.org; more focused special-purpose vocabularies, or open alternatives to schema.org with a transparent development process, will have a difficult stand.

An example. So let’s assume I have this address and want to mark it up with microdata:

<div>
    <span>26 Dun Aengus</span>,
    <span>Galway</span>,
    <span>Ireland</span>.
</div>

Then here’s how I would do it with schema.org terms:

<div itemscope itemtype="http://schema.org/PostalAddress">
    <span itemprop="streetAddress">26 Dun Aengus</span>,
    <span itemprop="addressLocality">Galway</span>,
    <span itemprop="addressCountry">Ireland</span>.
</div>

And here with vCard terms:

<div itemscope itemtype="http://microformats.org/profile/hcard">
    <span itemprop="street-address">26 Dun Aengus</span>,
    <span itemprop="locality">Galway</span>,
    <span itemprop="country-name">Ireland</span>.
</div>

It is clear why combining both versions into a single one is difficult. Microdata uses short property names like itemprop="street-address". If an element had multiple itemtypes, then it would be impossible to tell which itemtype the street-address property belongs to. Assuming that it belongs to both types would be dangerous; there could be cases where a property exists in both vocabularies but with different meaning. The restriction to a single type prevents such ambiguity.

Multiple itemtypes without ambiguity: Here’s the proposal. I’ll start by creating an item that has all the properties from both versions—I’m omitting the itemtypes for now to avoid ambiguity:

<div itemscope>
    <span itemprop="streetAddress street-address">26 Dun Aengus</span>,
    <span itemprop="addressLocality locality">Galway</span>,
    <span itemprop="addressCountry country-name">Ireland</span>.
</div>

Without itemtype, this generates an untyped item with six properties:

  • itemtype: none
  • property: streetAddress = 26 Dun Aengus
  • property: street-address = 26 Dun Aengus
  • property: addressLocality = Galway
  • property: locality = Galway
  • property: addressCountry = Ireland
  • property: country-name = Ireland

The altitem property. Microdata would get a new built-in property, called altitem. Let’s add an additional element with this property into the untyped item:

<meta itemprop="altitem"
      content="http://schema.org/PostalAddress streetAddress
               addressLocality addressCountry">

What’s going on here? The idea is that altitem takes a whitespace-separate list. When added to an item, it creates a new “alternate item” whose itemtype is the first element of the list. Then it looks at the rest of the list, which should be property short-names. It copies any of these named properties from the original item to the new item. So, we’d end up with a second item besides the type-less original item. This second item has:

  • itemtype: http://schema.org/PostalAddress
  • property: streetAddress = 26 Dun Aengus
  • property: addressLocality = Galway
  • property: addressCountry = Ireland

Which is exactly the same as the original schema.org item from above. Creating the vCard item is just another property:

<meta itemprop="altitem"
      content="http://microformats.org/profile/hcard street-address
                locality country-name">

This gives us:

  • itemtype: http://microformats.org/profile/hcard
  • property: street-address = 26 Dun Aengus
  • property: locality = Galway
  • property: country-name = Ireland

So now we’d have three items in total: the original untyped item, and the two typed alternate items.

What’s nice about this:

  • It doesn’t require any new syntax, just a new property.
  • Multiple types generate multiple items, which are visible in the microdata API just like normal items.
  • It plays well with itemref, so the altitem declaration doesn’t have to be repeated if I have several postal addresses on the page.
  • It’s plays well with a copy-and-paste style of web development. “If you want to use myVocab together with another vocab, just paste this snippet into your item and add the appropriate itemprops…”

Issues. Quite some details would still have to be worked out:

  • What happens to properties with full URL names? I guess they should always be copied to all items.
  • What happens to itemid? I guess all items should receive the same itemid from the original item.
  • In microdata, itemtype is inherited by nested sub-items. I’m not sure how this should work if altitem is present.
  • Properties within a microdata are ordered; there’s a question whether the order in altitem or in the original item should take precedence when alternate items are generated.
  • Would it be worth having a dedicated microdata attribute for this?
  • Would microdata clients actually implement this? There is a risk that too many implementers would take shortcuts and just implement the basic case and ignore altitem.

Summary: This post shows how multiple itemtypes could be supported in microdata without introducing new syntax, without making the common case of a single vocabulary more complex for authors, and without fundamentally changing the data model.

Posted in General | 4 Comments

The RDF 1.1 Literal Quiz

Let’s pretend we live in January 2013, and RDF 1.1 has just been published. This including the RDF Working Group’s attempt to clean up string literals. The issue with string literals is that RDF currently offers three different ways for doing something as simple as writing down a string:

  1. "foo",
  2. "foo"^^xsd:string,
  3. and the rather weird "foo@"^^rdf:PlainLiteral.

The working group is trying to fix this. Now here’s a quiz with some RDF trivia questions. What are the answers that you’d like to see? (“Don’t care” is a fine answer too.)

Q1. Does this RDF graph (written in Turtle) have one triple?

<a> <b> 1 .
<a> <b> "1"^^xsd:integer .

Q2. Does this RDF graph (written in Turtle) have one triple?

<a> <c> "foo" .
<a> <c> "foo"^^xsd:string .

Q3. Is this a valid Turtle file?

<a> <b> "foo@"^^rdf:PlainLiteral .

Q4. Is a parser allowed to unify "foo" and "foo"^^xsd:string into a single form while parsing?

Q5. Is this a valid N-Triples file?

<http://example.com/a> <http://example.com/b> "foo" .

Q6. Is this a valid N-Triples file?

<http://example.com/a> <http://example.com/b> "foo@"^^rdf:PlainLiteral .

Q7. Is this a valid N-Triples file?

<http://example.com/a> <http://example.com/b> "foo"@en .

Q8. Is this a valid N-Triples file?

<http://example.com/a> <http://example.com/b> "foo"^^xsd:string .

Q9. Is this true in SPARQL?

datatype("foo") = xsd:string

Q10. Is this true in SPARQL?

datatype("foo") = error

Q11. Is this true in SPARQL?

datatype("foo") = rdf:PlainLiteral

Q12. Is this true in SPARQL?

datatype("foo"@en) = xsd:string

Q13. Is this true in SPARQL?

datatype("foo"@en) = error

Q14. Is this true in SPARQL?

datatype("foo"@en) = rdf:PlainLiteral

Q15. Is this true in SPARQL?

datatype("foo"@en) = rdflang:en

Q16. Does the literal in this RDF/XML fragment have a language tag?

<rdf:Description rdf:about="a" xml:lang="en">
  <b>foo</b>
</rdf:Description>

Q17. Does the literal in this RDF/XML fragment have a language tag?

<rdf:Description rdf:about="a" xml:lang="en">
  <b rdf:datatype="&xsd;string">foo</b>
</rdf:Description>

For each of the following pairs of statements, if the statement on the left is true, then is the statement on the right true as well in a system that supports datatype inference (specifically, {xsd:string}-Entailment)?

Q18. { <a> <b> "foo" . } => { <a> <b> "foo"^^xsd:string . }

Q19. { <a> <b> "foo"^^xsd:string . } => { <a> <b> "foo" . }

Q20. { <a> <b> "foo" . } => { <a> <b> "foo"@en . }

Q21. { <a> <b> "foo"@en . } => { <a> <b> "foo" . }

Q22. { <a> <b> "foo"@en . } => { <a> <b> "foo"@en-GB . }

Q23. { <a> <b> "foo"@en-GB . } => { <a> <b> "foo"@en . }

Q24. { <a> <b> "foo"@fr . } => { <a> <b> "foo"@en . }

Leave your answer in the comments! We’ll check in January 2013 to see who came closest. The winner gets, uhhm, a copy of the RDF Concepts and Abstract Syntax spec signed by the editors …

Posted in Semantic Web | 1 Comment

Creating an RDF vocabulary: Lessons learned

With tools like Neologism and OpenVocab, creating an RDF vocabulary is easy. But if your goal is re-use within a wider community, you will face many questions that are not so easy to answer:

  • How much work is it going to be and what timeframe is realistic?
  • How broad and how deeply should you cover the domain? Where to stop?
  • Work alone or seek collaborators?
  • Should you start by setting up a mailing list, or by producing a first draft?
  • How much documentation do you need to produce?
  • Whose feature requests and modeling ideas should you heed and whose ignore?
  • How to keep pushing towards the uncertain goal of “adoption” in the face of limited time?

A few days ago, the VoID vocabulary became a W3C SWIG Note. VoID started in 2008 as a loose collaboration between Jun Zhao, Keith Alexander, Michael Hausenblas and me. We published a first non-W3C version in 2009. The W3C publication is a nice milestone for us, and I thought this a good opportunity to share some of the lessons I have learned along the way.

I will focus on process and collaboration in this post, and say little about modeling practices or publishing tools or RDFS/OWL geekery.

Lesson #1: Work in a team. Three or four people, each with their own use cases or data, might be ideal. It ensures that a variety of use cases are covered; fluctuations in available time don’t stall the project; it mellows any strong personal hand-writing in the modeling and design; it increases the network available for reaching out to potential users. Having a team of a few motivated people is perhaps the most important factor for success.

Lesson #2: Take your time. For all of us, VoID was a low-priority “background task”. We all get paid for doing other things. Inevitably, progress was often slow, with months where literally nothing happened. I probably averaged less than an hour of VoID work per week (with occasional major bursts of activity).

And that might be the best way. Progress in vocabulary design is not how quickly one produces a polished spec. Progress is learning about the needs of the potential user community. Going slow means more opportunity for feedback at every stage, and reduces the risk of creating something that nobody needs.

We also moved the vocabulary to a different host twice in the process. This worked out ok because we could retain the original namespace URI throughout the moves, but it definitely shows the advantage of going with something like purl.org from the start.

Lesson #3: Use a public issue tracker. This is crucial, even if you work alone. It adds structure to the work process and helps to ensure that no balls get dropped. Some issues will remain unresolved for long periods of time, and you need a place for collecting the random comments, discussions, related links, proposed text for changes and so on.

I think it’s important to use a tracker that is easy to work with, ideally one that the contributors are already familiar with. We used the one from Google Code. It’s simple and just works.

Setting up a Google Code project for developing the vocabulary worked very well for us. Besides the tracker, we also used the SVN repository for the spec, and the simple wiki for random bits of information, like lists of deployments, and examples that didn’t fit into the spec.

Don’t try to use a wiki or Google Doc or other funky collaboration device in place of an issue tracker. I’ve seen that done elsewhere and it doesn’t work.

Lesson #4: Perfection can wait till the next version. This sounds banal, but is so important. At some point quite a while ago, we were all quite fed up and just wanted to get something out of the door. So we decided not to tackle a lot of difficult open issues. We told ourselves that we would just do them in a second version. This turned out to be immensely liberating.

After version 1, we took a long break, and then started to work on version 2. Now we knew that deferring to the next version is always an option (which we used liberally). Not really clear if that use case is worth the effort? Defer. Not enough evidence or experience to inform the design? Defer. Two pig-headed contributors (that is, Keith and me) can’t agree on a design? Defer.

Lesson #5: Regular Skype calls. This one might be controversial, because no one likes wasting time in weekly conference calls. But I think it worked well for us. We didn’t quite do weekly calls, but scheduled them ad hoc, averaging perhaps one every two weeks. Often, the only progress between calls was that one of us felt a bit of shame and quickly did one or two of their actions in the thirty minutes before the call. This adds up over the months and makes sure that there is slow but steady progress.

We took turns chairing and scribing. The chair would take us through the agenda (typically “review open actions; review issues list; discuss particularly thorny issue XYZ; AOB; schedule next call”) and interrupt any discussion that started to go circular. The scribe would note whenever someone took an action to do something, and afterwards email a list of those and the date for the next call. A good call duration is somewhere between 60 and 90 minutes.

Lesson #6: Have a working draft of the spec from day one. Even if it’s just a few scribbles. Call them your working draft and take it from there. Then get into the habit of focussing any discussion on the question: What change should be made to the text? Arguing about words that should go into the text is much more productive than the alternative, which is arguing who is right or wrong. Ideally, whenever people start to disagree, they should draft up competing change proposals to be discussed in the next call.

Besides the spec text in SVN, we used Neologism to create and publish the actual RDFS vocabulary specification.

Lesson #7: Public mailing list is optional. Don’t you hate signing up to yet another mailing list? Me too. We started with a private mailing list, and found that its only real use was for notifications from the issue tracker. Discussion happened on Skype or in the tracker. We put external comments into the tracker too and discussed them there. This worked well.

This is about the creation phase of the vocabulary. It might be a different story once you get a bit of a user community going. We now have a public discussion list.

Lesson #8: Start over a beer and a large piece of paper. If you can. With everyone physically in the same room. That’s how we did it anyways, at a conference, and it was quite helpful for figuring out a core part of the vocabulary that seemed uncontroversial. Most of that time was spent arguing about—I’m sure this will come to no surprise to you—a name for the project.

Posted in General | 4 Comments

Blank nodes considered harmful

Well, they are not always harmful. But most of the time. I’ll get to that in a minute.

On the semantic-web@w3.org list, W3C’s Sandro Hawke has a lucid and concise summary of the problems with blank nodes in RDF. It’s worth quoting in full:

I agree that *software* should not change blank nodes to nodes with a
URI label. But, when practical, *people* probably should, as they are
authoring.

In general, blank nodes are a convenience for the content provider and a
burden on the content consumer. Higher quality data feeds use fewer
blank nodes, or none. Instead, they have a clear concept of identity
and service for every entity in their data.

If someone in the middle tries to convert (Skolemize) blank nodes, it’s
a large burden on them. Specifically, they should provide web service
for those new URIs, and if they get updated data from their sources,
they’re going to have a very hard [perhaps impossible] time
understanding what really changed.

Does this mean blank nodes are evil? Not always. Sometimes they are tolerable, sometimes they are a necessary last resort, and sometimes they are good enough. But they are never good.

  • They are fine for transient data that’s not meant to be stored.
  • They can be the only viable option if a changeable upstream data source doesn’t provide identifiers that persist across requests/updates.
  • They can be tolerable for unimportant auxiliary resources that don’t correspond to a meaningful entity in the domain of interest (e.g., some n-ary relations) and are not worth the hassle of maintaining a stable URI.

In all other cases, blank nodes should be avoided. Sandro is right: publishing RDF with blank nodes puts a burden on the consumer. Especially if the data might change in the future.

The higher the percentage of blank nodes in a dataset, the less useful it is.

Posted in General | Comments Off on Blank nodes considered harmful

Top 100 most popular RDF namespace prefixes

I run prefix.cc, a website for RDF developers where anyone can register and look up the expansion URIs for namespace prefixes such as foaf, dc, qb or void. The site tracks which prefixes gets looked up most often. This allows some insight into the popularity of RDF vocabularies and datasets.

This post is a snapshot of the top 100 most requested prefixes as of today.

Caveats:

  1. The counts reflect what knowledgeable RDF hackers are interested in. This may or may not reflect the interests of more casual users, or what’s deployed on the web. The og prefix for Facebook’s Open Graph protocol for example is outside of the list, at #273.
  2. “Users” of the site include automated apps and web crawlers. This distorts numbers. For example, the prefix.cc homepage links to prefix.cc/foaf, driving crawlers and first-time visitors that way, inflating foaf numbers.
  3. Here I deliberately do not include the full URI expansions for those prefixes. Prefix.cc allows multiple competing expansions for a prefix. Users can then vote to determine what’s shown first. It can be subject to gaming, ballot stuffing, and so on. There are strong disagreements over the “best” expansion for some prefixes, starting right at #2 with dc, which is one of most controversial prefixes on the site. (If you need expansions, then you can get a fresh set from the API.)
  4. Prefix.cc doesn’t allow registration of single-letter namespaces, along with some other syntactic restrictions. Some vocabularies suggest single-letter prefixes, most notably Google’s rdf.data-vocabulary.org, which is commonly abbreviated “v”. (Someone has registered dv for it, but that rarely gets looked up.)

That being said: The data is below, and a CSV version is available too.

Rank Prefix Lookups
1 foaf 45506
2 dc 17621
3 rdf 17585
4 rdfs 14865
5 owl 11898
6 geonames 9349
7 geo 4757
8 skos 4501
9 dbp 3396
10 swrc 2439
11 sioc 2336
12 xsd 2310
13 dbo 2089
14 dc11 2006
15 doap 1856
16 dbpprop 1697
17 content 1621
18 wot 1598
19 rss 1474
20 gen 1403
21 dbpedia 1377
22 d2rq 1370
23 nie 1352
24 xhtml 1336
25 test2 1305
26 gr 1301
27 dcterms 1255
28 org 1157
29 vcard 1154
30 akt 1150
31 dct 1118
32 ex 1104
33 fb 995
34 owlim 993
35 cfp 978
36 xf 960
37 sism 956
38 earl 948
39 bio 941
40 reco 936
41 xfn 926
42 media 925
43 air 921
44 dcmit 920
45 void 917
46 fn 915
47 afn 910
48 cc 906
49 cld 900
50 vann 898
51 days 895
52 ical 893
53 http 893
54 mu 888
55 sd 874
56 osag 874
57 botany 859
58 cal 858
59 musim 850
60 factbook 848
61 cs 845
62 log 838
63 rev 837
64 swande 836
65 bibo 834
66 dcq 834
67 cv 832
68 ome 830
69 biblio 830
70 dir 828
71 giving 827
72 memo 827
73 ok 826
74 rel 821
75 event 818
76 ir 818
77 aiiso 816
78 ad 813
79 dbr 813
80 co 812
81 af 809
82 cmp 806
83 bill 805
84 rif 804
85 xs 804
86 math 803
87 rdfg 803
88 daia 801
89 swc 800
90 tag 800
91 swanq 799
92 xhv 796
93 book 795
94 jdbc 793
95 myspace 792
96 tzont 792
97 sr 790
98 ctag 789
99 dcn 787
100 lomvoc 786
Posted in General | 2 Comments

Maintenance

This weblog has become quiet. These days, most of my word count goes into mailing lists, Twitter, and way too much personal email. Over here on the blog, cobwebs are gathering and some signs of bitrot have become evident.

So I’ve done some maintenance. I upgraded the software to WordPress 3.0.1, and changed to a new theme. I also decided to retire the blog’s name, dowhatimean.net, and instead move it to richard.cyganiak.de/blog.

This new location was previously home to a German-language blog I wrote back in 2004 and 2005. I imported the old posts into the site, so don’t be surprised if you encounter some German in the depths of the archive.

No URLs were broken in the making of this post. I hope.

Posted in General | Comments Off on Maintenance

prefix.cc, MkII

prefix.cc is a website I’ve made last February to ease a very common task in the life of RDF developers and SPARQL users: looking up namespace URIs. A short summary of what the site can do for you is available here.

The site was developed during a few weekends, and I haven’t touched the code since I first deployed it. Today I’m publishing the first serious update to the site. This post describes what’s new.

Reverse lookup. One of the most requested features is reverse lookup. You can now enter a URI of an RDF term into the query box on the start page, and the site will respond with the best prefix for contracting that URI into a QName. This functionality is also available as an API.

Negative votes. The site has received a moderate amount of spam, mostly from pranksters who think it would be funny to propose their own homepage as a better expansion for the foaf prefix. I’ve mostly cleaned this up manually, but I think it would be better to equip the user community with tools to handle this.

The site has always had a voting mechanism, which I intended as a tiebreaker in cases where people have submitted different URIs for the same prefix, for example in the case of the dc prefix. Starting today, you can submit both positive and negative votes. If a URI receives a certain amount of negative votes, it will be no longer shown.

New export formats. One of my favourite features is the ability to directly get output in various machine-readible syntaxes by composing an appropriate URI, such as http://prefix.cc/foaf.file.n3, which produces a declaration of the FOAF prefix in N3 format. I find this handy for copy-pasting into a text editor, but also for automating things.

A few formats have been added: vann produces an RDF/XML version of the namespace mapping in the VANN vocabulary (example). xmlns produces raw XML prefix declarations (example). go redirects to the namespace URI, so you can type http://prefix.cc/foaf.go into your browser bar as a shortcut for opening the FOAF specification. I’ve also added a table of all supported formats.

A side effect of the introduction of VANN support is that there is now a single VANN representation of all mappings known to the site.

Tweaks and fixes. Regular users will note a number of further small changes and bugfixes throughout the site. One notable fix is to the way namespace lookups are calculated for the list of popular prefixes. Ironically, most of the lookups actually are from web crawlers that followed the links in the list itself, making the list self-perpetuating. Also, the list featured the non-existing robots prefix, because many crawlers are looking for http://prefix.cc/robots.txt. These issues should now be fixed.

Internal changes. The site is developed in PHP, and started out as a quick weekend hack, so the initial code was a horrible mess that was hardly maintainable. I spent quite some time cleaning this up and refactoring the code into a much nicer structure that should be able to grow along with some of the additional features I’ve planned for the future. The codebase now totals some 1600 lines of PHP, CSS and Javascript.

Hidden goodies: RDFa markup and feed of latest additions. Finally, I want to highlight some features that have existed all along, but are easily missed: First, many pages contain RDFa markup, so if you want to re-use any prefix.cc data in your own site or application, you most likely can. Second, there is an RSS feed of the latest additions to the prefix database, and it is a neat way of learning about new vocabularies and ontologies that show up around the Web.

Bugs, comments, suggestions? Any feedback is appreciated. I did a lot of refactoring without a test harness, so it’s quite likely that a few new bugs have crept in. If you notice anything, please let me know. Also, if there is anything that you would like to see in prefix.cc Mk III, please share!

Posted in General, Semantic Web | Comments Off on prefix.cc, MkII

What’s in a name? And the Linked Data Police

So I wrote a rather angry private email to Erik Wilde a few days ago, complaining about his use of the term “linked data” for a site that doesn’t follow the linked data practices. Erik decided to publish my email on his blog, along with a long defense of his use of the term, in a post called “The Linked Data™ Police”. Since it’s in public now, we can just as well see if we can get a useful discussion out of this.

First, I realize that Erik probably responded more to the tone of my email than to the content. It was an angry rant, and my tone misses the mark, so his response is fair enough, and I have apologised to him. I also have to say that I’m speaking only for myself and nobody else—before others become scared of the “Linked Data™ Police”, I can assure them that to the best of my knowledge, it has a staff of one, and my peers in the community are in general a friendly and civil lot.

The site in question. So, what is this about? Erik and team have built recovery.berkeley.edu, a site that publishes structured data about Recovery Act spending. The site was built with a grant from the Sunlight Foundation. The technologies of choice are Atom and various other XML formats. As far as I can see, it’s excellent in its adherence to REST principles, including good URI design and “hypermedia as the engine of application state”. These two together are labelled as “linked data” in the site’s technical documentation.

This is a discussion about names, and not about substance. At the very core is the following question: Should we understand “linked data” to mean “the idea of somehow connecting pieces of data with links”, or should we take it to mean “RDF published according to the rules outlined by Tim Berners-Lee in the design note that coined the term”?

Obviously, I’m of the latter opinion. In this post, I want to do two things: First, I want to respond to some specific points from Erik’s post. I will do this by paraphrasing each point, and then responding to it. Second, I want to explain why I care about the matter and why I think that “linked data” should continue to be associated with Tim’s rules, and why advocates of different sets of rules should use different terms.

Erik: “The attitude is scary: Instead of figuring out the most effective way of adding more semantics to the web, it starts with a set of technologies and claims that whatever you want to do, you have to use those.” Linked data didn’t start with a set of technologies; a lot of deliberation by a lot of people went into the choice. Also, I have no quarrel with Erik’s choice of technologies, and I didn’t even suggest that he should or shouldn’t use any technology. There can be good reasons against using RDF, and it’s a good thing that innovation continues in other areas of web data technology.

But if Erik and colleagues don’t buy into the set of technology choices commonly called “linked data”, then why would they insist on using that name? What’s wrong with the established technical terms, REST and Resource-oriented Architecture? Is it just because those are already way past the peak of the hype cycle?

Erik: “Using generic problem names to refer to specific technologies only confuses people.” Erik mentions Linked Data, XML Schema, Semantic Web and Web Services as specific technologies that he considers to be badly labelled. But the peculiar thing about those four is not their names; they are controversial for other reasons. The IT world is full of specific technologies that use generic names: World Wide Web. Structured Query Language. Extensible Markup Language. Hypertext Transfer Protocol. Portable Document Format. Resource-Oriented Architecture. Scalable Vector Graphics. Open Document Format. Erik may not like it, but it’s a common practice.

Erik: “Choosing such names is usually an attempt to make competition harder.” Usually? I doubt that. Technologies are named in their very infancy, when their future and success is far from certain, and when competition is usually not an issue. The naming is usually an attempt to communicate as clearly as possible what the proposed technology is supposed to achieve, which is not a bad thing at all. Some fail at achieving the goal, but everyone designs (and names) assuming that it can be eventually achieved.

Erik: “RDF is just a stylesheet away.” Erik points out that it would be trivial to create a GRDDL transform that translates from the service’s output to RDF. Personally I wouldn’t call it trivial, and being just one transformation away from being compatible is not the same as being compatible. If there were GRDDL transforms in place, I would have no reason at all to complain, although just a few linked data clients support GRDDL at this time.

Back to the roots. So where did the term “linked data” come from? To the best of my knowledge, Tim Berners-Lee coined it in his 2006 Design Note that is titled “Linked Data”. The document introduced the four rules that are now known as the “Linked Data Principles.” Erik’s service is following all of them except the one that demands RDF or SPARQL.

It’s worth pointing out that the four rules did not mention RDF when Tim originally published them, but it is clear from the rest of the document that the use of RDF was implied. The document was aimed at the semantic web community. His later change was a clarification, not a change of intention.

I don’t know why Tim wrote this piece back in 2006, but my interpretation was that he wanted more people to publish data that can be browsed with his Tabulator RDF browser, and most RDF out there at that time couldn’t be browsed because of problems with one of the four rules. So I read it as a call for better interoperability among RDF publishers.

Broadening the term? There have been a number of calls for broadening the meaning of the term, most eloquently from Paul Miller, so Erik is certainly not alone in his view. Their intention is to get linked data quicker into the mainstream, which is a goal that I share.

The problem is that broadening a term makes it less meaningful. There is a danger that the term gets extended to the point where it’s equally meaningless to other buzzwords such as Web 3.0 or the venerable Semantic Web. If you can use other formats instead of RDF, then why not also use SOAP instead of HTTP? Why not do away with the URIs? Why not YQL instead of SPARQL? Where does “linked data” stop? Everything is somehow “data” and somehow “linked.”

Interoperability requires choices to be made. In my eyes, the great thing about the term “linked data” is that it has a reasonably precise technical definition, rooted in Tim’s Design Note and the early work of the Linking Open Data project. That work has turned the Semantic Web’s compelling but vague promises of a side-by-side “web for humans” and “web for machines” into concrete guidelines that people can actually implement, and the result is an ecosystem of interoperable tools, clients and datasets that continues to grow around these guidelines.

These guidelines will continue to evolve with the emergence of new technologies (e.g., RDFa) and increasing experience and maturity (e.g., importance of licensing and provenance handling).

But at the core, it has to be about a set of concrete technology choices and deployment practices that foster an interoperable ecosystem of data sources and clients. “Linked data” is the best name we have for that particular set of technology choices and practices. There is nothing magic about the name “linked data”, to the best of my knowledge it didn’t exist at all in the web community before 2006. The term has gained popularity because it has associated rules that tell you how to do it, not because of the “words”. Without the rules, the term would be meaningless fluff. Everything is somehow “linked” and somehow “data.”

If you think that a different set of rules would work better (which is entirely possible), then it would be prudent to write them down, coin a new term for them, and start the legwork of advertising them, just as Tim did since 2006.

Posted in General, Semantic Web | 7 Comments

Linked data at the New York Times: Exciting, but buggy

Update: Evan Sandhaus reports that all the issues mentioned below will be fixed. Great!

Yesterday at the International Semantic Web Conference, Evan Sandhaus of the New York Times unveiled data.nytimes.com, a site that publishes linked data for some parts of the Times’ index. To me, this was one of the most exciting announcements at the conference, and it caused quite a tweetstorm during and after Evan’s talk.

A bit of background: Every article published in the newspaper or on the website is tagged, classified and categorized in many ways by skilled editors. This metadata allows the creation of topic pages that automatically collect relevant articles for notable people, organisations, and events. Examples include Michelle Obama, Swine Flu (H1N1 Virus) and Wrestling.

What’s in the data? The dataset published yesterday contains information on each of the concepts that have a topic page. For now, it is limited to topic pages about people. The concepts are modelled in SKOS. The information attached to each concept consists mostly of links: to DBpedia, to Freebase, into the Times API (which is not available as RDF at this point), and of course to the corresponding topic page. This means that if you have a DBpedia URI for an especially notable entity, a high-quality New York Times topic page with the latest news about the topic is only two RDF links away. A notable feature of the links is that every single one has been manually reviewed, making this perhaps the highest-quality linkset in the LOD cloud.

How to get the data? This being linked data, every concept has a dereferenceable URI. Examples:

The site’s URI scheme follows one of the Cool URIs recipes: The identifiers above are resolvable, and by using content negotiation, web browsers are redirected to

http://data.nytimes.com/N13941567618952269073.html

which has a nicely formatted summary of the data available about Michelle Obama. Data browsers and other RDF-enabled clients, on the other hand, are redirected to

http://data.nytimes.com/N13941567618952269073.rdf

which has all the data goodness in RDF/XML.

There is also a dump: people.rdf. You can browse the data starting from the data.nytimes.com page. Everything is available under a CC-BY license.

Bugs and problems

This being a new dataset and the Times’ first foray into linked data, it turns out that the Beta label on the site is quite warranted. I will highlight four issues.

Data and metadata are mixed together. Let’s look at the data about Michelle Obama, available at the N13941567618952269073.rdf URI above. I’m reformatting the data into Turtle for legibility.

<http://data.nytimes.com/N13941567618952269073>
    a skos:Concept;
    skos:prefLabel "Obama, Michelle";
    skos:definition "Michelle Obama is the first …";
    skos:inScheme nyt:nytd_per;
    nyt:topicPage <http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html>;
    owl:sameAs <http://rdf.freebase.com/rdf/en.michelle_obama>;
    owl:sameAs <http://data.nytimes.com/obama_michelle_per>;
    owl:sameAs <http://dbpedia.org/resource/Michelle_Obama>;

This makes perfect sense, it’s data about a person, modelled as a SKOS concept. But then it goes on:

<http://data.nytimes.com/N13941567618952269073>
    dc:creator "The New York Times Company";
    time:start "2007-05-18"^^xsd:date;
    time:end "2009-10-08"^^xsd:date;
    dcterms:rightsHolder "The New York Times Company"^^xsd:string;
    cc:license "http://creativecommons.org/licenses/by/3.0/us/";
    .

This is not data about Michelle Obama the person, it’s metadata about the data published by the NYT. It’s certainly not true that Michelle Obama was created by the New York Times, or that she “started” in 2007 (whatever that’s supposed to mean), and don’t even get me started on asserting a rights or a license over a person.

Note that the NYT team actually went through the effort of setting up separate URIs for Michelle the person (http://data.nytimes.com/N13941567618952269073), and for the HTML and RDF documents describing the concepts (http://data.nytimes.com/N13941567618952269073.html and http://data.nytimes.com/N13941567618952269073.rdf). The reason why linked data experts advocate this practice of having separate URIs is exactly because it enables separation of data and metadata: It lets you state some facts about the concepts, and other things about the documents that describe the concepts. This is what should be done in the data above: The metadata should not be asserted about the URI identifying Michelle, but about the URI identifying the document published by the NYT: N13941567618952269073.rdf. So we would get:

<http://data.nytimes.com/N13941567618952269073>
    a skos:Concept;
    skos:prefLabel "Obama, Michelle";
    skos:definition "Michelle Obama is the first …";
    skos:inScheme nyt:nytd_per;
    nyt:topicPage <http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html>;
    owl:sameAs <http://rdf.freebase.com/rdf/en.michelle_obama>;
    owl:sameAs <http://data.nytimes.com/obama_michelle_per>;
    owl:sameAs <http://dbpedia.org/resource/Michelle_Obama>;

<http://data.nytimes.com/N13941567618952269073.rdf>
    dc:creator "The New York Times Company";
    time:start "2007-05-18"^^xsd:date;
    time:end "2009-10-08"^^xsd:date;
    dcterms:rightsHolder "The New York Times Company"^^xsd:string;
    cc:license "http://creativecommons.org/licenses/by/3.0/us/";
    .

Eric Hellman has a post about this issue, calling it “a potential legal disaster” because a license is attached to a resource that’s said to be the same as a resource on a different site (DBpedia and Freebase). He’s a bit alarmist, but this example highlights why the separation of data and metadata, of concept URIs and document URIs, is critically important in a general-purpose data model.

Distinguishing URIs and literals. Here’s some selected snippets from the RDF/XML output:

    <nyt:topicPage>http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html</nyt:topicPage>
    <cc:License>http://creativecommons.org/licenses/by/3.0/us/</cc:License>
    <cc:Attribution>http://data.nytimes.com/N13941567618952269073</cc:Attribution>

The value of all three properties are URIs. In the RDF data model, URIs are of such central importance that they are treated differently from any other kind of value (strings, integers, dates). But not so in the code example above. There, the three URIs are encoded as simple strings. This should be:

    <nyt:topicPage rdf:resource="http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html" />
    <cc:License rdf:resource="http://creativecommons.org/licenses/by/3.0/us/" />
    <cc:Attribution rdf:resource="http://data.nytimes.com/N13941567618952269073" />

Why does this matter? It’s basically like making links “clickable” in HTML by putting them into a <a href=”…”> tag: RDF clients will not recognize URIs if they are encoded as literals, and will not know that they can treat them as links that can be followed.

Content negotiation for hybrid clients. As usual for linked data emitting sites, there is content negotiation on the concept URIs: They redirect either to RDF or HTML, based on the Accept header sent by the client when resolving the URI via the HTTP protocol. Also as usual for first-time linked data producers, the content negotiation is a bit broken.

Here is what happens when I ask for HTML (using cURL, which is a handy tool for debugging the HTTP behaviour of linked data sites):

$ curl -I -H "Accept: text/html" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.html

Next I will ask for RDF:

$ curl -I -H "Accept: application/rdf+xml" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.rdf

So far, so good. But many clients are “hybrid”, they can consume both RDF and HTML. This includes many tools that can consume RDFa (RDF embedded in HTML pages). So it’s not uncommon to find tools that combine multiple media types in the accept header. The Times server should also redirect those tools to the RDF, because any RDF-consuming client can probably handle the raw RDF data better than the (not overly useful) HTML pages. But let’s see what happens:

$ curl -I -H "Accept: text/html,application/rdf+xml" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.rdf.html

The server redirects to a file that doesn’t exist, ending in .rdf.html. This is pretty funny to me as a programmer, because the bug gives me a glimpse into the Times codebase, where obviously a programmer didn’t consider that the two alternatives—sending HTML or sending RDF—are exclusive.

Update: Someone at the Times seems to be working on the server as I’m writing this; the latest behaviour is even worse; it redirects to .rdf.html even if I request only RDF, and uses 301 redirects instead of 303.

Using the Creative Commons schema. The NYT data uses the Creative Commons schema to license the data under CC-BY. Here’s the relevant RDF, in Turtle (I fixed the subject URI and turned literals into URIs where appropriate):

<http://data.nytimes.com/N13941567618952269073.rdf>
    cc:License <http://creativecommons.org/licenses/by/3.0/us/>;
    cc:Attribution >http://data.nytimes.com/N13941567618952269073<;
    cc:attributionName "The New York Times Company";
    .

This uses three properties: cc:License, cc:Attribution and cc:attributionName. But according to the schema, cc:License and cc:Attribution are classes, not properties. This should be:

<http://data.nytimes.com/N13941567618952269073.rdf>
    cc:license <http://creativecommons.org/licenses/by/3.0/us/>;
    cc:attributionURL <http://data.nytimes.com/N13941567618952269073>;
    cc:attributionName "The New York Times Company";
    .

Summary. The Times’ foray into linked data is an exciting new development, but it also shows how hard it is to get linked data right. This is a weakness of the linked data approach.

Can we do anything about this? Better tutorials and education can probably help. Another activity that is trying to address the issue is the Pedantic Web Group, a loose collection of people like me who obsess about the technical details of publishing data on the web and work with data publishers to get issues like the above fixed. We might even give you a hand with reviewing your stuff before you go live with it.

Posted in General, Semantic Web | 10 Comments