Fixing ambiguous concept URIs

This is in response to a discussion in the comments to my recent post on geonames.org. I criticized the use of the same URI for concepts and documents in the Geonames RDF output; this post describes in detail how to fix that kind of issue.

The problem: Geonames uses URIs to identify places. For example, this is the URI for Berlin:

http://ws.geonames.org/rdf?geonameId=2950159

This URI identifies both a concept (the city of Berlin) and a document (which provides some information about that city). This is a problem because, as TimBL points out, the URI now both identifies something that is located in Germany and something that mentions the class Feature. This ambiguity can cause a lot of trouble down the road.

How to fix it: Use different URIs for the concept and the document.

Let’s say we keep the URI above for the document and pick a new URI for the concept. When I create an RDF link from my profile to my home town, for example, I would use the new URI.

But we have to set things up in a certain way: When we retrieve the concept URI, we want to get to the contents of the document! There are two ways to do that, and we have to pick one:

The hash approach: By adding a fragment identifier to the document URI, we get a new concept URI. For example, this could be the concept URI for Berlin:

http://ws.geonames.org/rdf?geonameId=2950159#place

Just as with HTML, a hash in the URI means that the part before the hash is the document to be retrieved, and the part after the hash identifies something within the document. Consequently, one gets the document when trying to retrieve the hash URI.

The 303 approach: Here we pick a completely new URI for the concept. For example:

http://ws.geonames.org/places/2950159

(I’ve tried to pick a clean URI. Your URIs are your site’s prime real estate, it’s better not to clutter them.)

Whenever any HTTP URI is accessed, the web server responds with a status code, e.g. 200 for “OK, Here’s the page”, 404 for “Sorry, not found”, or 302 for “The document has temporarily moved to this other URI” (where the other URI is provided in a Location: HTTP header; this is called a redirection).

Now we have to set up the server to respond with a 303 (“See Other”) status code, and put the document URI in the Location: HTTP header. The client can fetch the Location: URI to retrive the document.

Which one to pick? To be honest I don’t really know. The 303 approach has the disadvantage of requiring an additional HTTP request to fetch the redirected document, and it may be harder to set up. The hash approach has the disadvantage that it feels a bit hackish when there is just a single concept described in the document. Do whatever works for you.

But is it a problem at all? Some researchers still quarrel about this whole issue. Some think it’s no problem at all; other think that something completely different should be done. I’m a toolsmith, not a philosopher, and therefore try to avoid these debates. W3C has said that we should do the things above, and I’m happy to comply and move on to more important issues.

Background reading: This piece is getting way too long already and I’m tired; so I’ll leave it up to you to provide interesting linkage in the comments.

This entry was posted in General, Semantic Web. Bookmark the permalink.

8 Responses to Fixing ambiguous concept URIs

  1. leobard says:

    uri crisis again. And I think your blog just ate the 1000 char comment I entered here.

  2. leobard says:

    your bloody blog just ate and discarded a long analysis of the problem, bad wordpress.

    Summary: the 303 approach is new, but the problem is so boring and dumb that every month another nitwit picks it up to discuss it endlessly.

    Face it: you cannot identify concepts, because concepts are only perceived by humans and do per se not exist. Philosophy helps here: you only perceive, and you use language to communicate facts about perceptions. So there is no uri for the document or for the concept. Think of words. You are reading my words at the moment, but is there an identifier for identifier besides identifier? If URIs are like words, then it depends how you use them in social behavior. If you decide that “http://ws.geonames.org/rdf?geonameId=2950159″ identifies berlin and enough people copy this, then its the identifier for Berlin for these people, like the “alt-names” in the document at this uri.

    What you can do is use topic maps resource-identifiers or what we used in the PIMO language, occurrenceRefs, see here:
    http://ontologies.opendfki.de/repos/ontologies/pim/pimo-language.rdfs

    see also:
    http://www.dfki.uni-kl.de/~sauermann/2006/01-pimo-report/pimOntologyLanguageReport.html#ch-IdentificationIssues

  3. Your article claims to be fixing ambiguity, yet it introduces more ambiguity by introducing concepts.

    After introducing a distinction between berlin-the-concept and a-document-about-berlin, you say “something that is located in Germany and something that mentions the class Feature”.

    Do you really mean that concept of Berlin is located in Germany?

    In actual fact there are 3 or more entities:

    – Berlin the city (a physical entity)
    – The geonames document record concerning (presumably) the city of Berlin
    – The concept of Berlin

    We can probably dispense with the last entity unless we are cognitive scientists or cultural historians.

    Once you correct this then your analysis of the problem is a good one

  4. Leo, some time ago I adopted the point of view that URIs are either dereferenceable or crap. URI crisis solved, and I can get on with the job of building software.

  5. Chris, this time your analysis is wrong. In actual fact, there are two entities:

    – Berlin the city (the thing itself; the concept)

    – A document describing Berlin

    Only cognitive scientists or cultural historians would bother to divide the first one into two distinct entities. I’m not one of those, and therefore don’t bother to make the distinction, and believe that the way I’ve presented it is perfectly clear for the intended target audience.

    If you have a good term for “those URIs that do not identify documents but other things, like cities or people or software projects“, a term that is understandable to non-experts, then please share. The best I could come up with was “concept URI”. And if you offer “entity URI” or “non-information resource URI“, then I shall personally lock you up in an ivory tower.

  6. Good post. Thank you, I think it explains it well. There are indeed variou sreasons for choosing the # method or the 303 method.

    When writing RDF (in N3 normaly) by hand, the # methd is very easy, and works well. You naturally write in files of manageble size which are easily delivered over the net. The thing after the # is the local identifier in the file,
    and you can add also som identifiers for other related things which are naturally described in the same file.

    When a large set of identifiers are generated automaticaly, then the 303 method may make sense, but the # method will work too, even if every uri ends “#it”, which looks odd. It is faster from the client’s poiunt of view.

    For example, the tabulator, which, like many things, looks up the ontology for each predicate or class term it comes across, has to look up every single Dublin Cote term (which has no #) like dc:title separately just to find each redirects to the same ontology file.

  7. Joshua Shinavier says:

    Tim, it’s nice to see someone defending the lowly hash URI. Everything I’ve read suggests that hash namespaces are the quick and dirty way to toss small, static RDF documents onto the web, while serious semantic web services use slash namespaces. Personally, I much prefer the # when it comes to actually dereferencing and caching URIs, for the reason you mentioned. What’s more, many actual slash URIs out there do NOT redirect to the URI of the appropriate document; you’re meant to guess their namespace and dereference that instead, so if you make the wrong choice, you get nothing. Content negotiation using hash URIs (with each URI in its own hash namespace, using “it” or some other gimmick for the fragment identifier), seems to me to combine the best of both worlds, but maybe I shouldn’t talk until I’ve tried it.

  8. Pingback: Fixing ambiguous concept URIs