Content negotiation with hash URIs (long)

Warning, WebArch nerdery ahead. The short version is, I’ve managed to convince myself that there is no problem with content negotiation on hash URIs. And if you want to do it, you should follow the Best Practices Recipes for Publishing RDF Vocabularies, the method outlined there is correct and makes sense. You can skip the rest of this post.

Recently I’ve asked on semantic-web@w3.org how to properly implement content negotiation for hash URIs. The discussion quickly turned into a flamewar between opponents and proponents of content negotiation in general, but also generated some insightful responses which helped me think through the issue.

Here I try to summarize both why the issue is tricky, and how it can be done correctly, with links to the appropriate sections of the relevant specifications.

So let’s assume http://example.org/foo#Bob is an HTTP URI that identifies Bob, a real person. From here on, I will just write foo#Bob instead of the full URI because it’s shorter.

Now let’s assume that all we have is the URI foo#Bob. We don’t know anything about it. The whole point of the “Web” in “Semantic Web” is that we can take the URI and look it up on the Web to get some clue about what that URI could identify.

Now let’s assume that the naming authority who has assigned the URI (the folks who run the server at example.org) wants to help clients by serving information about Bob in both HTML and RDF. The HTML information could be a web page that tells us Bob’s contact data. The RDF information could be some statements about foo#Bob, for example that it is a foaf:Person and that it’s foaf:name is “Bob”. The appropriate format should be served, depending on what the client wants.

The question is, how exactly does the HTTP interaction have to play out so that the client a) gets exactly the right clues about what the URI might identify, and b) ends up with information in the right format.

Scenario 1: The client wants RDF

1. The client starts out with the URI foo#Bob and doesn’t know anything about what it means.
2. According to RFC 3986, the client has to separate #Bob from foo and just request foo.
3. It’s an HTTP URI, so the client sends an HTTP GET request to foo.
4. The client wants RDF, so it sends along an Accept: application/rdf+xml header.
5. The server answers with an RDF document (Content-Type: application/rdf+xml) and a 200 OK status code.
6. Again, RFC 3986 tells us that the fragment #Bob within foo has to be interpreted with respect to the content type of the returned representation, that is, application/rdf+xml.
7. RFC 3870 tells us to look at RDF Concepts and Abstract Syntax, which says that foo#Bob means whatever an RDF representation of foo says about it.
8. From the retrieved RDF document, the client can learn that foo#Bob identifies a foaf:Person named “Bob”.

So this worked out really well. The client started out just with a URI, and after following a long paper trail of specifications it ends up with a bunch of RDF statements about the URI.

Scenario 2 (broken): The client wants HTML

1. Again, the client starts out with the URI foo#Bob and doesn’t know anything about what it means.
2. Again, it sends an HTTP GET request to foo.
3. The client wants HTML, so it sends along an Accept: text/html header.
4. The server answers with an HTML document and a 200 OK status code.
5. RFC 2854 tells us that #Bob within foo “designates the correspondingly named element; any element may be named with the “id” attribute, and A, APPLET, FRAME, IFRAME, IMG and MAP elements may be named with a “name” attribute.”

Oops! This HTTP interaction indicates to the client that foo#Bob is an HTML element inside an HTML document. That’s not the conclusion intended by the naming authority, and a clear contradiction to the other case where we served RDF. Therefore, it’s better to give another response to this request.

Enter httpRange-14.

Scenario 3: The client wants HTML, the server 303-redirects

1. Again, the client sends a GET to foo.
2. The server answers the request for foo with a 303 See Other status code, with the URI of another resource, bar.html, in the Location: response header.
3. According to httpRange-14, “the resource identified by that URI could be any resource.”
4. According to the HTTP protocol, the 303 status code tells the client to get an answer from another URI, but makes clear that “the new URI is not a substitute reference for the originally requested resource.”
5. The client sends a GET request to bar.html, which the server answers with an HTML document and HTTP 200 OK.
6. But since bar.html is explicitly a different resource from foo, the client can’t infer anything about the nature of foo#Bob from the content type of bar.html.
7. Thus, the only clue about the nature of foo#Bob is whatever the text in bar.html says.

This works fine. The naming authority can describe Bob in bar.html, and the client cannot infer any contradictory clues from the HTTP interaction. That’s the best we can do when the returned description is not in a machine-readable format.

So, the 303 redirect from foo to bar.html eliminated the contradiction because it leaves us without any information about the content type of foo.

Recipe 3 in the Best Practices Recipes for Publishing RDF Vocabularies recommends exactly this when serving HTML, and I agree fully with this recommendation.

However, the recipe also recommends to do the 303 redirect when the client asks for RDF. The server would 303-redirect from foo to bar.rdf. The 303 again means that we don’t know anything about the content type of foo, and can’t use the answer from foo to interpret foo#Bob. But we also have the response from bar.rdf, and if the RDF document contains statements about foo#Bob, then we again have learned what we wanted to know.

In summary, I think that the 303 redirect in the RDF case is unnecessary, but doesn’t hurt either, and is probably not such a bad idea because of the symmetry between the HTML and RDF responses.

Scenario 4: Making statements about parts of HTML documents

Can we use RDF to make statements about named parts of documents, such as report.html#section1? I think yes.

1. Again, the client knows just the URI and has no idea what it identifies.
2. It does HTTP GET to report.html.
3. The server responds with an HTML document and 200 OK.
4. According to RFC 2854, #section1 is the named element within the HTML document.

Caveat 1: What happens if the client asks for application/rdf+xml and the server answers with 406 Not Acceptable? Then the client didn’t get any clue out of the interaction, but there is no contradiction either. The server could have chosen to serve an alternate RDF representation of the report, which contains a statement that report.html#section1 is a doc:Section. It’s up to the naming authority to provide good clues and make it easy to find out what its URIs identify.

Caveat 2: RDF Concepts and Abstract Syntax has this section:

eg:someurl#frag means the thing that is indicated, according to the rules of the application/rdf+xml MIME content-type as a “fragment” or “view” of the RDF document at eg:someurl. If the document does not exist, or cannot be retrieved, or is available only in formats other than application/rdf+xml, then exactly what that view may be is somewhat undetermined, but that does not prevent use of RDF to say things about it.

True, if report.html is only available as HTML but not RDF, then RDF itself tells us nothing at all about what #section1 is. But the traditional Web interpretation of report.html#section1 tell us that it is a part of an HTML document, and since a URI can identify only one thing, this interpretation carries over into RDF. I think that’s the only sensible view.

Wow, this has become quite a long and rambling post. In summary, I think I’ve managed to piece together a semi-coherent picture of how all this works, and it has stopped my angst about using hash URIs.

If you have actually read until here, and disagree with any of the reasoning above, or find anything unclear, then please add a comment.

This entry was posted in General, Semantic Web. Bookmark the permalink.

9 Responses to Content negotiation with hash URIs (long)

  1. In Scenario 3 you say: “However, the recipe also recommends to do the 303 redirect when the client asks for RDF.”

    No, actually the recipe does not say that. or at least, I think the authors don’t want that. I had a chat with TimBl about this issue at ISWC and he said that you can always do # uris, and he likes them better because they spare you the second GET and the redirect. If I remember corretly, hash URIs are always concept identifiers and never document identifiers. If the returned document is RDF, you don’t have to do the HTML interpretation of RFC 2854, but instead do the RDF interpretation.

    So, I would conclude that for hosted RDF/XML files, the hash # URIs are a ‘poor man’s 303 redirect’ and you need not to redirect.

    Redirection was only needed to differentiate between Documents and Concept URIs, hash URIs can be seen as concept uris per definition. Note also, that foo#bob is a concept uri and foo is a document URI, because it returns HTTP-200 OK.

  2. Leo, look at the Vocabulary Recipes document, in recipe 3 there is 303 all over the place for both HTML and RDF. I don’t think the authors did that by accident.

    The 303 for hash URIs is another story from the 303 with “concept URIs”. For concept URIs, we do it because it prevents the client from concluding that the URI identifies a document (we want it to identify a person). For hash URIs, we do the 303 *when the client asks for HTML* because we want to prevent it from concluding that the URI is an *HTML* document, because that would make the fragment a part of an HTML document.

    If TimBL says that hash URIs always represent concepts, and not part of documents, then he contradicts RFC 3986 and introduces an ambiguity that is just as bad as the original “URI crisis” which httpRange-14 was designed to solve.

    RDF Concepts says that a hash URI, when encountered inside an RDF document, is interpreted as an RDF hash URI, not an HTML hash URI. But what a URI identifies is beyond RDF. A URI should be context-free and should identify the same thing wether it’s in an RDF file or a database or an email message.

    Can you point me to the step of recipe 1 or 2 where I did a wrong inference? I just followed the letter of the specs.

    Just do the 303 for HTML documents and everything is fine.

  3. John Black says:

    Above you state, “A URI should be context-free and should identify the same thing wether it’s in an RDF file or a database or an email message.” But you should see my post at http://kashori.com/2006/09/power-of-ambiguous-uri.html. In that post I cast doubt on whether even the W3C adheres to that restriction. In short, the URI that identifies the badge that makes the claim that the page on which it resides is valid xhtml identifies a different claim each time it appears on a different page. The claim is that this page is valid. So it is entirely context dependent.

  4. Hi Richard,

    I think hash-uris for RDF/XML documents don’t need a 303, because foo.rdf#bar will always be a concept- or?

    A RDF/XML is not required to host a HTML representation of the foo.rdf#bar URI, so if Accept:text/html is sent, it does not have to do anything special… but thats only my understanding, and I cannot remember if thats timbls saying.

  5. John, I posted a reply to your post here, I think it’s in your moderation queue. In short, I disagree with your basic premise. the badge URI doesn’t “identify a claim”. It identifies an image.

    Leo, I agree, when there’s only an RDF representation, then there’s no need for a 303. That’s the first scenario above. The need for a 303 arises only if we indeed want content negotiation and HTML on the foo URI.

  6. John Black says:

    Your comment is now posted, thanks. I plan to write a new post to deal with your objections. But here is a first draft of some of what I disagree with.

    Your first objection to my argument is that the URL http://www.w3.org/Icons/valid-xhtml10 does not make a claim. And you show how someone may just mention it, in your example, by commenting on the quality of the design of the image that it returns. But any identifier, word, phrase, sentence, etc. can be mentioned, and in that context, it doesn’t have the effect that it would in an ordinary context. For example, I could say, “The URL to this post of yours has 70 characters in it” And in that context, it is a character string and doesn’t identify your post. Does that mean in ordinary use it does not identify your post? Of course not. Nor does your mention of the poor design of the badge that is returned when you access the xhtml-valid URL change its referent in ordinary use. Any URL can be mentioned. That doesn’t render it useless.

    Your third objection is that the URI identifies “*just an image*”, not a claim. Why then is it named http://www.w3.org/Icons/valid-xhtml10? The URI minted by the W3C is named “valid-xhtml10″ for a reason. If it was just an image, they might as well have named it “image-123″. They named it “valid-xhtml10″ because they intended for it to be used to make a specific claim, namely that the page on which it appears uses xhtml that validates. Secondly, a machine can use (interpret, understand) the URI, “http://www.w3.org/Icons/valid-xhtml10″. It can be programmed conditionally based on finding that URI embedded in a page. For example, it may try to parse it as XML rather than use some lower level screen scrapping technique. The image is not the meaning of the URI, neither to a human nor to a machine. That URL’s proper interpretation is spelled out quite clearly by the creators of the URI in their help document I quoted, “To show readers that one has taken some care to create an interoperable Web page, a “W3C valid” badge may be displayed (here, the “valid XHTML 1.0″ badge) on any page that validates.” The image is for easy recognition by a human being, it is the URI that matters.

    I think your second objection is the most interesting. Recall that it is your statement above, “A URI should be context-free and should identify the same thing wether it’Â’s in an RDF file or a database or an email message.” that I am arguing with to begin with. Now in your comment you say, “Second, even if it was a claim, then the meaning is not in the URI alone, but arises from the *act of embedding* (in RDF speak, from making a statement ‘:page123 :embeds :w3c_badge’). Of course, the meaning of any statement depends on the subject. This doesn’t make the property-value pair ambiguous.” In other words, first you say “A URI should be context-free…” and then you say, “…the meaning is not in the URI alone…” which two statements seem contradictory. Is is exactly my point that the meaning of the URI depends on the subject with which it is used, and so it is not context-free. You cannot interpret the referent of the URI without knowing the context of its use. You cannot determine which author claims which page is valid unless you know the context in which that URI was embedded. Without that context, it doesn’t refer to or identify anything.

  7. Hi guys,

    I think Richard is right when he says, with respect to recipe 3 in the best practice recipes document, “… I think that the 303 redirect in the RDF case is unnecessary, but doesn’t hurt either, and is probably not such a bad idea because of the symmetry between the HTML and RDF responses.”

    In fact, perhaps the main reason why we chose to include a 303 redirect in the RDF case also is so that anyone retrieving the RDF data can maintain provenance information about the source of the data, and differentiate between different snapshots or versions of the data. I.e. if every time you (as the namespace owner) change the data you also change the redirect location, then you provide the basic framework for Semantic Web applications to talk about which version of an ontology they are committing to.

  8. Good point Alistair. I guess we could generalize this: It’s good practice to implement content negotiation by redirecting to the appropriate version. If any confusion about resource identity could arise from differences in content type, then the redirect should be a 303.

  9. John, apologies for not “freeing” your latest comment from the queue earlier. I don’t understand why it got stuck in moderation in the first place.

    I’m afraid we will just have to agree to disagree. To me, your claims are simply outlandish. You try to discuss URIs detached from their purpose of building interoperable information systems and their definition in the relevant internet and web standards. I see absolutely no value in doing that.