Warning, WebArch nerdery ahead. The short version is, I’ve managed to convince myself that there is no problem with content negotiation on hash URIs. And if you want to do it, you should follow the Best Practices Recipes for Publishing RDF Vocabularies, the method outlined there is correct and makes sense. You can skip the rest of this post.
Recently I’ve asked on firstname.lastname@example.org how to properly implement content negotiation for hash URIs. The discussion quickly turned into a flamewar between opponents and proponents of content negotiation in general, but also generated some insightful responses which helped me think through the issue.
Here I try to summarize both why the issue is tricky, and how it can be done correctly, with links to the appropriate sections of the relevant specifications.
So let’s assume
http://example.org/foo#Bob is an HTTP URI that identifies Bob, a real person. From here on, I will just write
foo#Bob instead of the full URI because it’s shorter.
Now let’s assume that all we have is the URI
foo#Bob. We don’t know anything about it. The whole point of the “Web” in “Semantic Web” is that we can take the URI and look it up on the Web to get some clue about what that URI could identify.
Now let’s assume that the naming authority who has assigned the URI (the folks who run the server at
example.org) wants to help clients by serving information about Bob in both HTML and RDF. The HTML information could be a web page that tells us Bob’s contact data. The RDF information could be some statements about
foo#Bob, for example that it is a
foaf:Person and that it’s
foaf:name is “Bob”. The appropriate format should be served, depending on what the client wants.
The question is, how exactly does the HTTP interaction have to play out so that the client a) gets exactly the right clues about what the URI might identify, and b) ends up with information in the right format.
Scenario 1: The client wants RDF
1. The client starts out with the URI
foo#Bob and doesn’t know anything about what it means.
2. According to RFC 3986, the client has to separate
foo and just request
3. It’s an HTTP URI, so the client sends an HTTP GET request to
4. The client wants RDF, so it sends along an
Accept: application/rdf+xml header.
5. The server answers with an RDF document (
Content-Type: application/rdf+xml) and a 200 OK status code.
6. Again, RFC 3986 tells us that the fragment
foo has to be interpreted with respect to the content type of the returned representation, that is,
7. RFC 3870 tells us to look at RDF Concepts and Abstract Syntax, which says that
foo#Bob means whatever an RDF representation of
foo says about it.
8. From the retrieved RDF document, the client can learn that
foo#Bob identifies a
foaf:Person named “Bob”.
So this worked out really well. The client started out just with a URI, and after following a long paper trail of specifications it ends up with a bunch of RDF statements about the URI.
Scenario 2 (broken): The client wants HTML
1. Again, the client starts out with the URI
foo#Bob and doesn’t know anything about what it means.
2. Again, it sends an HTTP GET request to
3. The client wants HTML, so it sends along an
Accept: text/html header.
4. The server answers with an HTML document and a 200 OK status code.
5. RFC 2854 tells us that
foo “designates the correspondingly named element; any element may be named with the “id” attribute, and A, APPLET, FRAME, IFRAME, IMG and MAP elements may be named with a “name” attribute.”
Oops! This HTTP interaction indicates to the client that
foo#Bob is an HTML element inside an HTML document. That’s not the conclusion intended by the naming authority, and a clear contradiction to the other case where we served RDF. Therefore, it’s better to give another response to this request.
Scenario 3: The client wants HTML, the server 303-redirects
1. Again, the client sends a GET to
2. The server answers the request for
foo with a 303 See Other status code, with the URI of another resource,
bar.html, in the
Location: response header.
3. According to httpRange-14, “the resource identified by that URI could be any resource.”
4. According to the HTTP protocol, the 303 status code tells the client to get an answer from another URI, but makes clear that “the new URI is not a substitute reference for the originally requested resource.”
5. The client sends a GET request to
bar.html, which the server answers with an HTML document and HTTP 200 OK.
6. But since
bar.html is explicitly a different resource from
foo, the client can’t infer anything about the nature of
foo#Bob from the content type of
7. Thus, the only clue about the nature of
foo#Bob is whatever the text in
This works fine. The naming authority can describe Bob in
bar.html, and the client cannot infer any contradictory clues from the HTTP interaction. That’s the best we can do when the returned description is not in a machine-readable format.
So, the 303 redirect from
bar.html eliminated the contradiction because it leaves us without any information about the content type of
Recipe 3 in the Best Practices Recipes for Publishing RDF Vocabularies recommends exactly this when serving HTML, and I agree fully with this recommendation.
However, the recipe also recommends to do the 303 redirect when the client asks for RDF. The server would 303-redirect from
bar.rdf. The 303 again means that we don’t know anything about the content type of
foo, and can’t use the answer from
foo to interpret
foo#Bob. But we also have the response from
bar.rdf, and if the RDF document contains statements about
foo#Bob, then we again have learned what we wanted to know.
In summary, I think that the 303 redirect in the RDF case is unnecessary, but doesn’t hurt either, and is probably not such a bad idea because of the symmetry between the HTML and RDF responses.
Scenario 4: Making statements about parts of HTML documents
Can we use RDF to make statements about named parts of documents, such as
report.html#section1? I think yes.
1. Again, the client knows just the URI and has no idea what it identifies.
2. It does HTTP GET to
3. The server responds with an HTML document and 200 OK.
4. According to RFC 2854,
#section1 is the named element within the HTML document.
Caveat 1: What happens if the client asks for
application/rdf+xml and the server answers with 406 Not Acceptable? Then the client didn’t get any clue out of the interaction, but there is no contradiction either. The server could have chosen to serve an alternate RDF representation of the report, which contains a statement that
report.html#section1 is a
doc:Section. It’s up to the naming authority to provide good clues and make it easy to find out what its URIs identify.
Caveat 2: RDF Concepts and Abstract Syntax has this section:
eg:someurl#fragmeans the thing that is indicated, according to the rules of the
application/rdf+xmlMIME content-type as a “fragment” or “view” of the RDF document at
eg:someurl. If the document does not exist, or cannot be retrieved, or is available only in formats other than
application/rdf+xml, then exactly what that view may be is somewhat undetermined, but that does not prevent use of RDF to say things about it.
report.html is only available as HTML but not RDF, then RDF itself tells us nothing at all about what
#section1 is. But the traditional Web interpretation of
report.html#section1 tell us that it is a part of an HTML document, and since a URI can identify only one thing, this interpretation carries over into RDF. I think that’s the only sensible view.
Wow, this has become quite a long and rambling post. In summary, I think I’ve managed to piece together a semi-coherent picture of how all this works, and it has stopped my angst about using hash URIs.
If you have actually read until here, and disagree with any of the reasoning above, or find anything unclear, then please add a comment.