cygri’s notes on web data

John Doe’s “Test ASIN” at Amazon

Posted on November 23, 2006 by Richard Cyganiak

My colleague Tobias Gauß dug this up while exploring the Amazon APIs: Test ASIN, by John Doe.

Apparently it’s going to be published by Test, Inc. on 07/07/07, and you can preorder NOW for only $19.99.

Posted in General, Semantic Web | Comments Off

RESTful SQL?

Posted on November 22, 2006 by Richard Cyganiak

I’m trying to understand Mark Baker’s criticism of yesterday’s SPARQL Update proposal. To me his criticism seems to boil down to “it’s not RESTful”, which is true, but not necessarily a problem. Why insist on applying REST to everything that goes over HTTP? So here are some questions to REST proponents.

To set the tone, an SQL query:

UPDATE table.foo SET table.foo=table.foo+1 WHERE table.bar > 500

Now, how would you RESTify SQL? Big bonus points if queries like the above are possible with your approach.

If your answer is, “I wouldn’t”, then would it be acceptable, under any circumstances, to receive SQL queries over port 80 in order to expliot existing server infrastructure?

Posted in General, Semantic Web | 18 Comments

SPARQL Update Language at ESW

Posted on November 20, 2006 by Richard Cyganiak

SPARQL Update Language at the ESW Wiki, started by Max Völkel and me. So far the wiki page is just a few notes about the trivial part (syntax), but it might be a good place for collecting relevant links and ideas and requirements.

Hopefully the DAWG will be able to tackle this after finishing the query part of SPARQL.

Posted in General, Semantic Web | Comments Off

Content negotiation with hash URIs (long)

Posted on November 16, 2006 by Richard Cyganiak

Warning, WebArch nerdery ahead. The short version is, I’ve managed to convince myself that there is no problem with content negotiation on hash URIs. And if you want to do it, you should follow the Best Practices Recipes for Publishing RDF Vocabularies, the method outlined there is correct and makes sense. You can skip the rest of this post.

Recently I’ve asked on semantic-web@w3.org how to properly implement content negotiation for hash URIs. The discussion quickly turned into a flamewar between opponents and proponents of content negotiation in general, but also generated some insightful responses which helped me think through the issue.

Here I try to summarize both why the issue is tricky, and how it can be done correctly, with links to the appropriate sections of the relevant specifications.

So let’s assume http://example.org/foo#Bob is an HTTP URI that identifies Bob, a real person. From here on, I will just write foo#Bob instead of the full URI because it’s shorter.

Now let’s assume that all we have is the URI foo#Bob. We don’t know anything about it. The whole point of the “Web” in “Semantic Web” is that we can take the URI and look it up on the Web to get some clue about what that URI could identify.

Now let’s assume that the naming authority who has assigned the URI (the folks who run the server at example.org) wants to help clients by serving information about Bob in both HTML and RDF. The HTML information could be a web page that tells us Bob’s contact data. The RDF information could be some statements about foo#Bob, for example that it is a foaf:Person and that it’s foaf:name is “Bob”. The appropriate format should be served, depending on what the client wants.

The question is, how exactly does the HTTP interaction have to play out so that the client a) gets exactly the right clues about what the URI might identify, and b) ends up with information in the right format.

Scenario 1: The client wants RDF

1. The client starts out with the URI foo#Bob and doesn’t know anything about what it means.
2. According to RFC 3986, the client has to separate #Bob from foo and just request foo.
3. It’s an HTTP URI, so the client sends an HTTP GET request to foo.
4. The client wants RDF, so it sends along an Accept: application/rdf+xml header.
5. The server answers with an RDF document (Content-Type: application/rdf+xml) and a 200 OK status code.
6. Again, RFC 3986 tells us that the fragment #Bob within foo has to be interpreted with respect to the content type of the returned representation, that is, application/rdf+xml.
7. RFC 3870 tells us to look at RDF Concepts and Abstract Syntax, which says that foo#Bob means whatever an RDF representation of foo says about it.
8. From the retrieved RDF document, the client can learn that foo#Bob identifies a foaf:Person named “Bob”.

So this worked out really well. The client started out just with a URI, and after following a long paper trail of specifications it ends up with a bunch of RDF statements about the URI.

Scenario 2 (broken): The client wants HTML

1. Again, the client starts out with the URI foo#Bob and doesn’t know anything about what it means.
2. Again, it sends an HTTP GET request to foo.
3. The client wants HTML, so it sends along an Accept: text/html header.
4. The server answers with an HTML document and a 200 OK status code.
5. RFC 2854 tells us that #Bob within foo “designates the correspondingly named element; any element may be named with the “id” attribute, and A, APPLET, FRAME, IFRAME, IMG and MAP elements may be named with a “name” attribute.”

Oops! This HTTP interaction indicates to the client that foo#Bob is an HTML element inside an HTML document. That’s not the conclusion intended by the naming authority, and a clear contradiction to the other case where we served RDF. Therefore, it’s better to give another response to this request.

Enter httpRange-14.

Scenario 3: The client wants HTML, the server 303-redirects

1. Again, the client sends a GET to foo.
2. The server answers the request for foo with a 303 See Other status code, with the URI of another resource, bar.html, in the Location: response header.
3. According to httpRange-14, “the resource identified by that URI could be any resource.”
4. According to the HTTP protocol, the 303 status code tells the client to get an answer from another URI, but makes clear that “the new URI is not a substitute reference for the originally requested resource.”
5. The client sends a GET request to bar.html, which the server answers with an HTML document and HTTP 200 OK.
6. But since bar.html is explicitly a different resource from foo, the client can’t infer anything about the nature of foo#Bob from the content type of bar.html.
7. Thus, the only clue about the nature of foo#Bob is whatever the text in bar.html says.

This works fine. The naming authority can describe Bob in bar.html, and the client cannot infer any contradictory clues from the HTTP interaction. That’s the best we can do when the returned description is not in a machine-readable format.

So, the 303 redirect from foo to bar.html eliminated the contradiction because it leaves us without any information about the content type of foo.

Recipe 3 in the Best Practices Recipes for Publishing RDF Vocabularies recommends exactly this when serving HTML, and I agree fully with this recommendation.

However, the recipe also recommends to do the 303 redirect when the client asks for RDF. The server would 303-redirect from foo to bar.rdf. The 303 again means that we don’t know anything about the content type of foo, and can’t use the answer from foo to interpret foo#Bob. But we also have the response from bar.rdf, and if the RDF document contains statements about foo#Bob, then we again have learned what we wanted to know.

In summary, I think that the 303 redirect in the RDF case is unnecessary, but doesn’t hurt either, and is probably not such a bad idea because of the symmetry between the HTML and RDF responses.

Scenario 4: Making statements about parts of HTML documents

Can we use RDF to make statements about named parts of documents, such as report.html#section1? I think yes.

1. Again, the client knows just the URI and has no idea what it identifies.
2. It does HTTP GET to report.html.
3. The server responds with an HTML document and 200 OK.
4. According to RFC 2854, #section1 is the named element within the HTML document.

Caveat 1: What happens if the client asks for application/rdf+xml and the server answers with 406 Not Acceptable? Then the client didn’t get any clue out of the interaction, but there is no contradiction either. The server could have chosen to serve an alternate RDF representation of the report, which contains a statement that report.html#section1 is a doc:Section. It’s up to the naming authority to provide good clues and make it easy to find out what its URIs identify.

Caveat 2: RDF Concepts and Abstract Syntax has this section:

eg:someurl#frag means the thing that is indicated, according to the rules of the application/rdf+xml MIME content-type as a “fragment” or “view” of the RDF document at eg:someurl. If the document does not exist, or cannot be retrieved, or is available only in formats other than application/rdf+xml, then exactly what that view may be is somewhat undetermined, but that does not prevent use of RDF to say things about it.

True, if report.html is only available as HTML but not RDF, then RDF itself tells us nothing at all about what #section1 is. But the traditional Web interpretation of report.html#section1 tell us that it is a part of an HTML document, and since a URI can identify only one thing, this interpretation carries over into RDF. I think that’s the only sensible view.

Wow, this has become quite a long and rambling post. In summary, I think I’ve managed to piece together a semi-coherent picture of how all this works, and it has stopped my angst about using hash URIs.

If you have actually read until here, and disagree with any of the reasoning above, or find anything unclear, then please add a comment.

Posted in General, Semantic Web | 9 Comments

Backing up your del.icio.us bookmarks

Posted on November 13, 2006 by Richard Cyganiak

Triggered by recent events, I’ve written a little script that backs up my del.icio.us bookmarks to my local disk. It’s a Ruby script and runs only on Unix systems such as Mac OS X.

When run, it will do this:

Call the del.icio.us API to check if your bookmarks have been updated,
if there are new posts, fetch them using the API,
store them as a timestamped, gzipped XML file in the directory where the script runs.

My system runs this automatically every hour, using a cronjob. To set up a cronjob, open the Terminal, enter crontab -e, and enter something like this:

43 * * /Users/richard/delicious-backups/delicious-backup.rb

This will run the script 43 minutes after every full hour, if the computer is on.

Here’s the script, delicious-backup.rb. Replace username and password.

#!/usr/bin/ruby
user = "username"
pw = "password"
header = "User-Agent: delicious-backup.rb"
api_url = "https://api.del.icio.us/v1/posts"
script = $0
update_xml = `curl -u #{user}:#{pw} -H "#{header}" #{api_url}/update`
exit 1 unless update_xml =~ /"(\d\d\d\d-\d\d-\d\dT.*?)"/
file = File.dirname(script) + "/" + $1.gsub(/:/, '') + ".xml.gz"
exit 0 if FileTest.exist? file
sleep 2
curl_cmd = "curl -u #{user}:#{pw} -H \"#{header}\" #{api_url}/all"
exit 1 unless system "#{curl_cmd} | gzip > #{file}"

I note that the design of the del.icio.us API made this extremely simple and quite pleasant. RESTful APIs are elegant.

Posted in General | 1 Comment

REST/WebArch question

Posted on November 13, 2006 by Richard Cyganiak

Does the httpRange-14 resolution imply that HTTP PUT and DELETE are in general not allowed on resources that answers to GET with a 303 redirect?

I take the answer to be yes, because asking a server to delete something that may be a real-world object seems unreasonable.

(POST seems less clear-cut. POSTing to my URI, for example, could send an email to me.)

Thoughts?

Posted in General, Semantic Web | 8 Comments

A year in emails

Posted on November 12, 2006 by Richard Cyganiak

A year ago, I wrote a little script that runs a couple of times per day and records the number of emails in my inbox into a database. I did this because I noticed that this number is a fairly reliable indicator of how good I feel in general. When I feel great, the number of emails in my inbox goes down. When I feel bad, it goes up.

The script produces a tiny chart showing how I did the last few days (live view). Looking at the chart going down gives me a nice little psychological kick – the same kind of small satisfaction you get from ticking off an item from a to-do list.

Email overload is a constant issue for me. There’s so much stuff coming in every day, with lots of noise but also lots of interesting and important messages that need to be dealt with. Sometimes it’s okay, but especially when I’m stressed out I find it almost impossible to keep up with the deluge. Making a decision on what to do about a particular mail is often hard, and I have a tendency to put off the hardest ones, especially when I’m under pressure from other matters. Sometimes it gets so bad I even stop weeding out the spam, and there are days where I dread even looking at the inbox.

In theory, I know how to solve this. It’s common sense really, and has been told and written down countless times. Keep your inbox at zero messages – it’s an inbox, not an archive or to-do list. Figure out what to do with each message right when it comes into your life. Make a decision, don’t let it sit. Is it junk? Delete. Is it something that just needs a quick two-sentence response? Respond and archive the mail. Does it announce some kind of event? Decide if I’m interested and put it on the calendar. A long and potentially interesting mailing list message? Put it in the to-read folder. Does it require a longer response or does it imply some action item for me? Put it in the to-do folder and work on it over the day. After a few minutes, the inbox should be empty again, and I should have a clear and clean picture of my new commitments.

I know all this, and have been trying literally for years to get there, whittling away at all the stuff that has collected in the inbox, sometimes getting down to 50 or even 20 mails, but always being buried again when the next deadline or stressful week comes up. It’s been only a few days since I managed to hit zero. And indeed, starting the day with an empty plate is a nice motivation to get there again at the end of the day.

So anyway, the script has run for a full year now and now I made a chart that shows the numbers for the whole time. I feel good about this, because it indicates that I’m about to get one more area of my life under control.

Emails in my inbox, 11/2005 to 11/2006

(The huge spike is when this blog was hit by a massive comment spam attack that generated hundreds of moderation requests. Luckily Akismet has mostly eliminated this kind of problem.)

What’s your email story?

Posted in General | Comments Off

Ze Frank: How to hire a web developer

Posted on November 10, 2006 by Richard Cyganiak

Ze Frank: How to hire a web developer

First, a good web developer does the minimum amount of work to achieve an acceptable result. To test this, start by just looking at your interviewee …

Posted in General | Comments Off

StatSVN

Posted on November 9, 2006 by Richard Cyganiak

StatSVN has had its first public release. It’s a port of our venerable StatCVS statistics tool to Subversion. Cool! It’s being developed by Jean-Philippe Daigle, Jason Kealey, and Gunter Mussbacher.

We considered Subversion support, but Subversion doesn’t include the all-important lines of code numbers in its logfiles. Tammo and Steffen even put together a patch for Subversion, but couldn’t get it accepted by the Subversion team.

StatSVN works around the problem by connecting to the Subversion server and fetching all revisions of all files to count the lines of code. This isn’t exactly an elegant solution, and it’s slow, but it works.

There’s a nice whitepaper on StatSVN. Here’s a sample report, which looks pretty much identical to those generated by StatCVS.

Posted in General, Semantic Web | Tagged StatCVS | 3 Comments

No more free internet

Posted on November 8, 2006 by Richard Cyganiak

I believe in free and open wifi. Ever since moving to my current appartment, I’ve kept my wireless network open for anyone to join.

Friends and colleagues tell me I’m nuts, and that I will go to jail because evil neighbours will download kiddie porn.

Today I’ve closed the access point. The network has become too popular. It is helplessly overloaded and has slowed to a crawl.

Sorry, neighbours! No more free internet.

Posted in General | 2 Comments

John Doe’s “Test ASIN” at Amazon

RESTful SQL?

SPARQL Update Language at ESW

Content negotiation with hash URIs (long)

Backing up your del.icio.us bookmarks

REST/WebArch question

A year in emails

Ze Frank: How to hire a web developer

StatSVN

No more free internet

About me

Links

Recent Posts

Archives