cygri’s notes on web data

Andrew Newman: Querying the Semantic Web using a Relational Based SPARQL

Posted on November 8, 2006 by Richard Cyganiak

Andrew Newman has annonced his thesis on a relational model for SPARQL. Sounds like a must-read.

(Having printed it, I’m getting a kick out of seeing my name between E. F. Codd and C. J. Date in the references.)

(Observation: During ISWC, the must-read stack grows by 50-60 pages a day.)

Posted in General, Semantic Web | Comments Off

QOTD: Email and answering machines

Posted on October 31, 2006 by Richard Cyganiak

David Allen:

Everybody would be further ahead if they made email like they made answering machines, have it blow up if you got more than 42.

The whole piece is worth listening to – a 17-minute discussion between productivity blogger Merlin Mann and Getting Things Done author David Allen about the productivity problems associated with email.

Posted in General | Comments Off

Namespaces in queries, part 3

Posted on October 26, 2006 by Richard Cyganiak

… in which I backpedal quite a bit, after helpful comments from Laurens Holst on my previous post.

Clarification 1: I don’t think that namespaces are bad. They are good and important. I just think that sometimes you want to ignore them in queries.

Clarification 2: My proposal doesn’t stop anyone from using namespaces and doesn’t change the semantics of any valid SPARQL query. It only makes some currently invalid SPARQL queries valid: those which use the default prefix without declaring it.

Now for the backpedaling: I guess that most people around here see SPARQL as “SQL for triple stores”, something you embed into application code. My perspective on SPARQL has become a bit different recently. Most of my recent SPARQLing was to interrogate SPARQL endpoints with unknown contents, or to explore the open Semantic Web using the SemWebClient library.

I use SPARQL interactively. I write quick one-off queries, see the result right away, and make corrections as needed. Bogus results are noticed and fixed instantly. I rarely re-run a query later on. Number of characters typed matters a lot in this scenario.

Maybe this explains why I advocate such a nutty idea, and my bad for not realizing this earlier.

So, time to shelf this idea until maybe more people will look beyond their own data and play with the open Semantic Web and feel the pain of having to type out the FOAF namespace for the n-th time.

Thanks for all the comments.

Posted in General, Semantic Web | 4 Comments

Namespaces in queries, part 2

Posted on October 26, 2006 by Richard Cyganiak

One of the nice things about blogging is that you get people to review your ideas, for free.

Yesterday I claimed that you don’t need URI prefixes in RDF queries. Instead of writing this:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX doap: <http://usefulinc.com/ns/doap#>
SELECT DISTINCT ?projectName ?personName
WHERE { 
  ?person foaf:name ?personName .
  ?person doap:project ?project .
  ?project doap:name ?projectName .
}

one could write this:

SELECT DISTINCT ?projectName ?personName
WHERE {
  ?person :name ?personName .
  ?person :project ?project .
  ?project :name ?projectName .
}

and have the undeclared default namespace match any namespace in the data. It’s shorter, it’s more readable, no more hunting around for copy and pasting those namespace URIs, what’s not to like?

Well, everybody hates the idea. See the comments to the piece linked above, and Ora’s rant and comments there. So I’m probably wrong.

I found most of the cited reasons pretty lame though, ranging from dogma to tortured SQL analogies to performance concerns. Some good bits:

Evan:

Lumping [properties from different namespaces] together should only be done if the user explicitly specifies that it should be done,

Spot on: an optional, well-defined and compact way for specifying that the user wants them lumped together. That’s what I want.

and it is likely there would be far more accidental collisions than purposeful collisions.

I disagree about the prediction, but it’s hard to prove either way.

Richard Newman:

I assume the proposal is just the fevered ramblings of a man sick of writing

PREFIX foaf: <http ://xmlns.com/foaf/0.1/>

every time he wanted to query on names.

My answer: environment support.

While Richard may be quite right about my motivation, I don’t see environment support happening. SPARQL is usually entered into web forms, text editors and IDEs. I’m not aware of any support in any of these environments (well, except in D2R Server (example)), and I’m not going to wait two years for a solution.

Ora and drewp pointed out that, if I don’t want to query for full URIs, I should query for rdfs:labels. That’s a good point. A query language that lets me do

SELECT ?personName ?projectName WHERE
?project a project .
?project name ?projectName .
?project developer ?person .
?person name ?personName .
?person a person .

would indeed be cool. This would be halfways between SPARQL and NLP projects like Ginseng. The downside is that one would have to design an all-new language, and apart from the namespace issue, I like SPARQL just fine.

Danny mentioned microformats, where namespaces are unnecessary because the community has to agree on a schema before it can be used. But I don’t want to change anything on the data side; URIs for properties and classes are great. That doesn’t necessarily mean we need to do the same on the query side.

Finally, Henry:

I think you just posted this to get attention ;-)

Uh, no. Though it certainly worked ;-)

A question to those still interested in the discussion: 100% of widely deployed RDF vocabularies follow the convention of namespace plus mnemonic local parts. Why is it wrong to exploit a universally accepted convention in a query language?

Posted in General, Semantic Web | 3 Comments

You don’t need URI prefixes in RDF queries

Posted on October 25, 2006 by Richard Cyganiak

Update: Everyone hates the idea; some for good reasons.

Properties and classes in RDF are identified by URIs. This is important because we want to be able to say additional things about them. But it has a cost. It makes RDF harder and uglier. Just think about the time you’ve spent copy and pasting prefix declarations and hunting for the right namespace URI for some vocabulary. Still, that’s a cost we have to pay in a system without a centralized schema.

Not so with queries. Check out this one:

PREFIX foaf: <http ://xmlns.com/foaf/0.1/>
PREFIX doap: <http ://usefulinc.com/ns/doap#>
SELECT DISTINCT ?projectName ?personName
WHERE { 
  ?person foaf:name ?personName .
  ?person doap:project ?project .
  ?project doap:name ?projectName .
}

The prefixes in this query are utterly superfluous. They are noise. They are ugly. They are a pain. They cause errors. They kill serendipity. All they provide, if anything, is a false sense of security.

Make them optional! Here’s a better version:

SELECT DISTINCT ?projectName ?personName
WHERE {
  ?person :name ?personName .
  ?person :project ?project .
  ?project :name ?projectName .
}

The query processor should match the QNames regardless of namespace. Thus, :name would match both foaf:name and doap:name. Writing SPARQL queries could actually be fun.

So I think that, if a query doesn’t declare the default namespace, then the default namespace should be understood to match any namespace.

Posted in General, Semantic Web | 7 Comments

Open Data and the Semantic Web – an elevator pitch

Posted on October 23, 2006 by Richard Cyganiak

(Copy’n’pasted from a poster I’ve done recently – I thought it’s worth re-posting here)

Public and private organizations can benefit from making data available in machine-readable formats. Such data is most valuable when it is easily accessible and can be re-used by integrating it with other data. Semantic Web technologies support this through

built-in globally unique identifiers (URIs),
powerful options for linking and integrating data published by different parties,
the ability to mix and match vocabularies for the description of a single resource.

Data published on the Semantic Web can be queried using the SPARQL query language, can be navigated with RDF browsers like Tabulator and Piggy Bank, and is accessible to RDF-consuming Web crawlers.

Posted in General, Semantic Web | Comments Off

Fixing ambiguous concept URIs

Posted on October 16, 2006 by Richard Cyganiak

This is in response to a discussion in the comments to my recent post on geonames.org. I criticized the use of the same URI for concepts and documents in the Geonames RDF output; this post describes in detail how to fix that kind of issue.

The problem: Geonames uses URIs to identify places. For example, this is the URI for Berlin:

http://ws.geonames.org/rdf?geonameId=2950159

This URI identifies both a concept (the city of Berlin) and a document (which provides some information about that city). This is a problem because, as TimBL points out, the URI now both identifies something that is located in Germany and something that mentions the class Feature. This ambiguity can cause a lot of trouble down the road.

How to fix it: Use different URIs for the concept and the document.

Let’s say we keep the URI above for the document and pick a new URI for the concept. When I create an RDF link from my profile to my home town, for example, I would use the new URI.

But we have to set things up in a certain way: When we retrieve the concept URI, we want to get to the contents of the document! There are two ways to do that, and we have to pick one:

The hash approach: By adding a fragment identifier to the document URI, we get a new concept URI. For example, this could be the concept URI for Berlin:

http://ws.geonames.org/rdf?geonameId=2950159#place

Just as with HTML, a hash in the URI means that the part before the hash is the document to be retrieved, and the part after the hash identifies something within the document. Consequently, one gets the document when trying to retrieve the hash URI.

The 303 approach: Here we pick a completely new URI for the concept. For example:

http://ws.geonames.org/places/2950159

(I’ve tried to pick a clean URI. Your URIs are your site’s prime real estate, it’s better not to clutter them.)

Whenever any HTTP URI is accessed, the web server responds with a status code, e.g. 200 for “OK, Here’s the page”, 404 for “Sorry, not found”, or 302 for “The document has temporarily moved to this other URI” (where the other URI is provided in a Location: HTTP header; this is called a redirection).

Now we have to set up the server to respond with a 303 (“See Other”) status code, and put the document URI in the Location: HTTP header. The client can fetch the Location: URI to retrive the document.

Which one to pick? To be honest I don’t really know. The 303 approach has the disadvantage of requiring an additional HTTP request to fetch the redirected document, and it may be harder to set up. The hash approach has the disadvantage that it feels a bit hackish when there is just a single concept described in the document. Do whatever works for you.

But is it a problem at all? Some researchers still quarrel about this whole issue. Some think it’s no problem at all; other think that something completely different should be done. I’m a toolsmith, not a philosopher, and therefore try to avoid these debates. W3C has said that we should do the things above, and I’m happy to comply and move on to more important issues.

Background reading: This piece is getting way too long already and I’m tired; so I’ll leave it up to you to provide interesting linkage in the comments.

Posted in General, Semantic Web | 8 Comments

Any opinions on BibSonomy?

Posted on October 15, 2006 by Richard Cyganiak

Any regular users of BibSonomy around? Del.icio.us totally works for me as a bookmark manager, and BibSonomy is similar, and now I wonder if I should use it to manage the papers I’m reading. I’m looking for opinions, what works well, what doesn’t?

Maybe I should install Semantic MediaWiki instead and manage my notes in there?

Posted in General, Semantic Web | 8 Comments

Linking your Geonames place from your FOAF file

Posted on October 14, 2006 by Richard Cyganiak

Now that we have a large database of places on the Semantic Web, I want to link to my place from my FOAF file.

To find a place, go to geonames.org, enter the name in the search box, find the right place on the result page.

geonames.org search result page for “berlin”

Click the small multicolored place markers near the left border (not the place name). A map will appear, with a marker and box showing the place.

geonames.org map showing Berlin

The box contains a link “semantic web rdf”. Copy the link. That’s your place.

I added this triple to my personal RDF data:

<http://richard.cyganiak.de/foaf.rdf#cygri>
    foaf:based_near <http://ws.geonames.org/rdf?geonameId=2950159> .

Now, when I load my URI in Tabulator, it lets me navigate from my profile up the geographical hierarchy from the city of Berlin to the federal state of Berlin to the country of Germany to the continent of Europe.

I’m not perfectly happy with using foaf:based_near, because I’m not based near Berlin, dammit, I’m based right in the middle of it. Is there a more appropriate and reasonably well-established property?

Posted in General, Semantic Web | 4 Comments

News from the frontier: Geonames on the Semantic Web

Posted on October 14, 2006 by Richard Cyganiak

The Geonames database is available on the Semantic Web. Announcement, Details, Discussion, notes about the ontology.

In summary, this means a whole lot of places with their names and geo coordinates and links to other places are now available on the Semantic Web. For example, this is a URI for Berlin:

http://ws.geonames.org/rdf?geonameId=2950159

And now I can linkt to it from my FOAF file.

This is an excellent example of a Semantic Web site done right. Well, almost right …

The good stuff:

URIs for all concepts, like the Berlin URI.
All URIs are dereferenceable. Click on the one above!
The data contains links to related places and concepts.
Clear, simple vocabulary.
Vocabulary re-uses existing work where appropriate (SKOS).
Small, easily processable chunks of data.

The bad stuff:

Mixing of documents and concepts. Thou Shalt Not use the same URI for a concept and the document describing it! See TBL’s Linked Data, section “Variation: URIs without Slashes and HTTP 303”
Dead vocabulary links. The vocabulary uses URIs like http://www.geonames.org/ontology#inCountry, but these URIs resolve to an HTML page; it would be much better to serve the RDFS/OWL specification (which actually resides here), or serve the right thing depending on the Accept: HTTP header (content negotiation).
No backlinks. There are links to concepts higher up in the hierarchy, but not the other way. This limit the possibilities of browsing and crawling the linked web of places.
No rdfs:labels. Adding the value of geonames:name redundantly as an rdfs:label (or at least declaring geonames:name a sub property of rdfs:label, which would be a good idea no matter what) would help RDF browsers like Tabulator to display the data in a better way.

But still, this is an extremely cool addition to the Semantic Web, and a nice showcase for linked data. Kudos to Bernard Vatant, who seems to have done most of the work and lobbying.

Posted in General, Semantic Web | 9 Comments