cygri’s notes on web data

JUC2006 wrapup

Posted on May 15, 2006 by Richard Cyganiak

JUC2006 has been over for a few days. This is my wrapup post. First some linkage:

Leigh’s summary
Lots of photos from Anja and Libby (I stole a random selection for this post)
More photos, from Libby too I think
#swig chatlogs with lots of linkage and commentary: day 1, day 2

There were lots of good talks. The one that got me most excited was Bastian Quilitz’ talk on federated SPARQL queries. It seems to me that SPARQL is the RDF landscape’s most promising area right now. Other SPARQL-related cool stuff: Damian’s SquirrelRDF which queries existing SQL databases with SPARQL; my own D2R Server which does pretty much the same; Ginseng from U of Zürich, a “fridge poetry to SPARQL translator”.

Another noteworthy talk was Leigh’s presentation of Slug, an RDF crawler that is very flexible and extensible and should cover anybody’s crawling needs. Chris Dollin presented Eyeball, a style checker for RDF which captures many common RDF authoring errors. RDFReactor by Max generates Java classes from an RDF schema and is very well thought-out.

There was more cool stuff; I neglected my blogging duties during Reto’s KnoBot talk and Kevin Wilkinson’s update on the state of Jena property tables (very good paper as well – doesn’t seem to be online though.)

I did a short informal talk on D2R Server, which is the latest addition to the D2RQ family of database-to-RDF mapping tools. I got a lot of excellent feedback on our work. Mailing lists are great, but you can’t beat face-to-face community meetings for gathering feedback. The database-to-RDF space is heating up right now, with Kevin’s property tables and Damian’s SquirrelRDF both available soon. Should be fun!

As always, the best part of the conference is to meet new folks, hear stories from the trenches, and put faces to names (and blogs).

Big thank yous go out to Paolo and Manuela for putting me up, and of course to Ian Dickinson and the whole Jena team for making it all happen.

Update: Proceedings are online, and even more photos.

Posted in General, Semantic Web | Comments Off

SPARQL endpoint self-descriptions

Posted on May 15, 2006 by Richard Cyganiak

In many scenarios involving SPARQL endpoints, it would be great to have machine-readable metadata about the endpoint. What is it called, what is inside, what can it do?

One place where this comes up is with generic SPARQL browsers. They should at least be able to display a human-readable title and description of the endpoint. Another place is SPARQL query federation where the federation engine needs to know a bit about the endpoint’s capabilities.

ESW Wiki: SPARQL Endpoint Description

I’ve discussed this quite a bit at JUC with Max Völkel and Bastian Quilitz; Dave Beckett also joined in. The wiki page linked above is a writeup of my notes. If you’re running a SPARQL endpoint or writing SPARQL client or server software, please have a look and add your thoughts.

Posted in General, Semantic Web | Comments Off

[juc] Damian Steer – SquirrelRDF: Querying existing SQL data with SPARQL

Posted on May 11, 2006 by Richard Cyganiak

SquirrelRDF is quite similar to our D2RQ. There’s a lot of data out there, but much is not in RDF, but in databases. SquirrelRDF allows SPARQL queries against databases.

Mapping from DB schema to RDF is done along the lines described in 1998 by TBL. The mapping is created automatically by a small tool that introspects the database schema. Mappings are 1:1 from table to class and from column to property.

It’s a simple tool – the hard work is down by Jena’s ARQ query engine which breaks down SPARQL queries into much simple triple queries and passes them to SquirrelRDF.

Should be available as a Jena contrib Real Soon Now. (Update: It’s released.)

Posted in General, Semantic Web | Comments Off

[juc] Bastian Quilitz – Federated queries with SPARQL

Posted on May 11, 2006 by Richard Cyganiak

People have been talking about federated semantic web queries for a while. Here’s a working prototype …

Bastian is an intern at HP Labs. The idea is to answer one SPARQL query using data from multiple SPARQL endpoints. The individual endpoints have to describe their capabilities with a simple service description. The federation engine then can create a plan of how to split up the query, execute the parts on individual stores, and recombine the results.

Service descriptions include:

the service endpoint URL
information on what kind of queries the service can answer with good performance, based on predicates, e.g. “This service can answer queries about foaf:name and foaf:mbox.”
a selectivity function
whether the endpoints provides definitive information.

Query plans are optimized based on a cost function that mostly uses selectivity as a cost measure. (E.g. foaf:gender has low selectivity, foaf:name high selectivity, and doing high selectivity parts first is better. I wonder whether triple counts would make a good factor in cost calculations.)

At the moment, the service descriptions must be provided by the party who sets up the federated server. They are also responsible for determining the selectivities. (I think that service endpoints should be able to provide a description of their own capabilities. The service knows its own data and is in a good position to describe what it can and cannot answer with good performance. The service is also able to calculate selectivities on its own.)

The code is not public at the moment. Bastian says he intends to publish it when it’s more polished. (Update: He says he will publish it soon.) (Update: Here it is.)

Posted in General, Semantic Web | Comments Off

[juc] François-Paul Servant – Semanlink

Posted on May 10, 2006 by Richard Cyganiak

Semanlink is an RDF-based personal information management system. It’s a tagging system. You can tag files, bookmarks and text notes. Unlike most tagging systems, Semanlink lets you arrange tags into a concept hierarchy. It runs as a servlet.

There’s a web page for each tag, but it doesn’t only list tagged items, but also “subtags” and “supertags.”

It’s a del.icio.us on steroids. The UI is not yet quite streamlined enough for my taste, but it looks usable. It’s technologically simple – Jena memory model, file-based persistence – so it should be hackable. Quite cool.

I’d use it if it had del.icio.us import (or synchronization, preferably).

Posted in General, Semantic Web | Comments Off

[juc] Chris Dollin – Eyeball

Posted on May 10, 2006 by Richard Cyganiak

Eyeball is a command-line tool for finding typical problems in RDF files. A “lint for RDF.” Or the “jena rdf screwup detection utility,” in danbri’s words.

Some of the errors that Eyeball can detect:

datatypes errors (integers with letters in them etc.)
weird-looking URIs
weird-looking namespace declarations
terms not declared in their schema file (can indicate typos)
cardinalities (to work around OWL’s weird open-world semantics)
domain/range problems (to work around RDFS’ weird semantics)

Eyeball detects many things that technically are legal, but from experience are likely unintended.

It’s easy to write new checkers (“inspectors”).

An online validator is planned. Yay!!!

Posted in General, Semantic Web | 2 Comments

[juc] Steve Battle – Gloze: XML to RDF and back again

Posted on May 10, 2006 by Richard Cyganiak

Steve wants to translate XML into RDF, and back. He tried to come up with the simplest possible mapping. And he hates inventing new languages. So he didn’t want to create a mapping language.

The solution is to look at XML Schema. If the XML to be translated conforms to a schema, then you have enough information to do roundtripping without information loss.

This is not so easy for many reasons. In XML, ordering is relevant. In RDF not. There’s a lot of detail to look at: XML schema types (simple and complex), text content, ID and IDREF.

Interesting approach to ordering the properties of an RDF resource: The properties are attached to the resource with plain old predicates. Ordering is added by attaching an rdf:Seq that contains reifications of these statements. This is done only if the XML schema says ordering is relevant.

There’s no download yet, but should be available in a couple of days. The tool is called Gloze.

Posted in General, Semantic Web | 1 Comment

[juc] Leigh Dodds – Slug Semantic Web Crawler

Posted on May 10, 2006 by Richard Cyganiak

Slug is one of Leigh’s pet projects. It’s a crawler for the semantic web.

There are lots of slug photos in the slides.

A semantic web crawler works like a web crawler, but it fetches RDF files instead of HTML pages, and follows rdfs:seeAlso links instead of HTML links.

Slug is multi-threaded and very extensible. Crawling and fetching is separated from the further processing, so you can do almost anything with the content that has been found. Pre-defined options include caching found RDF files in a local filesystem cache or storing them in a Jena persistent store.

Some pre-defined filters for changing the crawler’s behaviour: RegexFilter (ignore URLs matching a regex, e.g. FOAF profiles from LiveJournal), DepthFilter (crawl only six steps), SingleFetchFilter (don’t recrawl resources that you’ve already seen). Adding others is easy.

The crawler keeps metadata about its activity – what resource was fetched when with which result, and where the crawler comes from, so it records link structure.

(I’ve written my own super-simple FOAF crawler earlier this year. I wish I had known about Slug back then, it would have done a much better job, and using it would have been less work.)

Posted in General, Semantic Web | Comments Off

[juc] Dave Reynolds – PortalCore

Posted on May 10, 2006 by Richard Cyganiak

PortalCore is a web-pased portal toolkit that provides faceted browsing on RDF data. It was built at HP Labs and used in various customer projects.

Unlike Longwell, which also provides faceted browsing of RDF datasets, PortalCore is highly configurable. Which facets to use for navigation, how the facets work, what templates to use to display instances and so on.

The UI can show both authoritative and non-authoritative data and users can still distinguish between the two.

Velocity is used as template engine. Lucene provides full text search.

PortalCore predates SPARQL. Dave says it should migrate to SPARQL, especially because of LIMIT. Would be nice if it used Jena assemblers as well to specify the data to be used.

Big limitation: It’s just a browser, there’s no way to annotate resources.

(Personally I think that a good faceted browser needs heavy customization.)

Posted in General, Semantic Web | Comments Off

[juc] Max Völkel – RDFReactor

Posted on May 10, 2006 by Richard Cyganiak

RDFReactor generates Java objects from RDF schemas. This makes RDF much easier to use for the 90% of Java developers who are not RDF experts.

It’s hard to see the actual domain objects between all the triples. RDFReactor is like some glasses that let you view the triples through familiar Java objects.

For example, you could create a bunch of classes (Person, Group) from the FOAF schema and use them like this:

Person p1 = (Person) model.createInstance(Person.class, "http://www.example.com/ns/2005/#person1");
p1.setName("Joe");

Person p2 = (Person) model.createInstance(Person.class, "http://www.example.com/ns/2005/#person2");
p2.setName("Jim");

p1.addKnows(p2);

And as a result, a bunch of RDF statements end up in the model.

RDFReactor uses RDF2Go, an API that abstracts from different triple stores like Jena, Sesame, YARS, NG4J.

Very useful, and seems to Just Work.

(Now if you have something like this for every programming language (there’s ActiveRDF for Ruby, for example), then you could exchange objects between programming languages with RDF as an intermediate language.)

Posted in General, Semantic Web | 2 Comments

JUC2006 wrapup

SPARQL endpoint self-descriptions

[juc] Damian Steer – SquirrelRDF: Querying existing SQL data with SPARQL

[juc] Bastian Quilitz – Federated queries with SPARQL

[juc] François-Paul Servant – Semanlink

[juc] Chris Dollin – Eyeball

[juc] Steve Battle – Gloze: XML to RDF and back again

[juc] Leigh Dodds – Slug Semantic Web Crawler

[juc] Dave Reynolds – PortalCore

[juc] Max Völkel – RDFReactor

About me

Links

Recent Posts

Archives