cygri’s notes on web data

At the Semantic Desktop Hands-on Workshop

Posted on April 21, 2006 by Richard Cyganiak

Anja, Arne and me are at the Semantic Desktop Hands-on Workshop in Kaiserslautern. Things are starting out well – we have coffee and wireless. That’s going to be a fun weekend.

Posted in General, Semantic Web | Comments Off

Coming soon: Jena 2.4

Posted on April 18, 2006 by Richard Cyganiak

Jena team meeting notes:

Target date for release 2.4 completion: April 30th.

Very timely, with the Jena User Conference in May.

The main new feature seems to be support for assembler specifications, which are “RDF descriptions of how to construct a model and its associated resources, such as reasoners, prefix mappings, and initial content.”

This means applications that can work on top of different kinds of RDF models (e.g. RDF browsers, graph visualizers, RDF servers) now get a standardized config file format for setting up the models. This also means someone (I, most likely) will have to write an assembler for D2RQ models.

Posted in General, Semantic Web | Comments Off

We’ll be right back, after the commercial

Posted on April 17, 2006 by Richard Cyganiak

Silly Firefox ad.

(via fscklog)

Update: Link broke, here’s a working one.

Posted in General | Comments Off

Wisdom and mockery

Posted on April 16, 2006 by Richard Cyganiak

Scott Adams on respecting the beliefs of others:

I fantasize about becoming President one day and insisting on settling the question of which religion is “right.” I’d assemble all the experts on history and religious and science, and televise them arguing the merits and evidence of their sides, with cross-examination and – most important – mocking. There would be no stop date for this debate. It would continue until even a child could recognize which positions are the most easily mocked. Sometimes that’s as close to wisdom as we can get.

I love the guy.

Posted in General | Comments Off

A triple store for Semantic MediaWiki?

Posted on April 14, 2006 by Richard Cyganiak

The Semantic MediaWiki folks are evaluating RDF triple stores for use in their system. Currently, the PHP-based stores of RAP and ARC are on their shortlist, as well as the C-based Redland store.

This is interesting because MediaWiki powers Wikipedia, and if the Semantic MediaWiki folks do a good job, then maybe, just maybe, Wikipedia will use their stuff. And if that ain’t the killer app for the Semantic Web, then we can just as well go back to our CSV files.

I’m not involved with the Semantic MediaWiki project, and probably don’t know about a lot of important factors in their decision, but lack of knowledge has never stopped me from adding my â‚¬0.02.

So here’s my take on the triple store question: They should roll their own.

That’s going to be more work than using an existing store, but that work will pay off. I think that a custom store would give them better performance, would simplify code and database maintenance, and will be an easier sale to the rest of the MediaWiki community. Here’s why.

Semantic MediaWiki will not use all of RDF. They have wisely chosen a subset that gives all the expressive power needed in a Wiki, without including all the complexity of the RDF model. In more detail:

Named relations connect wiki pages with each others, these are like OWL’s ObjectProperties.
Named attributes annotate wiki pages with literal values, these are like OWL’s DatatypeProperties.
Relations and attributes are disjoint because they are in different namespaces.
There’s a fixed set of datatypes for attributes, like dates and geographical coordinates.
Attributes can have units of measurement, but they will be encoded in the attribute name and therefore don’t play any role in storage design.

A store for this data model would be a lot simpler than a generic RDF store, and may have a number of other benefits:

No need to store URIs: In RDF, resources are uniquely identified by URIs. URIs can become pretty large, they take up space in the database and make joins slower. That’s why many RDF stores normalize the schema by putting the URIs into an extra table and giving them numeric IDs. This saves a lot of space, but creates extra joins that cost time. I’m pretty sure that the MediaWiki database schema already comes with some kind of simple numeric ID for pages. This ID could be used to identify pages in the store. That would greatly simplify the whole affair.

No need for blank nodes and custom datatypes: They are not used in Semantic MediaWiki, no need to support them in the store.

No need to distinguish between literal and URI objects: Because attributes and relations are in different namespaces, the store designer should be able to get away with storing them in different tables. That simplifies the schema, is faster and reduces required space. The cost is a more complex query engine.

No need for named graphs: Good RDF stores have support for some form of named graphs because they make management of RDF data so much easier. Semantic MediaWiki doesn’t need them because the subjects already identifies the “source” of all triples.

Possibly no need to store literals with different datatypes in the same table: If the datatype of an attribute is fixed (e.g. it’s stored in the name or can be looked up in an extra table), then the store could have separate tables for all datatypes. That would be a huge performance bonus for queries because the datatyping and indexing of the database can be used.

No RDF in the core application: If the store works with wiki pages and relations and attributes, instead of URIs and RDF triples and properties, then even non-RDF developers will stand a chance of understanding what it does. RDF itself would relegated to the edges of the system (RDF/XML export and querying). I don’t know about the politics of MediaWiki, but I imagine that this might be a factor in getting buy-in from the core developers. They may prefer a system that is well integrated with the rest of MediaWiki and uses their terminology to some nebulous free-standing RDF module that they don’t understand.

It can be built and integrated piece by piece: One could start with attributes only, then add relations later, then add RDF/XML output and sophisticated querying. Attributes alone seem to be the minimal useful subset of features. Attributes plus MediaWiki’s existing category system should already enable some interesting stuff, like getting a list of all countries ordered by size. While RDF folks may not be impressed by this, MediaWiki folks may find it sufficiently useful and non-threatening to be interested. But again, I know nothing about the project’s politics, so this is just speculation.

How hard is it to build this thing? It’s hard, but not that hard. Both RAP’s and ARC’s database stores have been implemented by one person in a relatively short timeframe, and both support all of RDF. And, with the approach I’ve outlined above, you can get something useful out of your work way before you’ve finished the whole thing.

Anyway, this question is at the intersection of many of my interests (wikis, Wikipedia, SPARQL, databases, PHP). I want to be able to query Wikipedia’s SPARQL endpoint by the end of 2007 or so, and this is what I consider the most realistic way to get there.

Posted in General, Semantic Web | Comments Off

Behaviour-Driven Development

Posted on April 14, 2006 by Richard Cyganiak

A new micro-movement in the software development community attempts to reframe test-driven development as behaviour-driven development. This is an attempt to overcome the backwardness of test-first development. If I want to test something, I need to have that something first, right? So writing tests for something that doesn’t even exist is a pretty weird concept.

Call it behaviour-driven development (BDD) and it all makes sense. You are defining your code’s behaviour before actually writing it.

There’s a framework called jBehave (dig the logo!) that’s a kind of JUnit for BDD. Instead of unit tests with names like this:

testCreation
testAddOneWidget
testFindWidgetByName

you would write behiour assertions like this:

shouldBeEmptyOnCreation
shouldBeSizeOneAfterAddingOneWidget
shouldFindRightWidgetByName

Well, that’s just changing some words, but it seems like an important conceptual leap to me. We are not writing tests anymore. We are defining the behaviour of code in an automatically verifiable way.

I’ve also heard the term example-driven development for the same concept.

Something to keep an eye on.

(via Tim Bray)

Posted in General | Comments Off

A sigh of relief from a million web developers

Posted on April 13, 2006 by Richard Cyganiak

I guess I should pay a bit more attention to the Microsoft side of the world. It took me three weeks to notice that Internet Explorer 7 is available as a beta download. Anybody used it yet? Is it better than Firefox?

Posted in General | Comments Off

2006 Jena User Conference schedule

Posted on April 13, 2006 by Richard Cyganiak

The schedule for JUC2006, the Jena User Conference in Bristol (May 10th and 11th) has been finalized and is online. Looks like quite an interesting mix, there’s applications and there’s infrastructure work and there are tutorials and there’s die-hard research.

(I’ll be there. ~~I still have to make up my mind about attending myself.~~ I’d have to squeeze the the Bristol trip in between next week’s Semantic Desktop workshop in Kaiserslautern, and WWW2006 in late May.)

Posted in General, Semantic Web | Comments Off

SPARQL/AJAX Javascript library

Posted on April 6, 2006 by Richard Cyganiak

SPARQL JavaScript library

This looks very cool. I imagine it takes most of the pain out of doing AJAX over SPARQL.

Posted in General, Semantic Web | Comments Off

Wiki law-making

Posted on April 6, 2006 by Richard Cyganiak

Wikocracy is a wiki where users can edit the text of US laws – the Constitution, the Patriot Act, the DMCA and many others.

It’s a fascinating idea. Wikipedia has demonstrated that wikis can be a great way to hammer out consensus, even over hotly contested topics. Could the same consensus-building process be applied to lawmaking, where the stakes are much higher?

(via David Weinberger)

Posted in General | Comments Off

At the Semantic Desktop Hands-on Workshop

Coming soon: Jena 2.4

We’ll be right back, after the commercial

Wisdom and mockery

A triple store for Semantic MediaWiki?

Behaviour-Driven Development

A sigh of relief from a million web developers

2006 Jena User Conference schedule

SPARQL/AJAX Javascript library

Wiki law-making

About me

Links

Recent Posts

Archives