A triple store for Semantic MediaWiki?

The Semantic MediaWiki folks are evaluating RDF triple stores for use in their system. Currently, the PHP-based stores of RAP and ARC are on their shortlist, as well as the C-based Redland store.

This is interesting because MediaWiki powers Wikipedia, and if the Semantic MediaWiki folks do a good job, then maybe, just maybe, Wikipedia will use their stuff. And if that ain’t the killer app for the Semantic Web, then we can just as well go back to our CSV files.

I’m not involved with the Semantic MediaWiki project, and probably don’t know about a lot of important factors in their decision, but lack of knowledge has never stopped me from adding my €0.02.

So here’s my take on the triple store question: They should roll their own.

That’s going to be more work than using an existing store, but that work will pay off. I think that a custom store would give them better performance, would simplify code and database maintenance, and will be an easier sale to the rest of the MediaWiki community. Here’s why.

Semantic MediaWiki will not use all of RDF. They have wisely chosen a subset that gives all the expressive power needed in a Wiki, without including all the complexity of the RDF model. In more detail:

  • Named relations connect wiki pages with each others, these are like OWL’s ObjectProperties.
  • Named attributes annotate wiki pages with literal values, these are like OWL’s DatatypeProperties.
  • Relations and attributes are disjoint because they are in different namespaces.
  • There’s a fixed set of datatypes for attributes, like dates and geographical coordinates.
  • Attributes can have units of measurement, but they will be encoded in the attribute name and therefore don’t play any role in storage design.

A store for this data model would be a lot simpler than a generic RDF store, and may have a number of other benefits:

No need to store URIs: In RDF, resources are uniquely identified by URIs. URIs can become pretty large, they take up space in the database and make joins slower. That’s why many RDF stores normalize the schema by putting the URIs into an extra table and giving them numeric IDs. This saves a lot of space, but creates extra joins that cost time. I’m pretty sure that the MediaWiki database schema already comes with some kind of simple numeric ID for pages. This ID could be used to identify pages in the store. That would greatly simplify the whole affair.

No need for blank nodes and custom datatypes: They are not used in Semantic MediaWiki, no need to support them in the store.

No need to distinguish between literal and URI objects: Because attributes and relations are in different namespaces, the store designer should be able to get away with storing them in different tables. That simplifies the schema, is faster and reduces required space. The cost is a more complex query engine.

No need for named graphs: Good RDF stores have support for some form of named graphs because they make management of RDF data so much easier. Semantic MediaWiki doesn’t need them because the subjects already identifies the “source” of all triples.

Possibly no need to store literals with different datatypes in the same table: If the datatype of an attribute is fixed (e.g. it’s stored in the name or can be looked up in an extra table), then the store could have separate tables for all datatypes. That would be a huge performance bonus for queries because the datatyping and indexing of the database can be used.

No RDF in the core application: If the store works with wiki pages and relations and attributes, instead of URIs and RDF triples and properties, then even non-RDF developers will stand a chance of understanding what it does. RDF itself would relegated to the edges of the system (RDF/XML export and querying). I don’t know about the politics of MediaWiki, but I imagine that this might be a factor in getting buy-in from the core developers. They may prefer a system that is well integrated with the rest of MediaWiki and uses their terminology to some nebulous free-standing RDF module that they don’t understand.

It can be built and integrated piece by piece: One could start with attributes only, then add relations later, then add RDF/XML output and sophisticated querying. Attributes alone seem to be the minimal useful subset of features. Attributes plus MediaWiki’s existing category system should already enable some interesting stuff, like getting a list of all countries ordered by size. While RDF folks may not be impressed by this, MediaWiki folks may find it sufficiently useful and non-threatening to be interested. But again, I know nothing about the project’s politics, so this is just speculation.

How hard is it to build this thing? It’s hard, but not that hard. Both RAP’s and ARC’s database stores have been implemented by one person in a relatively short timeframe, and both support all of RDF. And, with the approach I’ve outlined above, you can get something useful out of your work way before you’ve finished the whole thing.

Anyway, this question is at the intersection of many of my interests (wikis, Wikipedia, SPARQL, databases, PHP). I want to be able to query Wikipedia’s SPARQL endpoint by the end of 2007 or so, and this is what I consider the most realistic way to get there.

This entry was posted in General, Semantic Web. Bookmark the permalink.