Creating an RDF vocabulary: Lessons learned

With tools like Neologism and OpenVocab, creating an RDF vocabulary is easy. But if your goal is re-use within a wider community, you will face many questions that are not so easy to answer:

  • How much work is it going to be and what timeframe is realistic?
  • How broad and how deeply should you cover the domain? Where to stop?
  • Work alone or seek collaborators?
  • Should you start by setting up a mailing list, or by producing a first draft?
  • How much documentation do you need to produce?
  • Whose feature requests and modeling ideas should you heed and whose ignore?
  • How to keep pushing towards the uncertain goal of “adoption” in the face of limited time?

A few days ago, the VoID vocabulary became a W3C SWIG Note. VoID started in 2008 as a loose collaboration between Jun Zhao, Keith Alexander, Michael Hausenblas and me. We published a first non-W3C version in 2009. The W3C publication is a nice milestone for us, and I thought this a good opportunity to share some of the lessons I have learned along the way.

I will focus on process and collaboration in this post, and say little about modeling practices or publishing tools or RDFS/OWL geekery.

Lesson #1: Work in a team. Three or four people, each with their own use cases or data, might be ideal. It ensures that a variety of use cases are covered; fluctuations in available time don’t stall the project; it mellows any strong personal hand-writing in the modeling and design; it increases the network available for reaching out to potential users. Having a team of a few motivated people is perhaps the most important factor for success.

Lesson #2: Take your time. For all of us, VoID was a low-priority “background task”. We all get paid for doing other things. Inevitably, progress was often slow, with months where literally nothing happened. I probably averaged less than an hour of VoID work per week (with occasional major bursts of activity).

And that might be the best way. Progress in vocabulary design is not how quickly one produces a polished spec. Progress is learning about the needs of the potential user community. Going slow means more opportunity for feedback at every stage, and reduces the risk of creating something that nobody needs.

We also moved the vocabulary to a different host twice in the process. This worked out ok because we could retain the original namespace URI throughout the moves, but it definitely shows the advantage of going with something like purl.org from the start.

Lesson #3: Use a public issue tracker. This is crucial, even if you work alone. It adds structure to the work process and helps to ensure that no balls get dropped. Some issues will remain unresolved for long periods of time, and you need a place for collecting the random comments, discussions, related links, proposed text for changes and so on.

I think it’s important to use a tracker that is easy to work with, ideally one that the contributors are already familiar with. We used the one from Google Code. It’s simple and just works.

Setting up a Google Code project for developing the vocabulary worked very well for us. Besides the tracker, we also used the SVN repository for the spec, and the simple wiki for random bits of information, like lists of deployments, and examples that didn’t fit into the spec.

Don’t try to use a wiki or Google Doc or other funky collaboration device in place of an issue tracker. I’ve seen that done elsewhere and it doesn’t work.

Lesson #4: Perfection can wait till the next version. This sounds banal, but is so important. At some point quite a while ago, we were all quite fed up and just wanted to get something out of the door. So we decided not to tackle a lot of difficult open issues. We told ourselves that we would just do them in a second version. This turned out to be immensely liberating.

After version 1, we took a long break, and then started to work on version 2. Now we knew that deferring to the next version is always an option (which we used liberally). Not really clear if that use case is worth the effort? Defer. Not enough evidence or experience to inform the design? Defer. Two pig-headed contributors (that is, Keith and me) can’t agree on a design? Defer.

Lesson #5: Regular Skype calls. This one might be controversial, because no one likes wasting time in weekly conference calls. But I think it worked well for us. We didn’t quite do weekly calls, but scheduled them ad hoc, averaging perhaps one every two weeks. Often, the only progress between calls was that one of us felt a bit of shame and quickly did one or two of their actions in the thirty minutes before the call. This adds up over the months and makes sure that there is slow but steady progress.

We took turns chairing and scribing. The chair would take us through the agenda (typically “review open actions; review issues list; discuss particularly thorny issue XYZ; AOB; schedule next call”) and interrupt any discussion that started to go circular. The scribe would note whenever someone took an action to do something, and afterwards email a list of those and the date for the next call. A good call duration is somewhere between 60 and 90 minutes.

Lesson #6: Have a working draft of the spec from day one. Even if it’s just a few scribbles. Call them your working draft and take it from there. Then get into the habit of focussing any discussion on the question: What change should be made to the text? Arguing about words that should go into the text is much more productive than the alternative, which is arguing who is right or wrong. Ideally, whenever people start to disagree, they should draft up competing change proposals to be discussed in the next call.

Besides the spec text in SVN, we used Neologism to create and publish the actual RDFS vocabulary specification.

Lesson #7: Public mailing list is optional. Don’t you hate signing up to yet another mailing list? Me too. We started with a private mailing list, and found that its only real use was for notifications from the issue tracker. Discussion happened on Skype or in the tracker. We put external comments into the tracker too and discussed them there. This worked well.

This is about the creation phase of the vocabulary. It might be a different story once you get a bit of a user community going. We now have a public discussion list.

Lesson #8: Start over a beer and a large piece of paper. If you can. With everyone physically in the same room. That’s how we did it anyways, at a conference, and it was quite helpful for figuring out a core part of the vocabulary that seemed uncontroversial. Most of that time was spent arguing about—I’m sure this will come to no surprise to you—a name for the project.

This entry was posted in General. Bookmark the permalink.

4 Responses to Creating an RDF vocabulary: Lessons learned

  1. John Samuel says:

    Is there any RDF for representing Commands? I mean Linux Commands or any terminal command?

  2. zazi says:

    Congrats, Richard, excellent post!

    In addition, I like to add some experiences that I made during the last months when I (co-)designed several ontologies and/or proposed enhancements of existing ones (see here for an overview). I may call it the “lone warrior” style ;)

    Re. lession #1: I prefer to work in a team, too. Unfortunately, it is not always possible to team up a couple of people to work on a specific ontology, because, e.g., you do not have the time or capacities to do this. That is why, I mainly chose the (more general) community approach, i.e., I proposed my thoughts, drafts and changes by using different communications channels, e.g., mailing lists or chat channels. Sometimes I got some feedback. However, I often experienced little to no reactions. So, I always have had to keep in mind to expect nothing, which made me even happier when I got some feedback ;) The disadvantages of this approach are, although, sometimes the feedback cycles were really fast,
    - people often do not really have time to intensively look into a specific subject to provide advanced feedback (especially in chat channels; albeit, this is quite comprehensible)
    - some people get annoyed of cross posting, which is, on the other side, an option to reach a broader audience (this is comprehensible, too; so, I mainly stopped posting such announcements on mailing list)
    Finally, you even have to expect nothing, if a team already exists that developed an ontology that one reused or where one proposed changes (team work?).

    Re. lession #2: Makes generally much sense, although, as I already mentioned in my comment to lession #1, this is not always possible. There is always some time pressure. My experience is, that many people give a s*** about ontology design and have no idea about how long a proper design of a new vocabulary will take. Often they view it like rapid proprietary database schemata design. Besides, I’m not in such a position were I get paided for doing other things. I have to present solutions as fast as possible (for free, anyway).

    Re. lession #3: Thanks a lot for opening my eyes for outlining the advantages of an issue tracker in comparison to a mailing list. I guess, I will ad this feature to.

    Cheers

    • Good points zazi. I guess I’d also advocate looking for some collaborators because it ensures that there is some minimum interest in the topic.

      Regarding the recommendation to take your time, this doesn’t mean spending a lot of time overall. My point is that it’s better to spend an hour per week for 40 weeks, than spending one week full-time. This way you will get more feedback that you can still take into account during the design process.

      Anyways, I’m speaking from the experience of creating one vocabulary only, so my observations are by no means a definitive account of the vocabulary creation process…