So I wrote a rather angry private email to Erik Wilde a few days ago, complaining about his use of the term “linked data” for a site that doesn’t follow the linked data practices. Erik decided to publish my email on his blog, along with a long defense of his use of the term, in a post called “The Linked Data™ Police”. Since it’s in public now, we can just as well see if we can get a useful discussion out of this.
First, I realize that Erik probably responded more to the tone of my email than to the content. It was an angry rant, and my tone misses the mark, so his response is fair enough, and I have apologised to him. I also have to say that I’m speaking only for myself and nobody else—before others become scared of the “Linked Data™ Police”, I can assure them that to the best of my knowledge, it has a staff of one, and my peers in the community are in general a friendly and civil lot.
The site in question. So, what is this about? Erik and team have built recovery.berkeley.edu, a site that publishes structured data about Recovery Act spending. The site was built with a grant from the Sunlight Foundation. The technologies of choice are Atom and various other XML formats. As far as I can see, it’s excellent in its adherence to REST principles, including good URI design and “hypermedia as the engine of application state”. These two together are labelled as “linked data” in the site’s technical documentation.
This is a discussion about names, and not about substance. At the very core is the following question: Should we understand “linked data” to mean “the idea of somehow connecting pieces of data with links”, or should we take it to mean “RDF published according to the rules outlined by Tim Berners-Lee in the design note that coined the term”?
Obviously, I’m of the latter opinion. In this post, I want to do two things: First, I want to respond to some specific points from Erik’s post. I will do this by paraphrasing each point, and then responding to it. Second, I want to explain why I care about the matter and why I think that “linked data” should continue to be associated with Tim’s rules, and why advocates of different sets of rules should use different terms.
Erik: “The attitude is scary: Instead of figuring out the most effective way of adding more semantics to the web, it starts with a set of technologies and claims that whatever you want to do, you have to use those.” Linked data didn’t start with a set of technologies; a lot of deliberation by a lot of people went into the choice. Also, I have no quarrel with Erik’s choice of technologies, and I didn’t even suggest that he should or shouldn’t use any technology. There can be good reasons against using RDF, and it’s a good thing that innovation continues in other areas of web data technology.
But if Erik and colleagues don’t buy into the set of technology choices commonly called “linked data”, then why would they insist on using that name? What’s wrong with the established technical terms, REST and Resource-oriented Architecture? Is it just because those are already way past the peak of the hype cycle?
Erik: “Using generic problem names to refer to specific technologies only confuses people.” Erik mentions Linked Data, XML Schema, Semantic Web and Web Services as specific technologies that he considers to be badly labelled. But the peculiar thing about those four is not their names; they are controversial for other reasons. The IT world is full of specific technologies that use generic names: World Wide Web. Structured Query Language. Extensible Markup Language. Hypertext Transfer Protocol. Portable Document Format. Resource-Oriented Architecture. Scalable Vector Graphics. Open Document Format. Erik may not like it, but it’s a common practice.
Erik: “Choosing such names is usually an attempt to make competition harder.” Usually? I doubt that. Technologies are named in their very infancy, when their future and success is far from certain, and when competition is usually not an issue. The naming is usually an attempt to communicate as clearly as possible what the proposed technology is supposed to achieve, which is not a bad thing at all. Some fail at achieving the goal, but everyone designs (and names) assuming that it can be eventually achieved.
Erik: “RDF is just a stylesheet away.” Erik points out that it would be trivial to create a GRDDL transform that translates from the service’s output to RDF. Personally I wouldn’t call it trivial, and being just one transformation away from being compatible is not the same as being compatible. If there were GRDDL transforms in place, I would have no reason at all to complain, although just a few linked data clients support GRDDL at this time.
Back to the roots. So where did the term “linked data” come from? To the best of my knowledge, Tim Berners-Lee coined it in his 2006 Design Note that is titled “Linked Data”. The document introduced the four rules that are now known as the “Linked Data Principles.” Erik’s service is following all of them except the one that demands RDF or SPARQL.
It’s worth pointing out that the four rules did not mention RDF when Tim originally published them, but it is clear from the rest of the document that the use of RDF was implied. The document was aimed at the semantic web community. His later change was a clarification, not a change of intention.
I don’t know why Tim wrote this piece back in 2006, but my interpretation was that he wanted more people to publish data that can be browsed with his Tabulator RDF browser, and most RDF out there at that time couldn’t be browsed because of problems with one of the four rules. So I read it as a call for better interoperability among RDF publishers.
Broadening the term? There have been a number of calls for broadening the meaning of the term, most eloquently from Paul Miller, so Erik is certainly not alone in his view. Their intention is to get linked data quicker into the mainstream, which is a goal that I share.
The problem is that broadening a term makes it less meaningful. There is a danger that the term gets extended to the point where it’s equally meaningless to other buzzwords such as Web 3.0 or the venerable Semantic Web. If you can use other formats instead of RDF, then why not also use SOAP instead of HTTP? Why not do away with the URIs? Why not YQL instead of SPARQL? Where does “linked data” stop? Everything is somehow “data” and somehow “linked.”
Interoperability requires choices to be made. In my eyes, the great thing about the term “linked data” is that it has a reasonably precise technical definition, rooted in Tim’s Design Note and the early work of the Linking Open Data project. That work has turned the Semantic Web’s compelling but vague promises of a side-by-side “web for humans” and “web for machines” into concrete guidelines that people can actually implement, and the result is an ecosystem of interoperable tools, clients and datasets that continues to grow around these guidelines.
These guidelines will continue to evolve with the emergence of new technologies (e.g., RDFa) and increasing experience and maturity (e.g., importance of licensing and provenance handling).
But at the core, it has to be about a set of concrete technology choices and deployment practices that foster an interoperable ecosystem of data sources and clients. “Linked data” is the best name we have for that particular set of technology choices and practices. There is nothing magic about the name “linked data”, to the best of my knowledge it didn’t exist at all in the web community before 2006. The term has gained popularity because it has associated rules that tell you how to do it, not because of the “words”. Without the rules, the term would be meaningless fluff. Everything is somehow “linked” and somehow “data.”
If you think that a different set of rules would work better (which is entirely possible), then it would be prudent to write them down, coin a new term for them, and start the legwork of advertising them, just as Tim did since 2006.
I think there are two issues here:
1- the propensity of a few Linked Data folks of trying to impose their view of the world, and strangely feel threatened by anyone advocating for a more flexible approach to fulfilling the vision of the semantic web. This vision is not vague but it does not prescribe the methods to use to get there – I think that’s a good thing because clearly Linked Data is only one possible way to get there, and after some examination it is possible it might have some fundamental flaws (e.g. it appears that computing queries across geographically-distributed triples/URIs is far too slow, and other things of that sort). In any case, linked data is missing the “semantization” part, it only carries the semantics – does not create them, and thus needs to be completed by concept extraction mechanism.
The defensiveness of those Linked Data people is a huge problem that limits the good will of people who could otherwise be good evangelists for linked data, like Erik and me, and it does bring up terms like “Police” (Erik) and “Inquisition” (from me).
As you rightly note, Richard, a lot of that has to do with reacting to tones, which has to do with how you see the world. So I’d argue the linked data folks in question should accept that whatever they say, there will be debates about alternative approaches, until they have proven practically that linked data is the only approach (if that’s really what you think), or alter their perspective to embrace the possibility that linked data is just one method among many to get to the semantic web.
Ultimately, beyond some major implementation question marks, the issue for me is that linked data is a proxy to building more intelligence into the web, by injecting it into the data itself. There will always be competing approaches not trying to inject more semantics into the data format itself, but deriving those from a more algorithmic approach, and identifying relevant links (data or document-level) after that. I think this approach is perfectly viable, and I see linked data as a pursuit that also is useful – but it has not made its case to deserve the title of “exclusive semantic web technology”, just yet…
2- the difficulty of using terms exactly as defined by a technology community, when those terms are so generic indeed. “Linked Data” makes one think of “Linked” and “Data”. “Semantic Web” makes one think of a Web that is Semantic, i.e. would understand the meaning. While I would agree at this point that it’s probably best to leave Linked Data (with block letters) to describe TBL’s stack, you won’t be able to suppress the use by people who think it’s simply linking data at the data level. And again, alternative approaches to do just that are useful, as I don’t think RDF cuts it fully. But my point is, you need to be more flexible, let people use it as they like, and could simply point out that Linked Data already refers to an established stack, and that they could perhaps use “linked data” without block letters rather than crack down and alienate the newcomers. This has happened way too often, and as a market development specialist (I’m not a technical expert on linked data, but I do proudly think I am one in market adoption and other related business matters…), I will point out once again that this does a huge disservice to your cause. It simply is not effective.
I hope those thoughts help further the discussion.
Pingback: Growth Times » Discussing the Semantic Portals of Eqentia, DBPedia and the Utility of Linked Data
Pingback: Open Calais – a web de dados estruturados « Tecnologia Educacional
Pingback: Conexão TE » Blog Archive » Open Calais – a web de dados estruturados
Pingback: Czy przysz?a Sie? musi by? pedantyczna? « Szko?a Web 3.0
Pingback: Seven Pillars of the Open Semantic Enterprise | Digital Asset Management
I think you’re essentially correct in this matter, and you hit the nail on the head when you wrote “Without the rules, the term would be meaningless fluff. Everything is somehow ‘linked’ and somehow ‘data'”.
When this kind of thing is done for business marketing purposes, it is understandable (albeit undesirable), but when it happens in a more academic setting, it lends credence to a redefinition – and subsequent abuse – of the term that is both premature and unhelpful.
For those of us not fortunate enough to be directly employed in this area, the multitude of (often unfamiliar) technologies and best practices related to the Semantic Web are hard enough to figure out and to keep up with, without prominent researchers casually redefining the vocabulary.
I say “casually” here, even though the document in questions does mention explicitly the deviation from accepted usage of the term, simply because there does not seem to be any actual reason to have done it, other than for a kind of convenience that is like calling a lettuce and tomato sandwich a BLT; after all it’s only a slice of bacon away.
While your tone was indeed a little “snooty”, and did not elicit a constructive response directly, the ensuing publicity might actually serve the point better. Hopefully a lot more people will now think twice before using the label Linked Data too loosely, not necessarily for fear of the Linked Data Police, but just for having thought about it at all…