Update: Evan Sandhaus reports that all the issues mentioned below will be fixed. Great!
Yesterday at the International Semantic Web Conference, Evan Sandhaus of the New York Times unveiled data.nytimes.com, a site that publishes linked data for some parts of the Times’ index. To me, this was one of the most exciting announcements at the conference, and it caused quite a tweetstorm during and after Evan’s talk.
A bit of background: Every article published in the newspaper or on the website is tagged, classified and categorized in many ways by skilled editors. This metadata allows the creation of topic pages that automatically collect relevant articles for notable people, organisations, and events. Examples include Michelle Obama, Swine Flu (H1N1 Virus) and Wrestling.
What’s in the data? The dataset published yesterday contains information on each of the concepts that have a topic page. For now, it is limited to topic pages about people. The concepts are modelled in SKOS. The information attached to each concept consists mostly of links: to DBpedia, to Freebase, into the Times API (which is not available as RDF at this point), and of course to the corresponding topic page. This means that if you have a DBpedia URI for an especially notable entity, a high-quality New York Times topic page with the latest news about the topic is only two RDF links away. A notable feature of the links is that every single one has been manually reviewed, making this perhaps the highest-quality linkset in the LOD cloud.
How to get the data? This being linked data, every concept has a dereferenceable URI. Examples:
The site’s URI scheme follows one of the Cool URIs recipes: The identifiers above are resolvable, and by using content negotiation, web browsers are redirected to
http://data.nytimes.com/N13941567618952269073.html
which has a nicely formatted summary of the data available about Michelle Obama. Data browsers and other RDF-enabled clients, on the other hand, are redirected to
http://data.nytimes.com/N13941567618952269073.rdf
which has all the data goodness in RDF/XML.
There is also a dump: people.rdf. You can browse the data starting from the data.nytimes.com page. Everything is available under a CC-BY license.
Bugs and problems
This being a new dataset and the Times’ first foray into linked data, it turns out that the Beta label on the site is quite warranted. I will highlight four issues.
Data and metadata are mixed together. Let’s look at the data about Michelle Obama, available at the N13941567618952269073.rdf URI above. I’m reformatting the data into Turtle for legibility.
<http://data.nytimes.com/N13941567618952269073>
a skos:Concept;
skos:prefLabel "Obama, Michelle";
skos:definition "Michelle Obama is the first …";
skos:inScheme nyt:nytd_per;
nyt:topicPage <http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html>;
owl:sameAs <http://rdf.freebase.com/rdf/en.michelle_obama>;
owl:sameAs <http://data.nytimes.com/obama_michelle_per>;
owl:sameAs <http://dbpedia.org/resource/Michelle_Obama>;
This makes perfect sense, it’s data about a person, modelled as a SKOS concept. But then it goes on:
<http://data.nytimes.com/N13941567618952269073>
dc:creator "The New York Times Company";
time:start "2007-05-18"^^xsd:date;
time:end "2009-10-08"^^xsd:date;
dcterms:rightsHolder "The New York Times Company"^^xsd:string;
cc:license "http://creativecommons.org/licenses/by/3.0/us/";
.
This is not data about Michelle Obama the person, it’s metadata about the data published by the NYT. It’s certainly not true that Michelle Obama was created by the New York Times, or that she “started” in 2007 (whatever that’s supposed to mean), and don’t even get me started on asserting a rights or a license over a person.
Note that the NYT team actually went through the effort of setting up separate URIs for Michelle the person (http://data.nytimes.com/N13941567618952269073), and for the HTML and RDF documents describing the concepts (http://data.nytimes.com/N13941567618952269073.html and http://data.nytimes.com/N13941567618952269073.rdf). The reason why linked data experts advocate this practice of having separate URIs is exactly because it enables separation of data and metadata: It lets you state some facts about the concepts, and other things about the documents that describe the concepts. This is what should be done in the data above: The metadata should not be asserted about the URI identifying Michelle, but about the URI identifying the document published by the NYT: N13941567618952269073.rdf. So we would get:
<http://data.nytimes.com/N13941567618952269073>
a skos:Concept;
skos:prefLabel "Obama, Michelle";
skos:definition "Michelle Obama is the first …";
skos:inScheme nyt:nytd_per;
nyt:topicPage <http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html>;
owl:sameAs <http://rdf.freebase.com/rdf/en.michelle_obama>;
owl:sameAs <http://data.nytimes.com/obama_michelle_per>;
owl:sameAs <http://dbpedia.org/resource/Michelle_Obama>;
<http://data.nytimes.com/N13941567618952269073.rdf>
dc:creator "The New York Times Company";
time:start "2007-05-18"^^xsd:date;
time:end "2009-10-08"^^xsd:date;
dcterms:rightsHolder "The New York Times Company"^^xsd:string;
cc:license "http://creativecommons.org/licenses/by/3.0/us/";
.
Eric Hellman has a post about this issue, calling it “a potential legal disaster” because a license is attached to a resource that’s said to be the same as a resource on a different site (DBpedia and Freebase). He’s a bit alarmist, but this example highlights why the separation of data and metadata, of concept URIs and document URIs, is critically important in a general-purpose data model.
Distinguishing URIs and literals. Here’s some selected snippets from the RDF/XML output:
<nyt:topicPage>http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html</nyt:topicPage>
<cc:License>http://creativecommons.org/licenses/by/3.0/us/</cc:License>
<cc:Attribution>http://data.nytimes.com/N13941567618952269073</cc:Attribution>
The value of all three properties are URIs. In the RDF data model, URIs are of such central importance that they are treated differently from any other kind of value (strings, integers, dates). But not so in the code example above. There, the three URIs are encoded as simple strings. This should be:
<nyt:topicPage rdf:resource="http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html" />
<cc:License rdf:resource="http://creativecommons.org/licenses/by/3.0/us/" />
<cc:Attribution rdf:resource="http://data.nytimes.com/N13941567618952269073" />
Why does this matter? It’s basically like making links “clickable” in HTML by putting them into a <a href=”…”> tag: RDF clients will not recognize URIs if they are encoded as literals, and will not know that they can treat them as links that can be followed.
Content negotiation for hybrid clients. As usual for linked data emitting sites, there is content negotiation on the concept URIs: They redirect either to RDF or HTML, based on the Accept header sent by the client when resolving the URI via the HTTP protocol. Also as usual for first-time linked data producers, the content negotiation is a bit broken.
Here is what happens when I ask for HTML (using cURL, which is a handy tool for debugging the HTTP behaviour of linked data sites):
$ curl -I -H "Accept: text/html" http://data.nytimes.com/N13941567618952269073
Response:
HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.html
Next I will ask for RDF:
$ curl -I -H "Accept: application/rdf+xml" http://data.nytimes.com/N13941567618952269073
Response:
HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.rdf
So far, so good. But many clients are “hybrid”, they can consume both RDF and HTML. This includes many tools that can consume RDFa (RDF embedded in HTML pages). So it’s not uncommon to find tools that combine multiple media types in the accept header. The Times server should also redirect those tools to the RDF, because any RDF-consuming client can probably handle the raw RDF data better than the (not overly useful) HTML pages. But let’s see what happens:
$ curl -I -H "Accept: text/html,application/rdf+xml" http://data.nytimes.com/N13941567618952269073
Response:
HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.rdf.html
The server redirects to a file that doesn’t exist, ending in .rdf.html. This is pretty funny to me as a programmer, because the bug gives me a glimpse into the Times codebase, where obviously a programmer didn’t consider that the two alternatives—sending HTML or sending RDF—are exclusive.
Update: Someone at the Times seems to be working on the server as I’m writing this; the latest behaviour is even worse; it redirects to .rdf.html even if I request only RDF, and uses 301 redirects instead of 303.
Using the Creative Commons schema. The NYT data uses the Creative Commons schema to license the data under CC-BY. Here’s the relevant RDF, in Turtle (I fixed the subject URI and turned literals into URIs where appropriate):
<http://data.nytimes.com/N13941567618952269073.rdf>
cc:License <http://creativecommons.org/licenses/by/3.0/us/>;
cc:Attribution >http://data.nytimes.com/N13941567618952269073<;
cc:attributionName "The New York Times Company";
.
This uses three properties: cc:License, cc:Attribution and cc:attributionName. But according to the schema, cc:License and cc:Attribution are classes, not properties. This should be:
<http://data.nytimes.com/N13941567618952269073.rdf>
cc:license <http://creativecommons.org/licenses/by/3.0/us/>;
cc:attributionURL <http://data.nytimes.com/N13941567618952269073>;
cc:attributionName "The New York Times Company";
.
Summary. The Times’ foray into linked data is an exciting new development, but it also shows how hard it is to get linked data right. This is a weakness of the linked data approach.
Can we do anything about this? Better tutorials and education can probably help. Another activity that is trying to address the issue is the Pedantic Web Group, a loose collection of people like me who obsess about the technical details of publishing data on the web and work with data publishers to get issues like the above fixed. We might even give you a hand with reviewing your stuff before you go live with it.