cygri’s notes on web data

URIs have a namespace part and a local part, right?

Posted on February 8, 2016 by Richard Cyganiak

This is a technical post on the way URIs break down into “namespace parts” and “local parts” in RDF. It was prompted by this comment in a recent discussion:

In a URI, the namespace part ends with the last slash or hash, right?

So, the namespace of <http://example.com/foo/bar-123> ends after /foo/, and bar-123 is the local part, right?

The answer is neither yes nor no. The question is based on a wrong assumption.

Let me explain.

Namespaces in RDF are a strange beast

On the one hand, one could say that they are purely a syntactic convenience for shortening repetitive URIs, and carry no syntactic meaning.

On the other hand, one could say that they are an integral part of the “user interface” to RDF data. Users, for example those writing SPARQL queries, much prefer to interact with the data in its prefix-abbreviated form and not through the full URIs.

Different parts of the RDF stack embody these different views. For example, the N-Triples syntax doesnt support namespace abbreviation at all. But the RDF/XML syntax doesn’t work without namespace abbreviation (and furthermore makes it impossible to abbreviate certain valid URIs).

So where does the namespace part end???

It’s sloppy to say that URIs have “namespace parts” and “local parts”. Rather, it would be more accurate to say:

Given a certain prefix mapping, a URI can be broken up into a namespace part and a local part, possibly in different ways.

Consider this prefix mapping, written in Turtle:

@prefix a: <http://example.com/>.
@prefix b: <http://example.com/foo/>.
@prefix c: <http://example.com/foo/bar->.

Now, given this prefix mapping, the URI <http://example.com/foo/bar-123> can be broken up into namespace and local parts in three different ways, yielding three different local names:

a:foo/bar-123
b:bar-123
c:123

Now a couple of observations:

There’s nothing inherently special about hashes or slashes in URI patterns, as shown in the third prefix.
It’s a common convention to define prefix mappings that go up to the last hash or slash (again due to the influence of old RDF/XML where you couldn’t have hashes or slashes in local names), and many tools that automatically create prefix mappings will do this, so you will see b:bar-123 more often than the other forms. But that’s merely a convention, and the other forms may well be more convenient for users sometimes.
The form a:foo/bar-123 actually needs to be escaped as a:foo\/bar-123 if written in Turtle or SPARQL, because unescaped slashes are not allowed in the local part of a prefixed name.
This escape mechanism was only introduced in the W3C Recommendation version of Turtle and in SPARQL 1.1, so may not work in older parsers, and community awareness of this form is regrettably low.
The form c:123 will work in Turtle and SPARQL but not in RDF/XML, because XML requires that element names start with a letter or underscore. So, in RDF/XML, only b:bar-123 works.

To summarise, URIs don’t simply have a namespace part and local part. Rather, someone defines a prefix mapping, and under that prefix mapping, there may be zero, one or more ways of abbreviating any given URI.

Posted in General, Semantic Web | Comments Off

Multiple itemtypes in Microdata

Posted on August 2, 2011 by Richard Cyganiak

There’s a lot of discussion recently around HTML5’s microdata proposal, and how it relates to W3C’s earlier RDFa standard that is currently being updated for HTML5. Microdata solves many of the use cases of RDFa in a much simpler way. But some other use cases it cannot solve. This is because microdata assumes a world where there are very few or even just a single vocabulary; mixing vocabularies on a single item is rather difficult. Jeni Tennison has an excellent statement of the problem, along with a proposed solution.

In this post I put forward another proposal for addressing at least part of the problem.

The problem: microdata is limited to a single itemtype per element.

Why is this a problem? Because it makes mixing vocabularies really hard. If I decide to mark up an address with schema.org’s PostalAddress, then I can’t easily add markup for microdata’s built-in vCard vocabulary. I’ll have to repeat content in order to use both vocabularies. This design benefits the Google-backed schema.org; more focused special-purpose vocabularies, or open alternatives to schema.org with a transparent development process, will have a difficult stand.

An example. So let’s assume I have this address and want to mark it up with microdata:

<div>
    <span>26 Dun Aengus</span>,
    <span>Galway</span>,
    <span>Ireland</span>.
</div>

Then here’s how I would do it with schema.org terms:

<div itemscope itemtype="http://schema.org/PostalAddress">
    <span itemprop="streetAddress">26 Dun Aengus</span>,
    <span itemprop="addressLocality">Galway</span>,
    <span itemprop="addressCountry">Ireland</span>.
</div>

And here with vCard terms:

<div itemscope itemtype="http://microformats.org/profile/hcard">
    <span itemprop="street-address">26 Dun Aengus</span>,
    <span itemprop="locality">Galway</span>,
    <span itemprop="country-name">Ireland</span>.
</div>

It is clear why combining both versions into a single one is difficult. Microdata uses short property names like itemprop="street-address". If an element had multiple itemtypes, then it would be impossible to tell which itemtype the street-address property belongs to. Assuming that it belongs to both types would be dangerous; there could be cases where a property exists in both vocabularies but with different meaning. The restriction to a single type prevents such ambiguity.

Multiple itemtypes without ambiguity: Here’s the proposal. I’ll start by creating an item that has all the properties from both versions—I’m omitting the itemtypes for now to avoid ambiguity:

<div itemscope>
    <span itemprop="streetAddress street-address">26 Dun Aengus</span>,
    <span itemprop="addressLocality locality">Galway</span>,
    <span itemprop="addressCountry country-name">Ireland</span>.
</div>

Without itemtype, this generates an untyped item with six properties:

itemtype: none
property: streetAddress = 26 Dun Aengus
property: street-address = 26 Dun Aengus
property: addressLocality = Galway
property: locality = Galway
property: addressCountry = Ireland
property: country-name = Ireland

The altitem property. Microdata would get a new built-in property, called altitem. Let’s add an additional element with this property into the untyped item:

<meta itemprop="altitem"
      content="http://schema.org/PostalAddress streetAddress
               addressLocality addressCountry">

What’s going on here? The idea is that altitem takes a whitespace-separate list. When added to an item, it creates a new “alternate item” whose itemtype is the first element of the list. Then it looks at the rest of the list, which should be property short-names. It copies any of these named properties from the original item to the new item. So, we’d end up with a second item besides the type-less original item. This second item has:

itemtype: http://schema.org/PostalAddress
property: streetAddress = 26 Dun Aengus
property: addressLocality = Galway
property: addressCountry = Ireland

Which is exactly the same as the original schema.org item from above. Creating the vCard item is just another property:

<meta itemprop="altitem"
      content="http://microformats.org/profile/hcard street-address
                locality country-name">

This gives us:

itemtype: http://microformats.org/profile/hcard
property: street-address = 26 Dun Aengus
property: locality = Galway
property: country-name = Ireland

So now we’d have three items in total: the original untyped item, and the two typed alternate items.

What’s nice about this:

It doesn’t require any new syntax, just a new property.
Multiple types generate multiple items, which are visible in the microdata API just like normal items.
It plays well with itemref, so the altitem declaration doesn’t have to be repeated if I have several postal addresses on the page.
It’s plays well with a copy-and-paste style of web development. “If you want to use myVocab together with another vocab, just paste this snippet into your item and add the appropriate itemprops…”

Issues. Quite some details would still have to be worked out:

What happens to properties with full URL names? I guess they should always be copied to all items.
What happens to itemid? I guess all items should receive the same itemid from the original item.
In microdata, itemtype is inherited by nested sub-items. I’m not sure how this should work if altitem is present.
Properties within a microdata are ordered; there’s a question whether the order in altitem or in the original item should take precedence when alternate items are generated.
Would it be worth having a dedicated microdata attribute for this?
Would microdata clients actually implement this? There is a risk that too many implementers would take shortcuts and just implement the basic case and ignore altitem.

Summary: This post shows how multiple itemtypes could be supported in microdata without introducing new syntax, without making the common case of a single vocabulary more complex for authors, and without fundamentally changing the data model.

Posted in General | 4 Comments

4 Responses to Multiple itemtypes in Microdata

Xi Bai says:

August 2, 2011 at 17:34

Thanks for this interesting proposal. Properties derived from different vocabs are grouped via altitems, neat! The thing is I think it does not solve the exceptional issue you mentioned (if full URIs are not used in itemprops):

“there could be cases where a property exists in both vocabularies but with different meaning.”

Does it?
- Richard Cyganiak says:
  
  August 2, 2011 at 20:19
  
  @Xi: In that case, the author can only use the clashing property from one vocabulary. It is not ambiguous, but the author has to choose.
  - Xi Bai says:
    
    August 3, 2011 at 12:57
    
    Hi, Richard,
    
    Point taken. Probably in that unusual but possible scenario, it’d be better if there is a chance for publishers to declare and use local alias for clashing properties in @content (e.g., content=”http://microformats.org/profile/hcard street-address: sa locality country-name”) and alias will be mapped to the full URI when an RDF/microdata parser is applied. May be however too complicated in this way.
    - Richard Cyganiak says:
      
      August 3, 2011 at 13:16
      
      There’s a trade-off between power and complexity. Semantic clashes between property names do occur, but they are rare, and I’m not sure it’s worth worrying much about.

Comments are closed.

The RDF 1.1 Literal Quiz

Posted on May 18, 2011 by Richard Cyganiak

Let’s pretend we live in January 2013, and RDF 1.1 has just been published. This including the RDF Working Group’s attempt to clean up string literals. The issue with string literals is that RDF currently offers three different ways for doing something as simple as writing down a string:

"foo",
"foo"^^xsd:string,
and the rather weird "foo@"^^rdf:PlainLiteral.

The working group is trying to fix this. Now here’s a quiz with some RDF trivia questions. What are the answers that you’d like to see? (“Don’t care” is a fine answer too.)

Q1. Does this RDF graph (written in Turtle) have one triple?

<a> <b> 1 .
<a> <b> "1"^^xsd:integer .

Q2. Does this RDF graph (written in Turtle) have one triple?

<a> <c> "foo" .
<a> <c> "foo"^^xsd:string .

Q3. Is this a valid Turtle file?

<a> <b> "foo@"^^rdf:PlainLiteral .

Q4. Is a parser allowed to unify "foo" and "foo"^^xsd:string into a single form while parsing?

Q5. Is this a valid N-Triples file?

<http://example.com/a> <http://example.com/b> "foo" .

Q6. Is this a valid N-Triples file?

<http://example.com/a> <http://example.com/b> "foo@"^^rdf:PlainLiteral .

Q7. Is this a valid N-Triples file?

<http://example.com/a> <http://example.com/b> "foo"@en .

Q8. Is this a valid N-Triples file?

<http://example.com/a> <http://example.com/b> "foo"^^xsd:string .

Q9. Is this true in SPARQL?

datatype("foo") = xsd:string

Q10. Is this true in SPARQL?

datatype("foo") = error

Q11. Is this true in SPARQL?

datatype("foo") = rdf:PlainLiteral

Q12. Is this true in SPARQL?

datatype("foo"@en) = xsd:string

Q13. Is this true in SPARQL?

datatype("foo"@en) = error

Q14. Is this true in SPARQL?

datatype("foo"@en) = rdf:PlainLiteral

Q15. Is this true in SPARQL?

datatype("foo"@en) = rdflang:en

Q16. Does the literal in this RDF/XML fragment have a language tag?

<rdf:Description rdf:about="a" xml:lang="en">
  <b>foo</b>
</rdf:Description>

Q17. Does the literal in this RDF/XML fragment have a language tag?

<rdf:Description rdf:about="a" xml:lang="en">
  <b rdf:datatype="&xsd;string">foo</b>
</rdf:Description>

For each of the following pairs of statements, if the statement on the left is true, then is the statement on the right true as well in a system that supports datatype inference (specifically, {xsd:string}-Entailment)?

Q18. { <a> "foo" . } => { <a> "foo"^^xsd:string . }

Q19. { <a> "foo"^^xsd:string . } => { <a> "foo" . }

Q20. { <a> "foo" . } => { <a> "foo"@en . }

Q21. { <a> "foo"@en . } => { <a> "foo" . }

Q22. { <a> "foo"@en . } => { <a> "foo"@en-GB . }

Q23. { <a> "foo"@en-GB . } => { <a> "foo"@en . }

Q24. { <a> "foo"@fr . } => { <a> "foo"@en . }

Leave your answer in the comments! We’ll check in January 2013 to see who came closest. The winner gets, uhhm, a copy of the RDF Concepts and Abstract Syntax spec signed by the editors …

Posted in Semantic Web | 1 Comment

1 Response to The RDF 1.1 Literal Quiz

Ryan Kohl says:

May 19, 2011 at 15:18

1. yes
2. yes
3. no
4. yes
5. yes
6. no
7. yes
8. yes
9. yes
10. no
11. no
12. yes
13. no
14. no
15. wtf?
16. don’t care
17. don’t care
18. yes
19. yes
20. no
21. yes
22. no
23. yes, by a narrow margin
24. no

Comments are closed.

Creating an RDF vocabulary: Lessons learned

Posted on March 7, 2011 by Richard Cyganiak

With tools like Neologism and OpenVocab, creating an RDF vocabulary is easy. But if your goal is re-use within a wider community, you will face many questions that are not so easy to answer:

How much work is it going to be and what timeframe is realistic?
How broad and how deeply should you cover the domain? Where to stop?
Work alone or seek collaborators?
Should you start by setting up a mailing list, or by producing a first draft?
How much documentation do you need to produce?
Whose feature requests and modeling ideas should you heed and whose ignore?
How to keep pushing towards the uncertain goal of “adoption” in the face of limited time?

A few days ago, the VoID vocabulary became a W3C SWIG Note. VoID started in 2008 as a loose collaboration between Jun Zhao, Keith Alexander, Michael Hausenblas and me. We published a first non-W3C version in 2009. The W3C publication is a nice milestone for us, and I thought this a good opportunity to share some of the lessons I have learned along the way.

I will focus on process and collaboration in this post, and say little about modeling practices or publishing tools or RDFS/OWL geekery.

Lesson #1: Work in a team. Three or four people, each with their own use cases or data, might be ideal. It ensures that a variety of use cases are covered; fluctuations in available time don’t stall the project; it mellows any strong personal hand-writing in the modeling and design; it increases the network available for reaching out to potential users. Having a team of a few motivated people is perhaps the most important factor for success.

Lesson #2: Take your time. For all of us, VoID was a low-priority “background task”. We all get paid for doing other things. Inevitably, progress was often slow, with months where literally nothing happened. I probably averaged less than an hour of VoID work per week (with occasional major bursts of activity).

And that might be the best way. Progress in vocabulary design is not how quickly one produces a polished spec. Progress is learning about the needs of the potential user community. Going slow means more opportunity for feedback at every stage, and reduces the risk of creating something that nobody needs.

We also moved the vocabulary to a different host twice in the process. This worked out ok because we could retain the original namespace URI throughout the moves, but it definitely shows the advantage of going with something like purl.org from the start.

Lesson #3: Use a public issue tracker. This is crucial, even if you work alone. It adds structure to the work process and helps to ensure that no balls get dropped. Some issues will remain unresolved for long periods of time, and you need a place for collecting the random comments, discussions, related links, proposed text for changes and so on.

I think it’s important to use a tracker that is easy to work with, ideally one that the contributors are already familiar with. We used the one from Google Code. It’s simple and just works.

Setting up a Google Code project for developing the vocabulary worked very well for us. Besides the tracker, we also used the SVN repository for the spec, and the simple wiki for random bits of information, like lists of deployments, and examples that didn’t fit into the spec.

Don’t try to use a wiki or Google Doc or other funky collaboration device in place of an issue tracker. I’ve seen that done elsewhere and it doesn’t work.

Lesson #4: Perfection can wait till the next version. This sounds banal, but is so important. At some point quite a while ago, we were all quite fed up and just wanted to get something out of the door. So we decided not to tackle a lot of difficult open issues. We told ourselves that we would just do them in a second version. This turned out to be immensely liberating.

After version 1, we took a long break, and then started to work on version 2. Now we knew that deferring to the next version is always an option (which we used liberally). Not really clear if that use case is worth the effort? Defer. Not enough evidence or experience to inform the design? Defer. Two pig-headed contributors (that is, Keith and me) can’t agree on a design? Defer.

Lesson #5: Regular Skype calls. This one might be controversial, because no one likes wasting time in weekly conference calls. But I think it worked well for us. We didn’t quite do weekly calls, but scheduled them ad hoc, averaging perhaps one every two weeks. Often, the only progress between calls was that one of us felt a bit of shame and quickly did one or two of their actions in the thirty minutes before the call. This adds up over the months and makes sure that there is slow but steady progress.

We took turns chairing and scribing. The chair would take us through the agenda (typically “review open actions; review issues list; discuss particularly thorny issue XYZ; AOB; schedule next call”) and interrupt any discussion that started to go circular. The scribe would note whenever someone took an action to do something, and afterwards email a list of those and the date for the next call. A good call duration is somewhere between 60 and 90 minutes.

Lesson #6: Have a working draft of the spec from day one. Even if it’s just a few scribbles. Call them your working draft and take it from there. Then get into the habit of focussing any discussion on the question: What change should be made to the text? Arguing about words that should go into the text is much more productive than the alternative, which is arguing who is right or wrong. Ideally, whenever people start to disagree, they should draft up competing change proposals to be discussed in the next call.

Besides the spec text in SVN, we used Neologism to create and publish the actual RDFS vocabulary specification.

Lesson #7: Public mailing list is optional. Don’t you hate signing up to yet another mailing list? Me too. We started with a private mailing list, and found that its only real use was for notifications from the issue tracker. Discussion happened on Skype or in the tracker. We put external comments into the tracker too and discussed them there. This worked well.

This is about the creation phase of the vocabulary. It might be a different story once you get a bit of a user community going. We now have a public discussion list.

Lesson #8: Start over a beer and a large piece of paper. If you can. With everyone physically in the same room. That’s how we did it anyways, at a conference, and it was quite helpful for figuring out a core part of the vocabulary that seemed uncontroversial. Most of that time was spent arguing about—I’m sure this will come to no surprise to you—a name for the project.

Posted in General | 4 Comments

4 Responses to Creating an RDF vocabulary: Lessons learned

John Samuel says:

March 8, 2011 at 11:06

Is there any RDF for representing Commands? I mean Linux Commands or any terminal command?
- Richard Cyganiak says:
  
  March 8, 2011 at 13:54
  
  John, my blog is not a general Semantic Web Q&A site, so don’t expect an answer here!
zazi says:

March 8, 2011 at 11:27

Congrats, Richard, excellent post!

In addition, I like to add some experiences that I made during the last months when I (co-)designed several ontologies and/or proposed enhancements of existing ones (see here for an overview). I may call it the “lone warrior” style ;)

Re. lession #1: I prefer to work in a team, too. Unfortunately, it is not always possible to team up a couple of people to work on a specific ontology, because, e.g., you do not have the time or capacities to do this. That is why, I mainly chose the (more general) community approach, i.e., I proposed my thoughts, drafts and changes by using different communications channels, e.g., mailing lists or chat channels. Sometimes I got some feedback. However, I often experienced little to no reactions. So, I always have had to keep in mind to expect nothing, which made me even happier when I got some feedback ;) The disadvantages of this approach are, although, sometimes the feedback cycles were really fast,
– people often do not really have time to intensively look into a specific subject to provide advanced feedback (especially in chat channels; albeit, this is quite comprehensible)
– some people get annoyed of cross posting, which is, on the other side, an option to reach a broader audience (this is comprehensible, too; so, I mainly stopped posting such announcements on mailing list)
Finally, you even have to expect nothing, if a team already exists that developed an ontology that one reused or where one proposed changes (team work?).

Re. lession #2: Makes generally much sense, although, as I already mentioned in my comment to lession #1, this is not always possible. There is always some time pressure. My experience is, that many people give a s*** about ontology design and have no idea about how long a proper design of a new vocabulary will take. Often they view it like rapid proprietary database schemata design. Besides, I’m not in such a position were I get paided for doing other things. I have to present solutions as fast as possible (for free, anyway).

Re. lession #3: Thanks a lot for opening my eyes for outlining the advantages of an issue tracker in comparison to a mailing list. I guess, I will ad this feature to.

Cheers
- Richard Cyganiak says:
  
  March 8, 2011 at 13:52
  
  Good points zazi. I guess I’d also advocate looking for some collaborators because it ensures that there is some minimum interest in the topic.
  
  Regarding the recommendation to take your time, this doesn’t mean spending a lot of time overall. My point is that it’s better to spend an hour per week for 40 weeks, than spending one week full-time. This way you will get more feedback that you can still take into account during the design process.
  
  Anyways, I’m speaking from the experience of creating one vocabulary only, so my observations are by no means a definitive account of the vocabulary creation process…

Comments are closed.

Blank nodes considered harmful

Posted on March 2, 2011 by Richard Cyganiak

Well, they are not always harmful. But most of the time. I’ll get to that in a minute.

On the semantic-web@w3.org list, W3C’s Sandro Hawke has a lucid and concise summary of the problems with blank nodes in RDF. It’s worth quoting in full:

I agree that *software* should not change blank nodes to nodes with a
URI label. But, when practical, *people* probably should, as they are
authoring.

In general, blank nodes are a convenience for the content provider and a
burden on the content consumer. Higher quality data feeds use fewer
blank nodes, or none. Instead, they have a clear concept of identity
and service for every entity in their data.

If someone in the middle tries to convert (Skolemize) blank nodes, it’s
a large burden on them. Specifically, they should provide web service
for those new URIs, and if they get updated data from their sources,
they’re going to have a very hard [perhaps impossible] time
understanding what really changed.

Does this mean blank nodes are evil? Not always. Sometimes they are tolerable, sometimes they are a necessary last resort, and sometimes they are good enough. But they are never good.

They are fine for transient data that’s not meant to be stored.
They can be the only viable option if a changeable upstream data source doesn’t provide identifiers that persist across requests/updates.
They can be tolerable for unimportant auxiliary resources that don’t correspond to a meaningful entity in the domain of interest (e.g., some n-ary relations) and are not worth the hassle of maintaining a stable URI.

In all other cases, blank nodes should be avoided. Sandro is right: publishing RDF with blank nodes puts a burden on the consumer. Especially if the data might change in the future.

The higher the percentage of blank nodes in a dataset, the less useful it is.

Posted in General | Comments Off

Top 100 most popular RDF namespace prefixes

Posted on February 15, 2011 by Richard Cyganiak

I run prefix.cc, a website for RDF developers where anyone can register and look up the expansion URIs for namespace prefixes such as foaf, dc, qb or void. The site tracks which prefixes gets looked up most often. This allows some insight into the popularity of RDF vocabularies and datasets.

This post is a snapshot of the top 100 most requested prefixes as of today.

Caveats:

The counts reflect what knowledgeable RDF hackers are interested in. This may or may not reflect the interests of more casual users, or what’s deployed on the web. The og prefix for Facebook’s Open Graph protocol for example is outside of the list, at #273.
“Users” of the site include automated apps and web crawlers. This distorts numbers. For example, the prefix.cc homepage links to prefix.cc/foaf, driving crawlers and first-time visitors that way, inflating foaf numbers.
Here I deliberately do not include the full URI expansions for those prefixes. Prefix.cc allows multiple competing expansions for a prefix. Users can then vote to determine what’s shown first. It can be subject to gaming, ballot stuffing, and so on. There are strong disagreements over the “best” expansion for some prefixes, starting right at #2 with dc, which is one of most controversial prefixes on the site. (If you need expansions, then you can get a fresh set from the API.)
Prefix.cc doesn’t allow registration of single-letter namespaces, along with some other syntactic restrictions. Some vocabularies suggest single-letter prefixes, most notably Google’s rdf.data-vocabulary.org, which is commonly abbreviated “v”. (Someone has registered dv for it, but that rarely gets looked up.)

That being said: The data is below, and a CSV version is available too.

Rank	Prefix	Lookups
1	foaf	45506
2	dc	17621
3	rdf	17585
4	rdfs	14865
5	owl	11898
6	geonames	9349
7	geo	4757
8	skos	4501
9	dbp	3396
10	swrc	2439
11	sioc	2336
12	xsd	2310
13	dbo	2089
14	dc11	2006
15	doap	1856
16	dbpprop	1697
17	content	1621
18	wot	1598
19	rss	1474
20	gen	1403
21	dbpedia	1377
22	d2rq	1370
23	nie	1352
24	xhtml	1336
25	test2	1305
26	gr	1301
27	dcterms	1255
28	org	1157
29	vcard	1154
30	akt	1150
31	dct	1118
32	ex	1104
33	fb	995
34	owlim	993
35	cfp	978
36	xf	960
37	sism	956
38	earl	948
39	bio	941
40	reco	936
41	xfn	926
42	media	925
43	air	921
44	dcmit	920
45	void	917
46	fn	915
47	afn	910
48	cc	906
49	cld	900
50	vann	898
51	days	895
52	ical	893
53	http	893
54	mu	888
55	sd	874
56	osag	874
57	botany	859
58	cal	858
59	musim	850
60	factbook	848
61	cs	845
62	log	838
63	rev	837
64	swande	836
65	bibo	834
66	dcq	834
67	cv	832
68	ome	830
69	biblio	830
70	dir	828
71	giving	827
72	memo	827
73	ok	826
74	rel	821
75	event	818
76	ir	818
77	aiiso	816
78	ad	813
79	dbr	813
80	co	812
81	af	809
82	cmp	806
83	bill	805
84	rif	804
85	xs	804
86	math	803
87	rdfg	803
88	daia	801
89	swc	800
90	tag	800
91	swanq	799
92	xhv	796
93	book	795
94	jdbc	793
95	myspace	792
96	tzont	792
97	sr	790
98	ctag	789
99	dcn	787
100	lomvoc	786

Posted in General | 2 Comments

2 Responses to Top 100 most popular RDF namespace prefixes

zazi says:

February 16, 2011 at 15:39

Some including namespace a very doubtable regarding their popularity. You may should mark “counts reflect what knowledgeable RDF hackers are interested in” in bold style. However, I think even then it is still very doubtable. Anyway, thanks a lot for your efforts. You already named the sources for this biased view (automated apps and web crawlers – are not “RDF hackers” ;) ).
Richard Cyganiak says:

February 16, 2011 at 16:18

@zazi: Data with known biases is better than no data.

Comments are closed.

Maintenance

Posted on October 28, 2010 by Richard Cyganiak

This weblog has become quiet. These days, most of my word count goes into mailing lists, Twitter, and way too much personal email. Over here on the blog, cobwebs are gathering and some signs of bitrot have become evident.

So I’ve done some maintenance. I upgraded the software to WordPress 3.0.1, and changed to a new theme. I also decided to retire the blog’s name, dowhatimean.net, and instead move it to richard.cyganiak.de/blog.

This new location was previously home to a German-language blog I wrote back in 2004 and 2005. I imported the old posts into the site, so don’t be surprised if you encounter some German in the depths of the archive.

No URLs were broken in the making of this post. I hope.

Posted in General | Comments Off

prefix.cc, MkII

Posted on January 11, 2010 by Richard Cyganiak

prefix.cc is a website I’ve made last February to ease a very common task in the life of RDF developers and SPARQL users: looking up namespace URIs. A short summary of what the site can do for you is available here.

The site was developed during a few weekends, and I haven’t touched the code since I first deployed it. Today I’m publishing the first serious update to the site. This post describes what’s new.

Reverse lookup. One of the most requested features is reverse lookup. You can now enter a URI of an RDF term into the query box on the start page, and the site will respond with the best prefix for contracting that URI into a QName. This functionality is also available as an API.

Negative votes. The site has received a moderate amount of spam, mostly from pranksters who think it would be funny to propose their own homepage as a better expansion for the foaf prefix. I’ve mostly cleaned this up manually, but I think it would be better to equip the user community with tools to handle this.

The site has always had a voting mechanism, which I intended as a tiebreaker in cases where people have submitted different URIs for the same prefix, for example in the case of the dc prefix. Starting today, you can submit both positive and negative votes. If a URI receives a certain amount of negative votes, it will be no longer shown.

New export formats. One of my favourite features is the ability to directly get output in various machine-readible syntaxes by composing an appropriate URI, such as http://prefix.cc/foaf.file.n3, which produces a declaration of the FOAF prefix in N3 format. I find this handy for copy-pasting into a text editor, but also for automating things.

A few formats have been added: vann produces an RDF/XML version of the namespace mapping in the VANN vocabulary (example). xmlns produces raw XML prefix declarations (example). go redirects to the namespace URI, so you can type http://prefix.cc/foaf.go into your browser bar as a shortcut for opening the FOAF specification. I’ve also added a table of all supported formats.

A side effect of the introduction of VANN support is that there is now a single VANN representation of all mappings known to the site.

Tweaks and fixes. Regular users will note a number of further small changes and bugfixes throughout the site. One notable fix is to the way namespace lookups are calculated for the list of popular prefixes. Ironically, most of the lookups actually are from web crawlers that followed the links in the list itself, making the list self-perpetuating. Also, the list featured the non-existing robots prefix, because many crawlers are looking for http://prefix.cc/robots.txt. These issues should now be fixed.

Internal changes. The site is developed in PHP, and started out as a quick weekend hack, so the initial code was a horrible mess that was hardly maintainable. I spent quite some time cleaning this up and refactoring the code into a much nicer structure that should be able to grow along with some of the additional features I’ve planned for the future. The codebase now totals some 1600 lines of PHP, CSS and Javascript.

Hidden goodies: RDFa markup and feed of latest additions. Finally, I want to highlight some features that have existed all along, but are easily missed: First, many pages contain RDFa markup, so if you want to re-use any prefix.cc data in your own site or application, you most likely can. Second, there is an RSS feed of the latest additions to the prefix database, and it is a neat way of learning about new vocabularies and ontologies that show up around the Web.

Bugs, comments, suggestions? Any feedback is appreciated. I did a lot of refactoring without a test harness, so it’s quite likely that a few new bugs have crept in. If you notice anything, please let me know. Also, if there is anything that you would like to see in prefix.cc Mk III, please share!

Posted in General, Semantic Web | Comments Off

What’s in a name? And the Linked Data Police

Posted on November 21, 2009 by Richard Cyganiak

So I wrote a rather angry private email to Erik Wilde a few days ago, complaining about his use of the term “linked data” for a site that doesn’t follow the linked data practices. Erik decided to publish my email on his blog, along with a long defense of his use of the term, in a post called “The Linked Data™ Police”. Since it’s in public now, we can just as well see if we can get a useful discussion out of this.

First, I realize that Erik probably responded more to the tone of my email than to the content. It was an angry rant, and my tone misses the mark, so his response is fair enough, and I have apologised to him. I also have to say that I’m speaking only for myself and nobody else—before others become scared of the “Linked Data™ Police”, I can assure them that to the best of my knowledge, it has a staff of one, and my peers in the community are in general a friendly and civil lot.

The site in question. So, what is this about? Erik and team have built recovery.berkeley.edu, a site that publishes structured data about Recovery Act spending. The site was built with a grant from the Sunlight Foundation. The technologies of choice are Atom and various other XML formats. As far as I can see, it’s excellent in its adherence to REST principles, including good URI design and “hypermedia as the engine of application state”. These two together are labelled as “linked data” in the site’s technical documentation.

This is a discussion about names, and not about substance. At the very core is the following question: Should we understand “linked data” to mean “the idea of somehow connecting pieces of data with links”, or should we take it to mean “RDF published according to the rules outlined by Tim Berners-Lee in the design note that coined the term”?

Obviously, I’m of the latter opinion. In this post, I want to do two things: First, I want to respond to some specific points from Erik’s post. I will do this by paraphrasing each point, and then responding to it. Second, I want to explain why I care about the matter and why I think that “linked data” should continue to be associated with Tim’s rules, and why advocates of different sets of rules should use different terms.

Erik: “The attitude is scary: Instead of figuring out the most effective way of adding more semantics to the web, it starts with a set of technologies and claims that whatever you want to do, you have to use those.” Linked data didn’t start with a set of technologies; a lot of deliberation by a lot of people went into the choice. Also, I have no quarrel with Erik’s choice of technologies, and I didn’t even suggest that he should or shouldn’t use any technology. There can be good reasons against using RDF, and it’s a good thing that innovation continues in other areas of web data technology.

But if Erik and colleagues don’t buy into the set of technology choices commonly called “linked data”, then why would they insist on using that name? What’s wrong with the established technical terms, REST and Resource-oriented Architecture? Is it just because those are already way past the peak of the hype cycle?

Erik: “Using generic problem names to refer to specific technologies only confuses people.” Erik mentions Linked Data, XML Schema, Semantic Web and Web Services as specific technologies that he considers to be badly labelled. But the peculiar thing about those four is not their names; they are controversial for other reasons. The IT world is full of specific technologies that use generic names: World Wide Web. Structured Query Language. Extensible Markup Language. Hypertext Transfer Protocol. Portable Document Format. Resource-Oriented Architecture. Scalable Vector Graphics. Open Document Format. Erik may not like it, but it’s a common practice.

Erik: “Choosing such names is usually an attempt to make competition harder.” Usually? I doubt that. Technologies are named in their very infancy, when their future and success is far from certain, and when competition is usually not an issue. The naming is usually an attempt to communicate as clearly as possible what the proposed technology is supposed to achieve, which is not a bad thing at all. Some fail at achieving the goal, but everyone designs (and names) assuming that it can be eventually achieved.

Erik: “RDF is just a stylesheet away.” Erik points out that it would be trivial to create a GRDDL transform that translates from the service’s output to RDF. Personally I wouldn’t call it trivial, and being just one transformation away from being compatible is not the same as being compatible. If there were GRDDL transforms in place, I would have no reason at all to complain, although just a few linked data clients support GRDDL at this time.

Back to the roots. So where did the term “linked data” come from? To the best of my knowledge, Tim Berners-Lee coined it in his 2006 Design Note that is titled “Linked Data”. The document introduced the four rules that are now known as the “Linked Data Principles.” Erik’s service is following all of them except the one that demands RDF or SPARQL.

It’s worth pointing out that the four rules did not mention RDF when Tim originally published them, but it is clear from the rest of the document that the use of RDF was implied. The document was aimed at the semantic web community. His later change was a clarification, not a change of intention.

I don’t know why Tim wrote this piece back in 2006, but my interpretation was that he wanted more people to publish data that can be browsed with his Tabulator RDF browser, and most RDF out there at that time couldn’t be browsed because of problems with one of the four rules. So I read it as a call for better interoperability among RDF publishers.

Broadening the term? There have been a number of calls for broadening the meaning of the term, most eloquently from Paul Miller, so Erik is certainly not alone in his view. Their intention is to get linked data quicker into the mainstream, which is a goal that I share.

The problem is that broadening a term makes it less meaningful. There is a danger that the term gets extended to the point where it’s equally meaningless to other buzzwords such as Web 3.0 or the venerable Semantic Web. If you can use other formats instead of RDF, then why not also use SOAP instead of HTTP? Why not do away with the URIs? Why not YQL instead of SPARQL? Where does “linked data” stop? Everything is somehow “data” and somehow “linked.”

Interoperability requires choices to be made. In my eyes, the great thing about the term “linked data” is that it has a reasonably precise technical definition, rooted in Tim’s Design Note and the early work of the Linking Open Data project. That work has turned the Semantic Web’s compelling but vague promises of a side-by-side “web for humans” and “web for machines” into concrete guidelines that people can actually implement, and the result is an ecosystem of interoperable tools, clients and datasets that continues to grow around these guidelines.

These guidelines will continue to evolve with the emergence of new technologies (e.g., RDFa) and increasing experience and maturity (e.g., importance of licensing and provenance handling).

But at the core, it has to be about a set of concrete technology choices and deployment practices that foster an interoperable ecosystem of data sources and clients. “Linked data” is the best name we have for that particular set of technology choices and practices. There is nothing magic about the name “linked data”, to the best of my knowledge it didn’t exist at all in the web community before 2006. The term has gained popularity because it has associated rules that tell you how to do it, not because of the “words”. Without the rules, the term would be meaningless fluff. Everything is somehow “linked” and somehow “data.”

If you think that a different set of rules would work better (which is entirely possible), then it would be prudent to write them down, coin a new term for them, and start the legwork of advertising them, just as Tim did since 2006.

Posted in General, Semantic Web | 7 Comments

7 Responses to What’s in a name? And the Linked Data Police

Greg Boutin says:

November 24, 2009 at 13:17

I think there are two issues here:

1- the propensity of a few Linked Data folks of trying to impose their view of the world, and strangely feel threatened by anyone advocating for a more flexible approach to fulfilling the vision of the semantic web. This vision is not vague but it does not prescribe the methods to use to get there – I think that’s a good thing because clearly Linked Data is only one possible way to get there, and after some examination it is possible it might have some fundamental flaws (e.g. it appears that computing queries across geographically-distributed triples/URIs is far too slow, and other things of that sort). In any case, linked data is missing the “semantization” part, it only carries the semantics – does not create them, and thus needs to be completed by concept extraction mechanism.
The defensiveness of those Linked Data people is a huge problem that limits the good will of people who could otherwise be good evangelists for linked data, like Erik and me, and it does bring up terms like “Police” (Erik) and “Inquisition” (from me).

As you rightly note, Richard, a lot of that has to do with reacting to tones, which has to do with how you see the world. So I’d argue the linked data folks in question should accept that whatever they say, there will be debates about alternative approaches, until they have proven practically that linked data is the only approach (if that’s really what you think), or alter their perspective to embrace the possibility that linked data is just one method among many to get to the semantic web.
Ultimately, beyond some major implementation question marks, the issue for me is that linked data is a proxy to building more intelligence into the web, by injecting it into the data itself. There will always be competing approaches not trying to inject more semantics into the data format itself, but deriving those from a more algorithmic approach, and identifying relevant links (data or document-level) after that. I think this approach is perfectly viable, and I see linked data as a pursuit that also is useful – but it has not made its case to deserve the title of “exclusive semantic web technology”, just yet…

2- the difficulty of using terms exactly as defined by a technology community, when those terms are so generic indeed. “Linked Data” makes one think of “Linked” and “Data”. “Semantic Web” makes one think of a Web that is Semantic, i.e. would understand the meaning. While I would agree at this point that it’s probably best to leave Linked Data (with block letters) to describe TBL’s stack, you won’t be able to suppress the use by people who think it’s simply linking data at the data level. And again, alternative approaches to do just that are useful, as I don’t think RDF cuts it fully. But my point is, you need to be more flexible, let people use it as they like, and could simply point out that Linked Data already refers to an established stack, and that they could perhaps use “linked data” without block letters rather than crack down and alienate the newcomers. This has happened way too often, and as a market development specialist (I’m not a technical expert on linked data, but I do proudly think I am one in market adoption and other related business matters…), I will point out once again that this does a huge disservice to your cause. It simply is not effective.

I hope those thoughts help further the discussion.
Pingback: Growth Times » Discussing the Semantic Portals of Eqentia, DBPedia and the Utility of Linked Data
Pingback: Open Calais – a web de dados estruturados « Tecnologia Educacional
Pingback: Conexão TE » Blog Archive » Open Calais – a web de dados estruturados
Pingback: Czy przysz?a Sie? musi by? pedantyczna? « Szko?a Web 3.0
Pingback: Seven Pillars of the Open Semantic Enterprise | Digital Asset Management
Will Daniels says:

January 17, 2010 at 08:16

I think you’re essentially correct in this matter, and you hit the nail on the head when you wrote “Without the rules, the term would be meaningless fluff. Everything is somehow ‘linked’ and somehow ‘data'”.

When this kind of thing is done for business marketing purposes, it is understandable (albeit undesirable), but when it happens in a more academic setting, it lends credence to a redefinition – and subsequent abuse – of the term that is both premature and unhelpful.

For those of us not fortunate enough to be directly employed in this area, the multitude of (often unfamiliar) technologies and best practices related to the Semantic Web are hard enough to figure out and to keep up with, without prominent researchers casually redefining the vocabulary.

I say “casually” here, even though the document in questions does mention explicitly the deviation from accepted usage of the term, simply because there does not seem to be any actual reason to have done it, other than for a kind of convenience that is like calling a lettuce and tomato sandwich a BLT; after all it’s only a slice of bacon away.

While your tone was indeed a little “snooty”, and did not elicit a constructive response directly, the ensuing publicity might actually serve the point better. Hopefully a lot more people will now think twice before using the label Linked Data too loosely, not necessarily for fear of the Linked Data Police, but just for having thought about it at all…

Comments are closed.

Linked data at the New York Times: Exciting, but buggy

Posted on October 30, 2009 by Richard Cyganiak

Update: Evan Sandhaus reports that all the issues mentioned below will be fixed. Great!

Yesterday at the International Semantic Web Conference, Evan Sandhaus of the New York Times unveiled data.nytimes.com, a site that publishes linked data for some parts of the Times’ index. To me, this was one of the most exciting announcements at the conference, and it caused quite a tweetstorm during and after Evan’s talk.

A bit of background: Every article published in the newspaper or on the website is tagged, classified and categorized in many ways by skilled editors. This metadata allows the creation of topic pages that automatically collect relevant articles for notable people, organisations, and events. Examples include Michelle Obama, Swine Flu (H1N1 Virus) and Wrestling.

What’s in the data? The dataset published yesterday contains information on each of the concepts that have a topic page. For now, it is limited to topic pages about people. The concepts are modelled in SKOS. The information attached to each concept consists mostly of links: to DBpedia, to Freebase, into the Times API (which is not available as RDF at this point), and of course to the corresponding topic page. This means that if you have a DBpedia URI for an especially notable entity, a high-quality New York Times topic page with the latest news about the topic is only two RDF links away. A notable feature of the links is that every single one has been manually reviewed, making this perhaps the highest-quality linkset in the LOD cloud.

How to get the data? This being linked data, every concept has a dereferenceable URI. Examples:

http://data.nytimes.com/N13941567618952269073 (Michelle Obama)
http://data.nytimes.com/54678418238199039913 (Philip K. Dick)
http://data.nytimes.com/68581771646200356283 (Amelia Earhart)

The site’s URI scheme follows one of the Cool URIs recipes: The identifiers above are resolvable, and by using content negotiation, web browsers are redirected to

http://data.nytimes.com/N13941567618952269073.html

which has a nicely formatted summary of the data available about Michelle Obama. Data browsers and other RDF-enabled clients, on the other hand, are redirected to

http://data.nytimes.com/N13941567618952269073.rdf

which has all the data goodness in RDF/XML.

There is also a dump: people.rdf. You can browse the data starting from the data.nytimes.com page. Everything is available under a CC-BY license.

Bugs and problems

This being a new dataset and the Times’ first foray into linked data, it turns out that the Beta label on the site is quite warranted. I will highlight four issues.

Data and metadata are mixed together. Let’s look at the data about Michelle Obama, available at the N13941567618952269073.rdf URI above. I’m reformatting the data into Turtle for legibility.

<http://data.nytimes.com/N13941567618952269073>
    a skos:Concept;
    skos:prefLabel "Obama, Michelle";
    skos:definition "Michelle Obama is the first …";
    skos:inScheme nyt:nytd_per;
    nyt:topicPage <http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html>;
    owl:sameAs <http://rdf.freebase.com/rdf/en.michelle_obama>;
    owl:sameAs <http://data.nytimes.com/obama_michelle_per>;
    owl:sameAs <http://dbpedia.org/resource/Michelle_Obama>;

This makes perfect sense, it’s data about a person, modelled as a SKOS concept. But then it goes on:

<http://data.nytimes.com/N13941567618952269073>
    dc:creator "The New York Times Company";
    time:start "2007-05-18"^^xsd:date;
    time:end "2009-10-08"^^xsd:date;
    dcterms:rightsHolder "The New York Times Company"^^xsd:string;
    cc:license "http://creativecommons.org/licenses/by/3.0/us/";
    .

This is not data about Michelle Obama the person, it’s metadata about the data published by the NYT. It’s certainly not true that Michelle Obama was created by the New York Times, or that she “started” in 2007 (whatever that’s supposed to mean), and don’t even get me started on asserting a rights or a license over a person.

Note that the NYT team actually went through the effort of setting up separate URIs for Michelle the person (http://data.nytimes.com/N13941567618952269073), and for the HTML and RDF documents describing the concepts (http://data.nytimes.com/N13941567618952269073.html and http://data.nytimes.com/N13941567618952269073.rdf). The reason why linked data experts advocate this practice of having separate URIs is exactly because it enables separation of data and metadata: It lets you state some facts about the concepts, and other things about the documents that describe the concepts. This is what should be done in the data above: The metadata should not be asserted about the URI identifying Michelle, but about the URI identifying the document published by the NYT: N13941567618952269073.rdf. So we would get:

<http://data.nytimes.com/N13941567618952269073>
    a skos:Concept;
    skos:prefLabel "Obama, Michelle";
    skos:definition "Michelle Obama is the first …";
    skos:inScheme nyt:nytd_per;
    nyt:topicPage <http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html>;
    owl:sameAs <http://rdf.freebase.com/rdf/en.michelle_obama>;
    owl:sameAs <http://data.nytimes.com/obama_michelle_per>;
    owl:sameAs <http://dbpedia.org/resource/Michelle_Obama>;

<http://data.nytimes.com/N13941567618952269073.rdf>
    dc:creator "The New York Times Company";
    time:start "2007-05-18"^^xsd:date;
    time:end "2009-10-08"^^xsd:date;
    dcterms:rightsHolder "The New York Times Company"^^xsd:string;
    cc:license "http://creativecommons.org/licenses/by/3.0/us/";
    .

Eric Hellman has a post about this issue, calling it “a potential legal disaster” because a license is attached to a resource that’s said to be the same as a resource on a different site (DBpedia and Freebase). He’s a bit alarmist, but this example highlights why the separation of data and metadata, of concept URIs and document URIs, is critically important in a general-purpose data model.

Distinguishing URIs and literals. Here’s some selected snippets from the RDF/XML output:

    <nyt:topicPage>http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html</nyt:topicPage>
    <cc:License>http://creativecommons.org/licenses/by/3.0/us/</cc:License>
    <cc:Attribution>http://data.nytimes.com/N13941567618952269073</cc:Attribution>

The value of all three properties are URIs. In the RDF data model, URIs are of such central importance that they are treated differently from any other kind of value (strings, integers, dates). But not so in the code example above. There, the three URIs are encoded as simple strings. This should be:

    <nyt:topicPage rdf:resource="http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html" />
    <cc:License rdf:resource="http://creativecommons.org/licenses/by/3.0/us/" />
    <cc:Attribution rdf:resource="http://data.nytimes.com/N13941567618952269073" />

Why does this matter? It’s basically like making links “clickable” in HTML by putting them into a <a href=”…”> tag: RDF clients will not recognize URIs if they are encoded as literals, and will not know that they can treat them as links that can be followed.

Content negotiation for hybrid clients. As usual for linked data emitting sites, there is content negotiation on the concept URIs: They redirect either to RDF or HTML, based on the Accept header sent by the client when resolving the URI via the HTTP protocol. Also as usual for first-time linked data producers, the content negotiation is a bit broken.

Here is what happens when I ask for HTML (using cURL, which is a handy tool for debugging the HTTP behaviour of linked data sites):

$ curl -I -H "Accept: text/html" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.html

Next I will ask for RDF:

$ curl -I -H "Accept: application/rdf+xml" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.rdf

So far, so good. But many clients are “hybrid”, they can consume both RDF and HTML. This includes many tools that can consume RDFa (RDF embedded in HTML pages). So it’s not uncommon to find tools that combine multiple media types in the accept header. The Times server should also redirect those tools to the RDF, because any RDF-consuming client can probably handle the raw RDF data better than the (not overly useful) HTML pages. But let’s see what happens:

$ curl -I -H "Accept: text/html,application/rdf+xml" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.rdf.html

The server redirects to a file that doesn’t exist, ending in .rdf.html. This is pretty funny to me as a programmer, because the bug gives me a glimpse into the Times codebase, where obviously a programmer didn’t consider that the two alternatives—sending HTML or sending RDF—are exclusive.

Update: Someone at the Times seems to be working on the server as I’m writing this; the latest behaviour is even worse; it redirects to .rdf.html even if I request only RDF, and uses 301 redirects instead of 303.

Using the Creative Commons schema. The NYT data uses the Creative Commons schema to license the data under CC-BY. Here’s the relevant RDF, in Turtle (I fixed the subject URI and turned literals into URIs where appropriate):

<http://data.nytimes.com/N13941567618952269073.rdf>
    cc:License <http://creativecommons.org/licenses/by/3.0/us/>;
    cc:Attribution >http://data.nytimes.com/N13941567618952269073<;
    cc:attributionName "The New York Times Company";
    .

This uses three properties: cc:License, cc:Attribution and cc:attributionName. But according to the schema, cc:License and cc:Attribution are classes, not properties. This should be:

<http://data.nytimes.com/N13941567618952269073.rdf>
    cc:license <http://creativecommons.org/licenses/by/3.0/us/>;
    cc:attributionURL <http://data.nytimes.com/N13941567618952269073>;
    cc:attributionName "The New York Times Company";
    .

Summary. The Times’ foray into linked data is an exciting new development, but it also shows how hard it is to get linked data right. This is a weakness of the linked data approach.

Can we do anything about this? Better tutorials and education can probably help. Another activity that is trying to address the issue is the Pedantic Web Group, a loose collection of people like me who obsess about the technical details of publishing data on the web and work with data publishers to get issues like the above fixed. We might even give you a hand with reviewing your stuff before you go live with it.

Posted in General, Semantic Web | 10 Comments

10 Responses to Linked data at the New York Times: Exciting, but buggy

Evan Sandhaus says:

October 31, 2009 at 00:03

I very much appreciate all this feeback. We’re already planning an update of the data to address the rights and other concerns raised by you and other members of the community. I hope to have this update pushed out by sometime next week.

And thank you for your patience. It’s only been 4 months, since we announced that we committed to this path and we’re still learning all the particulars.

All the best,

Evan Sandhaus

Semantic Technologist
New York Times R+D
Michael Schneider says:

October 31, 2009 at 01:09

That’s a good article!

One point:

I agree that the values of the properties nyt:topicPage and cc:licence should be URIs instead of literals. But, apart from the convenience to receive clickable links by this change, my main reason is that using URIs instead of literals here makes it clearer that not the URIs but the resources denoted by these URIs are the things the triples talk about, namely the actual topic page and the licence file, respectively.

On the other hand, I would say that the value of the cc:attributeName property (or cc:Attribute) should really be a literal rather than a URI, since in this case it is the URI itself the property refers to, AFAIU. (But I think that a clever data browser will still be able to recognize such a URI literal as a URI and can then render it in a clickable way as well.)

Cheers,
Michael
Richard Cyganiak says:

October 31, 2009 at 01:35

Michael, I think you meant cc:attributionURL when you said cc:attributionName? The range of this property is defined as rdfs:Resource in the schema definition, so it also has to be a URI. And by the way, you should have checked that before commenting here—do your homework before nitpicking ;-)

(IMO, putting a URI into a literal in RDF is always a horrible idea, because it will confuse people to no end, and breeds interoperability nightmares.)
Tom Heath says:

October 31, 2009 at 18:06

Hi Richard,

A typically thorough treatment of the subject :) One thing you didn’t mention are links to resources that can provide some background on the licensing topic. As you know, Leigh Dodds (Talis), Jordan Hatcher (Open Data Commons), Kaitlin Thaney (Science Commons) and I ran a tutorial at ISWC2009 that addressed many of these issues. The slides from the tutorial should provide a useful introduction to people new to the subject:

http://iswc2009.semanticweb.org/wiki/index.php/ISWC_2009_Tutorials/Legal_and_Social_Frameworks_for_Sharing_Data_on_the_Web

The following paper from LDOW2008 also provides a more narrative context:

http://events.linkeddata.org/ldow2008/papers/08-miller-styles-open-data-commons.pdf

It is highly likely that we will work in the coming months to provide a more polished and comprehensive guide to licensing and waivers for Linked Data publishing.

Cheers, Tom.
Pingback: Linked Open Data | Healthcare Semantic Architectures
Richard Cyganiak says:

October 31, 2009 at 19:11

Thanks for the pointers Tom. I couldn’t attend the tutorial, but would have loved to. It seems you’ve provided the most comprehensive view of the licensing issue so far.

(The link got mangled, I fixed it.)
Bill Roberts says:

November 2, 2009 at 10:01

Does the President of the United States know that he is married to a Concept?!

Thanks for the article. As you say, there seem to be a few teething troubles, but this should become a fantastic data resource once these are fixed and kudos to the NYT for making its data available in this way.

Is better tool support the way forward for this? Certainly more education should help, but there will always be pitfalls in representing information accurately. Many of the bugs you have pointed out should be identifiable automatically.

Anyway, I’m looking forward to being able to make use of the NYT data.

Cheers, Bill
Pingback: When Linked Data Rules Fail at Frederick Giasson’s Weblog
Pingback: When Linked Data Rules Fail » AI3:::Adaptive Information
Pingback: Du bist nicht deine Website » Kontroversen

Comments are closed.

URIs have a namespace part and a local part, right?

Namespaces in RDF are a strange beast

So where does the namespace part end???

Multiple itemtypes in Microdata

4 Responses to Multiple itemtypes in Microdata

The RDF 1.1 Literal Quiz

1 Response to The RDF 1.1 Literal Quiz

Creating an RDF vocabulary: Lessons learned

4 Responses to Creating an RDF vocabulary: Lessons learned

Blank nodes considered harmful

Top 100 most popular RDF namespace prefixes

2 Responses to Top 100 most popular RDF namespace prefixes

Maintenance

prefix.cc, MkII

What’s in a name? And the Linked Data Police

7 Responses to What’s in a name? And the Linked Data Police

Linked data at the New York Times: Exciting, but buggy

Bugs and problems

10 Responses to Linked data at the New York Times: Exciting, but buggy

About me

Links

Recent Posts

Archives