Blank nodes considered harmful

Well, they are not always harmful. But most of the time. I’ll get to that in a minute.

On the semantic-web@w3.org list, W3C’s Sandro Hawke has a lucid and concise summary of the problems with blank nodes in RDF. It’s worth quoting in full:

I agree that *software* should not change blank nodes to nodes with a
URI label. But, when practical, *people* probably should, as they are
authoring.

In general, blank nodes are a convenience for the content provider and a
burden on the content consumer. Higher quality data feeds use fewer
blank nodes, or none. Instead, they have a clear concept of identity
and service for every entity in their data.

If someone in the middle tries to convert (Skolemize) blank nodes, it’s
a large burden on them. Specifically, they should provide web service
for those new URIs, and if they get updated data from their sources,
they’re going to have a very hard [perhaps impossible] time
understanding what really changed.

Does this mean blank nodes are evil? Not always. Sometimes they are tolerable, sometimes they are a necessary last resort, and sometimes they are good enough. But they are never good.

  • They are fine for transient data that’s not meant to be stored.
  • They can be the only viable option if a changeable upstream data source doesn’t provide identifiers that persist across requests/updates.
  • They can be tolerable for unimportant auxiliary resources that don’t correspond to a meaningful entity in the domain of interest (e.g., some n-ary relations) and are not worth the hassle of maintaining a stable URI.

In all other cases, blank nodes should be avoided. Sandro is right: publishing RDF with blank nodes puts a burden on the consumer. Especially if the data might change in the future.

The higher the percentage of blank nodes in a dataset, the less useful it is.

This entry was posted in General. Bookmark the permalink.