I didn’t come across this recent addition to HP Lab’s list of technical
reports before: An
Introduction to the Semantic Web, Considerations for building
multilingual Semantic Web sites and applications by Jeremy Carroll.
If you read my blog, then you probably want to skip the “Introduction to
the SW” part. The rest of the report is a highly focused look at the
issues involved in building semantic web applications in any language
that is not English: Unicode, language tags on literals and embedded
XML, URIs vs. rdfs:labels, and IRIs.
The paper also mentions a feature introduced to RDQL in Jena 2.2:
the langeq operator. It is used to filter literals based on
their language tags, which is useful if your RDF data contains literals in multiple languages.
langeq can deal with subtags, that is, asking for
labels in German (tag de) will also give you labels in German
as used in Switzerland and as written using the spelling reform
beginning in the year 1996 C.E. (tag de-CH-1996.)
Example usage:
SELECT ?resource, ?label WHERE (?resource, rdfs:label, ?label) AND (?label langeq 'it')
The current SPARQL working draft doesn’t have a
facility like this, but thanks to SPARQL’s powerful expression system,
you can emulate the same thing:
SELECT ?resource ?label
WHERE {
?resource rdfs:label ?label
FILTER REGEX(LANG(?label), '^it(-|$)', 'i')
}
What’s going on here? LANG(?label) gives the label’s language
tag. It is matched against the regular expression ^it(-|$),
which matches the string it and any string that starts with
it-. The 'i' modifier to the REGEX function
makes the match case-insensitive, as required by RFC 3066 and its
replacement-in-progress, draft-phillips-langtags.
Jeremy also points out yet another ugly wart of RDF(/XML): Language
tagging is inconsistent for XML literals. To tag a plain literal, you
put an xml:lang attribute on its property element. If you do
the same for an XML literal, the language tag will be ignored. Instead,
you have to put the attribute onto some element within the XML literal.
IRIs (Internationalized Resource Identifiers) are yet another
interesting addition to the semantic web acronym soup. The next time
someone you know tries to understand the differences between URLs, URIs
and URNs, just mention that “they soon will all be replaced by IRIs
anyway.” This is a great way to keep sane people away from our line of
work.
Back to Jeremy’s report. I enjoyed this quote:
If you are asked to help with production of a
multilingual Semantic Web application you will be asking tool developers
for new features, you will be pushing at the boundaries, and finding
problems in the specifications – budget accordingly
…
Very true. But the same applies to unilingual semantic web applications.