cygri’s notes on web data

[bxmlt2005] Harald Schöning

Posted on September 14, 2005 by Richard Cyganiak

Harald Schöning from Software AG, is chief architect of the Tamino XML database and talks about the history of this software product. (Slides, in German)

They started development in 1997, before XML became a W3C recommendation. This was before XPath, XQuery, XML Schema etc. existed. Tamino was positioned as a “web database” to clearly differentiate it from relational databases.

Initially they had no idea in which areas XML would take hold, and what exactly an XML database would be usefol for. They even included an SQL engine with the product because they were so uncertain about the market positioning of the product. And the market today looks quite different from their initial expectations.

They support XQuery since 2002. XQuery is not yet a recommendation and has changed massively since 2002. So they have been chasing a moving target, which proved quite difficult because customers don’t like their query language to change strongly in every new version of the database.

API bindings to many programming languages are very imortant.

Tamino itself talks HTTP to the outside world. Initially they didn’t include a web server in Tamino but plugged into Apache, IIS or Netscape’s web server. But customers kept believing (wrongly) that having a separete web server would cost performance, so they also included a direct HTTP interface.

50% of the installed base is on Windows, about 15% on Solaris, 10% on Linux, the rest on other Unixes and OS 390.

Tamino never supported DTDs. They think DTDs suck.

They also think that entities in XML suck. How should a database deal with them? If you resolve at load time, people will complain. If you don’t, then entities have to be resolved at every query, which is too slow. So Tamino resolves at load time. Of course this means you can’t query for entites. Entity use seem to be in decline.

Tamino had its own schema language because XML Schema didn’t exist yet. They are not very happy with XML Schema.

Tamino supports Unicode. There are lots of interesting issues with special characters (umlauts, accents) and text retrieval.

Customers keep asking for a JDBC interface because so many tools (reporting etc.) can only talk to JDBC. This is a major roadblock for XML databases.

They have about 600 customers for Tamino. Most use it as part of some sort of content management solution. Another use is for data that is not structured enough for a relational DB.

An interesting application: Tamino can be put on a CD together with the database. The US Navy uses this to make their manuals searchable.

The largest database that is running Tamino has about one terabyte of data, is located in Romania and runs on a Linux version not even supported by Software AG.

The main customer requests are: performance, performance and performance. And big XML files. Like one gigabyte.

Most OODBMS are dead by now; XML DBs seem to have found their niche. [I wonder how RDF DBs will fare?]

Question from the audience: Do they think about RDF and the semantic web? Except for some academic users, customers don’t ask for it, so it’s a low priority. The company uses it in other products though (EII area).

Q: How to deal with changing schemas? Not much of a problem, customers don’t want to do incompatible schema changes.

Q: Why no effort to establish in universities? Would be important, but is not really part of the company strategy. Email Harald for an academic license.

Q: Implementation language? C and C++.

[Excellent talk.]

Posted in General | Comments Off

Forget feature requests

Posted on September 13, 2005 by Richard Cyganiak

The fine people at 37signals propose to just forget feature requests. Don’t bother to track them, just read them and throw them away. If something is really important, users will keep telling you over and over again, so you will remember anyway.

I don’t fully agree. A tracker for feature requests can focus discussion and “collaborative thinking”. But in general, the idea is sound. You don’t want to implement a new feature because of one reader. You want to implement the features that many people need. And these will pop up over and over again (like support for branches in StatCVS).

Posted in General | Comments Off

[bxmlt2005] Chris Hübsch: XQuestXML

Posted on September 13, 2005 by Richard Cyganiak

Chris Hübsch, TU Chemnitz: XQuestXML — an XML grammar for describing questionnaires

Chris presents an XML based system for online questionnaires. This is not very exciting to be honest, but I like the talk anyway. Like Chris’ talk from last year (page in German), it’s a great lesson in how computer systems should be built: Evalutate your options at any step and pick the best tool for the job. Too many people (including myself) use the “If all you have is a hammer …” approach.

Make the questionnaire system web-based or not? Web-based, because deployment is cheaper and you can do world-wide questionnaires.

What client-side web technologies should be used? Only plain HTML, no Flash etc., because it’s universally supported and sufficient to solve the problem.

How to accept answers from the web form? A PHP script, because that’s the stuff PHP was invented for.

Where to store answers? CSV, XML, database? Database because it’s easy to add new records, and it’s easy to do stuff like counting and averaging.

How to specify the questions? TML, LMML, LConML, DocBook, XFDL, XForms? All have some problems, a custom format seems to be the best choice. That’s XQuestXML.

A bit of scripting and XSLT creates HTML forms from the questionnaire definition file, sets up the database including creating the answers table, and sets up the receive-answer and export-all-answers PHP scripts.

There’s a PDF export function to create paper questionnaires but apparently no one uses this.

There are a few extra functions like access restricted questionnaires, recording answer times etc.

The system has been used at the university for studies and for teacher evaluation. The scripts are available for download (site in German).

Posted in General | 1 Comment

[bxmlt2005] Cristian Pérez de Laborda: Querying Relational Databases with RDQL

Posted on September 12, 2005 by Richard Cyganiak

Yay! This is very close to my D2RQ and sparql2sql work and one of the talks I’ve been looking forward to.

Cristian is from the Heinreich Heine Universität Düsseldorf. He wants to make the data in relational databases available as RDF. This is the same idea as D2RQ. But they do the mapping completely automatically. This turns every database into a source of RDF data.

Relational.OWL is an ontology for describing the structure of databases: tables, columns, primary keys etc. So it’s a meta-schema for describing relational schemas. Using this, it’s easy to write down the contents of a DB as RDF.

So how to get RDF data out of a database? Use an RDF query language! Crisitan picked RDQL.

An RDQL query might look like this:

SELECT ?person, ?name
WHERE (?person, rdf:type, dbs:PERSON)
      (?person, dbs:PERSON.NAME ?name)

This picks a person and its NAME column from the PERSON table. Class and property names are prescribed by the table and column names of the DB schema.

So is RDQL expressive enough to replace SQL? He aims a bit lower and tries to show that RDQL is as expressive as relational algebra. This is a bit weird because an RDQL result is not RDF, but a table-like structure. This means you can’t chain RDQL expressions, while you can always chain relational algebra expressions to form arbitrarily complex expressions.

He demoes a little tool that lets you enter an RDQL query like the one above, and it translates this to SQL and returns the results.

Unlike RDQL, SPARQL can return RDF graphs and could be better than RDQL for the job, but this remains as future work.

[The difference between this approach and D2RQ is that they have a simple, but automatic, 1:1 mapping from tables/columns to classes and properties, while with D2RQ, a custom mapping into any RDF vocabulary is possible, but has to be defined manually by an expert.]

Posted in General, Semantic Web | Comments Off

[bxmlt2005] Impressions from Magnus

Posted on September 12, 2005 by Richard Cyganiak

Here (in German)

He sits next to me and it’s his fault if I run out of battery on my laptop.

Posted in General | 1 Comment

[bxmlt2005] Stefan Audersch

Posted on September 12, 2005 by Richard Cyganiak

Stefan Audersch: Semantic Web technologies for visual exploration and fusion of multivariate data

[I missed the first few minutes]

It’s about data integration in a network of small companies and research institutes. They want to share medical and biological data.

Their system integrates data from two organizations. The data lives in a MySQL database. They describe the database schema with RDF, using Sesame. They expose a web service interface to the client.

On the client, a user can build a “workflow” consisting of data sources, “Semantic Joins” on primary keys, and various display/rendering/visualization modules. The RDF descriptions of the databases are used in the workflow builder, e.g. for checking the workflow for errors.

They have performance problems and hope to improve them by using parallelization.

Very interesting talk; I don’t do it justice with this poor summary. I’m looking forward to reading the paper.

[Update: Some corrections after a chat with Stefan]

Posted in General, Semantic Web | Comments Off

[bxmlt2005] Elena Paslaru: Towards a Cost Estimation Model for Ontology Engineering

Posted on September 12, 2005 by Richard Cyganiak

Elena Paslaru is from Freie Universität Berlin, that’s my university.

(Slides, PDF)

She asks the audience to see an ontology as the result of an engineering process, just like a piece of software.

There are many decisions to be made when a complex ontology is needed: Build? Buy? Extract from natural language documents? All at once or incremental? Elena’s group develops a methodology that helps with these decisions by estimating how much the different approaches will cost.

They have elaborate formulas for estimating the costs. E.g. if the project team has less than two months of experience with the knowledge representation experience, then they will need twice as much time as a team of experts with 6+ years of experience. The factors are based on review of literature and case studies

Problem: There’s not enough historical data to actually validate the model.

From the audience: A huge part of the costs is in maintenance. (The model includes maintenance costs.)

[I don’t quite buy into this. Does anybody actually have the problem that they need and ontology but don’t know how expensive it will be? I’d have appreciated some concrete examples demonstrating the need. Well, I’m just one of these lower case semantic web people.]

Posted in General, Semantic Web | Comments Off

[bxmlt2005] Nicola Henze: Personalization on the Semantic Web

Posted on September 12, 2005 by Richard Cyganiak

This is today’s invited talk. Nicola Henze is a professor at Uni Hannover, which is Germany’s second-best Semantic Web research location, according to Robert Tolksdorf. She talks about the “people” in TBL’s famous definition of the Semantic Web: “… an extension of the current Web … better enabling computers and people to work in cooperation.”

(Slides, PDF)

She reviews personalization approaches on the old web and on mobile devices. Research has shown that users can access services faster on personalized mobile devices, are more satisfied with the service, and ultimately use it more.

There’s a hierarchy of personalization. Services can be unpersonalised, can identify users anonymously, can be aware of the user’s context (his or her location, time, device), or can have complex models of individual users (assessing their goals, interests, requirements).

Enter the semantic web. It can be used to improve the existing approaches: Metadata about web pages (subjects, it’s the homepage of X, etc.) can be used to improve navigation and to select relevant bits and pieces.

It’s bad when we can’t comprehend why a personalized system does certain things, e.g. present me with a certain piece of information. Semantic web enables proof of these decisions: The system can explain why it show me this. sounds a bit like [TriQL.P]

Case study: a system for information about scientific publications. [how original!] The problem to be solved here is duplication of information. Information about publications originates on the author’s home pages. But we want different views on this information.

They use a product called LIXTO suite to extract the data from web pages. The data is then mixed with additional ontologies, citeseer data and personalization rules. RDQL queries are used to get stuff out of the data and show it in a custom interface.

She doesn’t give much details. What happens in the backend? There’s “Ontology” and “Reasoning” on the architecture slide, what do these components do? How does the rendering of RDF to HTML work? There’s more in the slides.

Basically it’s a portal that aggregates

She reads from a novel — I missed the title — where someone uses a computer to evaluate candidate’s claims in an election campaign. Giving a few orders to the computer — remove all obvious lies, superfuous stuff etc. — reduces their programmes to a few words. Very timely, there are elections in Germany next weekend.

The extraction from HTML is done with regular expressions. You need a new extractor for each new datasource. If extraction fails because the HTML structure has changed, a warning is raised. This works reasonably well because publication data always looks quite similar.

Posted in General, Semantic Web | Comments Off

[bxmlt2005] Panel: Standardization — positive impetus or market restraint?

Posted on September 12, 2005 by Richard Cyganiak

The day sets off with a panel discussion headlined “Standardization: Positive impetus or market restraint?” The panelists are:

Norman Heydenreich, a director of Microsoft Germany
Hans Kauper, PSI and BITKOM
Wolf-Dieter Lukas, Ministry of Education and Research
Andreas Luxa, Siemens and IEEE
Ingo Wende, German Institute for Standardization
Moderator: Rainer Thiem, <xmlcity:berlin> e.V.

Microsoft is the main sponsor of the conference

Thiem: Standardization is great, but opposed to the free market principle … Can standards dampen progress?

Lukas is a physicist, Heydenreich is a philosopher

Thiem: Should the government be involved in standardization process?

Lukas: No, the gov should not be first. But the gov is also a big IT customer, and government promotion of research can lead to standards.

Heydenreich: A standard is not a purpose in itself; what’s the goal? Often, parties that did not invest into a technology request it to be opened up. […] We open up the Office XML schemas to the market; you can get a license from us. Microsoft is involved in lots of standards activities with W3C and other orgs.

Wende: One of the big standardization failures was the ISO/OSI stack. All interested parties must be involved for success. Some success stories: JPEG, MPEG.

Wende: Good standard prescribe little but ensure interoperability anyway, e.g. prescribe interfaces not implementations.

From the audience (Matthias GÃ¼nig?): Users and professional societies must play a more active role in setting high-level standards.

Robert Tolksdorf, one of the conference organizers, in response to an audience question: Research on these subjects in Germany is centered in Karlsruhe, Hannover, Berlin and, somewhat, Munich.

Wende: Standards must be consensus based. Patents are a big issue.

Heydenreich: Patents are an issue, but some standards bodies require participants to sign away their IP rights. This takes away commercial incentive and is bad and will not be successful.

Luxa: You must be fast and innovative.

From the audience: Semantic standards are the gold of computer science/IT.

Thiem: It’s a social issue and not just a technical/economical issue.

Lukas: Setting standards is extremely important for long-term success of companies.

Kauper: Semantic Web is important.

Heydenreich: For one euro earned by Microsoft, small and medium business partners building on Microsoft’s platform earn 40 euros.

There was lots of talk about using XML standards in government and administration. It’s difficult in Germany because the federal states have so much independence.

Posted in General, Semantic Web | Comments Off

Berliner XML Tage 2005

Posted on September 12, 2005 by Richard Cyganiak

I’m at Berliner XML Tage 2005, an annual XML and Semantic Web conference in Berlin. It has a local focus, most of the attendees and many of the talks are in German. Usually there has been a fair share of industry presence. I blogged the conference last year (in German). The number of talks is somewhat done this year, there used to be three parallel tracks but there’s only one track today. Maybe XML is no longer much of a research topic, but a mature technology?

Posted in General, Semantic Web | 1 Comment

[bxmlt2005] Harald Schöning

Forget feature requests

[bxmlt2005] Chris Hübsch: XQuestXML

[bxmlt2005] Cristian Pérez de Laborda: Querying Relational Databases with RDQL

[bxmlt2005] Impressions from Magnus

[bxmlt2005] Stefan Audersch

[bxmlt2005] Elena Paslaru: Towards a Cost Estimation Model for Ontology Engineering

[bxmlt2005] Nicola Henze: Personalization on the Semantic Web

[bxmlt2005] Panel: Standardization — positive impetus or market restraint?

Berliner XML Tage 2005

About me

Links

Recent Posts

Archives