[juc] Leigh Dodds – Slug Semantic Web Crawler

Slug is one of Leigh’s pet projects. It’s a crawler for the semantic web.

There are lots of slug photos in the slides.

A semantic web crawler works like a web crawler, but it fetches RDF files instead of HTML pages, and follows rdfs:seeAlso links instead of HTML links.

Slug is multi-threaded and very extensible. Crawling and fetching is separated from the further processing, so you can do almost anything with the content that has been found. Pre-defined options include caching found RDF files in a local filesystem cache or storing them in a Jena persistent store.

Some pre-defined filters for changing the crawler’s behaviour: RegexFilter (ignore URLs matching a regex, e.g. FOAF profiles from LiveJournal), DepthFilter (crawl only six steps), SingleFetchFilter (don’t recrawl resources that you’ve already seen). Adding others is easy.

The crawler keeps metadata about its activity – what resource was fetched when with which result, and where the crawler comes from, so it records link structure.

(I’ve written my own super-simple FOAF crawler earlier this year. I wish I had known about Slug back then, it would have done a much better job, and using it would have been less work.)

This entry was posted in General, Semantic Web. Bookmark the permalink.