Lucene Revolution

I’ve just returned from Boston where I spent a few days at the Lucene Revolution conference. I attended the conference because we’ve been using Apache Solr for a couple of years now at McGill. Solr is a search engine sitting on top of the Apache Lucene library. Currently we’re using Solr for local search on each of our WMS sites, and more extensively on the course calendar site for searching programs, courses and faculty information. Eventually we would like Solr to meet all our search needs.

I learned a lot at the conference, including:

  • We’re on the right track with our Solr installation, but:
    • We need to upgrade to the latest stable version (happening soon).
    • We need to load balance our search servers properly so indexing goes to one server while searching is done from another.
    • The load balancing means we need to set up proper replication (also happening soon).
  • Solr could be doing so much more for us, like:
    • Full global crawl and search using something like Nutch on top of Solr to do the crawling.
    • BI data processing, along with the Hadoop stack.
    • Specialized search services such as channels, classifieds, and other data collections around campus.
    • Library collection data processing.
  • Other people are doing some very interesting things with Solr:
    • The Library of Alexandria uses Solr to index full contents of thousands of their manuscripts and books in five different languages.
    • UCLA is using Solr to process data retrieved from all TV news broadcasts for the past few years as part of their communications studies program.
    • A few people are using Solr to index Wikipedia content in various different ways.
    • The Internet Archive (aka The Wayback Machine) uses Solr to index their seven petabytes of archive data.
    • Solr is being used to process social networking feeds in real time, using machine learning to add value to the data.

There was a lot of talk about Hadoop at the conference, which is another open source project by Apache. Hadoop allows large scale distributed processing of data, so that extremely large datasets can be processed across many servers quicky and efficiently. We don’t currently have a good use case at McGill for using something like Hadoop, but the BI initiatives currently underway in IT Services could look at it for deep processing of our enterprise data, and our global search needs may become complex enough to warrant that kind of scale.

 

Comments are closed.

Blog authors are solely responsible for the content of the blogs listed in the directory. Neither the content of these blogs, nor the links to other web sites, are screened, approved, reviewed or endorsed by McGill University. The text and other material on these blogs are the opinion of the specific author and are not statements of advice, opinion, or information of McGill.