Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene User’s Group San Francisco
What is NRT? • Search on documents nearly as fast as they are indexed • Delete documents in a way that is immediate and IO efficient • Good for things like Twitter and other apps that require realtime searching (Social 2.0)
Today? • Users expect to search their data immediately after updating it (Web/Social 2.0 apps) • Search engines are designed to perform efficient batch indexing (not realtime) • Batch indexing is slow and updates take a while to be searchable
NRT in Lucene • Uses core Lucene code to make existing batch indexing nearly realtime • Required retrofitting of some of the core implementation • Details are hidden • Hopefully really easy for developers to use
Lucene NRT Patches • LUCENE-1314 – IndexReader.clone • LUCENE-1516 – IndexWriter.getReader • LUCENE-1313 – RAMDir in IndexWriter • LUCENE-1483 – Fast FieldCache loading • LUCENE-1231 – Column stride fields • LUCENE-1526 – Incremental copy-on-write
LUCENE-1314 • IndexReader.clone is like reopen • However it performs a copy-on-write of norms and deletes • Used by LUCENE-1516 to keep deletes in RAM (rather than flush them to disk)
LUCENE-1516 • Adds ability to obtain an IndexReader from IndexWriter • Efficient in ram deletes • Call IndexWriter.getReader instead of IndexReader.reopen • All updating, deletes, roepening, and flushing details hidden from user • Will be in Lucene 2.9
Sample IW.getReader Code IndexWriter writer; Document doc = new Document(); writer.addDocument(doc); IndexReader reader = writer.getReader(); Document sameDoc= reader.document(0); assert doc.equals(sameDoc);
LUCENE-1313 • Near Realtime Search • Makes IW.getReader faster • New segments are flushed to IndexWriter internal RAMDirectory • Could increase overall indexing performance because there’s no pause while the ram buffer is being written to disk • Will be in Lucene 2.9?
LUCENE-1483 • Searches on fieldcaches at the segment level • Means faster field cache loading and more efficient memory usage • Good for realtime because field cache loading is less of a bottleneck, less ram usage • Will be in Lucene 2.9
LUCENE-1526 • Optimize copy-on-write • When we’re doing IndexReader.clone, we may be creating a huge new array for a small number of deletes or norms updates • So we need to do incremental copy-on-write of things like deletes, norms, and field caches (?) • Lucene 3.0?
LUCENE-1231 • Column stride fields will make field cache loading faster because data will be loaded sequentially from disk • Today there are potentially two hard drive seeks per field cache value (TermEnum.next, TermDocs.next) • Lucene 3.0?
Future of Lucene NRT • LUCENE-1292 – Realtime parallel untokenized field index (for tags) • Pulsing - Store smaller postings directly in the term dictionary (to avoid seeks) for faster field cache loading • Replication • More benchmarks
LinkedIn Open Source Projects • Bobo – Facet library that counts using custom field caches http://code.google.com/p/bobo-browse/ • Zoie – Realtime search on top of Lucene http://code.google.com/p/zoie/ • Voldemort – Distributed key-value storage http://project-voldemort.com/
BoboBrowse: facet features • MultiSelect • Runtime-defined facets (query-based, etc) • Fast (custom field-cache based) • Custom facet types: • Hierarchical (/a/b/c) • Range • Multivalued
Zoie: realtime features • No modifications to core lucene • Multiple read/write: RAMDir + FSDir • IndexReader on (small) RAMDir opened per request: instantly realtime • IndexReaderDecorator for custom Reader • Transparent Indexing: implement StreamDataProvider then inject
Next Steps • Help work on the patches? https://issues.apache.org/jira/browse/LUCENE • LinkedIn is hiring • Contact: firstname.lastname@example.org or email@example.com