4/29/2008

Creating Huge Indexes with Lucene

Creation of huge indexes may require long time to generate the index files. There are few approaches which can be done to achieve faster performance while creating indexes. Lets talk about few of them

1. Thread based approach:
This approach can be useful when you have the source of data from a database (e.g. Oracle) . When you are interacting with Oracle for example your query may be able to utilize the full potential of Oracle database if run in shot for full data set, also the Lucene Index writer may need to wait if the query takes too long. So if its possible to split data into multiple logical sets then both Database and Lucene Index Writer capabilities can be utilized to the fullest.

While creating the index of huge size we may split the job in to multiple threads by separating them in logically separate data sets. Each thread can create indexes in parallel and once all threads complete we can merge them into one index using writer.addIndexes(Directory[]).

This is how the code for merging indexes looks like:

Directory directories[] = {
FSDirectory.getDirectory(new File(indexDir1)),
FSDirectory.getDirectory(new File(indexDir2)),
FSDirectory.getDirectory(new File(indexDir3)) };
IndexWriter writer = new IndexWriter(indexDirectory,
new StandardAnalyzer());
writer.addIndexesNoOptimize(directories);
writer.optimize();

When we ran 3 separate threads for 500MB of data fetched from Oracle database it reduced to almost 70% as compared to a single thread fetching data from Oracle and doing it. You can experiment on the number of threads based on the Database and host machine
capacity.

Merging multiple indexes is not a costly operation in such case as you are working on logically different data sets. It took less then a minute to merge 3 indexes (approximately 500Mb each).

1 comment:

Got something to say? Don't hold it! Tell it to us.

You Might Like

.....