Apache <strong>lucene</strong> 10 is out! Improvements to <strong>lucene</strong>'s hardware efficiency & more

Apache lucene 10 is finally out! With more than 2,000 commits from 185 unique contributors since lucene 9.0 - which was released in December 2021, almost 3 years ago - a lot has been going on. To be fair, a majority of these changes have been delivered in 9.x minor releases. However, the most ambitious changes usually need a major version, such as the introduction of multi-dimensional points in lucene 6.0, dynamic pruning in 8.0, or vector search in 9.0. In 10.0, the area of focus for lucene has been hardware efficiency, ie. making lucene better at taking advantage of modern hardware. Let me guide you through the main release highlights:

More search parallelism

For many years now, lucene has had the ability to parallelize search execution, by creating groups of segments, searching each group in a different thread, and combining results in the end. One downside of this approach is that it couples the index geometry - how the index is organized into segments - and search parallelism. For instance, an index that has been force-merged down to a single segment can no longer take advantage of multiple execution threads to be searched. Quite disappointing when modern CPUs commonly have tens of cores!

To overcome this limitation, lucene's query evaluation logic now allows splitting the index into logical partitions, that no longer need to be aligned with segments. For instance, an index that has been force-merged down to a single segment could still be sliced into 10 logical partitions that have one tenth of the documents of this segment each.

This change will help increase search parallelism, especially on machines that have many cores and/or on indexes that have few segments on their highest tier. This change doesn't work nicely with queries that have a high cost for creating a Scorer yet - such as range queries and prefix queries, but we're hoping to lift this limitation in upcoming minor releases.

Better I/O parallelism

Until now, lucene would use synchronous I/O and perform at most one I/O operation at a time per search thread. For indexes that significantly exceed the size of the page cache, this could lead to queries being bound on I/O latency, while the host is still very far from maxing out IOPS. Frustrating!

To overcome this, lucene's Directory abstraction introduced a new IndexInput#prefetch API, to let the OS know about regions of files that it is about to read. The OS can then parallelize retrieving pages that intersect with these regions, within a single OS thread. For instance, a BooleanQuery with TermQuery clauses would now perform the I/O of terms dictionary lookups in parallel and then retrieve the first couple pages of each postings list in parallel, within a single execution thread. MMapDirectory, lucene's default Directory implementation, implements this prefetch API using madvise's MADV_WILLNEED advice on Linux and Mac OS.

We are very excited about this change, which has already proved to help on fast local NVMe disks, and will further help on storage systems that have worse latencies while retaining good parallelism such as network-attached disks (GCP persistent storage, Amazon EBS, Azure managed disks) or even object storage (GCP Cloud storage, Amazon S3, Azure blob storage).

Better CPU efficiency and storage efficiency with sparse indexing

lucene 10 introduces support for sparse indexing, sometimes called primary-key indexing or zone indexing in other data stores. The idea is simple: if your data is stored on disk in sorted order, then you can organize it into blocks, record the minimum and maximum values per block, and your queries will be able to take advantage of this information to skip blocks that don't intersect with the query, or to fully match blocks that are contained by the query. Only blocks that partially intersect with the query will need further inspection, and the challenge consists of picking the best index sort that will minimize the number of such blocks.

lucene's sparse indexes are currently implemented via 4 levels of blocks that have 4k, 32k, 256k and 2M docs each respectively.

When done right, this form of indexing is extremely space-efficient (only a few bytes per block) and CPU-efficient (can make a decision about whether thousands of documents match or not with only a few CPU instructions). The downside is that the index can only be stored in a single order on disk, so not all fields can benefit from it. Typically, the index would be sorted on the main dimensions of the data. For instance, for an e-commerce catalog containing products, these dimensions could be the category and the brand of the products.

Conclusion

Note that some hardware-efficiency-related changes have also been released in 9.x minor releases. In particular, it's worth highlighting that:

lucene now takes advantage of explicit vectorization when comparing vectors and decoding postings,
lucene's concurent search execution logic performs work stealing in order to reduce the overhead of forking tasks,
lucene's postings format has been updated to have a more sequential access pattern,
lucene now passes a MADV_RANDOM advice when opening files that have a random-access pattern.

We are pretty excited about this new lucene release and the hardware-efficiency focus. In case you are curious to learn more about these improvements, we will be writing more detailed blogs about them in the coming weeks. Stay tuned.

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.