LogsDB and TSDS performance and storage improvements in Elasticsearch 8.19.0 and 9.1.0

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.

The Elasticsearch storage engine team is focused on improving storage efficiency, indexing throughput, and resource utilization for our users. We are excited to announce multiple enhancements to both Time Series Data Streams (TSDS) and the LogsDB index mode, available today in Elasticsearch versions 8.19.0 and 9.1.0. These improvements are a direct result of our commitment to helping DevOps, SRE, and Security teams manage massive volumes of logs and metrics data more effectively and economically.

The latest enhancements to LogsDB and TSDS directly address the challenges of cost and scale by delivering significant performance and efficiency gains automatically:

Slash disk I/O during ingestion by 50% by optimizing how we handle recovery source data, leading to higher indexing throughput and lower resource consumption.
Improve storage efficiency for documents with arrays (like a list of IP addresses), further reducing disk usage and boosting indexing speed.
Improve segment merging of doc_values by up to 40%. Segment merging is a resource-intensive background process that continuously runs while data is ingested. Most fields have doc_values enabled among other data structures like inverted index, and this results in a lower overall CPU footprint during indexing.
Reduce storage for _seq_no field by ~50% by replacing its BKD-tree with a more efficient skip list implementation (available in 9.1.0 only). This results in a storage usage reduction across the board that depends on the number of fields per document; in our internal benchmarks for time series data, we are observing an additional ~10% reduction in total storage used.

In our internal LogsDB benchmarks, these enhancements resulted in a ~16% reduction in storage used and a ~19% increase in median indexing throughput when compared to the performance of LogsDB when it was released in Elasticsearch version 8.17.0.

LogsDB benchmarks: Enhancements in 8.19.0 & 9.1.0

All in all, compared to standard mode in 8.17, LogsDB is now up to 4x more storage efficient with an indexing throughput penalty that is at max 10%.

Compared to 8.17 standard	8.19 logsdb basic	8.19 logsdb enterprise	9.1 logsdb basic	9.1 logsdb enterprise
Median indexing throughput overhead	11.00%	10.02%	11.64%	4.94%
Storage (cost) improvement	2.68x	3.65x	2.87x	3.83x

On top of storage efficiency, we are also tracking indexing throughput overhead, as we have recognized it’s a key consideration for adopting LogsDB. In response, we have reduced this overhead to 10% or less (below 5% in 9.1 enterprise), unlocking LogsDB for all types of log management, including high-volume ingestion.

The foundation: How LogsDB and TSDS optimize storage

Before diving into the new features, let's revisit the core principles that make LogsDB and TSDS so powerful for managing log and metric data. Both modes automatically trigger a suite of optimizations geared toward storage efficiency. Two of the most important ones are synthetic _source and index sorting.

Index Sorting: This feature ensures that documents are stored in a specific order on disk. By sorting data (e.g., by host.name and @timestamp), we group similar data together, which makes existing compression techniques far more effective and enables specialized, order-dependent codecs (like, for example, delta of deltas and run-length encoding). This further reduces storage usage at the cost of slightly more CPU work during indexing.
Synthetic _source: By default, Elasticsearch stores the original JSON document sent at index time in the _source field. Synthetic _source changes this by not storing the _source and instead reconstructing it on the fly from other indexed data structures (like doc values). The trade-off is a massive reduction in storage footprint in exchange for minor differences in the reconstructed source (e.g., field order may change). This is a cornerstone of the storage savings in both LogsDB and TSDS. Note that this feature is only available to Elastic Cloud serverless customers and organizations with an Enterprise license.

These foundational features already provide incredible value, and the new updates build on them to deliver even greater efficiency.

Slashing disk I/O by eliminating recovery source

Even when synthetic _source is enabled, older versions of Elasticsearch would still write the original source to disk in a special field. This was done to ensure replica shards can recover by replaying data from primary shards. Still, tentatively storing the original source came at the cost of significant disk I/O, even though the data gets later discarded.

Starting with versions 8.19.0 and 9.1.0, we have completely removed this step. Elasticsearch no longer writes this temporary recovery source, providing a massive boost to indexing performance. This single change dropped disk I/O on writes by ~50% in our TSDS benchmark.

Slashing disk I/O in Elasticsearch 8.19.0 and 9.1.0

This dramatic reduction in I/O directly translated into a 16% increase in median indexing throughput, allowing you to ingest more data, faster.

Accelerating segment merges for lower CPU overhead

Lucene’s doc values are used as columnar storage in Elasticsearch. This powers many functionalities like sorting, aggregating and filtering (when there is no inverted index or BKD tree). When index sorting is enabled, flushing data to disk and segment merging become more expensive, given that data structures need to be sorted based on the index sorting configuration. This applies to inverted indices, doc values and stored fields (used to store source) and more.

Our team has optimized this process significantly. Previously, merging involved multiple passes over the documents for each field (up to four times). Each pass over the documents performed a merge sort, which is a CPU-intensive operation.

Starting with 8.19.0 and 9.1.0, we have streamlined this to a single pass per field over the documents to merge. This change makes doc values segment merging up to 40% faster to complete. The impact is a lower overall CPU footprint during indexing, which is especially beneficial for high-ingestion use cases where the system is constantly merging segments.

Smarter array handling for greater storage efficiency

Previously, synthetic _source could not reconstruct the order of values within an array, forcing it to store the entire array in a separate field called _ignored_source. This meant that for fields containing arrays (like a list of security tags or IP addresses), we were storing the data twice: once in doc_values and again in _ignored_source.

We have now improved how we handle arrays of primitive values. In versions 8.19.0 and 9.1.0, the ordering of leaf array fields is preserved in a specialized doc_values field. This eliminates the need to store this data in _ignored_source, reducing storage usage and improving indexing performance for documents rich with array fields.

Replacing BKD trees with skip lists for a final storage squeeze

Every document in Elasticsearch is assigned a sequence number, stored in the _seq_no metadata field. This field was indexed using a BKD tree to enable efficient range queries, which are essential for replication. For example, a replica shard requesting operations between _seq_no X and Y. However, BKD trees are resource-intensive to build and consume considerable disk space.

For Logsdb and TSDS in version 9.1.0 and later, we have replaced the BKD tree for the _seq_no field with Lucene’s new doc value skippers (a lightweight skip list implementation on top of doc_values). This change improved indexing performance, reduced storage for _seq_no field by ~50% and resulted in an overall storage usage reduction by an additional ~10% in our internal benchmarks for time series data. The tradeoff is somewhat less optimal performance for range queries, which get executed as part of replication operations from primary to replica shards.

Putting it all together: Get started today

The latest enhancements in Elasticsearch 8.19.0 and 9.1.0 deliver a powerful combination of storage savings and performance gains for your logs and time-series data. By optimizing I/O, improving merge performance, handling arrays more intelligently, and trimming metadata, we are making it easier and more affordable than ever to retain and analyze your critical operational data for longer periods.

To take advantage of these automatic benefits, upgrade to Elastic version 8.19.0 or 9.1.0 today.

Ready to learn more about optimizing your data storage?