This is a cache of https://www.elastic.co/search-labs/blog/optimized-scalar-quantization-elasticsearch. It is a snapshot of the page at 2025-01-09T00:31:52.826+0000.
Optimized <strong>s</strong>calar Quantization: Even Better Binary Quantization - Ela<strong>s</strong>tic<strong>s</strong>earch Lab<strong>s</strong>

Optimized scalar Quantization: Even Better Binary Quantization

Here we explain optimized scalar quantization in Elasticsearch and how we used it to improve Better Binary Quantization (BBQ).

Our Better Binary Quantization (BBQ) indices are now even better(er). Recall improvements across the board (in extreme cases up to 20%) and unlocking the future of quantizing vectors to any bit size. As of Elasticsearch 8.18, BBQ indices are now backed by our state of the art optimized scalar quantization algorithm.

A Brief History of scalar Quantization

Introduced in Elasticsearch 8.12, scalar quantization was initially a simple min/max quantization scheme. Per lucene segment, we would find the global quantile values for a given confidence interval. These quantiles are then used as the minimum and maximum to quantize all the vectors. While this naive quantization is powerful, it only really works for whole byte quantization.

In Elasticsearch 8.15, we added half-byte, or int4, quantization. To achieve this with high recall, we added an optimization step, allowing for the best quantiles to be calculated dynamically. Meaning, no more static confidence intervals. Lucene will calculate the best global upper and lower quantiles for each segment. Achieving 8x reduction in memory utilization over float32 vectors.

Finally, now in 8.18, we have added locally optimized scalar quantization. It optimizes quantiles per individual vector. Allowing for exceptional recall at any bit size, even single bit quantization.

What is Optimized scalar Quantization?

For an in-depth explanation of the math and intuition behind optimized scalar quantization, check out our blog post on Optimized scalar Quantization. There are three main takeaways from this work:

  • Each vector, is centered on the Apache Lucene segment's centroid. This allows us to make better use of the possible quantized vectors to represent the dataset as a whole.
  • Every vector is individually quantized with a unique set of optimized quantiles.
  • Asymmetric quantization is used allowing for higher recall with the same memory footprint.

In short, when quantizing each vector:

  • We center the vector on the centroid
  • Compute a limited number of iterations to find the optimal quantiles. stopping early if the quantiles are unchanged or the error (loss) increases
  • Pack the resulting quantized vectors
  • store the packed vector, its quantiles, the sum of its components, and an extra error correction term

storage and Retrieval

The storage and retrieval of optimized scalar quantization vectors are similar to BBQ. The main difference is the particular values we store.

One piece of nuance is the correction term. For Euclidean distance, we store the squared norm of the centered vector. For dot product we store the dot product between the centroid and the uncentered vector.

Performance

Enough talk. Here are the results from four datasets.

  • Cohere's 768 dimensioned multi-lingual embeddings. This is a well distributed inner-product dataset.
  • Cohere's 1024 dimensioned multi-lingual embeddings. This embedding model is well optimized for quantization.
  • E5-small-v2 quantized over the quora dataset. This model typically does poorly with binary quantization.
  • GIsT-1M dataset. This scientific dataset opens some interesting edge cases for inner-product and quantization.

Here are the results for Recall@10|50

DatasetBBQBBQ with OsQImprovement
Cohere 7680.9330.9380.5%
Cohere 10240.9320.9451.3%
E5-small-v20.9720.9750.3%
GIsT-1M0.7400.98924.9%

Across the board, we see that BBQ backed by our new optimized scalar quantization improves recall, and dramatically so for the GIsT-1M dataset.

But, what about indexing times? surely all this per vector optimizations must add up. The answer is no.

Here are the indexing times for the same datasets.

DatasetBBQBBQ with OsQDifference
Cohere 768368.62s372.95s+1%
Cohere 1024307.09s314.08s+2%
E5-small-v2227.37s229.83s< +1%
GIsT-1M 1300.03s*297.13s-300%
  • since the quantization methodology works so poorly over GIsT-1M when using inner-product, it takes an exceptionally long time to build the HNsW graph as the vector distances are not well distinguished.

Conclusion

Not only does this new, state of the art quantization methodology improve recall for our BBQ indices, it unlocks future optimizations. We can now quantize vectors to any bit size and we want to explore how to provide 2 bit quantization, striking a balance between memory utilization and recall with no reranking.

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.

Ready to build state of the art search experiences?

sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself