This is a cache of https://www.elastic.co/search-labs/blog/hnsw-graphs-speed-up-merging. It is a snapshot of the page at 2025-04-10T00:36:56.346+0000.
Speeding up merging of HNSW graphs - Elasticsearch Labs

Speeding up merging of HNSW graphs

Explore the work we’ve been doing to reduce the overhead of building multiple HNSW graphs, particularly reducing the cost of merging graphs.

In the past, we discussed some of the challenges of having to search multiple HNSW graphs and how we were able to mitigate them. At that time we alluded to some further improvements we had planned. This post is the culmination of that work.

You might ask, why use multiple graphs at all? This is a side effect of an architectural choice in Lucene: immutable segments. As with most architectural choices there are pros and cons. For example, we’ve recently GA’d Serverless Elasticsearch. In this context, we’ve gained very significant benefits from immutable segments including efficient index replication and the ability to decouple index and query compute and autoscale them independently. For vector quantization, segment merges give us the opportunity to update parameters to adapt them to data characteristics. Along these lines, we think there are other advantages that having opportunities to measure data characteristics and revisit indexing choices affords.

In this post we will discuss the work we’ve been doing to significantly reduce the overhead of building multiple HNSW graphs and in particular to reduce the cost of merging graphs.

Background

In order to maintain a manageable number of segments Lucene periodically checks to see if it should merge segments. This amounts to checking if the current segment count exceeds a target segment count, which is determined by the base segment size and the merge policy. If the count is exceeded, Lucene merges groups of segments while the constraint is violated. This process has been described in detail elsewhere.

Lucene elects to merge similar sized segments because this achieves logarithmic growth in the write amplification. In the case of a vector index, write amplification is the number of times a vector will be inserted into a graph. Lucene will try to merge segments in groups of approximately 10. Consequently, vectors are inserted into a graph roughly 1+910log10(nn0)1+\frac{9}{10}\log_{10}\left(\frac{n}{n_0}\right) times, where nn is the index vector count and n0n_0 is the expected base segment vector count. Because of the logarithmic growth, write amplification is single digits even for huge indices. However, the total time spent merging graphs is linearly proportional to the write amplification.

When merging HNSW graphs we already make a small optimization: retaining the graph for the largest segment and inserting vectors from the other segments into it. This is the reason for the 9/10 factor above. Below we show how we are able to do significantly better by using information from all the graphs we are merging.

Graph Merging

Previously we retained the largest graph and inserted vectors from the others ignoring the graphs that contain them. The key insight we make use of below is that each graph we discard contains important proximity information about the vectors it contains. We would like to use this information to accelerate inserting, at least some, of the vectors.

We focus on the problem of inserting a smaller graph Gs=(Vs,Es)G_s=(V_s,E_s) into a larger graph Gl=(Vl,El)G_l=(V_l,E_l), since this is an atomic operation we can use to build any merge policy.

The strategy is to find a subset of vertices of JVsJ\subset V_s to insert into the large graph. We then use the connectivity of these vertices in the small graph to accelerate inserting the remaining vertices VsJV_s \setminus J. In the following, we use Ns(u)N_s(u) and Nl(u)N_l(u) to denote the neighbors of a vertex uu in the small and large graph, respectively. Schematically the process is as follows.

MERGE-HNSW

Inputs GsG_s and GlG_l

1    \;\;Find JVsJ\subset V_s to insert into GlG_l using COMPUTE-JOIN-SET
2    \;\;Insert each vertex uJu\in J into GlG_l
3    \;\;for uVsJu\in V_s \setminus J do
4        JuJNs(u)\;\;\;\;J_u\leftarrow J\cap N_s(u)
5        EuvJuNl(u)\;\;\;\;E_u\leftarrow \cup_{v\in J_u} N_l(u)
6        W\;\;\;\;W\leftarrow\,FAST-SEARCH-LAYER(Ju,Eu)(J_u, E_u)
7        neighbors\;\;\;\;neighbors \leftarrow\,SELECT-NEIGHBORS-HEURISTIC(u,W)(u, W)
8        JJ{u}\;\;\;\;J\leftarrow J \cup \{u\}

We compute the set JJ using a procedure we discuss below (line 1). Then we insert every vertex in JJ into the large graph using the standard HNSW insertion procedure (line 2). For each vertex we haven’t inserted we find its neighbors that we have inserted and their neighbors in the large graph (lines 4 and 5). We use a FAST-SEARCH-LAYER procedure seeded with this set (line 6) to find the candidates for the SELECT-NEIGHBORS-HEURISTIC from the HNSW paper (line 7). In effect, we’re replacing SEARCH-LAYER to find the candidate set in the INSERT method (Algorithm 1 from the paper), which is otherwise unchanged. Finally, we add the vertex we just inserted into JJ (line 8).

It's clear that for this to work every vertex in VsJV_s \setminus J must have at least one neighbor in JJ. In fact, we require that for every vertex in uVsJu\in V_s \setminus J that JNs(u)ku|J\cap N_s(u) |\geq k_u for some ku<Mk_u<M, the maximum layer connectivity. We observe that in real HNSW graphs we see quite a spread of vertex degrees. The figure below shows a typical cumulative density function of vertex degree for the bottom layer of an Lucene HNSW graph.

We explored using a fixed value for kuk_u as well as making it a function of the vertex degree. This second choice leads to larger speedups with minimal impact on graph quality so went with the following

ku=max(2,14Ns(u))k_u=\max\left(2,\frac{1}{4}|N_s(u)|\right)

Note that Ns(u)|N_s(u)| is equal to the degree of the vertex uu in the small graph by definition. Having a lower limit of two means that we will insert every vertex whose degree is less than two.

A simple counting argument suggests that if we choose JJ carefully we need only insert around 15Vs\frac{1}{5}|V_s| into GlG_l directly. Specifically, we color an edge of the graph if we insert exactly one of its end vertices into JJ. Then we know that for every vertex in VsJV_s \setminus J to have at least kuk_u neighbors in JJ we need to color at least uVsJku\sum_{u\in V_s\setminus J} k_u edges. Furthermore, we expect that

uVsJku(VsJ)14EU[Ns(U)]\sum_{u\in V_s\setminus J} k_u\approx \left(|V_s|-|J|\right)\frac{1}{4}\mathbb{E}_U\left[|N_s(U)|\right]

Here, EU[Ns(U)]\mathbb{E}_U\left[N_s(U)|\right] is the average vertex degree in the small graph. For each vertex uJu\in J we color at most Ns(u)|N_s(u)| edges. Therefore, the total number of edges we expect to color is at most JEU[Ns(U)]|J|\, \mathbb{E}_U\left[|N_s(U)|\right]. We hope by choosing JJ carefully we will color close to this number of edges and so in order to cover all the vertices J|J| needs to satisfy

JEU[Ns(U)]=(VsJ)14EU[Ns(U)]|J|\, \mathbb{E}_U\left[|N_s(U)|\right] =\left(|V_s|-|J|\right)\frac{1}{4}\mathbb{E}_U\left[|N_s(U)|\right]

This implies that J=14(VsJ)=4514Vs=15Vs|J|=\frac{1}{4}(|V_s|-|J|)=\frac{4}{5}\frac{1}{4}|V_s|=\frac{1}{5}|V_s|.

Providing SEARCH-LAYER dominates the runtime this suggests we could achieve up to a 5×5\times speed up in merge time. Given the logarithmic growth of the write amplification, this means even for very large indices we would typically only double the build time compared to building one graph.

The risk in this strategy is that we damage the graph quality. We initially tried with a no-op FAST-SEARCH-LAYER. We found this degrades graph quality to the extent that recall as a function of latency was impacted, particularly when merging down to a single segment. We then explored various alternatives using a limited search of the graph. In the end, the most effective choice was the simplest. Use SEARCH-LAYER but with a low ef_construction. With this parameterisation we were able to achieve excellent quality graphs and still decrease the merge time by a little over 30% on average.

Computing the Join Set

Finding a good join set can be formulated as a graph covering problem. A greedy heuristic is a simple and effective heuristic for approximating optimal graph covers. The approach we take picks vertices one at a time to add to JJ in decreasing gain order. The gain is defined as follows:

Gain(v)=max(kvc(v),0)+uNs(v)J1{c(u)<ku}Gain(v)=\max(k_v-c(v),0)+\sum_{u\in N_s(v)\setminus J} 1\left\{c(u)<k_u\right\}

Here, c(v)c(v) denotes the count of neighbors of a vector vv in JJ and 1{}1\{\cdot\} is the indicator function. The gain includes the change in the count of the vertex we added to JJ, i.e. max(kvc(v),0)\max(k_v-c(v),0), since we get closer to our goal by adding a less covered vertex. The gain calculation is illustrated in the figure below for the central orange vertex.

We maintain the following state for each vertex vv:

  1. Whether it is stale,
  2. Its gain Gain(v)Gain(v),
  3. The count of adjacent vertices in JJ denoted c(v)c(v),
  4. A random number in the range [0,1] that is used for tie breaking.

The pseudo code for computing the join set is as follows.

COMPUTE-JOIN-SET

Inputs GsG_s

1    C\;\;C\leftarrow\emptyset
2    Gainexit0\;\;Gain_{exit}\leftarrow 0
3    \;\;for uGsu\in G_s do
4        CC{(false,ku+deg(u),0,rand in [0,1])}\;\;\;\;C\leftarrow C \cup \{(\text{false}, k_u+deg(u), 0, \text{rand in }[0,1])\}
5        GainexitGainexit+ku\;\;\;\;Gain_{exit}\leftarrow Gain_{exit}+k_u
6    Gaintot0\;\;Gain_{tot}\leftarrow 0
7    \;\;while Gaintot<GainexitGain_{tot}<Gain_{exit} do
8        v\;\;\;\;v^*\leftarrow\, maximum gain vertex in CC

9        \;\;\;\;Remove the state for vv^* from CC
10      \;\;\;if vv^* is not stale then
11          JJ{v}\;\;\;\;\;J\leftarrow J\cup\{v^*\}
12          GaintotGaintot+Gain(v)\;\;\;\;\;Gain_{tot}\leftarrow Gain_{tot}+Gain(v^*)
13          \;\;\;\;\;for uNs(v)u \in N_s(v^*) do
14              \;\;\;\;\;\;\;mark uu as stale if c(v)<kvc(v^*)<k_{v^*}
15              \;\;\;\;\;\;\;mark neighbors of uu stale if c(u)=ku1c(u)=k_u-1
16              c(u)c(u)+1\;\;\;\;\;\;\;c(u)\leftarrow c(u)+1
17      \;\;\;else
18          Gain(v)max(kvc(v),0)+uNs(v)J1{c(u)<ku}\;\;\;\;\;Gain(v^*)\leftarrow \max(k_v-c(v^*),0)+\sum_{u\in N_s(v^*)\setminus J} 1\left\{c(u)<k_u\right\}
19          \;\;\;\;\;if Gain(v)>0Gain(v^*)>0 then
20              CC{(false,Gain(v),c(v),copy rand)}\;\;\;\;\;\;\;C\leftarrow C \cup \{(\text{false},Gain(v^*),c(v^*),\text{copy rand})\}
21    \;\;return JJ

We first initialize the state in lines 1-5.

In each iteration of the main loop we initially extract the maximum gain vertex (line 8), breaking ties at random. Before making any change, we need to check if the vertex’s gain is stale. In particular, each time we add a vertex into JJ we affect the gain of other vertices:

  1. Since all its neighbors have an additional neighbor in JJ their gains can change (line 14)
  2. If any of its neighbors are now fully covered all their neighbors’ gains can change (lines 14-16)

We recompute gains in a lazy fashion, so we only recompute the gain of a vertex if we want to insert it into JJ (lines 18-20). Since gains only ever decrease we can never miss a vertex we should insert.

Note that we simply need to keep track of the total gain of vertices we’ve added to JJ to determine when to exit. Furthermore, whilst Gaintot<GainexitGain_{tot}<Gain_{exit} at least one vertex will have non-zero gain so we always make progress.

Results

We ran experiments on four datasets that together cover our three supported distance metrics (Euclidean, cosine and inner product):

  1. quora-E5-small: 522931 documents, 384 dimensions and uses cosine similarity,
  2. cohere-wikipedia-v2: 1M documents, 768 dimensions and uses cosine similarity,
  3. gist: 1M documents, 960 dimensions and uses Euclidean distance, and
  4. cohere-wikipedia-v3: 1M documents, 1024 dimensions and uses maximum inner product.

For each dataset we evaluate two quantization levels:

  1. int8 – which uses a 1-byte integer per dimension and
  2. BBQ – which uses a single bit per dimension.

Finally, for each experiment we evaluated search quality at two retrieval depths and examine after building the index and then after force merging to a single segment.

In summary, we achieve consistent substantial speedups in indexing and merging while maintaining graph quality and so search performance in all cases.

Experiment 1: int8 quantization

The average speedups from the baseline to the candidate, the proposed changes, are:

Index Time Speedup: 1.28×\times

Force Merge Speedup: 1.72×\times

This corresponds to the following breakdown in run times

For completeness the exact times are

IndexMerge
Datasetbaselinecandidatebuildcandidate
quora-E5-small112.41s81.55s113.81s70.87s
wiki-cohere-v2158.1s122.95s425.20s239.28s
gist141.82s119.26s536.07s279.05s
wiki-cohere-v3211.86s168.22s654.97s414.12s

Below we show the recall vs latency graphs that compare the candidate (dashed lines) to the baseline at two retrieval depths: recall@10 and recall@100 for indices with multiple segments (the final result of our default merge strategy after indexing all vectors) and after force merging to a single segment. A curve that is higher and further to the left is better, which means higher recall at lower latency.

As you can see, for multiple segment indices the candidate is better for the Cohere v3 dataset and slightly worse, but almost comparable, for all other datasets. After merging to a single segment recall curves are almost identical for all cases.

Experiment 2: BBQ quantization

The average speedups from the baseline to the candidate are:

Index Time Speedup: 1.33×\times

Force Merge Speedup: 1.34×\times

This corresponds to the following breakdown in run times

For completeness the exact times are

IndexMerge
Datasetbaselinecandidatebuildcandidate
quora-E5-small70.71s58.25s59.38s40.15s
wiki-cohere-v2203.08s142.27s107.27s85.68s
gist110.35s105.52s323.66s202.2s
wiki-cohere-v3313.43s190.63s165.98s159.95s

For multiple segment indices the candidate is better for almost all datasets, except cohere v2 where the baseline is slightly better. For the single segment indices recall curves are almost identical for all cases.

Conclusion

The algorithm discussed in this blog will be available in the upcoming Lucene 10.2, and in the Elasticsearch release that is based on it. Users will be able to take advantage of the improved merge performance and reduced index build time in these new versions. This change is a part of our continuous effort to make Lucene and Elasticsearch fast and efficient for vector and hybrid search.

Try out vector search for yourself using this self-paced hands-on learning for Search AI. You can start a free cloud trial or try Elastic on your local machine now.

Related content

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself