Reranking with an Elasticsearch-hosted cross-encoder from HuggingFace

In this short blog, I’ll show you how to use a model from Hugging Face to perform semantic reranking in your own Elasticsearch cluster at search time. We will download the model using Eland, load a dataset from Hugging Face, and perform sample queries using retrievers, all in a Jupyter notebook.

Overview

If you are unfamiliar with Semantic text, check out these resources:

What it is
Why you would want to use it
How to create an inference API and connect it to an external service
How to use a retriever query for re-ranking

Please review the following links:

What is semantic reranking and how to use it?
- Learn about the trade-offs using semantic reranking in search and RAG pipelines
Semantic reranking in Elasticsearch with retrievers
- This blog includes a video presentation and an overview of everything you need to get started.
Elastic Docs - Semantic re-ranking
- This excellent doc guide talks about use cases, encoder model types, and re-ranking in Elasticsearch

The code in this blog and accompanying notebook will also get you started, but we aren’t going to go in-depth on the what and why.

Also, note that I’ll show code snippets below, but the best way to do this yourself is to follow the accompanying notebook.

Step zero

I will also assume you have an Elasticsearch cluster or serverless project you will use for this guide. If not, head on over to cloud.elastic.co and sign up for a free trial! You'll need a Cloud ID and Elasticsearch API Key.

I’ll wait...

Model selection

The first (real) step is choosing a model to use for re-ranking. A deep discussion of selecting a model and evaluating results is outside the scope of this blog. Know that, for now, Elasticsearch only supports cross-encoder models.

While not directly covering model selection, the following blogs give a good overview of evaluating search relevance.

Evaluating Search Relevance (three-part series)
Search relevance tuning: Balancing keyword and semantic search

For the guide, we are going to use the cross-encoder/ms-marco-MiniLM-L-6-v2. This model used the MS Marco dataset for retrieval and re-ranking.

Model loading

To load an NLP model from Hugging Face into Elasticsearch, you will use the Eland Python Library.

Eland is Elastic's Python library for data frame analytics and loading supervised and NLP models into Elasticsearch. It offers a familiar Pandas-compatible API.

The code below is from the notebook section "Hugging Face Reranking Model."

model_id = "cross-encoder/ms-marco-MiniLM-L-6-v2"

cloud_id = "my_super_cloud_id"
api_key = "my_super_secred_api_key!"

!eland_import_hub_model \
--cloud-id $cloud_id \
--es-api-key $api_key \
--hub-model-id $model_id \
--task-type text_similarity

Eland doesn’t have a specific `rerank` task type; we use the text_similarity type to load the model.

This step will download the model locally where your code is running, split it apart, and load it into your Elasticsearch cluster.

Cut to

In the notebook, you can follow along to set up your cluster to run the re-ranking query in the next section. The setup steps after downloading the model shown in the notebook are:

Create an Inference Endpoint with the rerank task
- This will also deploy our re-ranking model on Elasticsearch machine learning nodes
Create an index mapping
Download a dataset from Hugging Face - CShorten/ML-ArXiv-Papers
Index the data into Elasticsearch

Re-rank time!

With everything set up, we can query using the text_similarity_reranker retriever. The text similarity reranker is a two-stage reranker. This means that the specified retrievers are run first, and then those results are passed to the second re-ranking stage.

Example from the notebook:

query = "sparse vector embedding"

# Query with Semantic Reranker
response_reranked = es.search(
    index="arxiv-papers-lexical",
    body={
      "size": 10,
      "retriever": {
        "text_similarity_reranker": {
          "retriever": {
            "standard": {
              "query": {
                "match": {
                  "title": query
                }
              }
            }
          },
        "field": "abstract",
        "inference_id": "semantic-reranking",
        "inference_text": query,
        "rank_window_size": 100
      }
    },
    "fields": [
      "title", 
      "abstract"
    ], 
    "_source": False
    }
)

The parameters for the text_similarity_reranker above are:

`retriever - Here, we do a simple match query with a standard retriever for lexical first-stage retrieval. You can also use a knn retriever or an rrf retriever here.
field - The field from the first-stage results the re-ranking model will use for similarity comparisons.
inference_id - The ID of the inference service to use for re-ranking. Here, we are using the model we loaded earlier.
inference_text - The string to use for the similarity ranking
rank_window_size - The number of top documents from the first stage the model will consider.

You may wonder why `rank_window_size` is set to 100, even though you might ultimately want only the top 10 results.

In a two-stage search setup, the initial lexical search provides a broad set of documents for the semantic re-ranker to evaluate. Returning a larger set of 100 results increases the chances that relevant documents are available for the semantic re-ranker to identify and reorder based on semantic content, not just lexical matches. This approach compensates for the lexical search's limitations in capturing nuanced meaning, allowing the semantic model to sift through a broader range of possibilities.

However, finding the right `rank_window_size` is a balance. While a larger candidate set improves accuracy, it may also increase resource demands, so some tuning is necessary to achieve an optimal trade-off between recall and resources.

Comparison

While I’m not going to provide an in-depth analysis of the results in this short guide, What may be of general interest is to look at the top 5 results from a standard lexical match query and the results from the re-ranked query above.

This dataset contains a subset of ArXiv papers about Machine Learning. The results listed are the titles of the papers.

The “Scored Results” are the top 10 results using a standard retriever

The “Reranked Results” are the top 10 results after re-ranking

	Scored Results	Reranked Results
0	Compact Speaker Embedding: lrx-vector	Scaling Up Sparse Support Vector Machines by Simultaneous Feature and Sample Reduction
1	Quantum Sparse Support Vector Machines	Spaceland Embedding of Sparse Stochastic Graphs
2	Sparse Support Vector Infinite Push	Elliptical Ordinal Embedding
3	The Sparse Vector Technique, Revisited	Minimum-Distortion Embedding
4	L-Vector: Neural Label Embedding for Domain Adaptation	Free Gap Information from the Differentially Private Sparse Vector and Noisy Max Mechanisms
5	Spaceland Embedding of Sparse Stochastic Graphs	Interpolated Discretized Embedding of Single Vectors and Vector Pairs for Classification, Metric Learning and Distance Approximation
6	Sparse Signal Recovery in the Presence of Intra-Vector and Inter-Vector Correlation	Attention Word Embedding
7	Stable Sparse Subspace Embedding for Dimensionality Reduction	Binary Speaker Embedding
8	Auto-weighted Mutli-view Sparse Reconstructive Embedding	NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization
9	Embedding Words in Non-Vector Space with Unsupervised Graph Learning	Estimating Vector Fields on Manifolds and the Embedding of Directed Graphs

Your turn

Hopefully, you see how easy it is to incorporate a re-ranking model from Hugging Face into Elasticsearch so you can start re-ranking. While this isn't the only re-ranking option, it can be helpful when you are running air-gapped, don't have access to an external re-ranking service, wants to control costs or have a model that works particularly well for your dataset.