As part of our natural language processing (NLP) blog series, we will walk through an example of using a text embedding model to generate vector representations of textual contents and demonstrating vector similarity search on generated vectors. We will deploy a publicly available model on Elasticsearch and use it in an ingest pipeline to generate embeddings from textual documents. We will then show how to use those embeddings in the vector similarity search to find semantically similar documents for a given query.
Vector similarity search or, as is commonly called semantic search, goes beyond the traditional keyword based search and allows users to find semantically similar documents that may not have any common keywords thus providing a wider range of results. Vector similarity search operates on dense vectors and uses k-nearest neighbour search to find similar vectors. For this, contents in the textual form first need to be converted to their numeric vector representations using a text embedding model.
We will use a public dataset from the MS MARCO Passage Ranking Task for demonstration. It consists of real questions from the Microsoft Bing search engine and human generated answers for them. This dataset is a perfect resource for testing vector similarity search, firstly, because question-answering is a one of the most common use cases for vector search, and secondly, the top papers in the MS MARCO leaderboard use vector search in some form.
In our example we will work with a sample of this dataset, use a model to produce text embeddings, and then run vector search on it. We hope to also do a quick verification of the quality of produced results from the vector search.
Deploying NLP Models, generating text embeddings & running vector search
1. Deploy a text embedding model
The first step is to install a text embedding model. For our model we use
msmarco-MiniLM-L12-cos-v5 from Hugging Face. This is a sentence-transformer model that takes a sentence or a paragraph and maps it to a 384-dimensional dense vector. This model is optimized for semantic search and was specifically trained on the MS MARCO Passage dataset, making it suitable for our task. Besides this model, Elasticsearch supports a number of other models for text embedding. The full list can be found here.
We install the model with the Eland docker agent that we built in the NER example. Running a script below imports our model into our cluster and deploys it:
eland_import_hub_model
--cloud-id <cloud-id> \
-u <username> -p <password> \
--hub-model-id sentence-transformers/msmarco-MiniLM-L12-cos-v5 \
--task-type text_embedding \
--start
This time, --task-type is set to text_embedding and the --start option is passed to the Eland script so the model will be deployed automatically without having to start it in the Model Management UI. To speed up inferences, you can increase the number of inference threads with inference_threads parameter.
We can test the successful deployment of the model by using this example in Kibana Console:
POST /_ml/trained_models/sentence-transformers__msmarco-minilm-l12-cos-v5/deployment/_infer
{
"docs": {
"text_field": "how is the weather in jamaica"
}
}
We should see the predicted dense vector as the result:
{
"predicted_value" : [
0.051237598061561584,
-0.04680659621953964,
0.03971194103360176
…
]
}
2. Loading initial data
As mentioned in the introduction, we use the MS MARCO Passage Ranking dataset. The dataset is quite big, consisting of over 8 million passages. For our example, we use a subset of it that was used in the testing stage of the 2019 TREC Deep Learning Track. The dataset msmarco-passagetest2019-top1000.tsv used for the re-ranking task contains 200 queries and for each query a list of relevant text passages extracted by a simple IR system. From that dataset, we’ve extracted all unique passages with their ids, and put them into a separate tsv file, totaling 182469 passages. We use this file as our dataset.
We use Kibana's file upload feature to upload this dataset. Kibana file upload allows us to provide custom names for fields, let’s call them id with type long for passages’ ids, and text with type text for passages’ contents. The index name is collection. After the upload, we can see an index named collection with 182469 documents.
3. Creating pipeline for text embeddings
We want to process the initial data with an Inference processor that will add an embedding for each passage. For this, we create a text embedding ingest pipeline and then reindex our initial data with this pipeline.
In the Kibana Console we create an ingest pipeline (as we did in the previous blog post), this time for text embeddings, and call it text-embeddings. The passages are in a field named text. As we did before, we’ll define a field_map to map text to the field text_field that the model expects. Similarly on_failure handler is set to index failures into a different index:
PUT _ingest/pipeline/text-embeddings
{
"description": "text embedding pipeline",
"processors": [
{
"inference": {
"model_id": "sentence-transformers__msmarco-minilm-l12-cos-v5",
"target_field": "text_embedding",
"field_map": {
"text": "text_field"
}
}
}
],
"on_failure": [
{
"set": {
"description": "Index document to 'failed-<index>'",
"field": "_index",
"value": "failed-{{{_index}}}"
}
},
{
"set": {
"description": "Set error message",
"field": "ingest.failure",
"value": "{{_ingest.on_failure_message}}"
}
}
]
}
4. Reindex data through text embeddings pipeline
We want to reindex documents from the collection index into the new collection-with-embeddings index by pushing documents through text-embeddings pipeline, so that documents in the collection-with-embeddings index have an additional field for passages’ embeddings. From Elasticsearch v 8.11, it is not necessary anymore to define index mapping for dense_vector
field - long float arrays are automatically mapped as dense_vector field with correct number of dimensions set. But if we want to have more control on index mapping, we can define it as following:
PUT collection-with-embeddings
{
"mappings": {
"properties": {
"text_embedding.predicted_value": {
"type": "dense_vector"
},
"text": {
"type": "text"
}
}
}
}
Note: from Elasticsearch v 8.11, it is optional to provide dims
, index
and similarity
parameters in the mapping of dense_vector
.
Finally, we are ready to reindex. Given that reindex will take some time to process all documents and infer on them, we do reindex in the background by invoking the API with the wait_for_completion=false flag.
POST _reindex?wait_for_completion=false
{
"source": {
"index": "collection"
},
"dest": {
"index": "collection-with-embeddings",
"pipeline": "text-embeddings"
}
}
The above returns a task id. We can monitor progress of the task with:
GET _tasks/<task_id>
Alternatively, track progress by watching Inference count increase in the model stats API or model stats UI.
The reindexed documents now contain the inference results – vector embeddings. As an example one of the documents looks like this:
{
"id": "G7PPtn8BjSkJO8zzChzT",
"text": "This is the definition of RNA along with examples of types of RNA molecules. This is the definition of RNA along with examples of types of RNA molecules. RNA Definition",
"text_embedding":
{
"predicted_value":
[
0.057356324046850204,
0.1602816879749298,
-0.18122544884681702,
0.022277727723121643,
....
],
"model_id": "sentence-transformers__msmarco-minilm-l12-cos-v5"
}
}
5. Vector similarity search
From Elasticsearch v 8.7, we support implicit generation of embeddings from query terms during a search request using query_vector_builder
parameter of knn
search. For this you simply need to provide your model_id
(in our case it will be "sentence-transformers__msmarco-minilm-l12-cos-v5"), and model_text
– the query string from which the model will generate the dense vector representation.
GET collection-with-embeddings/_search
{
"knn": {
"field": "text_embedding.predicted_value",
"query_vector_builder": {
"text_embedding": {
"model_id": "sentence-transformers__msmarco-minilm-l12-cos-v5",
"model_text": "how is the weather in jamaica"
}
},
"k": 10,
"num_candidates": 100
},
"fields": [
"id",
"text"
],
"_source": false
}
Note: from Elasticsearch v 8.13, it is optional to provide k
and num_candidates
parameters for knn
search.
As a result, we get top 10 closest to the query documents sorted by their proximity to the query:
"hits" : [
{
"_index" : "collection-with-embeddings",
"_id" : "47TPtn8BjSkJO8zzKq_o",
"_score" : 0.94591534,
"fields" : {
"id": [
434125
],
"text": [
"The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. Continue Reading."
]
}
},
{
"_index" : "collection-with-embeddings",
"_id" : "3LTPtn8BjSkJO8zzKJO1",
"_score" : 0.94536424,
"_source" : {
"id": [
4498474
],
"text": [
"The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year"
]
}
},
{
"_index" : "collection-with-embeddings",
"_id" : "KrXPtn8BjSkJO8zzPbDW",
"_score" : 0.9432083,
"_source" : {
"id": [
190804
],
"text": [
"Quick Answer. The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. Continue Reading"
]
}
},
...
6. Quick verification
As we used only a subset of MS MARCO dataset, we can not do a full evaluation. What we can do instead is a simple verification on a few queries just to get a sense that we indeed are getting relevant results, and not some random ones. From the TREC 2019 Deep Learning Track judgements for Passage Ranking Task, we take the 3 last queries, submit them to our vector similarity search, get top 10 results and consult the TREC judgments to see how relevant are the results that we have received. For the Passage Ranking task, passages are judged on a four-point scale of Irrelevant (0), Related (the passage is on-topic but does not answer the question) (1), Highly Relevant (2), and Perfectly Relevant (3).
Please note that our verification is not a rigorous evaluation, it is used only for our quick demo. Since we only indexed passages that are known to be related to the queries, it is a much easier task than the original passage retrieval task. In the future we intend to do a rigorous evaluation on the MS MARCO dataset.
Query #1124210 “tracheids are part of _____” submitted to our vector search returns the following results:
| | |
Passage id | Relevance rating | Passage |
---|---|---|
2258591 | 2 - highly relevant | Tracheid of oak shows pits along the walls. It is longer than a vessel element and has no perforation plates. Tracheids are elongated cells in the xylem of vascular plants that serve in the transport of water and mineral salts.Tracheids are one of two types of tracheary elements, vessel elements being the other. Tracheids, unlike vessel elements, do not have perforation plates.racheids provide most of the structural support in softwoods, where they are the major cell type. Because tracheids have a much higher surface to volume ratio compared to vessel elements, they serve to hold water against gravity (by adhesion) when transpiration is not occurring. |
2258592 | 3 - perfectly relevant | Tracheid. a dead lignified plant cell that functions in water conduction. Tracheids are found in the xylem of all higher plants except certain angiosperms, such as cereals and sedges, in which the water-conducting function is performed by vessels, or tracheae.Tracheids are usually polygonal in cross section; their walls have annular, spiral, or scalene thickenings or rimmed pores.racheids are found in the xylem of all higher plants except certain angiosperms, such as cereals and sedges, in which the water-conducting function is performed by vessels, or tracheae. Tracheids are usually polygonal in cross section; their walls have annular, spiral, or scalene thickenings or rimmed pores. |
2258596 | 2 - highly relevant | Woody angiosperms have also vessels. The mature tracheids form a column of superposed, cylindrical dead cells whose end walls have been perforated, resulting in a continuous tube called vessel (trachea). Tracheids are found in all vascular plants and are the only conducting elements in gymnosperms and ferns. Tracheids have Pits on their end walls. Pits are not nearly as efficient for water translocation as Perforation Plates found in vessel elements. Woody angiosperms have also vessels. The mature tracheids form a column of superposed, cylindrical dead cells whose end walls have been perforated, resulting in a continuous tube called vessel (trachea). Tracheids are found in all vascular plants and are the only conducting elements in gymnosperms and ferns |
2258595 | 2 - highly relevant | Summary: Vessels have perforations at the end plates while tracheids do not have end plates. Tracheids are derived from single individual cells while vessels are derived from a pile of cells. Tracheids are present in all vascular plants whereas vessels are confined to angiosperms. Tracheids are thin whereas vessel elements are wide. Tracheids have a much higher surface-to-volume ratio as compared to vessel elements. Vessels are broader than tracheids with which they are associated. Morphology of the perforation plate is different from that in tracheids. Tracheids are thin whereas vessel elements are wide. Tracheids have a much higher surface-to-volume ratio as compared to vessel elements. Vessels are broader than tracheids with which they are associated. Morphology of the perforation plate is different from that in tracheids. |
131190 | 3 - perfectly relevant | Xylem tracheids are pointed, elongated xylem cells, the simplest of which have continuous primary cell walls and lignified secondary wall thickenings in the form of rings, hoops, or reticulate networks. |
7443586 | 2 - highly relevant | 1 The xylem tracheary elements consist of cells known as tracheids and vessel members, both of which are typically narrow, hollow, and elongated. Tracheids are less specialized than the vessel members and are the only type of water-conducting cells in most gymnosperms and seedless vascular plants. |
181177 | 2 - highly relevant | In most plants, pitted tracheids function as the primary transport cells. The other type of tracheary element, besides the tracheid, is the vessel element. Vessel elements are joined by perforations into vessels. In vessels, water travels by bulk flow, as in a pipe, rather than by diffusion through cell membranes. |
2947055 | 0 - irrelevant | Cholesterol belongs to the groups of lipids called _______.holesterol belongs to the groups of lipids called _______. |
6541866 | 2 - highly relevant | In most plants, pitted tracheids function as the primary transport cells. The other type of tracheary element, besides the tracheid, is the vessel element. Vessel elements are joined by perforations into vessels. In vessels, water travels by bulk flow, as in a pipe, rather than by diffusion through cell membranes. In most plants, pitted tracheids function as the primary transport cells. The other type of tracheary element, besides the tracheid, is the vessel element. Vessel elements are joined by perforations into vessels. In vessels, water travels by bulk flow, as in a pipe, rather than by diffusion through cell membranes. |
Query #1129237 “hydrogen is a liquid below what temperature” returns the following results:
| | |
Passage id | Relevance rating | Passage |
---|---|---|
8588222 | 0 - irrelevant | Answer to: Hydrogen is a liquid below what temperature? By signing up, you'll get thousands of step-by-step solutions to your homework questions.... for Teachers for Schools for Companies |
128984 | 3 - perfectly relevant | Hydrogen gas has the molecular formula H 2. At room temperature and under standard pressure conditions, hydrogen is a gas that is tasteless, odorless and colorless. Hydrogen can exist as a liquid under high pressure and an extremely low temperature of 20.28 kelvin (−252.87°C, −423.17 °F). Hydrogen is often stored in this way as liquid hydrogen takes up less space than hydrogen in its normal gas form. Liquid hydrogen is also used as a rocket fuel. |
8588219 | 3 - perfectly relevant | User: Hydrogen is a liquid below what temperature? a. 100 degrees C c. -183 degrees C b. -253 degrees C d. 0 degrees C Weegy: Hydrogen is a liquid below 253 degrees C. User: What is the boiling point of oxygen? a. 100 degrees C c. -57 degrees C b. 8 degrees C d. -183 degrees C Weegy: The boiling point of oxygen is -183 degrees C. |
3905057 | 3 - perfectly relevant | Hydrogen is a colorless, odorless, tasteless gas. Its density is the lowest of any chemical element, 0.08999 grams per liter. By comparison, a liter of air weighs 1.29 grams, 14 times as much as a liter of hydrogen. Hydrogen changes from a gas to a liquid at a temperature of -252.77°C (-422.99°F) and from a liquid to a solid at a temperature of -259.2°C (-434.6°F). It is slightly soluble in water, alcohol, and a few other common liquids. |
4254811 | 3 - perfectly relevant | At STP (standard temperature and pressure) hydrogen is a gas. It cools to a liquid at -423 °F, which is only about 37 degrees above absolute zero. Eleven degrees cooler, at … -434 °F, it starts to solidify. |
2697752 | 2 - highly relevant | Hydrogen's state of matter is gas at standard conditions of temperature and pressure. Hydrogen condenses into a liquid or freezes solid at extremely cold... Hydrogen's state of matter is gas at standard conditions of temperature and pressure. Hydrogen condenses into a liquid or freezes solid at extremely cold temperatures. Hydrogen's state of matter can change when the temperature changes, becoming a liquid at temperatures between minus 423.18 and minus 434.49 degrees Fahrenheit. It becomes a solid at temperatures below minus 434.49 F.Due to its high flammability, hydrogen gas is commonly used in combustion reactions, such as in rocket and automobile fuels. |
6080460 | 3 - perfectly relevant | Hydrogen can exist as a liquid under high pressure and an extremely low temperature of 20.28 kelvin (−252.87°C, −423.17 °F). Hydrogen is often stored in this way as liquid hydrogen takes up less space than hydrogen in its normal gas form. Liquid hydrogen is also used as a rocket fuel. Hydrogen is found in large amounts in giant gas planets and stars, it plays a key role in powering stars through fusion reactions. Hydrogen is one of two important elements found in water (H 2 O). Each molecule of water is made up of two hydrogen atoms bonded to one oxygen atom. |
128989 | 3 - perfectly relevant | Confidence votes 11.4K. At STP (standard temperature and pressure) hydrogen is a gas. It cools to a liquid at -423 °F, which is only about 37 degrees above absolute zero. Eleven degrees cooler, at -434 °F, it starts to solidify. |
1959030 | 0 - irrelevant | While below 4 °C the breakage of hydrogen bonds due to heating allows water molecules to pack closer despite the increase in the thermal motion (which tends to expand a liquid), above 4 °C water expands as the temperature increases. Water near the boiling point is about 4% less dense than water at 4 °C (39 °F) |
3905800 | 0 - irrelevant | Hydrogen is the lightest of the elements with an atomic weight of 1.0. Liquid hydrogen has a density of 0.07 grams per cubic centimeter, whereas water has a density of 1.0 g/cc and gasoline about 0.75 g/cc. These facts give hydrogen both advantages and disadvantages. |
Query #1133167 “how is the weather in jamaica” returns the following results:
| | |
Passage id | Relevance rating | Passage |
---|---|---|
434125 | 3 - perfectly relevant | The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. |
4498474 | 3 - perfectly relevant | The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. |
190804 | 3 - perfectly relevant | Quick Answer. The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. Continue Reading. |
1824479 | 3 - perfectly relevant | A: The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. |
1824480 | 3 - perfectly relevant | Quick Answer. The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. |
1824488 | 2 - highly relevant | Learn About the Weather of Jamaica The weather patterns you'll encounter in Jamaica can vary dramatically around the island Regardless of when you visit, the tropical climate and warm temperatures of Jamaica essentially guarantee beautiful weather during your vacation. Average temperatures in Jamaica range between 80 degrees Fahrenheit and 90 degrees Fahrenheit, with July and August being the hottest months and February the coolest. |
4922619 | 2 - highly relevant | Weather. Jamaica averages about 80 degrees year-round, so climate is less a factor in booking travel than other destinations. The days are warm and the nights are cool. Rain usually falls for short periods in the late afternoon, with sunshine the rest of the day. |
190806 | 2 - highly relevant | It is always important to know what the weather in Jamaica will be like before you plan and take your vacation. For the most part, the average temperature in Jamaica is between 80 °F and 90 °F (27 °FCelsius-29 °Celsius). Luckily, the weather in Jamaica is always vacation friendly. You will hardly experience long periods of rain fall, and you will become accustomed to weeks upon weeks of sunny weather. |
2613296 | 2 - highly relevant | Average temperatures in Jamaica range between 80 degrees Fahrenheit and 90 degrees Fahrenheit, with July and August being the hottest months and February the coolest. Temperatures in Jamaica generally vary approximately 10 degrees from summer to winter |
1824486 | 2 - highly relevant | The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably... |
As we can see for all 3 queries Elasticsearch returned mostly relevant results, and the top results for all queries were mostly either highly or perfectly relevant.
Trying it out
NLP is a powerful feature in the Elastic Stack with an exciting roadmap. Discover new features and keep up with the latest developments by building your cluster in Elastic Cloud. Sign up for a free 14-day trial today and try the examples in this blog.
More NLP reads
- How to deploy NLP named entity recognition NER
- How to deploy NLP sentiment analysis
- How to deploy natural language processing: Getting started
Ready to try this out on your own? Start a free trial or use this self-paced hands-on learning for Search AI.
Elasticsearch has integrations for tools from LangChain, Cohere and more. Join our advanced semantic search webinar to build your next GenAI app!