Synonyms allow us to search in our documents using different words that have the same meaning to ensure that users find what they were looking for regardless of the exact word they used. You would think that since RAG applications use semantic/vector search, part of the synonym functionality would be already covered by synonyms (since by definition, synonyms are semantically related words).
Is that true? Does semantic search truly replace synonyms? In this article, we'll analyze the impact of using synonyms in a RAG application.
Steps
Configure the Inference Endpoint
For this example, we'll implement a RAG (Retrieval-Augmented Generation) system with and without synonyms in a HR context. We'll index different documents with variations of the term PTO (Paid Time Off), like "vacation" or "holiday". We'll then configure synonyms to show how these relationships improve relevance and accuracy in search.
First, let's create an endpoint using the ELSER model with the inference api by running the following commands in Kibana DevTools:
Configure synonyms
What are synonyms in Elasticsearch?
In Elasticsearch, synonyms are words or phrases with the same or similar meaning stored as synonym sets that can be managed as files or via API. They allow users to find relevant information, even if they use different terms to refer to the same concept.
So, for example, if we create a set of synonyms where "holiday" and "vacation" are synonyms of "Paid Time Off" when an employee searches for either of those words, they will find the related documents to all of them.
You can read more about them in this article.
Let's create a set of synonyms using the synonyms API:
It's important to note that the set of synonyms must be configured before it can be applied to an index.
Now, let's define the settings and mappings for our data:
We'll use the semantic_text field to do semantic search and the synonyms graph token filter to handle multiword synonyms.
We also created both a text_field
.synonym version and a text_field
version of the field for more control on how we query the field with, or without considering the synonyms.
Finally, we use copy_to to copy the value of text_field to the semantic_text version of the field to enable both full-text and semantic queries.
Index documents
We will now index our documents using the bulk API
:
We are now ready to start searching! But first, let's make sure that synonyms are working by searching for holidays
We adjust boosts so synonyms score lower than original words.
Check the response:
As we can see, when we searched for "holidays", the second document has the synonym: "Paid Time Off".
Semantic Search
Semantic search interprets the meaning of words and phrases instead of just considering text matches. Unlike synonyms, which depend on pre-defined words, semantic search uses language models to interpret queries and documents on a deeper level.
While we could return the results we need by running a semantic search, we'll see that by using synonyms, relevant results will rank higher, thus allowing us to send fewer docs to the LLM, improving results and lowering costs.
Response:
When using semantic search, the documents that are relevant for the query do not come in the first positions since the dataset includes the term "holidays" in contexts unrelated to "Paid time off".
Hybrid Search
Hybrid search gives us the ability to combine the results of both full-text and semantic search queries into one, normalized result set by using RRF (Reciprocal Rank Fusion) to balance the scores from different retrievers.
Response:
This query will return both semantic and text related documents.
Synonyms and RAG
data:image/s3,"s3://crabby-images/b4ddd/b4ddd344c31c8a71f9586cd6da1068277cc3358e" alt=""
In this section, we'll evaluate how synonyms and semantic search improve queries in a RAG system. We'll use a common question about days off for this example:
"How many vacation days are provided for holidays?"
For this question, we're interested in the information in document 1. Document 2 is closer to the result that we want, but it's not precise. We'll get this result when we search with no synonyms. Let't take a look a their content:
- [1] Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones.
- [2] Holidays: Paid public holidays recognized each calendar year.
Both documents include information related to days off but only document 2 uses specifically the term "holidays" so we can test how synonyms and semantic search work in Playground.
You can access Playground from Search>Playground. From there, you need to configure the LLM you want to use and select the index we've already created to be sent as context. You can read more about Playground and its configuration here
Once Playground has been configured, if we click the query button, we can see that synonyms are deactivated:
data:image/s3,"s3://crabby-images/9d852/9d85219e57a72774418cfa1d1248391d5f65a0b8" alt=""
For each question, we'll send the first three hits of the previous query to the LLM, as context:
data:image/s3,"s3://crabby-images/da841/da8416b5b475c8016c917828e7aa0f1c62b39e82" alt=""
Now, let's ask the question to Playground and check the results with synonyms deactivated:
data:image/s3,"s3://crabby-images/ed9c1/ed9c1c340db4c306f714bee7be3bf744fd2334ae" alt=""
Since the document specifying the amount of holidays employees get a year is not among the first three hits, the LLM cannot answer the question. In this case, the closest result is in document [2].
Note: By clicking on "Snippet", we can see the specific content in Elasticsearch from where the answer comes.
Let's clean up the chat, activate synonyms and ask the same question again:
data:image/s3,"s3://crabby-images/b04e6/b04e6c5cff481fe05a4f05e9df2d80bf61c5880c" alt=""
data:image/s3,"s3://crabby-images/81378/81378496f86bbc4801338260890dd8ee8db28c22" alt=""
Note that when you enable a semantic_text field, and a text field, Playground will automatically generate a hybrid search query:
data:image/s3,"s3://crabby-images/dc397/dc397375d5e28856f793bfa74b75b82813ee64ea" alt=""
Let’s repeat the question, now with synonyms activated:
data:image/s3,"s3://crabby-images/cff64/cff648eacedf4e094a30da62453581a171cb425e" alt=""
Now, the answer does include the document we were searching for, since the synonym allowed for document [1] to be sent to the LLM.
Conclusion
In this article, we discovered that synonyms are a fundamental part of search systems, even when using semantic search since it does not necessarily cover the synonym functionality.
Synonyms allow us to control which documents we want to boost based on our use case, improving accuracy by tuning relevance. Semantic search, on the other hand, is useful for recall, meaning it introduces potentially relevant results without us having to add synonyms for every related term.
With Hybrid search we can do synonyms and semantic search at the same time, bringing the best of the two worlds. With Playground, if we select as search fields a combination of semantic, and text fields, a hybrid query will be built automatically for us.
Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!
Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.