Evaluating your <strong>elasticsearch</strong> LLM applications with Ragas

Want to get Elastic certified? Find out when the next elasticsearch Engineer training is running!

elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.

Ragas is an evaluation framework that enables you to gain a deeper understanding of how your LLM applications are performing. It provides established metrics and scoring to create a quantitative analysis of your LLM application’s overall performance. Evaluation frameworks, such as Ragas, reveal issues in LLM systems and provide standardized metrics to guide enhancements and pinpoint where potential issues may be.

Using a framework like Ragas is helpful since it can help eliminate the guesswork of working with LLM systems and enable you to measure and improve upon past performance. It can be helpful for letting you know if your application is prone to hallucination in certain areas, or if your application is generating answers or if your retrieval is pulling irrelevant information.

This blog post guides you through an example of assessing the quality of a Retrieval-Augmented Generation (RAG) solution using Ragas metrics and elasticsearch. You can find the full code and related resources on GitHub.

Overview of the solution

The solution, discussed in this blog post, first uploads a dataset from a JSON file into elasticsearch. The dataset used is books.json, a subset of 25 books from Goodreads, including the book title, the author's name, book description, publication year, and a Goodreads URL.

After the book data is loaded, the next step is to store the index with vector embeddings in elasticsearch. You can then create evaluation questions and use them to retrieve the relevant context through vector search.

Once the contexts are retrieved for each question, you can generate both answers and ground truths. Ground truths are essentially ideal answers. Using GPT-4o, you can create answers based on the retrieved context, while a separate function will generate ground truths applying scoring logic to rank and select the most relevant books from the retrieved results, generating recommendations that can serve as the evaluation benchmark. Basically, attempting to figure out what you are really looking for, trying to find a response that is the best match, and explaining to you the reason why.

Finally, you'll evaluate performance by comparing the generated answers against the ground truths using Ragas metrics and viewing the results.

Prerequisites

The version of Python that is used is Python 3.12.1, but you can use any version of Python higher than 3.10.
This example uses elasticsearch version 9.0.3, but you can use any version of elasticsearch that is higher than 8.10.
You will need an OpenAI API Key, which you can find on the API keys page in OpenAI's developer portal.
You will also need a base level of knowledge regarding vector databases and RAG. If you aren’t already familiar with vector search concepts in Elastic, you may want to first check out the following resources:

Metrics

The metrics used in this example are context precision, faithfulness, and context recall.

Context precision refers to the relevance of the generated answer to the actual question or topic, measuring the proportion of the retrieved context used in the answer that is actually relevant to answering the question.
Faithfulness is a way to determine if your LLM is hallucinating and indicates the accuracy of the generated answer.
Context recall measures how many of the relevant documents were successfully retrieved, focusing on not missing important results.
For context_precision, high precision scores mean that the generated answer is found in the context. High scores for faithfulness mean the generated answer is accurate and aligns with the reference. High scores for context_recall mean the generated answer uses all or most of the relevant context.

The reason these three metrics were chosen is that they provide an actionable signal about both retriever and generator performance. They allow you to determine if your RAG system retrieved the right things, answer what you retrieved, and assess if your answer is correct or supported.

These are only three of the available metrics from Ragas. You can learn more about other metrics available in the Ragas documentation.

Setting up

You will first want to install the packages required for this application. These include the following:

The elasticsearch Python client – helpful for authenticating and connecting to elasticsearch and for vector search.
Ragas – used for evaluating the quality of the LLM applications using standard metrics.
The Hugging Face library datasets – used to create a robust evaluation dataset.
Langchain-OpenAI – used both for generating answers to user questions and for evaluation.

After installing the required packages, you can now import os, which you be using for setting environment variables and other related tasks, json for parsing a JSON file containing books, getpass for passing in sensitive values like API keys and tokens, the elasticsearch Python client, ragas for evaluation, and ragas.metrics for the metrics used to evaluate the rag application, as well as datasets for creating an evaluation dataset and langchain_openai for model chat capabilities.

Creating the elasticsearch index and vector search

Now, you will want to create a variable called es, where you can pass in your elasticsearch host address and your elasticsearch API key.

You will also need to create a variable called index_name that can be set to whatever name you want to give your index. For this example, you can name your index ragas-books.

To add vector embeddings to the dataset, you will want to create a function that takes a query string and turns it into a vector using the Elastic machine learning model .multilingual-e5-small-x86_64. Be sure to check out our documentation on E5 models to learn more. If you run into any issues, you may need to deploy the model first before the code can run properly.

Now, you should check if you have an existing index with the same name. If you do, you will delete the older index. After that, you can create a new index with the proper mappings. Since the dataset used is a snippet of books from Goodreads, the mappings will match the headers from that example.

At this point, you will want to extract the data from the books.json file, loop through it, generate an embedding for the book descriptions, and load that into the elasticsearch index. Since the original dataset contains book title, the author's name, book description, publication year, and a Goodreads URL, the book description is the best candidate for generating embeddings because it contains meaning beyond keywords. You can check out this Linkedin post to learn more about what fields are good candidates for semantic search.

While it’s possible to generate embeddings from multiple fields such as title and description, using the description alone is a more simple approach that avoids noise from a shorter field like title. There is a tradeoff since you may miss some title-specific matches.

Now that the data is loaded into elasticsearch, you can create a function called vector_search to perform a k-nearest neighbors (KNN) search. 3 was chosen for the top_k value, since it is how many nearest-neighbor context chunks are retrieved from elasticsearch for each question. It tends to be small enough to keep context relevant. Still, large enough to give the model multiple options and avoid missing something due to retrieval errors. top_k is a value that could be adjusted if you are seeing that your results aren't accurate results from your RAG application. To learn more about what K value to choose, be sure to check out our blog post on the subject.

This function first generates an embedding for the input query using the embed_query function. It performs the search on the specified index, returns the results, and creates a context based on the book title. Finally, it returns both the text contexts for RAG and the metadata for the books.

Implementing the RAG generation component

At this point, you will want to check if an environment variable is set for your OpenAI API Key, and if not, you will want to put one using getpass.

After obtaining the OpenAI API Key, you can create a variable called chat_llm that will be used for RAG, which calls ChatOpenAI with the model you want to use, the temperature which controls the randomness/creativity of the LLM’s output (lower values tend to be a safer choice), and your OpenAI API key. While this example is using gpt-4o, you can easily change the model to another by adjusting the parameter model=”model name”.

Next, you can create a function called generate_answer that first joins the context strings into a single block of text. After, you can make a prompt that instructs the LLM only to use the provided context, and give that prompt to the LLM to get the response. After, you'll need to strip the whitespace from that response. This function represents the generation in RAG. It plays a key role when using Ragas to evaluate the quality of generated answers.

Before you can create ground truths that represent a known correct answer, you can use a function that extracts intent patterns and other key attributes from user questions. The function analyze_question_intent is designed to identify genres, extract quality preferences, and determine whether the question is author-specific from book-related data.

You will also want to score a book against a user’s query based on intent data. You will get back a numeric score, where the higher the score means a better match, a list of reasons explaining the score and the book’s rating and metadata.

To create ground truths, you will want to begin by setting up a fallback option if no books are found. If books are available, it selects the top-ranked result and generates a recommendation based on the detected intent. It will prioritize genre matches first, then high rating preferences, then popularity preferences, followed by author-specific requests. If none of these apply, it falls back to a general recommendation. This function will favor higher-rated books when possible and it will include a second book as an additional suggestion if its score is close to the top book’s score.

While the function below is a good starting point since it uses retrieval context rather than external knowledge, applies business logic such as rating preferences and genre matching, and generates consistent, reasonable responses, manually generating ground truths won’t scale well. In production environments, you would want to consider having your ground truths be semi-automated with human review, or LLM-generated with validation.

Running the demo

Finally, you are now ready to run the demo. This function is similar to a main function, where it ties all the other functions together to generate an evaluation. First, you will define some demo questions to evaluate the RAG application. After that, you can initialize the lists and loop through each question. After looping through each question, you want to generate an answer using the retrieved context and create ground truths using the best-matching books. Next, you will collect the data for evaluation and create a dataset for Ragas evaluation. Once the evaluation dataset is created, you can run the evaluation, print the results, and save them to a CSV file.

Finally, we can run the demo with some additional error handling.

Examining the output

In LLM applications, scores above 0.8 on these metrics typically show strong performance, but this could differ depending on the dataset, domain, and use case. More information on each metric is available in the documentation.

In the output below, the average faithfulness score is 0.750, indicating that most answers generally remained consistent with the retrieved context. The context_recall average of 0.500 shows that the system retrieved enough information for a complete answer only half the time. Meanwhile, the context_precision average of 0.625 suggests that a fair portion of the generated content directly matched the retrieved context, but there’s still room for improvement.

When you take a look at the individual questions, both Question 0 and Question 3 scored well on both faithfulness and context precision. This shows an alignment between retrieval and generation. Question 1 had strong recall but only moderate precision, which means that some details were likely added beyond the retrieved context. Question 2 scored low on both precision and recall, showing gaps in retrieval and that the generated content drifted away from the provided context.

To improve these results, you could focus on enhancing retrieval quality by experimenting with different embedding models, refining chunking strategies, or applying context engineering to better match user queries. On the generation side, using stricter prompts and other prompt engineering tactics could be helpful in reducing hallucination, and ensuring that evaluation ground truths are tightly aligned with retrieved content will lead to more accurate scoring.

Common challenges with evaluation frameworks like Ragas

While evaluation frameworks such as Ragas can serve as useful baselines, they are only guidelines and are designed to be part of a broader evaluation strategy rather than definitive measures of system quality.

There are some common issues with evaluation frameworks, which include overly simplistic ground truths that don't provide the whole picture, small sample sizes, and problems with circularity with LLMs evaluating LLMs.

Additionally, there is sometimes a disconnect between evaluation results and actual performance in the real world. Systems that achieve high evaluation scores may perform inconsistently in actual user scenarios and uncommon situations. Frameworks such as this one can focus heavily on specific dimensions such as factual accuracy or relevance while potentially underweighting others, like user experience, response latency, or handling of ambiguous queries.

To mitigate these challenges, you may want to explore A/B testing with real users, using human-in-the-loop reviews or LLM-as-a-judge ensembles as a way of reducing evaluator bias.

Next steps

The example in this blog post is a starting point for working with Ragas. There are also some key considerations to think about for adapting a solution like this at scale:

Regularly re-index your content in elasticsearch whenever the source data changes
Use a broad, realistic evaluation set to track real-world performance
Measure and optimize your system’s speed and cost at production scale to avoid bottlenecks or budget issues.

Conclusion

Using evaluation methods, such as the Ragas framework, can help you determine if your LLM application is performing as intended and provide a sense of its accuracy. It can help guide you in deciding whether to pivot to another model if it's not performing as well as expected, and can be used for side-by-side comparisons of different models to evaluate how well each model works for your purposes.

Report an issue