This article provides a practical comparison of Retrieval Augmented Generation (RAG) and fine-tuning by examining their performance in a chat box scenario for a fictional e-commerce store.
The article is organized as follows:
RAG vs. Fine-tuning
RAG

RAG (Retrieval Augmented Generation) combines large language models (LLMs) with information retrieval systems so that the generated answers feed on updated and specific data coming from a knowledge base.
- Advantages: RAG allows us to use external data without modifying the base model and provides precise, safe, and traceable answers.
- Implementation: In Elasticsearch, data can be indexed using optimized indexes for semantic search and document-level security.
- Challenges: RAG relies on external knowledge, making accuracy dependent on retrieved information. Retrieval can be costly in terms of context window size. RAG also faces integration and privacy challenges, especially with sensitive data across different sources.
Fine-tuning

Fine-tuning involves training a pre-trained model on a specific dataset. This process adjusts the model's internal weights, enabling it to learn patterns and generate customized answers.
Fine-tuning can also be used for model distillation, a technique where a smaller model is trained on the outputs of a larger model to improve performance on a specific task. This approach allows leveraging the capabilities of a larger model at a reduced cost.
- Advantages: It offers a high level of optimization, adapting answers to specific tasks, making it ideal for static contexts or domains where knowledge does not change frequently.
- Implementation: It requires training the model with structured data using an input-output format. OpenAI fine-tuning makes this flow easier using a UI where you can upload the dataset (JSONL) and then train and test it in a controlled environment.
- Challenges: The retraining process consumes time and computer resources. Precision depends on the quality and size of the dataset, so small or unbalanced ones can result in generic or out-of-context answers; it requires expertise and effort to get it right. There is no grounding or per-user data segmentation.
From OpenAI docs: “We recommend first attempting to get good results with prompt engineering, prompt chaining (breaking complex tasks into multiple prompts), and function calling…”
Fine-tuning and RAG comparison
Aspect | Fine-Tuning | RAG |
---|---|---|
Supported data | Static | Dynamic |
Setup cost | High (training and resources) | Low (index configuration) |
Scalability | Low, requires model retraining | High, real-time updates |
Update time | Hours/Days | Minutes |
Precision with recent changes | Low when not trained with new data | High thanks to semantic search |
Chatbot test case: Pear Store
We will use a test case based on a fictional online store called 'Pear Store'.
Pear Store needs an assistant to answer specific questions about its policies, promotions, and products. These answers must be truthful and consistent with the store information and useful for both employees and customers.
Fine-tuning Dataset
We'll use a training dataset with specific questions and their answers regarding products, policies and promotions. For example:
- Question: What happens if a product is defective?
- Answer: If a product is defective, we'll send you a free gift of one kilogram of pears along with the replacement.
RAG Dataset
For the RAG implementation, we will use the same dataset, converted into a PDF and indexed into Elasticsearch using Playground.
Here's the PDF file content:
Approach 1: Fine-tuning
First, we prepare the dataset in JSONL
format, as shown below:
Make sure each line in the JSONL file is a valid JSON object and there are no trailing commas.
Next, using the OpenAI UI, we can go to Dashboard > Fine-tuning
and hit Create

Then you can upload the JSONL
file we just created.

Now click Create to start training.
After the job is finished, you can hit Playground, and you will have a convenient interface to compare the results with and without the fine-tuned model against a particular question.

On the right side, we can see that the model provided the custom answer about defective products: a free kilogram of pears along with the replacement.
However, the model's response was inconsistent. A subsequent attempt with the same question yielded an unexpected answer.

Although fine-tuning allows us to customize the model's answers, we can see that the model still deviated and provided answers that were just generic and not aligned with our dataset. This is probably because fine-tuning needs more adjustments or a bigger dataset. Now, if we want to change the source data, we will have to repeat the fine-tuning process.
Approach 2: RAG
To test the dataset using RAG, we will use Playground to create the RAG application and upload the dataset to Kibana.
To upload a PDF using the UI and configure the semantic text field, follow the steps from this video:
To learn more about uploading PDFs and interacting with them using Playground, you can read this article.
Now we're ready to interact with our data using Playground! Using the UI, we can change the AI instructions and check the source of the document used to provide an answer.
When we ask the same question in Playground: "What happens if a product is defective?" we receive the correct answer: "If a product is defective, we send you a free gift of one kilogram of pears along with the replacement.". Additionally, we get a citation to verify the answer´s source and can review the instructions the model followed:

If we want to change the data, we just have to update the index with the information about the Q/A.
Final thoughts
The choice between fine-tuning and RAG depends on the requirements of each system. A common pattern is using some domain specific fine tuned model, like FinGPT for finance, LEGAL-BERT for legal, or medAlpaca for medical to acquire common terminology. Then, frame the answers context, and build a RAG system on top of it with company specific documents.
Fine-tuning is useful when you want to manage the model's behavior, and doing so through prompt engineering is not possible, or it requires so many tokens that it’s better to add that information to the training. Or perhaps the task is so narrow and structured that model distillation is the best option.
RAG, on the other hand, excels at integrating knowledge through dynamic data and ensuring accurate, up-to-date responses in real-time. This makes it especially useful for scenarios like the Pear Store, where policies and promotions change frequently. RAG also provides data that is grounded in the answer and offers the ability to segment information delivered to users via document-level security.
Combining fine-tuning and RAG can also be an effective strategy to leverage the strengths of both approaches and tailor solutions to specific project needs.
Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!
Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.