Longer context ≠ Better: Why RAG still matters

elasticsearch has native integrations to industry leading Gen AI tools and providers. Check out our webinars on going Beyond RAG Basics, or building prod-ready apps Elastic Vector Database.

To build the best search solutions for your use case, start a free cloud trial or try Elastic on your local machine now.

Models with over 1 million tokens are not something new; more than 1 year ago, Google announced Gemini 1.5 with 1 million tokens of context. One million tokens is approximately 2000 A5 pages which, in many cases, is more than all the data we have stored.

Then the question arises: “What if I just send everything in the prompt?”

In this article, we will compare RAG with just sending everything to a long context model and letting the LLM analyze the context and answer a question.

You can find a notebook with the full experiment here.

Initial thoughts

Before we begin, we can make some statements to put to the test:

Convenience: Not many models have long context versions, so our alternatives are limited.
Performance: The speed of processing 1M tokens by an LLM should be slower than retrieving from elasticsearch + LLM processing smaller token contexts.
Price: The price per question should be significantly higher.
Precision: A RAG system can effectively help us filter out noise and keep the LLM’s attention on what matters.

Though sending everything as context is an advantage, it presents the challenge of making sure you are capturing every relevant document on your query. elasticsearch allows you to mix and match different strategies to search the right documents: filters, full-text search, semantic search, and hybrid search.

Test definition

Model/RAG specs

LLM Model: gemini-2.0-flash
Model provider: Google
Dataset: elasticsearch search labs articles

For each of the test cases we are going to evaluate:

LLM Price
End-to-end latency
Answer correctness

Test cases

Based on an elasticsearch articles dataset, we are going to test the two strategies, RAG and LLM full context, on two different types of questions:

Textual. The question will be about a text that is literally written in the documents.

Non-textual. The question text will not be present in the document and will require the LLM to infer information or use different pieces.

Running tests

1. Index data

Download the dataset in NDJSON format to run the following steps:

The following steps and screenshots were taken from a Cloud Hosted Deployment. In the deployment, go to “Overview” and scroll down to click “Upload a file.” Then click on “here” since we need to add customized mappings.

In the new view, drag the ndjson file with the dataset and click on import.

Then, click on advanced and enter the index’s name and add the following mappings:

Click import to finalize and wait for the data to be indexed.

2. Textual RAG

I’ve extracted a fragment of the article elasticsearch in JavaScript the proper way, part II to use as the query string.

Running match phrase query

This is the query we're going to use to retrieve the results from elasticsearch using match phrase search capabilities. We will pass the query_str as input to the match phrase search.

Returned hits:

This prompt template gives the LLM the instructions to answer the question and the context to do so. At the end of the prompt, we're asking for the article that contains the information we are looking for.

The prompt template will be the same for all tests.

Run results through LLM

elasticsearch results will be provided as context for the LLM so we can get the result we need. We are going to extract the article title and the highlights relevant to the user query. After that we send the question, article titles, and highlights to the LLM to find the answer.

LLM response:

The model finds the right article.

3. LLM Textual

Match All Query

To provide context to the LLM, we're going to get it from the indexed documents in elasticsearch. We are going to send all the 303 articles we have indexed, which are about 1 million tokens long.

Run results through LLM

As in the previous step, we're going to provide the context to the LLM and ask for the answer.

LLM response:

It failed! When multiple articles contain similar information, the LLM might struggle to pinpoint the exact text you’re searching for.

RAG non-textual

For the second test we're going to use a semantic query to retrieve the results from elasticsearch. For that we built a short synopsis of elasticsearch in JavaScript, the proper way, part II article as query_str and provided it as input to RAG.

From now on, the code mostly follows the same pattern as the tests with the textual query, so we’ll refer to the code in the notebook for those sections.

Running semantic search

Notebook reference: 2. Run Comparisons > Test 2: Semantic Query > Executing semantic search.

Semantic search response hits:

Run results through LLM

Notebook reference: 2. Run Comparisons > Test 2: Semantic Query > Run results through LLM

LLM response:

4. LLM non-textual

Match-all query

Notebook reference: 2. Run Comparisons > Test 2: Semantic Query > Match all query

Match-all-query response:

Run results through LLM

Notebook reference: 2. Run Comparisons > Test 2: Semantic Query > Run results through LLM

LLM response:

As the LLM has more options to choose from that were not filtered on a search stage, it will pick every similar article.

Test results

Now we are going to visualize the results of the test.

Textual Query

	Strategy	Answer	Tokens Sent	Time(s)	LLM Cost
0	Textual RAG	elasticsearch in JavaScript the proper way, part II - elasticsearch Labs	237	1.281432	0.000029
1	Textual LLM	The title of the article is "Testing your Java code with mocks and real elasticsearch"	1,023,231	45.647408	0.102330

Semantic Query

	Strategy	Answer	Tokens Sent	Time(s)	LLM Cost
0	Semantic RAG	elasticsearch in JavaScript the proper way, part II - elasticsearch Labs	1,328	0.878199	0.000138
1	Semantic LLM	"elasticsearch in JavaScript the proper way, part II" and "A tutorial on building local agent using LangGraph, LLaMA3 and elasticsearch vector store from scratch - elasticsearch Labs" and "Advanced integration tests with real elasticsearch - elasticsearch Labs" and "Automatically updating your elasticsearch index using Node.js and an Azure Function App - elasticsearch Labs"	1,023,196	44.386912	0.102348

Conclusion

RAG is still highly relevant. Our tests show that using massive context models to send data without filtering in the context window is inferior to a RAG system in price, latency, and precision. It is common to see models losing attention when processing large amounts of context.

Even with the capabilities of large language models (LLMs), it's crucial to filter data before sending it to them since sending excessive tokens can lower the quality of responses. However, large context LLMs remain valuable when pre-filtering isn't feasible or when answers require drawing from extensive datasets.

Additionally, you still need to make sure you’re using the correct queries on your RAG systems to get a complete and correct answer. You can test different values in your queries to retrieve different amounts of documents until you find what works best for you.

Convenience: The average tokens sent to the LLM using RAG was 783. Lower than all the mainstream models' maximum context window.
Performance: RAG delivers significantly faster query speed. An average of 1 second, versus the 45 seconds of the pure LLM approach.
Price: The average cost of a RAG query ($0,00008) was 1250 times lower than the pure LLM approach ($0,1)
Precision: The RAG system produced accurate responses across all iterations, while the full-context approach led to inaccuracies.

Report an issue