What is Context Engineering?

Elasticsearch has native integrations to industry leading Gen AI tools and providers. Check out our webinars on going Beyond RAG Basics, or building prod-ready apps Elastic Vector Database.

To build the best search solutions for your use case, start a free cloud trial or try Elastic on your local machine now.

With the fast-paced, evolving nature of AI, new terms and techniques appear all the time. One of the latest discussions is around context engineering. If you are not sure what context engineering is, why it's important, or what techniques you can use to optimize the context your agentic systems use, read on to find out.

Context engineering refers to a collection of practices that can be combined to provide the right information to Large Language Models (or LLMs) to help them accomplish the desired task. It's important to ensure that the LLMs we use in agents and MCP tools have the right information sources to ensure they provide accurate results, and don't hallucinate or fail to give the desired answer. My high school maths teacher always talked about the notion of "rubbish in, rubbish out" in terms of the inputs we provided to our calculations and proofs.

ChatGPT Output asking what the author said about Context Engineering

The same goes for LLMs. We can't expect LLMs to provide the answers and automations we need accurately without providing them with the right information. As you can see from the above example, leveraging ChatGPT, it can only pick up information on which it was trained, or that is provided in the context via the components discussed in subsequent sections.

Components

The visualization below showcases the key components of context that we can use to improve the responses of LLMs invoked by AI agents and called by MCP tools:

Context Engineering — Source: https://www.philschmid.de/context-engineering

As Dexter Horthy outlines as his 3rd principal of 12-Factor Agents, it's important to own your context to ensure LLMs generate the best outputs possible.

RAG

RAG is an architectural pattern where data sourced from an information retrieval system, such as Elasticsearch, is provided to an LLM to ground and enhance the results that they generate. We've covered RAG in many Elasticsearch Labs blogs, including this one, which provides an overview of the construct, and this tutorial, which covers building a single RAG chatbot with Python, Langchain, React and Elasticsearch.

Although some suggest that the ever-expanding context window size of newer LLMs means that "RAG is dead", in fact may find their LLM suffers from context confusion as covered by Drew Breunig. Context confusion refers to the issue where surplus information provided to the LLM leads to a sub-optimal response. RAG helps direct LLMs to the desired result as it addresses common limitations of general LLMs, including:

Lack of specific domain knowledge for jargon-heavy disciplines such as financial services or engineering.
Newer information or events that have happened after the model has been trained.
Hallucinations, where the LLM generates incorrect answers.

RAG typically involves pulling relevant documents from a data store and passing via the prompt, or through dedicated AI tools invoked by an LLM. A simple example from the AI SDK Travel Planner leveraging the Elasticsearch JavaScript client is given below:

Retrieving relevant information from sources such as Elasticsearch or others, and even utilizing techniques such as LLM summarization or data aggregation as my colleague Alex achieved when building an MCP data to summarize and query his health data, can ensure that the LLM has the precise data it needs to provide the answer. This context can then be passed using emerging protocols such as Model Context Protocol (MCP) or Agent2Agent Protocol (known as A2A).

Prompts

Perhaps considered a more established practice, but still very much a subset of context engineering, prompt engineering refers to the practice of refining and crafting effective inputs (or prompts) to an LLM to produce the result we want. Although commonly structured as simple text, prompts can consist of other media sources such as images and sounds. Sander Schulhoff et al. in their survey of prompt engineering define the following components of a prompt:

Directive: the instruction or question serving as the main intent of the request.
Exemplars: demonstrable examples to guide the LLM to accomplish the task.
Output Formatting: the format the output information is expected to be returned, such as JSON or unstructured text. This is important as depending on the source of data the LLM may need to translate (for example from the structured JSON of an Elasticsearch query response to another format compared to returning the result directly).
Style instructions: guidance on how to alter the structure of the output. This is considered a specific type of output formatting.
Role: the persona the LLM needs to emulate to achieve the task (for example a travel agent).
Additional information: other useful details needed to complete the task, including context from other sources.

Specifically, the example below showcases each of these elements for a prompt for a travel planning agent:

All of these elements can be tweaked and evaluated to ensure the optimal result is obtained by the LLM. In addition to these elements, there are numerous techniques that can be used to structure and optimize prompts to gain the answer you need. For example, Wei et al. in their 2023 paper found that standard zero-shot prompts where we ask an LLM a simple structured question fair less effectively compared to chain-of-thought prompting techniques for arithmetic and reasoning tasks. The differences are summarized in the example below:

Standard versus Chain-of-Thought Prompting — Source: https://arxiv.org/pdf/2201.11903

When considering the format of the prompt to provide you need to consider several factors, including:

The type of task (for example simple recall or translation compared to complex arithmetic reasoning).
Task complexity and ambiguity. Ambiguous requests may lead to unpredictable results.
The inputs you are providing as context, along with the format.
The output required.- The capabilities of your chosen LLM.
The persona you would like the LLM to emulate.

Memory

Much like humans, AI applications rely on both short and long-term memory to recall information. Within context engineering:

Short-term memory, often referred to as state or chat history, refers to the messages exchanged in the current conversation between the user and the model. This includes the initial and follow-up questions presented by the user.
Long-term memory, simply referred to as memory, refers to information shared across conversations. Key examples would be relevant common information or recent prior conversations.

Taking our Travel Planner Agent example, the short-term memory would include the travel dates and location, along with any follow-up messages if the user changes their mind and wants to explore another destination. The long-term memory in this place could contain profile information about the user's travel preferences, along with past trips that could be used to inform the suggestions of what activities to include in a new itinerary (such as wine tasting opportunities for those who have taken part in those activities on prior vacations).

Most AI frameworks provide the ability to manage both chat history and memory, as it's important to ensure the history is managed to ensure it fits within the context window alongside the other elements of context. Taking LangGrap has an example, short-term memory is managed as part of the agent state using a checkpointer, while long-term memory is persisted to long-term stores:

LangGraph Memory Management — Source: https://langchain ai.github.io/langgraphjs/concepts/memory/

As we build multi-agent architectures, we need to also be mindful about the segregation of memory and context. When splitting tasks among sub-agents in a larger flow, each agent may need knowledge of the other agent's results to remain in sync. However, over time these additions result in a context window overflow:

Multi-Agent Context Overflow — Source: https://cognition.ai/blog/dont-build-multi-agents#a-theory-of-building-long-running-agents))](/assets/images/context-engineering/cognition-memory-overflow.png

It is important that the context is stored in both types of memory to ensure relevant and up-to-date context is provided to LLMs. Failure to do so can result in context poisoning. This can be carried out through malicious intent, as we see in prompt injection and data poisoning attacks as per the OWASP LLM Application Top 10. But it can also occur for innocent reasons, such as the buildup of history that can distract the model, or even contradictory information that results in clashes.

In the Gemini 2.5 report, researchers found that a Pokémon-playing Gemini agent showed a tendency to repeat actions from its history instead of forming novel approaches, meaning the growing context became more of a hindrance to solving the problem. For these reasons, practices such as chat history trimming, summarization, and relevant retrieved information should be managed.

Structured Outputs

As we move to complex AI agent architectures, there is a need to ensure that the outputs emitted by LLMs adhere to a schema or contract that makes it easier to parse and integrate with other systems and workflows.

We are all used to freeform text results, but these formats can be difficult to integrate into dependent systems and agents. Much like designing a set of REST endpoints not just adhering to best practices such as the OpenAPI standard but to a contract compatible with other components, we need to specify the output format and schema that we expect the LLM to return. The below example represents specifying a schema to generate an object adhering to a particular schema using AI SDK:

The introduction of JSON structured output for LLM outputs makes sense. It's common to need to balance processing structured and unstructured data, just as Elasticsearch does internally. For this reason there is emerging support in some models for generating outputs adhering to a provided JSON schema, including through the Structured Outputs feature available in the OpenAI platform. When combined with function calling, this allows us to define standard contracts for the passing of information between tools. However, given LLMs can generate JSON with syntax issues it's important to handle potential errors gracefully when processing results.

Available Tools

The final element of context that we can use within context engineering is the tools that we give to LLMs to provide data. Tools allow us to perform actions such as automating operations such as booking the trip defined by our itinerary, retrieving data using RAG as discussed previously, or providing information from other sources of information. We have shown an example of a RAG tool above with our flightTool, but tools can be used to pull in other sources of information, for example the below weather tool built with AI SDK:

Irrespective of the framework used, a tool comprises of:

A description of what the tool does to inform the LLM.
The parameters expected by the function, along with defined data types. Here we define these using the Typescript validation library zod.
The function to be invoked by the LLM when the tool is used.

If an LLM supports tool calling, it can choose to call (potentially) one or many tools to solve the problem. I have discussed my experiences of model choice before while building my own multi-tool AI agent. It's important when choosing a model to investigate the level of tool calling support using resources such as the Hugging Face Open LLM Leaderboard or Berkeley Function-Calling Leaderboard. The problem is that given the LLM decides which tools are relevant to the objective, it is possible that it can be confused by a tool many tools and call irrelevant tools as discussed by Drew Breunig. This idea of tool confusion is also discussed in the 2024 paper by Paramanayakam et al. where they found the performance of Llama 3.1 8b improved when provided with less tools (19 compared with 46).

Optimizing the number of tools available is an open area of research. Experiments to apply RAG architectures to combat tool confusion, such as retrieving tool descriptions to optimize tool selection in MCP so that providing relevant tool descriptions to the LLM results in more accurate results.

Conclusion

This article covered what context engineering is, and gave an overview of the key components of context. If you are interested in learning more, check out the resources below.