What is context engineering?

Context engineering is the practice of giving AI systems the right information at the right time. Think of it like preparing a briefing for a new colleague: You wouldn't dump every company document on their desk, you'd carefully select the most relevant information for their specific task.

Modern AI agents need access to vast amounts of data, documents, databases, emails, code but can only process a limited amount at once. Context engineering is the discipline of intelligently selecting, organizing, and delivering exactly what the AI needs to make good decisions, without overwhelming it with unnecessary information. Done well, it's the difference between an AI that gives generic responses and one that provides genuinely helpful, accurate answers grounded in your specific data.

Why context engineering? The limits of raw LLMs

LLMs and reasoning models (RMs) are powerful components in modern applications, but they possess a fundamental limitation: An LLM's performance is not solely a function of its internal, static knowledge. Its practical success is critically dependent on the external information and tools provided to it at the moment of inference.

By default, LLMs have four major constraints:

Static knowledge: Their understanding of the world is frozen at their last training date, leaving them unaware of current events.
No access to private data: They have no native ability to access your company's live, proprietary data: the documents, metrics, and logs that hold the most valuable context.
Hallucinations and lack of grounding: The models function by predicting the next most probable token in a sequence. This process is optimized for linguistic coherence, not factual verification, allowing them to generate plausible-sounding but factually incorrect information.
Contextual drift and lack of memory: Agents struggle with multistep tasks because they lack persistent context or memory. Without a way to recall previous decisions, their reasoning "drifts," causing them to re-infer information inconsistently and fail at complex workflows.

This has given rise to context engineering, an emerging practice focused on building reliable, stateful AI agents. Context engineering shifts the focus beyond prompt engineering, which crafts instructions for a single interaction, to managing the full context as agents tackle multistep, complex tasks. Context engineering is the art of managing a model's limited attention. This practice involves architecting the entire information ecosystem surrounding the model: curating its context window at any given moment and strategically deciding what information from user messages, tool outputs, or its own internal thoughts makes it into the agent's limited "working memory."

Context engineering draws inspiration from established software engineering principles. Just as developers architect databases, APIs, and data pipelines to optimize information flow in traditional systems, context engineers design the information architecture that powers intelligent agents. Context engineers are responsible for managing what information occupies the LLM's limited "working memory" (the context window) and what is retrieved from "persistent memory" (like a vector database). Context engineering recognizes that even the most capable LLM cannot compensate for poorly structured, incomplete, or irrelevant context.

The critical distinction: Context vs. prompt engineering

While often used interchangeably, these terms represent different levels of abstraction. Prompt engineering is the tactical craft of writing a single instruction to get a specific, often one-off response.

Ultimately, prompt engineering is a subset of context engineering. The practice of context engineering determines what fills the LLM's context window, while prompt engineering is concerned with crafting the specific instruction within that curated window.

Aspect	Prompt engineering	Context engineering
Primary goal	Elicit a specific, often one-off response	Ensure consistent, reliable system performance across tasks and sessions
Scope	A single interaction or the immediate instruction string	The entire information environment, including memory, tools, and data sources
Analogy	Asking a well-phrased question	Building the library and providing the tools for an expert to use
Core activity	Wordsmithing, instruction crafting	Systems design, data orchestration, memory management

What are the building blocks of context engineering?

Critical capabilities of the context engineering practice

Instructions/system prompt

The system prompt establishes the agent's foundational context: its identity, capabilities, constraints, and behavioral guidelines. Unlike user prompts that change with each interaction, the system prompt remains relatively stable and acts as a persistent "personality" and rulebook. Effective system prompts balance three competing demands: specificity (clear enough to prevent ambiguous behavior), flexibility (general enough to handle diverse scenarios), and conciseness (brief enough to preserve context window space). Best practices include:

Defining the agent's role explicitly ("You are a financial analyst assistant ...")
Providing concrete examples of desired behavior rather than abstract rules
Using structured delimiters (XML tags, markdown sections) to organize instructions for better model comprehension
Placing critical constraints (safety rules, output format requirements) at prominent positions since models exhibit positional bias

Advanced techniques include conditional instructions that activate based on runtime context (e.g., "If the user asks about personal information, redirect to privacy policy") and meta-instructions that guide the agent's reasoning process (e.g., "Think step-by-step before providing analysis"). The system prompt is particularly vulnerable to context window competition; as conversation history, tool outputs, and retrieved data accumulate, poorly designed system prompts get pushed out of the model's effective attention span, causing behavioral drift where the agent gradually "forgets" its core instructions.

Long-term memory

Long-term memory enables an AI to retain information across multiple sessions or conversations. Unlike short-term memory, which is ephemeral and lost at the end of a session, long-term memory allows an AI to recall user preferences, past interactions, and learned facts for future reference.

State/history (short-term memory)

State and history constitute the agent's working memory of the current session: the record of what has been said, done, and learned within an ongoing interaction. This short-term memory enables conversational continuity; the agent can reference previous exchanges without forcing users to repeat context. However, conversation history grows linearly with interaction length, quickly consuming the context window.

Effective context engineering requires active memory management strategies. Summarization compresses older exchanges into concise representations while preserving key facts and decisions. Windowing keeps only the most recent N messages, discarding earlier history under the assumption that recent context matters most. Selective retention applies heuristics to identify and preserve critical information (user preferences, established facts, open questions) while pruning routine conversational filler.

More sophisticated approaches use episodic memory structures where the agent writes an important state to external storage and retrieves it on demand, mimicking how humans don't hold entire conversations in active working memory but can recall specific details when needed. The challenge is maintaining coherence; overly aggressive pruning causes the agent to "forget" key context and repeat mistakes, while insufficient compression leads to context overflow and performance degradation.

Retrieved information (RAG)

Retrieval augmented generation (RAG) involves the AI retrieving external data "just in time" from a knowledge base, such as internal company documents or public websites. RAG enables the AI to answer questions using information it was not originally trained on, thereby ensuring its responses are both current and accurate.

Semantic chunking

Semantic chunking improves retrieval by structuring information logically. Instead of breaking text into arbitrary, fixed-size pieces, semantic chunking groups related concepts together (e.g., by paragraphs, functions, or logical sections). When a relevant chunk is retrieved, its immediate surroundings are also included. This provides the LLM with more coherent, complete context, which helps it reason more effectively and mitigates issues from fragmented information.

Hybrid search

Hybrid search is critical for context engineering because relying on a single retrieval method often fails. Vector search excels at finding conceptually similar information (e.g., "summer clothes" finds "warm-weather outfits"), but it can miss specific, precise terms. Keyword search (like BM25) excels at finding exact matches (e.g., "SKU-123AB") but fails with synonyms. By combining both in a single, unified query, hybrid search ensures the LLM receives the most accurate, balanced context possible, capturing both the user's conceptual intent and any critical keywords.

Reranking

Reranking solves the "speed vs. accuracy" trade-off inherent in large-scale retrieval. The initial search (like hybrid search) is optimized to quickly retrieve a large set of potentially relevant documents (e.g., the top 100). A reranking model — which is typically more computationally expensive but far more accurate — is then used to re-score only this smaller subset. For context engineering, this is vital because it ensures the absolute best, most relevant snippets are placed at the very top of the context window, which is essential for mitigating the "lost in the middle" problem and focusing the LLM's attention on the highest-quality information.

Available tools

Tools extend an agent's capabilities beyond text generation by enabling interaction with external systems: executing code, querying databases, calling APIs, or manipulating files. From a context engineering perspective, tools create a unique challenge: each tool requires a description (name, purpose, parameters, usage examples) that consumes context window space. As tool libraries grow, this "tool context overhead" becomes significant. A 100-tool agent might spend 30%–40% of its context window just describing available capabilities before the user's actual task begins.

Effective tool engineering follows several principles:

Keep tool descriptions concise but unambiguous: Include the tool's purpose, required parameters with types, and one canonical example.
Design tools to be composable: Smaller, focused tools (e.g., "search_documents," "summarize_text") combine more flexibly than monolithic tools trying to handle multiple scenarios.
Implement tool categories or namespaces to enable selective loading: An agent working on financial analysis doesn't need tools for image processing.
Use tool result filtering: Return only essential information to the agent, not raw API responses. A database query tool should return "Found 3 relevant transactions totaling $4,532" rather than complete SQL result sets.

Well-designed tools also include error handling in their descriptions, teaching the agent how to recover from failures gracefully rather than cascading errors through the workflow.

Agentic search

Agentic search is a specialized "sub-agent" tool that performs complex, multistep exploration in its own isolated context. For example, it can translate a natural language request into a precise ESQL query, find the data, and return only a concise summary to the main agent, keeping its working memory clean.

Domain-specific workflows

Domain-specific workflows are predefined, deterministic toolchains designed for high-stakes, predictable business processes where reliability and consistency outweigh exploratory flexibility. Unlike general-purpose agents that reason through each step dynamically, these workflows follow a strict, validated sequence. For example: "Verify Customer Identity → Check Credit History → External Regulatory Screening → Calculate Risk Score → Generate Compliance Report." Each step has explicit success criteria, error handling, and rollback procedures.

This rigidity is intentional; it prevents the unpredictability inherent in LLM-based reasoning from affecting mission-critical operations like financial approvals, medical diagnostics, or regulatory compliance. From a context engineering perspective, domain workflows simplify the agent's task by reducing degrees of freedom. The agent doesn't need context about all possible tools and strategies, only the specific information required for the current workflow step. This focused context improves both accuracy and efficiency.

Implementation typically involves state machines or directed acyclic graphs (DAGs) where the LLM handles variable elements (parsing user input, selecting data sources, generating natural language summaries) while deterministic logic controls the overall process flow. The tradeoff is reduced adaptability; these workflows excel at known scenarios but struggle when edge cases fall outside the predefined path.

Dynamic tool discovery

Dynamic tool discovery addresses the "prompt bloat" problem that occurs when agents have access to large tool libraries. Rather than listing hundreds of tool descriptions in the system prompt — which consumes valuable context window space and degrades tool selection accuracy — this strategy uses semantic search over tool metadata to retrieve only relevant capabilities at runtime.

When an agent receives a task, it queries a tool registry using the task description as input, retrieving the 3–5 most semantically similar tools for that specific context. This approach mirrors just-in-time data retrieval: tools remain in external storage until needed, and the agent's attention stays focused on applicable capabilities rather than being diluted across an exhaustive catalog. Protocols like MCP (Model Context Protocol) standardize this pattern by providing registries where tools can be discovered, understood, and invoked dynamically. However, dynamic discovery introduces latency (the search operation itself) and requires careful engineering to prevent the agent from selecting suboptimal tools or chasing dead ends when tool descriptions are ambiguous.

User prompt

The user prompt is the direct input that triggers agent behavior and defines the immediate task context. Unlike the system prompt (which remains relatively static), the user prompt varies with each interaction and carries the highest attention weight in most LLM architectures. This positional bias means user prompts often override conflicting information elsewhere in the context.

Effective context engineering treats user prompts as more than simple questions; they can include explicit context hints (timestamps, user preferences, session state) that guide retrieval and tool selection without bloating the system prompt. For stateful agents, the user prompt becomes the entry point where session-specific information gets injected — for example, "given our conversation about quarterly metrics ..." signals the agent to prioritize recently retrieved financial data. However, user prompts also represent the most unpredictable element of context and can be ambiguous, contradictory, or adversarial. Context engineering must account for this variability through query understanding models that reformulate unclear requests, safety filters that detect prompt injection attempts, and fallback strategies when user intent cannot be reliably inferred from the input alone.

Structured output

Structured output refers to information that an AI needs to format in a specific way, such as JSON, XML, or a table. By defining a structured output, AI responses can be consistent and easily used by other programs or systems.

For a more in-depth exploration of these concepts, read the full blog post: Context engineering overview.

The context engineering pipeline

The practice of context engineering is best understood as the design of a systematic pipeline built to support the LLM. Rather than just combining various components ad-hoc, this pipeline is tailored to a specific task and is designed to manage the entire flow of information to and from the model at every stage of the loop. This pipeline is typically broken down into three core stages:

Context retrieval and generation: This stage involves actively sourcing raw data from a wide array of potential inputs, such as retrieving documents from a vector database, querying a structured SQL database, or making API calls to external services.
Context processing: Once gathered, the raw information is optimized. This involves transforming the data to maximize its signal-to-noise ratio using techniques like chunking, summarization, compression, and structuring.
Context management: This final stage governs how information is stored, updated, and utilized across multiple interactions. It is crucial for building stateful applications and involves strategies for both short-term (session) and long-term (persistent) memory.

How does context engineering work?

Common to all context engineering pipelines are a set of strategies to dynamically manage what the model "sees." This is a practice that treats the context window as a limited resource that must be actively optimized by selecting, filtering, and ranking data rather than just being passively filled with raw, unfiltered information. These strategies can be grouped into four main categories.

Selection: Retrieving the right information

The most powerful strategy is to keep information outside of the context window and retrieve it "just in time" when the agent needs it. This mirrors how humans work: we don't memorize entire libraries; we use search engines and filing systems to find what we need on demand.

For an AI agent, this means querying an external knowledge base. However, finding the right information is a significant challenge. As data grows, simple semantic search can become unreliable. Effective selection often requires a hybrid approach, blending multiple search techniques, like keyword, semantic, and graph-based retrieval, to pinpoint the exact context needed from vast and complex datasets.

Writing: Creating external memory

This strategy gives an agent a place to offload information by writing to an external memory, like a "scratchpad" file or a dedicated database. For example, an agent can save its multistep plan to a file and refer back to it, preventing the plan from being pushed out of a crowded context window. This allows the agent to maintain state and track progress on long-running tasks without cluttering its working memory.

Compression: Making context more efficient

Compression techniques reduce the number of tokens in the context window while preserving the essential information.

Summarization: Uses an LLM to distill long conversations or documents into concise summaries. For instance, the complete, token-heavy output of a tool can be replaced by a short summary of its results.
Trimming: Filters context using hard-coded rules, such as removing the oldest messages in a conversation or clearing redundant tool outputs that are no longer needed.

Isolation: Separating concerns

For highly complex tasks, a single agent can become overwhelmed. Isolation involves breaking the problem down and assigning sub-tasks to specialized "sub-agents," each with its own clean, focused context window. A lead agent coordinates this team, receiving only the distilled, final outputs from each specialist. This approach keeps each agent's context relevant and manageable, improving overall performance on complex research or analysis tasks.

By following these principles, context engineering aims to provide the LLM with the smallest possible set of high-signal tokens that maximize the chance of a successful outcome: relevant output.

The core technical challenge: The context window

Understanding the context window

At its foundation, context engineering is shaped by a fundamental constraint: LLMs have finite attention budgets. The context window (measured in tokens) defines the maximum amount of information a model can process at once. While modern models support increasingly large context windows (100,000, 1 million, or even 2 million tokens), simply filling this space doesn't guarantee better performance.

LLMs operate on transformer architecture, where every token must attend to every other token. As context grows, this creates computational overhead and what practitioners call "context rot": the model's ability to maintain focus and recall specific details degrades as the information load increases. This phenomenon mirrors human cognitive limits; more information doesn't always mean better decisions.

Attention degradation

Simply expanding the window introduces significant challenges:

Increased cost and latency: The computational complexity of the Transformer architecture's attention mechanism grows quadratically ($O(n^2)$) with sequence length, making larger contexts exponentially more expensive and slower.
Performance degradation ("lost in the middle"): LLMs show strong recall for information at the very beginning or end of a long context window but suffer a significant drop in performance for information located in the middle.
Noise and distraction: A larger context window increases the likelihood of including irrelevant "noisy" information, which can distract the model and degrade the quality of the output. This is often called the "needle in a haystack" problem.

This paradox reinforces the need for intelligent curation, not just brute force, making context engineering somewhat of a fine craft.

Why context engineering matters for AI agents and applications

The primary challenge for any AI agent is completing its task correctly. The performance-cost-latency tradeoff is a secondary optimization that can only be addressed after the core problem of accuracy is solved. Context engineering addresses this hierarchy of needs in order:

Accuracy and reliability

The main driver for context engineering is ensuring an agent can successfully and reliably complete its task. Without accurate, relevant context and the correct tools, an agent will fail by hallucinating, selecting the wrong tool, or being unable to execute a multistep plan. This is the foundational problem that context engineering solves.

Quality of output

Output quality in context-engineered systems refers to how well the agent's responses align with user intent, factual accuracy, and task requirements distinct from mere fluency or coherence, which LLMs achieve naturally. High-quality output depends critically on high-quality input context; the "garbage in, garbage out" principle applies directly.

Context engineering improves output quality through several mechanisms:

Retrieval quality ensures the agent accesses accurate, relevant source material rather than hallucinating or relying on outdated training data.
Context structure affects how effectively the model can extract and synthesize information.
Well-chunked, semantically coherent context produces more accurate reasoning than fragmented snippets.
Signal-to-noise ratio matters: Including five highly relevant documents outperforms, including those same five plus twenty marginally related ones, as irrelevant information distracts the model's attention.

Output quality also depends on instruction clarity in the system prompt and explicit formatting requirements (structured outputs like JSON reduce parsing errors). Measuring quality requires task-specific evaluation: factual accuracy for RAG systems, task completion rates for agents, user satisfaction scores for conversational systems. Context engineering enables systematic quality improvement by making the input-output relationship observable and tunable; you can measure which context combinations produce better outputs and optimize retrieval, ranking, and filtering accordingly.

The performance-cost-latency tradeoff

Every token in the context window carries cost: computational resources, API charges, and latency. Context engineering directly impacts all three:

Cost optimization: Reducing unnecessary tokens in prompts can lower API costs by orders of magnitude for high-volume applications.
Latency reduction: Smaller, focused contexts mean faster inference times and more responsive applications.
Quality improvement: Targeted, high-signal context consistently outperforms large, unfocused information dumps.

Diagram of the context engineering performance triangle: context quality, cost, latency

Reliability and error recovery

Production AI systems must be resilient. Poor context engineering leads to several failure modes:

Context poisoning: When hallucinations or errors become embedded in the context and compound across subsequent interactions
Goal drift: When accumulating irrelevant information causes agents to lose track of their original objectives
Capacity overflow: When critical information gets truncated as the context window fills with lower-priority data

Good context engineering prevents these issues through validation, pruning, and structured memory management. Treating context as a carefully curated resource rather than a passive accumulator of information.

Getting started with context engineering on Elasticsearch

Elasticsearch is an ideal platform for implementing context engineering because it unifies many of the required components into a single, cohesive system. It is a vector database, a search engine, a NoSQL document store, and more, all in one. This allows you to store all your data in one place and use the industry's most powerful query language to provide the most relevant context for any kind of question.

Elastic Agent Builder is available now as a technical preview. Start implementing context engineering with Elasticsearch:

Context engineering with Elastic
Start a free Elasticsearch Cloud trial
Read the Agent Builder documentation
Explore the Jupyter notebook: Your First Elastic Agent on GitHub
Watch the on-demand workshop: Elastic AI agents and MCP
Try Agent Builder locally
LangChain's context engineering middleware
LlamaIndex RAG framework