Context Engineering for AI Agents 2026: The Complete Guide
If you have built an AI agent that works brilliantly in demos but collapses in production, the problem is almost certainly not the model. The model is fine. The problem is context — what you put into the model's context window, in what order, in what format, and what you leave out. This guide covers context engineering: the discipline that separates toy agents from production systems.
TL;DR
- Context engineering is the practice of deliberately designing everything that enters an LLM's context window: instructions, memory, tool results, and conversation state.
- A well-engineered context window can raise agent task-completion rates from ~30% to ~90% on the same underlying model.
- The four layers of context are: system instructions, memory (short- and long-term), tools, and current state.
- RAG (Retrieval Augmented Generation) is the standard pattern for injecting relevant external knowledge without overflowing the window.
- Multi-turn conversations need active management: summarization, sliding windows, and selective retention.
- Measure context quality with evals — not vibes.
1. What Is Context Engineering?
In early 2025, Andrej Karpathy posted a widely circulated observation: "prompt engineering" was the wrong frame for what serious LLM practitioners were actually doing. The real work was not writing clever one-liners or magic words. It was the careful curation, compression, and structuring of everything that flows into the context window over the lifetime of an agent's task.
He called this context engineering: the discipline of designing the full information environment in which a model operates.
The distinction matters. Prompt engineering implies a fixed string you write once. Context engineering implies a dynamic system — one that selects information, manages memory, retrieves documents, invokes tools, compresses history, and updates state as the agent works. It is engineering in the same sense as database engineering or network engineering: a set of design decisions with measurable performance consequences.
The reason context engineering has displaced prompt engineering as the central skill in applied AI work is that modern agents are not single-turn question-answering systems. They run for many steps, call tools, observe results, make plans, recover from errors, and build up state. Every one of those steps produces output that feeds back into the next step's context. A context design that degrades gracefully under this pressure is the difference between an agent that finishes tasks and one that hallucinates its way into an error spiral.
2. The Context Window: What Fits, What Doesn't
Every LLM processes a fixed-length sequence of tokens. As of mid-2026, context windows range from 8 K tokens (older or specialized models) to over 1 M tokens (Gemini 1.5 Pro, Claude's extended context mode). Bigger windows sound better, but the economics and attention dynamics are more complicated.
Token costs and latency
Pricing for frontier models in 2026 is roughly structured around input tokens. A 200 K-token context costs 25–50x more per call than an 8 K-token context. For agents that make dozens or hundreds of LLM calls per task, this compounds fast. An agent that indiscriminately stuffs every document it has ever seen into the context window will be both slow and expensive.
The "lost in the middle" problem
Research (Liu et al., 2023; replicated repeatedly since) shows that transformer attention is strongest at the beginning and end of the context. Information buried in the middle of a very long context is statistically less likely to influence the output. This means a 500 K-token context window does not give you 500 K tokens of equally weighted working memory — it gives you something more like a scroll where the edges are high-fidelity and the center is lower-fidelity.
Practical sizing heuristics
| Content type | Rough token budget |
|---|---|
| System prompt | 500–2 000 |
| Retrieved documents (RAG) | 2 000–20 000 |
| Conversation history (compressed) | 1 000–5 000 |
| Tool schemas | 500–3 000 |
| Current task state | 500–2 000 |
| Available headroom for output | 2 000–8 000 |
Design for the model's practical sweet spot, not its advertised maximum.
3. The Four Context Layers
A well-designed agent context has four distinct layers, each serving a different purpose. Conflating them is the most common structural mistake.
Layer 1: Instructions
The stable definition of who the agent is, what it is for, what it must never do, and how it should format its output. This goes in the system prompt. It should change rarely — ideally never within a single session.
Layer 2: Memory
Information about the world and the ongoing task, drawn from both short-term (in-context) and long-term (external) storage. This is dynamic: it changes as the task progresses and as retrieval surfaces new facts.
Layer 3: Tools
Descriptions of the actions the agent can take: function signatures, parameter schemas, and usage guidance. The model uses these to decide when and how to invoke external capabilities.
Layer 4: State
The current snapshot of where the agent is in its task: the plan it is following, what it has already done, what it is about to do, and any intermediate results it needs to carry forward. State is ephemeral and task-specific.
Keeping these layers conceptually separate makes context design tractable. When something goes wrong, you can ask: "Is this a bad instruction? A memory gap? A missing tool? A state tracking failure?" and debug accordingly.
4. Writing Effective System Prompts
The system prompt is the most important text in your entire context. It runs on every single call. A vague system prompt is a tax you pay on every token the agent ever processes.
Be specific about the agent's role
Vague:
You are a helpful assistant that helps with coding tasks.
Specific:
You are a senior Python engineer working inside a CI/CD pipeline. Your
job is to analyze failing test output, identify the root cause, and
propose a minimal code fix. You have access to the repository filesystem
and a bash execution environment. You do not have internet access.
The specific version constrains the model's prior over what kinds of responses are appropriate. It narrows the output distribution toward what you actually want.
State constraints explicitly
Tell the model what it must not do. LLMs have wide prior distributions; you need to cut off the tails.
Constraints:
- Never modify files outside the /workspace directory.
- Never install packages with pip or apt without first showing the user
the install command and waiting for confirmation.
- If you are uncertain about the correct fix, say so and propose two
alternatives rather than guessing.
- Output only valid Python. Never output pseudocode as if it were
runnable.
Specify output format explicitly
Response format:
1. ROOT CAUSE: One sentence identifying the bug.
2. EXPLANATION: Two to four sentences explaining why the bug occurs.
3. FIX: The corrected code block, complete and runnable.
4. TEST: One pytest test that would catch this bug if added to the suite.
A structured output format has two benefits: it makes the response machine-parseable, and it forces the model to think in a structured way (chain-of-thought implicit in format).
5. Short-Term vs Long-Term Memory
Agents need two kinds of memory, and mixing them up is a common source of both failures and unnecessary cost.
Short-term memory (in-context)
Everything inside the current context window. It is fast, precise, and perfectly consistent — the model sees exactly what you put there. It is also limited by the context window size and lost the moment the session ends. Use short-term memory for:
- The current conversation thread
- The current task's intermediate results
- Retrieved documents relevant to the current step
- The current tool call chain
Long-term memory (external storage)
Information persisted outside the model: databases, vector stores, key-value stores, file systems. It survives across sessions. It can scale to arbitrary size. It requires a retrieval mechanism to get information back into the context window. Use long-term memory for:
- User preferences and history
- Knowledge bases and documentation
- Previous task outputs that might be relevant again
- Learned facts about the user's codebase or domain
import json
from pathlib import Path
from datetime import datetime
class AgentMemory:
"""Simple file-backed long-term memory for an agent."""
def __init__(self, path: str = "agent_memory.json"):
self.path = Path(path)
self._store: dict = {}
if self.path.exists():
self._store = json.loads(self.path.read_text())
def remember(self, key: str, value: str) -> None:
self._store[key] = {
"value": value,
"timestamp": datetime.utcnow().isoformat(),
}
self.path.write_text(json.dumps(self._store, indent=2))
def recall(self, key: str) -> str | None:
entry = self._store.get(key)
return entry["value"] if entry else None
def recall_recent(self, n: int = 5) -> list[dict]:
sorted_entries = sorted(
self._store.items(),
key=lambda x: x[1]["timestamp"],
reverse=True,
)
return [{"key": k, **v} for k, v in sorted_entries[:n]]
For production systems, replace the file backend with Redis, a vector database, or a dedicated memory service.
6. RAG: Retrieval Augmented Generation
RAG is the standard architecture for giving an agent access to a large knowledge base without blowing the context window. The pattern is:
- At indexing time: chunk documents, embed each chunk, store embeddings in a vector database.
- At query time: embed the current query, find the most similar chunks, inject them into the context.
Why RAG instead of a huge context window?
Even with 1 M token context windows, RAG remains valuable for three reasons:
- Cost: Retrieving 10 relevant chunks costs a fraction of loading 10 000 chunks.
- Precision: Retrieval surfaces the most relevant information; stuffing everything in dilutes signal.
- Freshness: Vector indexes can be updated continuously; a long context prompt cannot.
RAG implementation with LlamaIndex
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.openai import OpenAIEmbedding
# Configure models
Settings.llm = Anthropic(model="claude-sonnet-4-5")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
# Index a directory of documents
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
# Create a query engine
query_engine = index.as_query_engine(similarity_top_k=5)
# Query — retrieval and generation happen automatically
response = query_engine.query(
"What are the rate limits for the payments API?"
)
print(response)
RAG implementation with LangChain
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain.chains import RetrievalQA
# Load and chunk documents
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
chunks = splitter.split_documents(documents)
# Build vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings)
# Build RAG chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
return_source_documents=True,
)
result = qa_chain.invoke({"query": "How do I handle webhook retries?"})
print(result["result"])
print("Sources:", [d.metadata["source"] for d in result["source_documents"]])
Chunking strategy matters
The most common RAG failure is poor chunking. A chunk that cuts a code example in half, or splits a table from its header, or severs a paragraph from its context sentence, retrieves the chunk but makes it useless. Use:
- Semantic chunking: split on sentence boundaries, paragraph breaks, and section headers rather than fixed character counts.
- Overlap: include 10–20% overlap between adjacent chunks so that context near boundaries is not lost.
- Metadata: attach the document title, section heading, and URL to every chunk so the model knows where the information came from.
7. Tool Calling: Giving Agents Verifiable Actions
A tool is a function the model can invoke. The key property of a tool call is that its result is verifiable — you can check whether the file was created, whether the API returned 200, whether the test passed. This is what separates tool-using agents from pure text generators.
Defining tools clearly
import anthropic
import subprocess
client = anthropic.Anthropic()
tools = [
{
"name": "run_bash",
"description": (
"Execute a bash command in the workspace and return stdout and stderr. "
"Use for running tests, checking file contents, and inspecting the environment. "
"Commands time out after 30 seconds. Do not use for long-running processes."
),
"input_schema": {
"type": "object",
"properties": {
"command": {
"type": "string",
"description": "The bash command to execute.",
}
},
"required": ["command"],
},
},
{
"name": "write_file",
"description": (
"Write content to a file in the workspace. "
"Creates the file if it does not exist; overwrites if it does. "
"Always use absolute paths within /workspace."
),
"input_schema": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Absolute path to the file.",
},
"content": {
"type": "string",
"description": "The full content to write.",
},
},
"required": ["path", "content"],
},
},
]
def run_bash(command: str) -> str:
result = subprocess.run(
command,
shell=True,
capture_output=True,
text=True,
timeout=30,
cwd="/workspace",
)
return f"stdout:\n{result.stdout}\nstderr:\n{result.stderr}\nreturncode: {result.returncode}"
def write_file(path: str, content: str) -> str:
from pathlib import Path
p = Path(path)
p.parent.mkdir(parents=True, exist_ok=True)
p.write_text(content)
return f"Written {len(content)} bytes to {path}"
Tool description quality
A tool description is part of your context. A vague description produces vague tool usage. Each description should answer: - What does this tool do? - When should the agent use it (vs alternatives)? - What are its limits and failure modes? - What does the return value look like?
8. Multi-Turn Conversation Management
A naive agent appends every message to the history and eventually runs out of context window. There are three standard strategies for managing growing conversations.
Strategy 1: Sliding window
Keep only the last N turns. Fast and simple. Loses older context completely.
def sliding_window(messages: list[dict], max_turns: int = 20) -> list[dict]:
"""Keep system message plus the last max_turns message pairs."""
system = [m for m in messages if m["role"] == "system"]
non_system = [m for m in messages if m["role"] != "system"]
return system + non_system[-max_turns * 2:]
Strategy 2: Summarization
When the conversation exceeds a token threshold, summarize older turns into a compact summary, then continue with the summary plus recent history.
import anthropic
client = anthropic.Anthropic()
def summarize_history(messages: list[dict], keep_last_n: int = 6) -> list[dict]:
"""Summarize all but the last keep_last_n messages into a single summary."""
if len(messages) <= keep_last_n + 1: # +1 for system
return messages
system = [m for m in messages if m["role"] == "system"]
non_system = [m for m in messages if m["role"] != "system"]
to_summarize = non_system[:-keep_last_n]
to_keep = non_system[-keep_last_n:]
history_text = "\n".join(
f"{m['role'].upper()}: {m['content']}" for m in to_summarize
)
summary_response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=500,
messages=[
{
"role": "user",
"content": (
f"Summarize this conversation history concisely, "
f"preserving all decisions made and facts established:\n\n{history_text}"
),
}
],
)
summary_message = {
"role": "user",
"content": f"[Previous conversation summary]: {summary_response.content[0].text}",
}
return system + [summary_message] + to_keep
def count_tokens_approx(messages: list[dict]) -> int:
"""Rough token estimate: 1 token ≈ 4 characters."""
total_chars = sum(len(str(m.get("content", ""))) for m in messages)
return total_chars // 4
def manage_context(
messages: list[dict],
token_budget: int = 50_000,
) -> list[dict]:
while count_tokens_approx(messages) > token_budget:
messages = summarize_history(messages)
if len(messages) <= 3:
break
return messages
Strategy 3: Selective retention
Keep only messages that contain "important" information (decisions, discovered facts, errors encountered). Requires tagging messages as they are created.
def tag_message(message: dict, important: bool = False) -> dict:
return {**message, "important": important}
def selective_retain(messages: list[dict], max_tokens: int = 30_000) -> list[dict]:
system = [m for m in messages if m.get("role") == "system"]
important = [m for m in messages if m.get("important") and m.get("role") != "system"]
recent = [m for m in messages if not m.get("important") and m.get("role") != "system"][-10:]
return system + important + recent
9. Context Compression Techniques
When you have more information than fits — retrieved documents, long tool outputs, verbose histories — compression is the tool.
Map-reduce summarization
Process each chunk independently (map), then combine the summaries (reduce). Useful for long documents that exceed a single LLM call.
import anthropic
from typing import Generator
client = anthropic.Anthropic()
def chunk_text(text: str, chunk_size: int = 3000) -> Generator[str, None, None]:
"""Yield successive chunks of approximately chunk_size characters."""
words = text.split()
current_chunk: list[str] = []
current_length = 0
for word in words:
current_chunk.append(word)
current_length += len(word) + 1
if current_length >= chunk_size:
yield " ".join(current_chunk)
current_chunk = []
current_length = 0
if current_chunk:
yield " ".join(current_chunk)
def map_summarize(chunk: str, focus: str) -> str:
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=300,
messages=[
{
"role": "user",
"content": (
f"Summarize this text, focusing on information relevant to: {focus}\n\n{chunk}"
),
}
],
)
return response.content[0].text
def reduce_summaries(summaries: list[str], focus: str) -> str:
combined = "\n\n---\n\n".join(summaries)
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=600,
messages=[
{
"role": "user",
"content": (
f"Combine these summaries into one coherent summary. "
f"Focus on: {focus}\n\n{combined}"
),
}
],
)
return response.content[0].text
def compress_document(text: str, focus: str) -> str:
chunks = list(chunk_text(text))
if len(chunks) == 1:
return map_summarize(chunks[0], focus)
summaries = [map_summarize(chunk, focus) for chunk in chunks]
return reduce_summaries(summaries, focus)
Hierarchical summarization
Build a tree of summaries: summarize sections, then summarize the section summaries. Preserves more structure than flat map-reduce.
def hierarchical_summarize(sections: list[str], focus: str, max_depth: int = 2) -> str:
"""Recursively summarize sections until a single summary remains."""
if len(sections) == 1:
return sections[0]
if max_depth == 0:
return reduce_summaries(sections, focus)
# Summarize pairs of sections
paired: list[str] = []
for i in range(0, len(sections), 2):
if i + 1 < len(sections):
pair_summary = reduce_summaries([sections[i], sections[i + 1]], focus)
paired.append(pair_summary)
else:
paired.append(sections[i])
return hierarchical_summarize(paired, focus, max_depth - 1)
10. Measuring Context Quality: Evals
You cannot improve what you do not measure. Context engineering without evals is guesswork.
The core metric: task success rate
For any agent, define a binary or graded success criterion. Run the agent against a test suite of tasks. Track the pass rate across context design iterations.
import json
from dataclasses import dataclass, field
from typing import Callable
@dataclass
class EvalCase:
task_id: str
input: str
expected_output: str | None = None
grader: Callable[[str, str | None], float] | None = None # returns 0.0–1.0
@dataclass
class EvalResult:
task_id: str
agent_output: str
score: float
passed: bool
def exact_match_grader(output: str, expected: str | None) -> float:
if expected is None:
return 0.0
return 1.0 if output.strip() == expected.strip() else 0.0
def run_eval(
agent_fn: Callable[[str], str],
cases: list[EvalCase],
threshold: float = 0.7,
) -> dict:
results: list[EvalResult] = []
for case in cases:
try:
output = agent_fn(case.input)
grader = case.grader or exact_match_grader
score = grader(output, case.expected_output)
except Exception as e:
output = f"ERROR: {e}"
score = 0.0
results.append(
EvalResult(
task_id=case.task_id,
agent_output=output,
score=score,
passed=score >= threshold,
)
)
pass_rate = sum(r.passed for r in results) / len(results) if results else 0.0
avg_score = sum(r.score for r in results) / len(results) if results else 0.0
return {
"pass_rate": pass_rate,
"avg_score": avg_score,
"total": len(results),
"passed": sum(r.passed for r in results),
"failed": sum(not r.passed for r in results),
"results": [vars(r) for r in results],
}
What to measure beyond pass rate
- Context utilization: Are retrieved documents actually cited in answers? High retrieval, low citation = retrieval is off-target.
- Tool call accuracy: Does the agent call the right tool with the right parameters?
- Failure mode distribution: Does the agent fail by hallucinating, by refusing, by tool error, or by logic error? Each has a different fix.
- Context window usage: What fraction of the budget is consumed? Consistently near 100% means you need compression.
11. Common Failure Modes and Fixes
Failure: The agent ignores retrieved documents
Symptom: RAG retrieves relevant chunks but the model answers from prior knowledge instead.
Fix: Explicitly instruct the model to ground its answer in the provided context. Add to the system prompt: "Base your answer only on the documents provided in the context. If the answer is not in the documents, say so."
Failure: The agent gets confused mid-task
Symptom: The agent starts a task correctly but loses track of what it has done and starts repeating steps or contradicting itself.
Fix: Add an explicit state block to the context at each step. Before each LLM call, inject a structured summary of completed steps, current goal, and remaining steps.
def build_state_block(completed: list[str], current_goal: str, remaining: list[str]) -> str:
completed_str = "\n".join(f" - [DONE] {s}" for s in completed)
remaining_str = "\n".join(f" - [ ] {s}" for s in remaining)
return (
f"## Current Task State\n"
f"Completed:\n{completed_str or ' (none yet)'}\n\n"
f"Current goal: {current_goal}\n\n"
f"Remaining:\n{remaining_str or ' (none)'}\n"
)
Failure: Tool calls with wrong parameters
Symptom: The model calls tools with missing or incorrectly typed arguments, causing repeated failures.
Fix: Improve tool descriptions (see section 7). Add parameter-level validation that returns a helpful error message rather than a stack trace. The model can often self-correct when given a clear error.
Failure: Context window overflow
Symptom: API errors about exceeding max tokens, or truncated responses.
Fix: Implement proactive context management. Monitor token usage before each call and compress or summarize before hitting the limit, not after.
def estimate_tokens(text: str) -> int:
# Rough estimate; use tiktoken or the model's tokenizer for precision
return len(text) // 4
def safe_context(messages: list[dict], max_tokens: int = 180_000) -> list[dict]:
total = sum(estimate_tokens(str(m.get("content", ""))) for m in messages)
while total > max_tokens * 0.85 and len(messages) > 2:
# Drop the oldest non-system message
for i, m in enumerate(messages):
if m.get("role") != "system":
messages.pop(i)
break
total = sum(estimate_tokens(str(m.get("content", ""))) for m in messages)
return messages
Failure: Instruction conflict
Symptom: The model behaves inconsistently because the system prompt says one thing and a retrieved document says another.
Fix: Establish a clear precedence order in the system prompt. "In case of conflict between these instructions and retrieved documents, these instructions take precedence."
12. Real Example: Building a Coding Agent with Proper Context Design
Here is a minimal but production-honest coding agent that applies the principles from this guide.
import anthropic
import subprocess
from pathlib import Path
client = anthropic.Anthropic()
SYSTEM_PROMPT = """You are a Python coding agent working inside a sandboxed workspace at /workspace.
Your capabilities:
- Read files in /workspace using the read_file tool.
- Write or modify files using the write_file tool.
- Execute bash commands using the run_bash tool (30-second timeout, no internet access).
Your process for any coding task:
1. Read relevant existing files before making changes.
2. Make changes incrementally and verify each step by running tests.
3. Report what you did, what changed, and the test results.
Constraints:
- Only modify files inside /workspace.
- Never run commands that take more than 30 seconds.
- If a test fails, analyze the failure before trying another fix.
- If you are uncertain, explain your uncertainty rather than guessing.
Output format for task completion:
SUMMARY: One sentence describing what was done.
CHANGES: Bulleted list of files modified and what changed.
VERIFICATION: Output of the test run confirming success.
"""
TOOLS = [
{
"name": "read_file",
"description": "Read a file from /workspace. Returns the file contents as a string.",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "Absolute path inside /workspace."}
},
"required": ["path"],
},
},
{
"name": "write_file",
"description": "Write content to a file. Creates parent directories as needed.",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string"},
"content": {"type": "string"},
},
"required": ["path", "content"],
},
},
{
"name": "run_bash",
"description": "Run a bash command. Returns stdout, stderr, and exit code.",
"input_schema": {
"type": "object",
"properties": {
"command": {"type": "string"}
},
"required": ["command"],
},
},
]
def handle_tool(name: str, inputs: dict) -> str:
if name == "read_file":
p = Path(inputs["path"])
if not p.exists():
return f"Error: {inputs['path']} does not exist."
return p.read_text()
if name == "write_file":
p = Path(inputs["path"])
p.parent.mkdir(parents=True, exist_ok=True)
p.write_text(inputs["content"])
return f"Wrote {len(inputs['content'])} bytes to {inputs['path']}."
if name == "run_bash":
try:
result = subprocess.run(
inputs["command"],
shell=True,
capture_output=True,
text=True,
timeout=30,
cwd="/workspace",
)
return (
f"stdout:\n{result.stdout}\n"
f"stderr:\n{result.stderr}\n"
f"returncode: {result.returncode}"
)
except subprocess.TimeoutExpired:
return "Error: command timed out after 30 seconds."
return f"Error: unknown tool '{name}'."
def run_coding_agent(task: str, max_iterations: int = 20) -> str:
messages: list[dict] = [{"role": "user", "content": task}]
completed_steps: list[str] = []
for iteration in range(max_iterations):
# Inject current state into the context
state_note = build_state_block(
completed=completed_steps,
current_goal=task,
remaining=[],
)
# Prepend state to the last user message only on first call;
# subsequent calls carry state through the message history.
if iteration == 0:
messages[0]["content"] = f"{state_note}\n\nTask: {task}"
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=4096,
system=SYSTEM_PROMPT,
tools=TOOLS,
messages=messages,
)
# Append assistant response
messages.append({"role": "assistant", "content": response.content})
# Check for completion
if response.stop_reason == "end_turn":
final_text = next(
(b.text for b in response.content if hasattr(b, "text")), ""
)
return final_text
# Handle tool calls
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = handle_tool(block.name, block.input)
completed_steps.append(f"{block.name}({list(block.input.keys())})")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
messages.append({"role": "user", "content": tool_results})
return "Max iterations reached without completion."
def build_state_block(completed: list[str], current_goal: str, remaining: list[str]) -> str:
completed_str = "\n".join(f" - [DONE] {s}" for s in completed)
remaining_str = "\n".join(f" - [ ] {s}" for s in remaining)
return (
f"## Current Task State\n"
f"Completed:\n{completed_str or ' (none yet)'}\n\n"
f"Current goal: {current_goal}\n\n"
f"Remaining:\n{remaining_str or ' (none)'}\n"
)
# Usage
if __name__ == "__main__":
result = run_coding_agent(
"Write a Python function called `fibonacci(n)` that returns the "
"nth Fibonacci number using memoization. Save it to /workspace/fib.py "
"and add a pytest test file at /workspace/test_fib.py. Run the tests."
)
print(result)
This agent demonstrates: a specific system prompt with explicit constraints, clear tool descriptions, state tracking across iterations, and tool result injection into the conversation history.
FAQ
Q: Is context engineering the same as prompt engineering?
No. Prompt engineering typically refers to crafting a single prompt for a single call. Context engineering covers the full dynamic system: how instructions, memory, retrieved knowledge, tool results, and conversation state are curated, compressed, and composed across the lifetime of an agent task. Prompt engineering is a subset.
Q: Do I still need RAG with 1 M token context windows?
Yes, for most production systems. Cost, latency, and the lost-in-the-middle attention degradation all make selective retrieval preferable to loading everything. RAG also allows the knowledge base to be updated without changing the prompt.
Q: How many retrieved chunks should I inject?
Start with 3–7 chunks. More is not always better — irrelevant chunks add noise. Measure retrieval precision with your eval suite and tune similarity_top_k based on task performance, not intuition.
Q: How do I know if my system prompt is good enough?
Run an eval suite. A good system prompt consistently produces outputs that meet your success criteria. If pass rate is below ~70% on a well-defined task set, the system prompt is likely too vague, missing key constraints, or conflicting with retrieved context.
Q: Should the agent manage its own context?
No. The agent (the LLM) should not be responsible for deciding what to remember or forget — it cannot see its own context limits, and asking it to manage memory adds latency and failure modes. Context management should be handled by the orchestration layer (your Python code) before each LLM call.
Q: What is the most common context engineering mistake?
Vague system prompts combined with no state tracking. This produces agents that hallucinate their task progress and loop indefinitely. Fix with specificity in instructions and an explicit state block injected at each step.
Q: How do I handle very long tool outputs (e.g., a full test suite log)?
Truncate and summarize before injecting. Keep the first and last N lines (where error information typically appears) and summarize the middle. Never inject raw, unbounded tool output into the context.
Sources
- Karpathy, Andrej. "Context Engineering." X (formerly Twitter), 2025. (Widely cited observation that context engineering supersedes prompt engineering as the central skill in applied LLM work.)
- Liu, Nelson F. et al. "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the Association for Computational Linguistics, 2023.
- Lewis, Patrick et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS, 2020.
- Anthropic. "Claude API Documentation: Tool Use." docs.anthropic.com, 2026.
- LlamaIndex documentation. llamaindex.ai, 2026.
- LangChain documentation. python.langchain.com, 2026.
- OpenAI. "Prompt Engineering Guide." platform.openai.com/docs/guides/prompt-engineering, 2026.
- Shinn, Noah et al. "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS, 2023. (Foundational work on agent self-correction loops.)
- Yao, Shunyu et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR, 2023. (Foundational work on tool-using agent architecture.)
- Wang, Xuezhi et al. "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR, 2023.