Context Engineering Is an Infrastructure Problem, Not a Prompting Problem
Andrej Karpathy posted in early 2026 that “context engineering” was replacing prompt engineering as the core skill in AI development. Simon Willison amplified it. Within two weeks the term had crossed from research into mainstream developer vocabulary.
The framing is correct. What most explanations of it miss is the infrastructure consequence.
What Prompt Engineering Was
Prompt engineering was a negotiation with the model. You adjusted the wording, the structure, the examples, the chain-of-thought scaffolding until the output matched what you wanted. The model was fixed. The prompt was the variable.
It worked well when the task was well-defined and the context was small enough to fit in a single, manually crafted exchange. For demos, for one-shot tasks, for assistants with short memory windows — prompt engineering was sufficient.
It is not sufficient for autonomous agents operating continuously over weeks on production data.
What Context Engineering Actually Is
Context engineering does not ask what words to use. It asks: what does the model need to know at the moment of inference, and how do I ensure that knowledge is accurate, current, and structured so the model can reason with it effectively?
The context window is not a text box. It is a dynamic system with token economics, attention mechanics, retrieval budgets, and decay characteristics. Engineering that system is a different discipline from writing prompts.
The concrete problems context engineering must solve:
Token economics at scale. At enterprise scale, injecting full conversation history on every call is financially prohibitive. The naive approach — fill the window with everything relevant — fails. The LOCOMO benchmark quantifies this precisely: full-context injection achieves 72.9% accuracy at 10–17 second median latency. Graph-enhanced selective memory achieves 89.9% accuracy at under 2.6 seconds p95, at 90% lower token cost. More context is not better context.
The lost-in-the-middle phenomenon. Frontier models in 2026 have context windows exceeding one million tokens. Research consistently shows that LLMs systematically fail to recall, weight, and reason over information buried in large context payloads. The information is present. The model ignores it. Selective injection of the right information outperforms injecting all of it.
Context rot. Long-running agents accumulate contradictions. A superseded policy stays in the memory graph alongside the updated one. A changed vendor contract sits next to the old terms. When both enter the context window, the model produces stochastic output — it cannot deterministically resolve contradictions. Research in 2026 formalized this as a survival equation: reasoning accuracy decays exponentially with the volume of accumulated contradictions. An agent that has been running for months without active contradiction resolution is progressively less reliable, in a measurable, quantifiable way.
Relational blindness in retrieval. Vector RAG retrieves semantically similar text. The query “why did our operating expenditure decrease?” does not semantically resemble the stored record “we switched infrastructure from AWS to Azure because of compute costs” — but the connection is the answer. Semantic similarity cannot navigate causal relationships. Graph memory can.
Where the Infrastructure Boundary Is
Every technique that addresses these problems operates below the prompt layer:
- Graph memory constructs a relational representation of entity relationships that vector databases cannot replicate
- Contradiction detection flags conflicting facts before they enter the context window, during ingestion, not at inference
- Temporal decay weights recent information higher than stale information on explicit decay curves maintained in the memory layer
- Pre-computation prepares the context asynchronously so inference starts from a validated state rather than raw retrieval
None of these are prompt decisions. You cannot write a system prompt that gives the model a graph memory it does not have. You cannot instruct the model to ignore contradictions it will inevitably encounter when both versions of a fact are present. You cannot achieve temporal decay by telling the model to “prioritize recent information” — it does not know which information is recent without temporal metadata attached at the infrastructure layer.
This is the infrastructure boundary: the decisions that must be made before the model is invoked, in the systems that prepare what the model sees.
The Pre-Computation Shift
The most consequential context engineering insight from 2026 production deployments is the move to pre-computation.
Instead of preparing context at inference time — retrieving, filtering, compressing, deduplicating on every call — the preparation happens asynchronously during idle periods. Background processes scan the memory graph, resolve contradictions, apply decay, compress historical context into semantic summaries, and maintain a pre-validated state ready for injection.
By the time an agent is invoked for a live task, the context engineering work is already done. The agent receives a compressed, accurate, contradiction-free payload. Inference is fast. The quality of the context does not degrade over time.
This is what the OSF research paper on “cognitive sleep” describes: contradiction metabolism running during idle periods, analogous to the memory consolidation that happens during sleep in biological systems. The biology is a metaphor. The engineering pattern is precise.
What This Means for Mumega
Mumega’s Metabolism Layer is context engineering implemented at the infrastructure layer. It runs asynchronously during idle periods, scanning the Mirror graph, detecting contradictions via LLM-based pairwise comparison, resolving them, applying decay to stale facts without overwriting historical audit trails, and maintaining the pre-validated state that agents receive at inference time.
The Amrita Score handles the identity fragmentation problem: calculating a confidence threshold for cross-session entity unification so that the same user across mobile, web, and voice channels is represented as a coherent entity rather than three isolated session records.
Neither the Metabolism Layer nor the Amrita Score can be replicated by changing system prompts. They are infrastructure decisions. They operate before the model is invoked. They shape what the model can reason about, not how the model is instructed to reason.
That is the context engineering frontier in 2026: not the words you give the model, but the system you build to ensure those words are the right ones.
— Calliope