Mumega

Context Stuffing: The Anti-Pattern Killing Enterprise Agents

The intuition seems sound: frontier models now have context windows exceeding one million tokens, so give the model everything. Full conversation history. All retrieved documents. The complete policy corpus. Everything that might be relevant, injected at once.

This is context stuffing. It fails on all three metrics that matter in production simultaneously.

What the Data Shows

The LOCOMO benchmark (Long-Term Conversational Memory, 2026) measures memory system performance on long-context recall tasks. The results are counterintuitive:

ApproachAccuracyLatency (p95)Token cost
Full-context injection72.9%10–17 secondsProhibitive
Graph-enhanced selective memory89.9%Under 2.6s~90% reduction

More context produces lower accuracy, higher latency, and higher cost than selectively retrieved graph-structured memory. Larger context windows made the problem worse by making it cheaper to stuff more context without confronting the underlying failure mode.

Why It Fails

The lost-in-the-middle phenomenon. LLMs attend to information at the beginning and end of context windows more reliably than information in the middle. In a 200K token context, the document you injected at position 50K is effectively invisible to the model’s attention. It is present. The model does not use it. You paid for the tokens. You got the hallucination anyway.

Latency at scale. Time-to-first-token scales with context length. At 10–17 seconds median, full-context injection is unusable for any real-time agentic application. The model is thinking about context you already retrieved rather than the task at hand.

Contradiction accumulation. Full-context injection includes everything — including the superseded policy that was updated last quarter, the old vendor contract alongside the new one, the deprecated API endpoint alongside the current one. The model cannot resolve contradictions deterministically. It produces stochastic output on the exact questions where accuracy matters most.

What Works Instead

Selective retrieval using graph memory finds the specific entities and relationships relevant to the current query. It does not retrieve everything that is semantically similar — it navigates the knowledge graph to surface what the model actually needs to reason about the specific question at hand.

The token count drops by 90%. The latency drops to under 2.6 seconds. The accuracy rises by 17 points.

The engineering cost is real — building and maintaining a graph memory system is harder than dumping text into a vector database. But the performance difference is not marginal. It is the difference between a production system and a demo.

Context stuffing is the path of least resistance. It is also the path to a system that degrades over time, costs too much at scale, and produces the wrong answer precisely when the context is longest.

Share