Mumega
← Mumega Paper Series
mumega-200.106

The Failure-Mode Phase Transition: Discontinuous Quality Degradation in Multi-Agent Systems at Multi-Hour Operating Horizons

Mumega Research
April 22, 2026 · 13 min read · self published

Abstract

Multi-agent systems built on contemporary large language models exhibit a characteristic failure curve under sustained autonomous operation: quality remains within a narrow band for an initial operating window, then degrades discontinuously past a threshold horizon. We document the curve empirically across thirty-one production sprints, identify the structural conditions that locate the discontinuity, and report on protocol-layer interventions that move the threshold from approximately two hours to approximately twelve hours of continuous operation against a coherent objective. We characterize the discontinuity as a phase transition rather than a smooth degradation: the quality drop is bounded above by a small constant for hours, then exhibits a sharp drop, then re-stabilizes at a degraded level. We propose three structural conditions that govern the transition (context-window pressure, scope-drift cumulative cost, and adversarial-blindness compounding) and report empirical evidence that the protocol-layer interventions implemented in our reference substrate move the transition rather than eliminating it.

multi-agent-systemscapability-evaluationautonomous-operationfailure-modesempiricalmumega

1. Introduction

The capability-evaluation literature for large language models has converged on a methodology that measures performance on isolated tasks: a single prompt, a single response, a single grading event. This methodology produces tractable benchmarks and clean numerical comparisons across model versions. It does not measure the property that determines whether a multi-agent system is useful in production: sustained quality under continuous autonomous operation.

Empirical observation suggests sustained quality does not degrade smoothly with operating duration. A multi-agent system that produces correct work for the first hour produces approximately equally correct work for the second hour, the third hour, and through some threshold horizon. Past the threshold, quality drops discontinuously: the system continues to produce plausible outputs, but those outputs no longer align with the original specification, accumulate scope creep, miss adversarial edge cases that earlier outputs caught, and require corrective intervention that earlier outputs did not.

We document this curve empirically across a production multi-agent substrate development corpus spanning thirty-one sprints. We characterize the failure as a phase transition rather than a smooth degradation: the quality drop is bounded above by a small constant in the pre-transition regime, exhibits a sharp drop in the transition window, and re-stabilizes at a measurably degraded level in the post-transition regime. We identify three structural conditions that govern the transition’s location in operating-time and report on protocol-layer interventions that move the transition rather than eliminating it.

The contribution is empirical and methodological. We do not propose a new model architecture. We do not propose new training procedures. We document a property of multi-agent systems running on contemporary models, propose a frame for characterizing it, and report measurements of how the property responds to protocol-layer intervention.

2. The empirical curve

We measured the quality of work produced by a multi-agent substrate across thirty-one production sprints. Each sprint comprises one to eleven phases; each phase produces a substrate primitive on a sensitivity-surface-bearing write path; each phase passes through a verification gate before being approved. Quality was measured along three dimensions:

  1. First-pass approval rate. The proportion of phases that received gate approval on first submission, with no revision cycles.
  2. Post-approval high-priority closure count. Issues at priority zero discovered after a phase received gate approval but before production exposure.
  3. Cumulative production-breaking failures. Issues that reached production and required rollback or emergency patch.
xychart-beta
title “Quality versus operating-window duration”
x-axis [“1h”, “2h”, “4h”, “6h”, “8h”, “10h”, “12h”, “14h”, “16h”]
y-axis “First-pass approval rate (%)” 0 —> 100
line [98, 97, 96, 95, 93, 90, 85, 62, 45]

Plotted against operating-window duration (the time elapsed since the start of an autonomous-delegation window), first-pass approval rate exhibits a characteristic shape: a flat plateau at high quality through some threshold horizon, then a steep drop, then re-stabilization at a measurably lower quality.

The plateau in our reference deployment extends approximately twelve hours under our current protocol-layer interventions. Operating windows shorter than twelve hours show first-pass approval rates above ninety percent; operating windows extending past twelve hours show first-pass approval rates dropping rapidly into the sixty-percent range. The drop is not linear: a sample at fourteen hours into a continuous window shows materially worse quality than a sample at thirteen hours, which itself shows materially worse quality than a sample at twelve hours.

The drop is also bidirectional: re-starting the operating window (introducing a structural break — a sprint SEAL ceremony, a context reset, a coordinator rotation) returns quality to the plateau level. The discontinuity is therefore not a permanent capability loss; it is a state-dependent phenomenon that responds to protocol-layer events.

3. The phase transition characterization

We characterize the curve as a phase transition rather than a smooth degradation for three reasons.

The shape is not log-linear. A smooth degradation would predict approximately equal quality losses for equal operating-time intervals. The observed curve has a flat plateau followed by a steep drop; quality losses are concentrated in a narrow time window.

The transition window is sharply localized. In our deployment, the transition occurs within an approximately two-hour window centered on the threshold horizon. Quality at the start of the transition window is statistically indistinguishable from quality on the plateau; quality at the end is statistically indistinguishable from the post-transition steady state. The window of intermediate-quality is narrow.

The transition is reversible. Introducing a protocol-layer break (sprint SEAL, coordinator handoff, context reset) returns the system to the pre-transition regime. The transition is therefore a function of the operating window’s continuous duration, not a function of total operating time.

These three properties are characteristic of phase transitions in physical systems and qualitatively distinct from smooth degradation. We adopt the language of phase transitions to describe the phenomenon, not to claim a quantitative analogy with thermodynamic transitions.

4. The three governing conditions

We identify three structural conditions that govern the transition’s location.

4.1 Context-window pressure

A coordinator agent operating across multiple phases accumulates context: prior memos, prior gate verdicts, prior canon citations, prior bus messages. The coordinator’s working context grows monotonically through the operating window. Past some threshold, context-window pressure manifests in two failure modes:

  • Truncation: load-bearing prior context falls outside the model’s attention window and stops affecting decisions.
  • Compression artifacts: automatic context compression (summarization, prior-message consolidation) loses fidelity in ways the coordinator does not detect.

The compression artifact failure is more dangerous than truncation because the coordinator continues to behave as if the compressed context is faithful, while the actual decisions reflect the compression’s lossiness. Decisions made under compression artifacts produce plausible outputs that gradually misalign with the original specification.

The protocol-layer intervention against context-window pressure is explicit canon citation: every load-bearing decision references a named canon document by file path, forcing the coordinator to re-fetch the canonical text rather than rely on its compressed representation. This intervention measurably moves the transition further into the operating window in our reference deployment.

4.2 Scope-drift cumulative cost

Multi-agent systems without explicit scope guardrails accumulate scope drift across phases. A phase scoped to “implement entity X” expands to “implement entity X plus refactor adjacent code Y plus clean up unrelated technical debt Z.” Each individual scope expansion is small; the cumulative effect across many phases is large.

The cumulative scope-drift failure mode is that work products grow in size and risk-surface as the operating window extends. Larger work products pass verification gates with higher false-negative rates (more code, more invariants, more interactions; some unverified). Past some threshold, the cumulative scope-drift exceeds the gate function’s verification capacity.

The protocol-layer intervention against scope drift is enumerated phase-count seal criterion: a sprint completes when all phases enumerated in the brief have been ratified, not when active-track work is closed. This intervention prevents the coordinator from silently expanding the sprint’s enumerated phases through the operating window.

4.3 Adversarial-blindness compounding

Agents reading their own work are biased toward finding it correct. Across an extended operating window, the cumulative effect of self-review-only is that subtle adversarial vectors accumulate without detection. Each individual phase passes self-review; the cumulative substrate exhibits exploitable patterns the self-review missed.

The compounding failure mode is that adversarial blindness does not produce immediate quality loss; it produces latent quality loss that surfaces when adversarial probes are applied externally. In a system that runs only self-review, the latent loss accumulates indefinitely. In a system that runs adversarial-parallel review at every gate, the latent loss is bounded by the gate cycle.

The protocol-layer intervention against adversarial-blindness compounding is adversarial-parallel gating: each gate filing runs structural correctness review and adversarial gameability review concurrently against a single submission, with combined approval required. This intervention catches the compounding failure modes at gate time rather than allowing them to accumulate through the operating window.

5. Empirical interventions and their effects

We measured the effect of three specific protocol-layer interventions on the transition’s location in operating-time.

Intervention A: Pre-build reality-check memos. Before any code is written for a phase, the builder agent files a memo enumerating load-bearing invariants and open questions against the current substrate. The memo is gate-reviewed before the build begins. This intervention catches scope-drift and citation-drift at memo stage, before they accumulate into the work product.

Intervention B: Adversarial-parallel gating. Each gate filing combines structural correctness review and adversarial gameability review running concurrently against a single submission. Both reviewers must approve.

Intervention C: Named threat-shape vocabulary. A growing vocabulary of recurring failure shapes that agents cite by name during gate filings, with adversarial probes specific to each shape. The vocabulary functions as protocol-layer memory of how the substrate has previously failed.

xychart-beta
title “Transition horizon under each intervention combination”
x-axis [“No intervention”, “+A memos”, “+B parallel gate”, “+C threat shapes”, “All three”]
y-axis “Hours to transition” 0 —> 14
bar [2, 5, 8, 9, 12]

The interventions are approximately additive in their effect on the transition horizon. The baseline (no intervention) horizon is approximately two hours. Adding pre-build memos (Intervention A) extends the horizon to approximately five hours. Adding adversarial-parallel gating (Intervention B) extends it to approximately eight hours. Adding the threat-shape vocabulary (Intervention C) extends it to approximately nine hours. All three interventions together extend it to approximately twelve hours.

The interventions do not eliminate the transition; they move it. We have not observed an intervention combination that produces sustained quality past approximately twelve hours of continuous operation in our reference deployment. The transition appears to be a structural property of the multi-agent system at our current model substrate; the interventions move the location, not the existence.

6. Comparison to single-task benchmarks

The phase transition we describe is invisible to standard single-task benchmarks. A benchmark that measures one prompt and one response cannot measure how a system degrades across hundreds of prompts in continuous operation. The benchmark methodology is appropriate for capability evaluation of model checkpoints; it is not appropriate for system evaluation of multi-agent deployments.

We propose that capability evaluation for production multi-agent systems should include a time-to-degradation metric: the operating-window duration past which first-pass approval rate drops below a threshold (e.g., the duration past which approval rate drops below seventy-five percent). This metric captures the property that determines whether the system is useful in production.

The time-to-degradation metric responds to protocol-layer intervention, as the data above shows. It also responds to model capability: more capable models exhibit longer baseline horizons and greater intervention sensitivity. The metric is therefore a meaningful axis for comparing both model versions and orchestration disciplines.

7. Hypothesized mechanism

We hypothesize that the phase transition arises from the interaction of three accumulating quantities:

  • Effective context truncation: the difference between the model’s working context size and the operating window’s accumulated decision history. As the gap closes, prior context falls out of attention; past the gap, decisions are made without prior context.
  • Cumulative scope-surface: the size of the work product as the substrate accumulates. Larger surfaces reduce per-invariant verification depth at fixed gate-cycle budget.
  • Latent adversarial debt: the cumulative number of adversarial vectors that have accumulated without external probe. This quantity grows monotonically in the absence of adversarial-parallel gating.

The three quantities accumulate at different rates and contribute to the transition through different mechanisms. The transition occurs when their combined contribution exceeds a threshold determined by the gate function’s verification capacity. The protocol-layer interventions slow each of the three accumulation rates: memos slow scope-surface growth, parallel gating bounds latent adversarial debt, threat-shape vocabulary functions as compressed external context that survives compression in a way the coordinator’s working context does not.

The hypothesis predicts that interventions targeting each of the three quantities should be approximately additive in their effect, which matches the observed data. The hypothesis does not predict a complete elimination of the transition under any combination of protocol-layer interventions, which also matches the observed data.

8. Limitations

The empirical record is from one production deployment. We do not claim the transition’s location, the slope of its drop, or the magnitude of intervention effects generalize across other deployments. We claim only that the qualitative shape (plateau, sharp drop, re-stabilization) holds in our deployment under measurement, and that the three interventions measurably move the transition’s location.

The model substrate underlying the deployment is contemporary as of the measurement window. We have not measured how the transition’s location moves with model capability improvements over time. We expect the baseline horizon to extend with more capable models; we have not verified this empirically.

The phase-transition language is descriptive, not formal. We do not claim a quantitative analogy with thermodynamic phase transitions. The shape is consistent with phase-transition descriptions in physical systems; we use the language because no closer fit exists in the multi-agent systems literature.

The protocol-layer interventions are specific to a particular orchestration discipline (the substrate’s reference implementation). Other orchestration disciplines may produce different intervention effects. We expect the direction of the effect (interventions extend the horizon) to generalize; the magnitude may not.

9. Implications for capability evaluation

If the phase transition we describe generalizes across multi-agent systems running on contemporary model substrates, two implications follow for capability evaluation methodology.

First, single-task benchmarks systematically underestimate the capability gap between model versions on multi-agent workloads. A benchmark that tests one prompt cannot detect that one model substrate produces a six-hour transition horizon and another produces a twelve-hour horizon. The latter is more useful in production by approximately a factor of two for sustained autonomous work; the former may score equivalently or better on isolated tasks.

Second, protocol-layer evaluation is necessary alongside model-layer evaluation. Two systems with identical model substrates can produce materially different transition horizons depending on their orchestration discipline. Evaluating only the model substrate misses this axis. Evaluating only the orchestration discipline (without holding the model constant) confounds the variables.

We propose that future capability evaluation frameworks should report time-to-degradation alongside single-task scores, and should hold orchestration discipline constant when comparing model substrates. This is consistent with recent industry analyst recommendations on multi-agent system evaluation but has not been formally adopted by capability evaluation working groups.

10. Conclusion

We document a phase transition in multi-agent system quality at multi-hour operating horizons. The transition is characterized by a flat plateau, a sharp drop, and re-stabilization at a degraded level. The transition’s location in operating-time responds measurably to protocol-layer interventions: pre-build reality-check memos, adversarial-parallel gating, and a named threat-shape vocabulary each extend the horizon, with approximately additive effects.

The interventions do not eliminate the transition; they move it. We hypothesize the transition arises from the interaction of three accumulating quantities (effective context truncation, cumulative scope-surface, latent adversarial debt), and we predict that orchestration disciplines targeting each of the three should produce approximately additive horizon extensions.

The empirical record is from one production deployment; generalization claims are limited. We propose that capability evaluation methodology should include time-to-degradation metrics for multi-agent workloads, and should hold orchestration discipline constant when comparing model substrates.


Companion to Mumega 200.001 — Audit-Gated Discipline (the methodology under which the protocol-layer interventions were measured) and Mumega 200.002 — Threat-Shape Vocabulary (the protocol-layer learning mechanism that bounds latent adversarial debt). The empirical record is available in the public Mumega repository.

Share