What 1.06B Tokens Taught Us About Agent Stop Rules
We proved a real security invariant, then kept proving it long after the lesson was clear. The work was not fake. The waste was real. The useful lesson is that agent systems need a transition rule: when repeated proofs stop buying new truth, the next output must be consolidation, not another sprint.
The Question
Why did it take us 85 sprints to test one feature?
That is the uncomfortable question Hadi asked after watching the S085-S172 operator-action lane consume roughly 1.06B Codex tokens.
The short answer: because the system kept rewarding continuation after the proof shape had stabilized.
The longer answer matters more. We were not doing random work. We were hardening an authenticated operator-action path for com-mumega. Each sprint added one internal-only dashboard action, proved it through the browser, checked tenant isolation, wrote a receipt, and reported back through Loom.
At first, that was exactly the right move.
Later, it became a loop.
What We Actually Proved
The repeated proof family covered one operator surface and one API route:
| Surface | Role |
|---|---|
/dashboard | Operator UI surface |
/api/dashboard/operator-actions | Authenticated action route |
mumega_session cookie | Browser session carrier |
| Receipt JSON | Internal proof record |
Across the completed run, the proof invariant was consistent:
- A signed
com-mumegamember can load the dashboard. - Anonymous dashboard access is denied.
- Cross-tenant dashboard access is denied.
- The intended internal-only operator action appears.
- The browser sends the action to
/api/dashboard/operator-actions. - The browser uses the signed cookie and does not place a session token in the body.
- The API accepts the signed
com-mumegamember action. - The API denies anonymous and cross-tenant attempts.
- The receipt persists.
- The dashboard readback survives reload.
- No external action, paid work, customer-visible delivery, tenant bypass, or secret exposure occurs.
That invariant is valuable. It is the kind of thing a tenant operating layer needs before real customers, budgets, workflows, and live dispatch enter the system.
The Data
The final closeout inventory says:
| Metric | Value |
|---|---|
| Token use observed by Hadi | ~1.06B Codex tokens |
| Completed proof range | S085-S171 |
| Completed proof receipts | 87 |
| Interrupted stop marker | S172 |
| Covered routes | 2 |
| Final proven action | readiness_bridge_live_source_post_rollout_post_rollout_deployment_review_ack |
| S172 receipt | none |
| External actions performed | 0 |
| Paid work started | 0 |
| Customer-visible delivery added | 0 |
| Tenant-isolation bypasses | 0 |
| Secrets exposed | 0 |
The receipt count is impressive. It is also the warning sign.
If a system produces 87 receipts for the same route pair and the same invariant, the question is no longer “can it prove the next action?”
The question becomes:
Why are we still adding one more action by hand?
Where The Work Was Useful
The early lane was worth doing.
We needed to know whether a real operator surface could:
- enforce tenant scope
- reject anonymous users
- reject cross-tenant users
- avoid session-token leakage in the request body
- persist internal receipts
- keep proof readback visible after reload
Those are not cosmetic checks. They are the base mechanics of a secure tenant dashboard.
S085-S101 established the basic dashboard and bridge-start shape. S102-S128 extended the live-source acknowledgement path. S129-S150 repeated it through post-rollout states. S151-S171 repeated it again through post-rollout post-rollout states.
The first band clarified the system.
The second band hardened confidence.
The third band should have triggered consolidation much earlier.
Where The Loop Became Waste
By the time the proof shape had repeated several times, the next sprint could be generated mostly by string substitution:
readiness_bridge_live_source_X_review_ack
data-send-bridge-live-source-X-review-action
mumega_sNNN_bridge_live_source_X_review_receipt
renderBridgeLiveSourceXReviewReceipt(...)That is the moment an autonomous system should stop expanding.
Not because the work is wrong, but because the product truth has stopped changing.
The proof was saying the same thing:
- same dashboard
- same API route
- same cookie carrier
- same denial rules
- same internal receipt
- same reload readback
- same non-live boundary
Only the label changed.
More formally, a lane must stop when all of these are true:
| Stop condition | Why it matters |
|---|---|
| Same surfaces | We are not testing a new route or UI boundary |
| Same threat model | We are not adding a new security question |
| Same proof invariant | We are not discovering new behavior |
| Same code motion | We are copying structure with renamed identifiers |
| More than 3 repeats | The pattern is stable enough to abstract |
At that point the next sprint should be a consolidation sprint.
Not “keep going.”
Not “one more.”
Consolidate.
What We Changed
We stopped S172 before proof.
That matters. It means the system finally treated “unproved” honestly:
- S085-S171 are receipt-backed.
- S172 has no receipt.
- S172 is not exposed.
- The partial S172 dashboard/API/script/brief surface was removed.
- The family was closed as a repeated proof family.
We then wrote three operator artifacts:
| Artifact | Purpose |
|---|---|
2026-05-09-s085-s172-authenticated-operator-action-family-closeout.md | Close the expansion chain |
2026-05-09-s085-s172-authenticated-operator-action-consolidation.md | Name the invariant, surfaces, receipt inventory, limits, and next move |
2026-05-09-s085-s172-final-wrap.md | Hand off the lane cleanly |
The final recommendation is not another adjacent acknowledgement.
It is S172-manifest-consolidation.
The Next Architecture
The current implementation proved the invariant, but it should not remain hand-expanded forever.
The next shape should be:
graph TD A[Typed operator-action manifest] —> B[Dashboard action groups] A —> C[API allowlist] A —> D[Family-level invariant proof] D —> E[Representative browser proofs] E —> F[Return to live readiness bridge]
The manifest should own:
- action type
- label
- receipt key
- status copy
- tenant scope
- internal-only boundary
- required role/session posture
The dashboard should render from the manifest. The API should validate from the manifest. The proof should iterate the manifest.
That is what a repeated proof family wants to become.
The Deeper Lesson
The expensive part was not that Codex used 1.06B tokens.
The expensive part was that the system did not know when enough evidence had become structure.
Agents need persistence. They also need brakes.
Humans often under-specify the brakes because they are focused on momentum:
- keep going
- do next
- do not stop
- finish it
Those are useful commands when the system is still discovering the path. They are dangerous when the system is already walking in circles.
The operating layer needs to understand the difference.
What We Will Do Differently
Going forward, repeated-proof lanes should carry a pattern detector:
| Signal | Required response |
|---|---|
| Same invariant passes 3 times | Name the invariant |
| Same code motion repeats 3 times | Propose a manifest or helper |
| Same route pair repeats 5 times | Open consolidation, not another action |
| Token burn crosses budget | Report burn rate and ask whether value changed |
| Next sprint is string substitution | Stop by default |
That does not make the system slower. It makes the system more durable.
The point of autonomous work is not infinite motion. It is compounding judgment.
The Lesson In One Sentence
We spent roughly 1.06B tokens proving an operator-action envelope, and the real product improvement was learning where the proof should have become a manifest.
That is a good lesson, as long as we encode it.