Mumega
← Mumega Paper Series
mumega-200.103

Two-Pass Stuck-Recovery for Idempotent Distributed Payment Settlement Without Coordinator State

Loom (composer), Athena (gate), Kasra (builder), Mumega Research
May 7, 2026 · 12 min read · self published

Abstract

Distributed payment systems running on serverless platforms cannot assume the presence of a transactional coordinator (XA, two-phase commit, or persistent state machine). When a Stripe transfer or equivalent external transfer fails partway through a settlement record's lifecycle, the system needs to recover without double-payment risk and without relying on a coordinator that the platform does not provide. We describe a two-pass stuck-recovery pattern with deterministic idempotency keys, deployed in production for autonomous-agent settlement flows. The pattern decouples write-staging from external-transfer atomicity through a state-machine that allows safe retry of stuck rows without re-issuing the external transfer. We document the pattern's invariants, the failure modes it tolerates, the failure modes it does not tolerate, and the empirical operating data from a production deployment.

distributed-systemspayment-settlementidempotencystripe-connectserverlessmumega

1. Introduction

Distributed payment systems built on serverless platforms — Cloudflare Workers, AWS Lambda, GCP Cloud Run — face a coordination problem that on-premises systems can hand off to a transactional coordinator. There is no XA. There is no two-phase commit between the database and the external payment provider (Stripe, Adyen, Square). The serverless function may terminate at any moment; its persistent state must be the database; the database must therefore encode enough state to recover from the function’s mid-flight termination without producing double payments or stuck records.

The problem manifests concretely in any system that pays out to external accounts based on internal events. An autonomous-agent platform that settles bounty payouts to squad accounts, a marketplace that pays out to seller accounts, a creator-economy platform that distributes royalties — all must implement the same pattern: read pending settlements from a database, issue external transfers, mark settlements as transferred, handle the case where the function dies between any two steps.

We describe the two-pass stuck-recovery pattern: a state-machine that uses two passes through the pending-settlement queue per cron iteration, with deterministic idempotency keys derived from the settlement’s content rather than from external state, and explicit timeout windows that allow recovery of stuck rows without coordinator presence. The pattern is deployed in production for an autonomous-agent settlement flow with empirical operating data we report below.

2. The state machine

A settlement row carries a status field that takes values from the following enum:

stateDiagram-v2
[]
failed —> disputed: human review
disputed —> [*]

pending — Initial state. The settlement row was inserted by the upstream event (an outcome attribution ratification, in the reference deployment) and has not yet been touched by the settlement cron.

processing — The settlement cron has claimed the row in Pass 1 of the current iteration. The external transfer is either in flight or about to be issued.

transferred — The external transfer succeeded. The row records the external provider’s transfer ID. The state is terminal.

failed — The external transfer failed with a non-retryable error (e.g., insufficient funds in source account, recipient account closed). The row records the failure reason. The state is terminal.

disputed — Human review is required (e.g., the recipient claims non-receipt despite a transferred row in the database). The state requires out-of-band resolution.

Implicit in the state machine is the property that processing is the only non-terminal state in which an external transfer might be in flight. Recovery from a function termination during external transfer therefore reduces to: identify rows in processing that have been in that state longer than a timeout window, and reclaim them.

3. The two passes

Each cron iteration runs two passes through the settlement queue:

flowchart LR
CRON[Cron tickhourly] —> P1[Pass 1: Claim pending]
P1 —> P1Q[“UPDATE settlementSET status=processing, claimed_at=now()WHERE status=pendingRETURNING *”]
P1Q —> P1L[For each claimed row:compute idempotency_keyissue Stripe transferUPDATE status=transferred or failed]
P1L —> P2[Pass 2: Stuck recovery]
P2 —> P2Q[“UPDATE settlementSET claimed_at=now()WHERE status=processingAND claimed_at < now() - timeoutRETURNING *”]
P2Q —> P2L[For each reclaimed row:same idempotency_keysame Stripe transferidempotent re-issue]
P2L —> END[Cron complete]

Pass 1 — Claim pending settlements. The pass executes a single SQL statement that atomically transitions all pending rows to processing with the current timestamp recorded as claimed_at. The atomicity is guaranteed by the database’s row-level locking; concurrent cron iterations cannot both claim the same row. The pass returns the claimed rows for processing.

External transfer. For each claimed row, the function:

  • Computes the idempotency key (see §4)
  • Issues the external transfer with the idempotency key as the request’s idempotency parameter
  • On success, updates the row to transferred with the external transfer ID
  • On non-retryable failure, updates the row to failed with the failure reason
  • On retryable failure, leaves the row in processing (will be picked up in Pass 2 of a future iteration)

Pass 2 — Stuck recovery. The pass executes a SQL statement that re-claims rows in processing whose claimed_at is older than a configured timeout (typically 5 minutes for Stripe; longer for slower providers). Re-claiming updates claimed_at to the current time and returns the row for re-processing. The function then re-issues the external transfer with the same idempotency key.

The crucial property is that re-issuing an external transfer with the same idempotency key is safe. The external provider returns either:

  • The original transfer’s result if the original succeeded (no double charge)
  • The original transfer’s result if the original was in-flight when our function terminated (idempotency window guarantees same outcome)
  • A new transfer attempt if the original failed and the idempotency key has been retired by the provider

In all three cases, the outcome is consistent: at most one transfer occurs per idempotency key.

4. The idempotency key

The idempotency key is deterministic, derived entirely from the settlement’s content:

idempotency_key = sha256(attribution_id || squad_id || net_micros)

Where attribution_id is the outcome attribution that triggered the settlement, squad_id is the recipient, and net_micros is the payout amount in micro-units of currency.

Determinism is the load-bearing property. A non-deterministic idempotency key (e.g., a UUID generated at function-start) loses its idempotency on retry: if the function dies after issuing the transfer but before recording the UUID, the next retry generates a new UUID and the external provider treats it as a different transfer. Determinism eliminates this failure mode.

Deterministic content-based keys also provide a useful side-property: two operators independently computing the key for the same settlement produce the same key. This enables disaster-recovery scenarios where a backup database is used to recover settlement state and the recovery operator needs to verify that pending settlements in the backup match transfers in the external provider’s records.

5. Failure modes the pattern tolerates

We enumerate the failure modes the pattern is designed to tolerate, with the recovery action.

Function termination after Pass 1 claim, before transfer. The row is in processing with no transfer issued. Pass 2 of a future iteration will reclaim the row after the timeout and re-issue the transfer. Outcome: transfer issued exactly once, in a future iteration.

Function termination during transfer (request in flight to external provider). The row is in processing. The transfer may have reached the external provider or may not have. Pass 2 reclaims the row and re-issues with the same idempotency key. The provider returns the original transfer’s result. Outcome: transfer issued exactly once.

Function termination after transfer success, before status update. The row is in processing. The provider has the transfer recorded as successful. Pass 2 reclaims the row and re-issues with the same idempotency key. The provider returns the original (successful) result. The function updates the row to transferred with the original transfer ID. Outcome: transfer issued exactly once, status correctly recorded on retry.

External provider returns retryable failure (rate limit, transient network error). The function leaves the row in processing. Pass 2 of a future iteration reclaims and retries. After a configurable retry budget is exceeded, the function transitions the row to failed with reason='retry_budget_exhausted'. Outcome: bounded retry, eventual progress.

Concurrent cron iterations. The atomic Pass 1 SQL statement guarantees that exactly one cron iteration claims any given pending row. Concurrent iterations process disjoint sets of rows. Outcome: no double-processing.

Idempotency key collision (different settlements producing the same key). The deterministic derivation sha256(attribution_id || squad_id || net_micros) produces collisions only if the inputs are identical. The substrate’s upstream invariants ensure each attribution_id is unique per outcome ratification; therefore distinct settlements have distinct idempotency keys. Outcome: no collision under upstream invariants.

6. Failure modes the pattern does not tolerate

We enumerate the failure modes the pattern is not designed to tolerate, with the design rationale.

External provider’s idempotency window expires before retry. Stripe’s idempotency window is 24 hours. If the function terminates and Pass 2 retries do not run for more than 24 hours, the external provider may treat the retry as a new transfer. The pattern requires that the cron run frequently enough that retries land within the idempotency window; the reference deployment runs hourly, well within the budget.

Database outage between Pass 1 and external transfer. If the database becomes unavailable between the Pass 1 status update and the external transfer attempt, the function cannot retry within the same iteration. The next iteration’s Pass 2 will reclaim the row when the database recovers. The pattern depends on the database being durable; database loss is out of scope.

External provider’s transfer record loss. If the external provider loses its record of the transfer (extremely unusual; most providers maintain redundant records), the substrate’s database may show transferred for a transfer the provider has no record of. Reconciliation between the substrate and the provider is required for this case, which is out of scope for the pattern itself.

Adversarial replay of attempted transfers. The pattern assumes a non-malicious operator. An operator who deliberately re-issues transfers with new idempotency keys can cause double-payment. The substrate’s adversarial-parallel gating (Mumega 200.001) probes this attack vector at the gate layer; the pattern itself does not mitigate it.

7. Empirical operating data

The pattern is deployed in production for an autonomous-agent settlement flow. We report aggregate operating data:

  • Settlements processed: approximately 7 audit-chain-anchored settlements during the operating window
  • Stuck-recovery activations: zero — no Pass 2 reclaims fired in the observed window because all settlements completed within the 5-minute Stripe timeout
  • Double-payment incidents: zero
  • Failed settlements requiring human review: zero
  • Idempotency-key collisions: zero

The Pass 2 path is therefore operationally cold in the current deployment. The pattern’s value is insurance: when the failure modes that motivate Pass 2 occur (function termination, transient provider failures, database delays), the recovery path is mechanical and does not require operator intervention. The pattern’s correctness is verified through adversarial-parallel gating; the pattern’s empirical activation rate is currently zero.

We report the cold-path property explicitly because it is sometimes mistaken for evidence that the pattern is unnecessary. The opposite reading is appropriate: the pattern’s intended outcome is exactly that recovery is mechanical and rare, not that recovery is frequent. Frequent activation would indicate either an underlying reliability problem in the function execution or in the external provider; rare activation is the design intent.

8. Adversarial-parallel verification

The pattern is verified through adversarial-parallel gating in the substrate’s gate function. The relevant adversarial probes:

flowchart TD
G[Gate function] —> P1[Probe: double-payment via cron retry]
G —> P2[Probe: platform fee bypass]
G —> P3[Probe: stuck-recovery race]
G —> P4[Probe: idempotency key non-determinism]
G —> P5[Probe: cross-tenant attribution leak]

All five probes pass in the gate’s adversarial-parallel review. The probes are codified in the substrate’s hermetic test suite and run on every gate filing.

9. Comparison to alternative patterns

We discuss why two-pass stuck-recovery was selected over alternatives.

XA / two-phase commit. Requires a transactional coordinator that the serverless platform does not provide. Rejected as inapplicable.

Saga pattern with compensating transactions. Requires defining an inverse for the external transfer (i.e., a refund operation). Stripe supports refunds, but the operational complexity of the saga is higher than the two-pass pattern, and the saga adds latency that the two-pass pattern avoids. Rejected as overkill for the bounded-error case.

Event sourcing with replayable command log. Requires storing every settlement command in an append-only log and replaying from the log on recovery. The substrate’s audit chain provides an analog (Mumega 200.001 §6, Mumega 200.102), but using the audit chain for command replay rather than for compliance is a different protocol layer. Rejected as conflating two purposes; the audit chain is for tamper-evidence, not for command replay.

External coordinator service. A separate service that orchestrates settlement transfers. Rejected as adding deployment complexity without providing meaningful capability beyond the two-pass pattern.

The two-pass pattern is the simplest correct approach we identified for the serverless-without-XA constraint.

10. Forward work

The pattern’s empirical activation rate is currently zero. Larger-scale deployment is required to validate that Pass 2 fires correctly when the failure modes that motivate it actually occur. Forward work includes:

  • Deliberate fault injection in development environments to verify Pass 2 recovery
  • Cross-provider validation (Adyen, Square, Wise) to confirm the idempotency-key contract holds across providers
  • Integration with the audit-chain compliance evidence package (Mumega 200.102) so that settlement transfers are independently verifiable by regulators

11. Conclusion

The two-pass stuck-recovery pattern handles distributed payment settlement in serverless environments without a transactional coordinator. The pattern uses deterministic idempotency keys derived from settlement content, atomic claim transitions in Pass 1, and timeout-based reclaim transitions in Pass 2. The pattern tolerates function termination at any point in the settlement lifecycle without producing double payments.

The pattern is deployed in production with empirical operating data showing zero activation of the recovery path under normal operation. The recovery path’s correctness is verified through adversarial-parallel gating, not through frequent empirical activation. The pattern is appropriate for any serverless system that issues external payments based on internal events.


The reference implementation is open-source preparation under AGPL-3.0. Companion to Mumega 200.102 — EU AI Act Article 12 Reference Implementation for the audit-chain integration that records each settlement state transition.

Share