Mumega

What We Learned Building an AI Coordination Substrate From Scratch

We shipped 12 sprints of production infrastructure in 48 hours. Not prototypes — production systems with cryptographic audit trails, tenant isolation, and adversarial security reviews. Along the way, we learned things about building AI coordination systems that we haven’t seen written down anywhere else.

This isn’t a product pitch. These are engineering and architectural lessons for anyone building AI agent systems — whether you’re using Claude Code, OpenAI’s Workspace Agents, Google’s ADK, CrewAI, or rolling your own.

1. Adversarial Review Catches What Correctness Review Can’t

We run every piece of code through two parallel gates: a correctness review (does it work?) and an adversarial review (can it be broken?). These are orthogonal concerns. You can write perfectly correct code that has exploitable attack surfaces.

In 12 sprints, the adversarial gate caught 21 real issues that passed correctness review:

  • A .format() call on user-controlled Discord messages that allowed template injection via {
  • TOTP MFA with a 90-second window instead of 30-second (replay-vulnerable)
  • SHA256-hashed backup codes brute-forceable on GPU in under 5 seconds
  • Row-level security policies with USING(true) that provided zero tenant isolation
  • A delivery-confirmation check that returned delivered=True for any online agent regardless of whether the message was actually read

None of these were bugs in the traditional sense. The code worked. It just wasn’t safe.

Lesson: If you’re building multi-tenant AI agent systems, run adversarial review in parallel with correctness review, not after it. The cost of sequential review is that you ship the vulnerability before you find it.

2. The Savepoint Pattern for Idempotent Webhooks

When a Stripe webhook fires payment_intent.succeeded, it might fire multiple times. Your handler needs to be idempotent. The naive approach — check if a row exists, then insert — has a TOCTOU race: two concurrent webhook deliveries both pass the check before either inserts.

The pattern that works with PostgreSQL and asyncpg:

async with conn.transaction():  # outer transaction
    try:
        async with conn.transaction():  # inner = SAVEPOINT
            await conn.execute(
                "INSERT INTO webhook_processed (id, event_id, status) "
                "VALUES ($1, $2, 'processing')",
                new_id, event_id,
            )
    except asyncpg.UniqueViolationError:
        # Someone else got here first — check their status
        existing = await conn.fetchrow(
            "SELECT status FROM webhook_processed WHERE event_id = $1",
            event_id,
        )
        if existing["status"] == "processed":
            return {"ok": True, "reason": "idempotent_skip"}
        elif existing["status"] == "processing":
            return {"ok": False, "reason": "retry_in_flight"}
        else:
            return {"ok": False, "reason": "prior_attempt_failed"}

    # ... do the actual work inside the outer transaction ...

    await conn.execute(
        "UPDATE webhook_processed SET status='processed' WHERE id=$1",
        new_id,
    )

The inner async with conn.transaction() creates a SAVEPOINT. If the INSERT hits a UniqueViolationError, the savepoint rolls back but the outer transaction survives. You can then query the existing row to make a status-aware decision.

Key insight: if the work fails mid-way, the outer transaction rolls back, the processing row disappears, and Stripe’s next retry gets a fresh INSERT. Retry-safe by construction.

We use this pattern for payment webhooks, subscription creation, agent minting, and any state transition that external systems might trigger multiple times.

3. The Seed Pattern: Minimum Viable Agent Deployment

When deploying an AI agent into a new business, the temptation is to configure everything upfront: connect all their tools, set up all the rules, customize the personality. This doesn’t scale and it delays the first value delivery.

Instead, we use the “seed” pattern:

Day 0: Plant. Deploy the agent with a name, an identity, and a connection to one channel (Discord, Slack, or Teams). That’s it. One tool connection. One communication surface.

Day 1: Introduce. The agent sends one message: “Hi, I’m [name]. I’ll be quiet for a few days while I learn your rhythm.”

Day 3: First insight. After observing 20+ messages, the agent shares its first observation: “I noticed you get most leads on Tuesday mornings.” This earns trust before the agent tries to intervene.

Week 2: First intervention. Only after the business has accepted the agent’s observations does it start actively nudging: “You haven’t followed up with [contact] in 8 days.”

Ongoing: Earn trust. Accepted suggestions increase intervention frequency. Rejected ones decrease it. The agent adapts to the business, not the other way around.

The technical implementation is a series of small, independent rules (we call them “ruliads” after Wolfram’s concept) that each fire independently based on state:

def ruliad_stale_deal(deal: dict, stale_days: int = 7) -> dict | None:
    last_action = deal.get("last_action_at")
    if not last_action:
        return None
    age = (datetime.now(timezone.utc) - last_action).days
    if age <= stale_days:
        return None
    if deal.get("stage") in ("closed-won", "closed-lost"):
        return None
    return {
        "action": "send_message",
        "text": f"Stale deal: {deal['contact_name']}{age} days since last action.",
    }

Each ruliad is 3-10 lines. Dumb alone. Together, 24 of them produce intelligent organizational coordination — like cellular automata producing complex behavior from simple rules.

Lesson: Don’t try to deploy a fully-configured AI system on day one. Deploy a seed. Let it earn trust. The business will tell you what it needs through its acceptance and rejection patterns.

4. Signals, Not Actions: The Nervous System Architecture

Early on, we built systems where the AI agent would DO things: send emails, manage ads, create content, update CRM records. This created three problems:

  1. Blame surface. If the AI sends a bad email, it’s your fault. If the AI writes wrong content, it’s your fault. Every action is a liability.

  2. Support burden. “The AI sent the wrong thing to my client.” Now you’re debugging someone else’s business context at 2am.

  3. Replaceability. If you’re doing the work, you’re an agency. Agencies get fired and replaced by cheaper agencies.

We restructured everything around signals:

  • The agent OBSERVES that a deal is stale. It doesn’t send the follow-up — it tells the human.
  • The agent NOTICES a competitor changed their pricing page. It doesn’t adjust prices — it alerts.
  • The agent DETECTS that a team member’s response time is increasing. It doesn’t diagnose burnout — it privately asks if everything is okay (with explicit opt-in consent).

The agent is the nervous system. It detects. It signals. It remembers. The human is the muscles. They act.

SIGNAL (what we build):
  "Deal with Acme is stale — 9 days no action."
  → Customer decides what to do. Our responsibility: accuracy of the signal.

ACTION (what we stopped building):
  "I sent a follow-up email to Acme on your behalf."
  → If the email is wrong: our fault. Our responsibility: the entire outcome.

Lesson: AI coordination systems should provide awareness, not agency. The moment your AI takes actions on behalf of humans, you inherit the liability for those actions. Signals scale. Actions don’t.

5. Filesystem as Registry: Kill the Configuration Database

Every multi-tenant platform eventually faces the question: where do you store the configuration for each tenant type, vertical, or deployment variant?

The obvious answer is a database table or a JSON config file with a schema. We tried this. It created a coordination problem: every new vertical required a database migration, a code change to the config loader, and a UI update to the admin panel.

The answer that works: the filesystem IS the registry.

sos/services/seeds/packs/
├── generic/
│   └── seed.json          ← config for generic businesses
├── real-estate/
│   └── seed.json          ← config for real estate
├── dental/
│   └── seed.json          ← config for dental clinics
└── grants/
    └── seed.json          ← config for grant funding

Adding a new vertical = creating a directory with a seed.json file. No migration. No code change. No admin panel update. The deployment script scans the directory:

def _scan_available_verticals() -> list[str]:
    return [d.name for d in PACKS_DIR.iterdir()
            if d.is_dir() and not d.name.startswith(".")]

Lesson: If your registry is small (tens to hundreds of entries, not millions), use the filesystem. It’s versioned by git, reviewable in PRs, deployable with no migration, and scannable with three lines of code. Databases are for data that changes at runtime. Configuration changes at deploy time.

6. The 16D Identity Vector: Beyond API Keys

Every agent in our system has a 16-dimensional identity vector derived from a SHA-256 hash of its name and role. This isn’t decoration — it determines:

  • Visual identity. The vector maps to aesthetic parameters (color temperature, geometry style, contrast, texture) that produce a unique generative art portrait for each agent.

  • Behavioral character. Different dimensions map to communication style, intervention frequency, formality level, and domain expertise weighting.

  • Resonance routing. When a new signal arrives (a deal, a lead, an event), the signal’s characteristics can be matched against agent vectors to determine which agent should handle it.

The vector is deterministic (same name + role always produces the same vector), immutable (the identity doesn’t change), and cryptographically verifiable (the QNFT registry entry is hash-chained).

Lesson: Agent identity should be richer than an API key. When agents have persistent, verifiable, multi-dimensional identity, you can build routing, trust, and coordination systems that aren’t possible with flat authentication tokens.

7. Mock Transactions Suppress Exceptions by Default

This one cost us hours of debugging. When mocking asyncpg transactions in Python tests:

mock_tx = AsyncMock()
mock_tx.__aenter__ = AsyncMock(return_value=None)
mock_tx.__aexit__ = AsyncMock(return_value=False)  # ← CRITICAL

The default AsyncMock().__aexit__ returns a MagicMock(), which is truthy. In Python’s context manager protocol, a truthy return from __aexit__ means “suppress the exception.” This means your test’s raise RuntimeError("mint_failed") gets silently swallowed, the transaction appears to succeed, and your rollback test passes when it shouldn’t.

Always explicitly set __aexit__ = AsyncMock(return_value=False) to propagate exceptions correctly.

8. The WORM Audit Chain as Trust Infrastructure

We write every state-changing action to an append-only, hash-chained audit log stored in R2 with Object Lock (7-year COMPLIANCE retention). Each entry references the previous entry’s SHA-256 hash. A secondary verifier process independently re-reads and re-computes hashes every 15 minutes.

This isn’t a compliance checkbox. It’s trust infrastructure.

When a customer asks “what did the AI do with my data?”, we hand them the chain. When a regulator asks for proof, we hand them the chain. When we onboard an enterprise customer whose CISO needs to sign off, we hand them the chain.

The cost is one additional DB write per action (in the same transaction — if the audit write fails, the action rolls back). The value is that every customer interaction is cryptographically provable. In a world where AI trust is the bottleneck to adoption, this is the difference between “trust us” and “verify it yourself.”

Lesson: Build audit infrastructure before you need it. Adding it retroactively to a production system is architecturally painful. Building it from day one makes every future compliance conversation trivial.

9. The Channel Is the Interface

We spent weeks planning a dashboard. Then we realized: our customers already have their team on Discord or Slack. They already check it 50 times a day. Building a separate dashboard means asking them to check a 51st thing.

Instead, the agent lives in their channel. The daily briefing IS the dashboard. The nudges ARE the notifications. The weekly report link is the “depth” layer for when they want more detail.

Zero new tabs. Zero new logins. Zero new habits to form.

The customers who succeed with AI coordination aren’t the ones with the best dashboards. They’re the ones whose AI meets them where they already are.

10. Deploy-Time Decisions, Not Runtime Decisions

Every decision that can be made at deploy time should be made at deploy time. Runtime decisions are:

  • Harder to debug (state-dependent)
  • Harder to audit (which path was taken?)
  • Harder to test (combinatorial explosion)

Our vertical pack system resolves all vertical-specific behavior at deploy time:

{
  "agent_role": "operations coordinator",
  "ruliads_enabled": ["stale-deal-nudge", "hot-opportunity-flag", ...],
  "compliance": ["RECO", "REBBA", "PIPEDA"],
  "collections": ["listings", "neighborhoods", "blog", "team"]
}

At runtime, the agent doesn’t check “am I a real estate agent or a dental agent?” It was DEPLOYED with the real estate ruliads. It only knows what it was born with. The deploy-time configuration is the identity.

Lesson: Push decisions as early as possible in the lifecycle. Build-time > deploy-time > runtime. Each step earlier reduces the state space your system needs to reason about.


These lessons came from building a coordination substrate from scratch — but they apply to anyone building AI agent systems at any scale. The common thread: AI systems that last are the ones that earn trust through transparency (audit chains), respect boundaries (signals not actions), and grow gradually (seeds not big-bang deployments).

The models will keep getting smarter every quarter. The architecture that makes smart models useful for teams of humans — that’s the harder problem, and the more durable one.

Share