How to ship a multi-agent system that survives contact with production

The previous post ended on a promise: building multi-agent systems poorly is worse than not building them at all. That sounds like rhetoric. It isn’t. A bad single-agent script disappoints once and gets quietly retired. A bad multi-agent system in production runs continuously, makes decisions you don’t see, accumulates errors across handoffs, and burns money in ways that don’t show up until the monthly invoice. The cost of poor execution is asymmetric, and in most cases the team that commissioned the build won’t realise what’s wrong until the system has already shaped six months of decisions on flawed output.

This post is about the disciplines that separate an agent system that survives contact with production from one that quietly poisons your operations. None of them are about prompting. All of them are familiar to anyone who has shipped distributed systems before. The mistake we see most often is teams approaching multi-agent work as a prompt-engineering problem, then discovering, three months in, that the actual hard parts are the parts they assumed would be free.

The premise: this is a distributed system that happens to use language models

Once you have more than one agent, with state passing between them and external tools being called, you have a small distributed system. Not metaphorically. Mechanically. You have asynchronous components, partial failures, retries, state synchronisation, observability requirements, and cost containment. Every one of these is a solved problem in the broader software engineering canon, with established patterns, well-known failure modes, and a body of practice older than transformer architectures.

The teams that succeed at multi-agent work treat the language model as the interesting but constrained substrate, and the orchestration around it as the engineering. The teams that fail treat the orchestration as plumbing they can figure out as they go. By the time they realise the plumbing is the product, the system is already running and the cost of fixing it approaches the cost of replacing it.

Six disciplines decide whether your system survives. We treat them as non-negotiable.

1. Inter-agent contracts must be typed, validated, and versioned

When agent A hands work to agent B, the handoff is an interface. If that interface is a free-form natural language blob, you have no contract. You have a hope. Hopes work in demos and fail in production.

Every handoff in a system we ship is a structured object: typed at the schema level, validated at the boundary, and versioned so that an upgrade to one agent doesn’t silently break the agent downstream. We use JSON Schema or Pydantic, depending on the stack. The choice doesn’t matter. The discipline does.

This sounds obvious until you see how often it isn’t done. The temptation to let agent B parse agent A’s prose output is enormous because it works in the first ten test runs. It works in the eleventh too. Around the hundredth, agent A produces a slight phrasing variation, agent B parses it differently, downstream actions diverge, and nobody notices for a week. The fix, after the fact, is the same as the fix would have been on day one: enforce a schema. The cost difference is two orders of magnitude.

Versioning matters because agents evolve. When you tighten agent A’s prompt to fix one behaviour, its output schema can shift in ways that propagate downstream. Versioned contracts let you upgrade one agent in isolation, run the new version against held-out evaluations, and detect regressions before they reach the agents that depend on it.

2. Every action must be idempotent or transactional

Agents retry. Networks fail. Models occasionally produce duplicate tool calls. In a production system, every external action your agents take will eventually be invoked twice for the same logical event. The question is not whether this happens; it is whether your system handles it gracefully or cheerfully sends two invoices to the same customer.

Idempotency is the cheap fix: every action carries an idempotency key derived from the logical event, and the downstream system deduplicates. For actions you control (database writes, internal API calls), this is straightforward. For actions you don’t (sending email, calling third-party APIs that don’t natively support idempotency), you need a transactional outbox: write the intended action to a durable outbox table in the same database transaction as the state change that triggered it, then have a separate worker drain the outbox and execute the action. The outbox record is not deleted until the action confirms success. This pattern guarantees at-least-once delivery without blocking the agent on the external call.

The agent itself cannot be the source of truth on whether an action happened. It will hallucinate completion. It will retry after partial failures. It will be restarted by the orchestrator and resume from a checkpoint that doesn’t reflect reality. The orchestration layer holds the truth; agents act and report; the outbox closes the loop.

3. Observability has to be readable by operators, not engineers

The default observability story for an agent system is engineer-grade: structured logs, traces, span attributes, all consumable in a tool the operations team has never opened. This is not enough.

Operators, the people whose function the agent system runs (not the engineers who built it), need to be able to answer three questions in the moment without help:

What did the system just do?
Why did it do that?
What is it about to do?

If they can’t answer those three from a UI you built for them, the system is not really theirs. It is a black box wearing a friendly face. They will not trust it. When something goes wrong (and something always goes wrong), they will not know whether to intervene or wait, and they will make the wrong call either way.

We build the operator surface as a first-class deliverable. Not a logs viewer; a feed of decisions, with the input that drove each decision, the agent that made it, and the action it produced. Filterable, replayable, and granular enough to reconstruct any single piece of work end to end. This is harder than it sounds because the underlying execution is genuinely complex; the surface has to compress that complexity into something a human can scan.

The right test is brutal: if your operations lead, the day after handover, can investigate any anomaly without paging an engineer, the observability is sufficient. If they can’t, you shipped a system you still own.

4. Cost ceilings and circuit breakers are not optional

Agents in a loop can spend money quickly. A single agent that gets stuck in a retry cycle, or that recursively decomposes a task into sub-tasks without bound, can produce a four-figure inference bill in an afternoon. Multi-agent systems compound this risk because the loop spans agents, and an individual agent’s safety check doesn’t see the global behaviour.

Every system we ship has hard cost ceilings at three levels: per-task (no single piece of work can exceed a budget without explicit escalation), per-agent (no agent can exceed a daily spend without halting), and system-wide (the orchestrator stops accepting new work if global spend trajectory exceeds the monthly budget). Hitting any ceiling is a circuit breaker, not a soft warning. The system pauses, alerts, and waits for a human.

This frustrates teams in the first month because the ceilings will trip on legitimate work. That is the point. Tuning the ceilings is part of operating the system; setting them generously and ignoring them is how budgets get destroyed. The first time the system saves you from a runaway, the discipline pays for itself many times over.

5. Human-in-the-loop boundaries are explicit, not implied

“Human in the loop” is one of those phrases everyone agrees with and nobody implements consistently. In a system we ship, every action is classified at design time into one of three tiers:

Auto: the agent acts; the operator sees it after the fact.
Confirm: the agent prepares the action and waits for one-click confirmation.
Draft: the agent produces a recommendation; a human takes the action manually.

The classification is per action type, not per agent. The same agent might auto-send internal notifications, confirm-tier external emails, and draft-tier customer refunds. The boundaries are explicit, encoded in configuration, and visible to the operator in real time.

The trap is letting the classification drift. Teams under pressure to scale will quietly promote actions from confirm to auto because the operator is becoming a bottleneck. Sometimes this is right. Often it is the agent system being asked to compensate for a tuning problem the team hasn’t diagnosed. We treat any reclassification as a reviewable change, with an explicit decision recorded, because the difference between “the agent sends 200 emails a day with one-click approval” and “the agent sends 200 emails a day autonomously” is the entire difference between a system you trust and a system you fear.

6. Versioning, evaluation, and rollback are part of the system

Agents change. Prompts get tightened. Tool sets evolve. Models are upgraded by the provider whether you ask for it or not. Every change is a chance to introduce a regression that nobody notices until it has been compounding for weeks.

The defence is unglamorous. Every agent has a versioned prompt and tool set, stored in a repository, deployed through a pipeline. Every change runs against a held-out evaluation set before it reaches production. Every deployed version is rollbackable in seconds, not hours. None of this is novel; all of it is routinely skipped by teams who treat agent prompts as configuration to be edited live in a console.

The hard part of evaluation in agent systems is that the right metric is rarely “did the agent produce the expected output for this input.” It is “across a hundred runs of this task, what is the distribution of outcomes, and where on that distribution does the new version sit.” Stochastic systems require statistical evaluation. We invest more in eval infrastructure than in any individual agent, because eval is what makes evolution safe, and evolution is the entire promise of an agent system that learns its function over time.

Two failure modes you only see in production

Two pathologies are essentially invisible in development and merciless once a system is live.

Cascading hallucinations. A specialist agent makes a small factual error. A downstream agent treats that error as ground truth and reasons confidently from it. A third agent acts on the resulting plan. The original error is now embedded in a sequence of seemingly correct decisions, undetectable from the final output. The defence is inter-agent suspicion: every agent treats inputs from sibling agents with the same scepticism it would apply to inputs from a user, and the orchestrator maintains provenance so that any final output can be traced back to its sources for audit. Provenance also lets you do something teams routinely skip: when you discover a class of factual error, you can find every downstream decision that was contaminated by it and remediate, rather than guessing at the blast radius.

Prompt injection through tool outputs. An agent calls a tool. The tool returns content that includes adversarial instructions. The agent, dutifully following its system prompt, treats the tool output as data to reason about and ends up acting on the injected instructions. This is not theoretical. It is the most reliable way to compromise an agent system, and it gets more reliable as agents are given more powerful tools. Three defences are required in combination. First, parse tool outputs into structured types before they reach the model context; an agent that never sees raw tool text cannot be instructed by it. Second, keep tool-output content in a distinct message role (tool result, not user or system), and instruct the model explicitly that tool results are data to reason about, not instructions to follow. Third, audit the full reasoning chain when an action looks anomalous; the injection point is usually traceable from the event log if the schema events are instrumented correctly. Treat every byte that did not originate inside your own system as untrusted, even if it came from an API you control. Especially then.

What “shipped” means

A multi-agent system is shipped when an operator on your team can run it, monitor it, and explain its behaviour without help. Not when it works in a demo. Not when it passes acceptance tests. Not when it produced its first useful output. Those are milestones; none of them are shipping.

Shipping means the system has been operating for long enough that its failure modes are known, its cost profile is stable, its observability is sufficient for the team that owns it, and the team that built it can step away without the system degrading. The interval between “first useful output” and “shipped” by this definition is usually weeks, not days. Teams that compress it are storing risk, not saving time.

This is the discipline. It is unglamorous, it borrows from a half-century of distributed systems practice, and it has very little to do with the parts of agent work that get talked about. But it is the difference between a system that earns its place inside your operations and one that quietly rots there. In our experience, the teams that take it seriously are the ones whose agent systems are still running, profitably, two years later. The teams that don’t are the ones writing the next RFP to replace what they bought.

The next post will look at the orchestration layer itself: what it does, how to architect it, and why the choice of orchestration framework matters less than the choice to take the orchestration layer seriously in the first place.