Stop building “multi-agent orchestration”; build workflow contracts

I keep seeing the same failure mode: you demo a “three-agent” workflow that looks great in a notebook, then it dies in prod because nobody can answer one boring question—what exactly did Agent B receive, and what did it promise back?

The orchestration layer is where systems go to become folklore. People add another agent, another prompt, another “manager” call, and the behavior gets harder to reason about. The number of agents is not the problem. The missing piece is the workflow contract.

A workflow contract is the part you can test without invoking the model. It’s the shape of the job, the handoff semantics, and the failure semantics. Everything else is optional.

The demo problem: agents are cheap, contracts aren’t

Toy multi-agent demos usually share a pattern:

The “manager” writes a plan in free text.
Workers read that text and do tool calls.
The manager stitches results back together.
When it breaks, you re-run the demo and hope the model behaves.

That’s not a system. It’s a cinematic loop.

In a real system, the orchestration layer has to survive:

tool failures (timeouts, partial results, malformed tool outputs)
model variability (different formats, missing fields, refusal paths)
concurrency (same workflow running twice, out-of-order events)
cost controls (budget exceeded mid-run)

If your “contract” is “LLM will probably format it right,” you don’t have orchestration. You have wishful execution.

The correct metric is: can you replay a workflow run and get the same state transitions given the same tool outcomes? If the answer is “not really,” you’re building a demo generator.

What a workflow contract actually includes

Treat agents like role-bound workers. The workflow contract is the boundary between the workflow engine and the model.

Concretely, a contract has to define:

1) Inputs and outputs (schemas, not prose)

Required fields
Optional fields
Validation rules
Canonical formats (dates, IDs, units)
Output invariants (“must include citations array even if empty”)

If Agent B can return confidence as text sometimes and as a number other times, you’ve already lost.

2) Role constraints

A role is not a prompt. It’s a constraint on allowed actions.

Agent X may only call tools A/B/C
Agent Y may only transform data, not fetch external facts
No hidden “manager” behavior inside workers

You don’t need perfect autonomy. You need predictable capability boundaries.

3) Handoff semantics (what “done” means)

Define the transitions:

When does Agent A hand off? (on status=ready, not “when it feels done”)
How do you represent “needs clarification”? (structured clarification_request)
What does “success” look like? (required fields present + validation pass)

Handoffs should be deterministic: given the same input state, the next state is the same.

4) State model

A workflow engine needs a state machine, not a narrative.

pending -> running -> succeeded | failed | needs_input | needs_tool_retry
Correlation IDs
Versioning for contract changes

5) Observability hooks

Contracts should include traceable artifacts:

tool call logs (inputs/outputs)
model prompt/version identifiers
validation errors
redaction rules

If you can’t see where the contract broke, you’ll “fix the prompt” forever.

Determinism where possible (and where you can’t)

I’m opinionated here: orchestration should be deterministic where possible.

That means:

Routing rules are explicit (no “manager decides” without a rule)
Tool calls are bounded (timeouts, budgets, max calls)
Output formats are enforced (schema validation + repair loop)
Plans are either structured or replayable

But you still have to deal with nondeterminism.

So separate it:

Deterministic parts: state transitions, tool selection, retry policy, compensation
Nondeterministic parts: text generation, classification, summarization

Then wrap nondeterminism with contracts.

Example pattern:

Worker returns structured output with a decision enum.
Engine validates.
If invalid: run a repair step that only fixes formatting, not meaning.
If tool failed: engine follows a deterministic retry/compensation path.

You’re not trying to make the model deterministic. You’re making the system deterministic around the model.

Failure modes are the product

If your workflow contract doesn’t define failure modes, you’re outsourcing reliability to the runtime.

A real contract includes:

Retry semantics: which errors are retryable, with what backoff
Timeout semantics: max wall time per stage
Partial completion: what gets committed, what gets rolled back
Compensation: how to undo side effects
Dead-letter paths: where “cannot recover” goes

Also: idempotency.

If the same workflow event arrives twice, you need a rule for deduping. Otherwise your “agent system” becomes a side-effect generator.

And you need failure to be legible.

Instead of “LLM failed,” you want:

validation_error: missing_field=order_id
tool_error: provider_timeout tool=search
contract_error: role_violation agent=writer tool=filesystem_write

Those are actionable. “It didn’t work” isn’t.

A practical build pattern: agent harness + relay with memory

The market signal I buy is scaffolding: harnesses and relays that make workflows provider-agnostic.

The key is not “more clever orchestration.” The key is a harness that:

enforces contract schemas
normalizes tool outputs across providers
records run artifacts for replay
exposes deterministic workflow transitions

Then add a relay layer that learns from past runs, but only through the contract.

So the relay can:

choose routing based on past success/failure
adjust repair strategies for specific validation errors
surface “known bad” prompt/format combinations

Not by letting the model freestyle. By updating policy keyed on contract outcomes.

If you do it right, your multi-agent system starts to look less like a swarm and more like a workflow engine with role-bound workers.

Stop building multi-agent orchestration as a vibe layer.

Build workflow contracts, and your agents become boring—in the best way.