A multi-agent orchestration layer with evals before autonomy

Overview

A prototype agent can look impressive in a demo and still fail in production. The useful work was turning the system into an orchestrated, testable workflow with explicit routing and regression checks.

The operating problem

The agent was spending tokens on repeated context, weak routing, and tool calls that could be resolved earlier in the workflow.

The team needed to catch hallucination, retrieval drift, and tool-call regressions before user-facing release.

Reliability had to improve without making the system so rigid that it lost the flexibility that made agents useful.

The BuildModal implementation

Orchestration core

We split the workflow into planner, retrieval, tool-use, verifier, and response steps, then controlled which context each step was allowed to see.

Evaluation harness

A RAGAS-based test suite scored retrieval relevance, factual grounding, answer quality, and regression behavior against curated task sets.

Release gates

The system treated eval failures, tool-call anomalies, and citation gaps as release blockers instead of observations someone might inspect later.

Rollout and controls

Instrumented traces first so the team could see token waste, bad branches, and repeated context.
Refactored the orchestration layer around explicit state, tool boundaries, and verifier passes.
Added eval suites to the build process so reliability checks ran before production changes shipped.

Results

Token usage dropped by roughly thirty percent through context control and better routing.

The team gained a regression harness for hallucination, retrieval quality, and tool-call behavior.

The agent moved from demo logic toward production operating discipline: traceable, testable, and cheaper to run.

What this proves

Autonomy without evals is theater. The production work is orchestration, context control, release gates, and making failures visible early enough to fix.

A multi-agent orchestration layer with evals before autonomy

Orchestration core

Evaluation harness

Release gates

A 25‑minute call to pick the first workflow.

Tell us about the workflow.

A multi-agent orchestration layer with evals before autonomy

Orchestration core

Evaluation harness

Release gates

Read the next case study.

Turning a lean operator into an AI-native business without hiring a full AI team

Building a price and promotion system that moved margin and demand together

A 25‑minute call to pick the first workflow.

Tell us about the workflow.