A prototype agent can look impressive in a demo and still fail in production. The useful work was turning the system into an orchestrated, testable workflow with explicit routing and regression checks.
The agent was spending tokens on repeated context, weak routing, and tool calls that could be resolved earlier in the workflow.
The team needed to catch hallucination, retrieval drift, and tool-call regressions before user-facing release.
Reliability had to improve without making the system so rigid that it lost the flexibility that made agents useful.
Orchestration core
We split the workflow into planner, retrieval, tool-use, verifier, and response steps, then controlled which context each step was allowed to see.
Evaluation harness
A RAGAS-based test suite scored retrieval relevance, factual grounding, answer quality, and regression behavior against curated task sets.
Release gates
The system treated eval failures, tool-call anomalies, and citation gaps as release blockers instead of observations someone might inspect later.
- Instrumented traces first so the team could see token waste, bad branches, and repeated context.
- Refactored the orchestration layer around explicit state, tool boundaries, and verifier passes.
- Added eval suites to the build process so reliability checks ran before production changes shipped.
Token usage dropped by roughly thirty percent through context control and better routing.
The team gained a regression harness for hallucination, retrieval quality, and tool-call behavior.
The agent moved from demo logic toward production operating discipline: traceable, testable, and cheaper to run.
Autonomy without evals is theater. The production work is orchestration, context control, release gates, and making failures visible early enough to fix.
