Evaluating agents when there's no single right answer
- William Jacob
- Evaluation , Agents
- 05 May, 2026
Evaluating a single prompt is hard. Evaluating an agent that runs ten tool calls before answering is a different category of hard. The trajectories that produce a correct answer rarely match exactly. The trajectories that produce a wrong answer often look reasonable until step seven. Standard exact-match scoring is useless here, and reviewers burn out fast on long-form trace inspection.
What ends up actually working
Three signals do the heavy lifting. Outcome correctness — did the final answer match the ground truth — is necessary but not sufficient. Trajectory cost — number of steps, total tokens, total tool calls — catches the agents that get the right answer the wrong way. And subgoal progress — did the agent advance through expected milestones — catches the silent-failure cases where the agent reaches the answer by accident.
Building the eval set
Hand-curate twenty trajectories before you spend a dollar on automation. The first twenty teach you what signals matter for your task. After that, LLM-as-judge with a careful rubric scales further than human review, but only if you’ve calibrated the judge against your hand-labeled set. Skip that calibration and the judge will agree with itself confidently and wrongly.
Agent eval looks like a metrics problem and is actually a labeling problem. The teams that ship reliable agents have invested in trajectory datasets that the rest of the field would consider tedious.