Evaluating agents when there's no single right answer

Evaluating a single prompt is hard. Evaluating an agent that runs ten tool calls before answering is a different category of hard. The trajectories that produce a correct answer rarely match exactly. The trajectories that produce a wrong answer often look reasonable until step seven. Standard exact-match scoring is useless here, and reviewers burn out fast on long-form trace inspection.

What ends up actually working

Three signals do the heavy lifting. Outcome correctness — did the final answer match the ground truth — is necessary but not sufficient. Trajectory cost — number of steps, total tokens, total tool calls — catches the agents that get the right answer the wrong way. And subgoal progress — did the agent advance through expected milestones — catches the silent-failure cases where the agent reaches the answer by accident.

Building the eval set

Hand-curate twenty trajectories before you spend a dollar on automation. The first twenty teach you what signals matter for your task. After that, LLM-as-judge with a careful rubric scales further than human review, but only if you’ve calibrated the judge against your hand-labeled set. Skip that calibration and the judge will agree with itself confidently and wrongly.

Agent eval looks like a metrics problem and is actually a labeling problem. The teams that ship reliable agents have invested in trajectory datasets that the rest of the field would consider tedious.

Evaluating agents when there's no single right answer

What ends up actually working

Building the eval set

Tags :

Share :

Related Posts

When the agent fails: recovery patterns that don't loop forever

Agent guardrails without lobotomizing the agent

Planner-executor splits: when to separate them

Tool selection: when the model should pick, and when you should

Designing an agent harness that doesn't fight the model

Memory strategies for long-running agents

How autonomous is too autonomous

Agent memory: episodic, semantic, and what to keep