Evaluating agents when there's no single right answer

Evaluating agents when there's no single right answer

Evaluating a single prompt is hard. Evaluating an agent that runs ten tool calls before answering is a different category of hard. The trajectories that produce a correct answer rarely match exactly. The trajectories that produce a wrong answer often look reasonable until step seven. Standard exact-match scoring is useless here, and reviewers burn out fast on long-form trace inspection.

What ends up actually working

Three signals do the heavy lifting. Outcome correctness — did the final answer match the ground truth — is necessary but not sufficient. Trajectory cost — number of steps, total tokens, total tool calls — catches the agents that get the right answer the wrong way. And subgoal progress — did the agent advance through expected milestones — catches the silent-failure cases where the agent reaches the answer by accident.

Building the eval set

Hand-curate twenty trajectories before you spend a dollar on automation. The first twenty teach you what signals matter for your task. After that, LLM-as-judge with a careful rubric scales further than human review, but only if you’ve calibrated the judge against your hand-labeled set. Skip that calibration and the judge will agree with itself confidently and wrongly.

Agent eval looks like a metrics problem and is actually a labeling problem. The teams that ship reliable agents have invested in trajectory datasets that the rest of the field would consider tedious.

Related Posts

Agent guardrails without lobotomizing the agent

Agent guardrails without lobotomizing the agent

Adding guardrails to an agent is one of those task ...

Planner-executor splits: when to separate them

Planner-executor splits: when to separate them

A single model doing both planning and execution f ...

Tool selection: when the model should pick, and when you should

Tool selection: when the model should pick, and when you should

Tool-using agents look powerful in demos because t ...

Designing an agent harness that doesn't fight the model

Designing an agent harness that doesn't fight the model

Lorem ipsum dolor sit amet consectetur adipisicing ...

Memory strategies for long-running agents

Memory strategies for long-running agents

Long-running agents accumulate context. The job of ...

How autonomous is too autonomous

How autonomous is too autonomous

Autonomy in agents is a slider, not a switch, and ...

Agent memory: episodic, semantic, and what to keep

Agent memory: episodic, semantic, and what to keep

The first agent you build has no memory beyond the ...