Eval

Evaluating agents when there's no single right answer

Evaluating agents when there's no single right answer

Evaluating a single prompt is hard. Evaluating an ...