Testing LLM apps when the output is non-deterministic

Testing LLM apps when the output is non-deterministic

Testing an LLM app with the testing patterns from a deterministic codebase produces flaky tests that fail on weekends and pass on Mondays. The model output for the same input is not stable across calls; even temperature-zero output drifts because the model itself updates. Snapshot tests fire constantly and stop being read; exact-match assertions fail on whitespace and capitalization. The team learns to ignore CI, which is worse than not having tests.

The test types that actually catch regressions

Property-based assertions: the output contains a number between X and Y, the output mentions the user’s name, the output is valid JSON. These tests don’t care about the exact text and they catch real regressions. Reference-set evaluations: a curated set of inputs with expected categorical outputs (correct/incorrect, not exact strings). Calibration tests: at temperature zero, the same input should produce semantically identical outputs across runs — semantic similarity, not equality. LLM-as-judge for the open-ended cases, but only after calibrating the judge against a hand-labeled set.

What to give up on

Stop testing for exact wording unless the wording is the contract — for example, structured output where the schema is the test. Stop trying to make tests deterministic with seeds and prompt-caching tricks. Embrace the variance and test the properties that matter.

The tests that survive in an LLM codebase are the tests that look least like the tests in your other codebases. That feels wrong, and it’s correct.

Related Posts