Retry, backoff, and the ghosts in your latency graph

Retry logic for LLM calls is one of those things that feels obvious until it nearly takes down a service. A 429 from the model API is not the same as a 429 from a cache lookup. A timeout that fires after 30 seconds because the model is generating unusually slowly is not the same as a network blip. Treating them the same is how a five-minute incident on the upstream API becomes a forty-minute incident on yours.

What naive retry hides

The classic anti-pattern is exponential backoff with no jitter and no cap. When the upstream rate-limits, every client retries on the same schedule, the second wave hits the same limit, and the system enters a thundering herd that the upstream cannot recover from. Your latency dashboard shows a clean spike followed by tails that don’t decay, and the cause looks like the model API but is actually your retry policy.

What works in practice

Cap retries at two for interactive paths, three for background paths. Add jitter — a uniform random delay between zero and the backoff window. Distinguish error classes: rate limits, timeouts, content-policy refusals, and transient 5xx all deserve different policies. And budget retries against a deadline, not a count — a request that has already taken eight seconds is not improved by another four-second retry.

Most production LLM incidents I’ve reviewed had a working primary path and a broken retry path. The retry path is the one that catches you.

Retry, backoff, and the ghosts in your latency graph

What naive retry hides

What works in practice

Tags :

Share :

Related Posts

Tracing LLM apps: what to log when nothing crashes

Rate limits that protect users, not just upstream

LLM security: the threats nobody warned you about