When the agent fails: recovery patterns that don't loop forever
- Sam Wilson
- Reliability , Agents
- 12 May, 2026
Agent failures don’t throw exceptions. They produce plausible-looking output that’s wrong, or quietly retry the same broken approach in a slightly different way. Wrapping agents in try/catch is the wrong mental model — the agent didn’t crash, it just kept going in a useless direction. Recovery has to be designed in, not bolted on.
The failure modes that need different recovery
Tool failures — the API returned an error or timed out — are the easiest case: the agent should see the error and try a different approach. Reasoning failures — the agent is confidently wrong about what step comes next — are harder, because the agent doesn’t know it’s wrong. Loop failures — the agent retries the same approach over and over — are the worst, because each iteration looks productive in isolation.
Recovery patterns that survive contact with reality
Cap iteration count, always. Detect repetition in the action history — if the agent has called the same tool with similar arguments three times in a row, escalate or abort. For reasoning failures, a separate “is the current plan still right?” check, run periodically by a smaller model on the action log, catches the worst cases. None of this is glamorous, and all of it gets cut from the first version of every agent because it feels paranoid until the first time it fires.
Agent failure recovery is the part of the system that exists to keep small failures from becoming catastrophic. Skipping it is how you discover that “agentic” and “autonomous” are not the same word.