When the agent fails: recovery patterns that don't loop forever

Agent failures don’t throw exceptions. They produce plausible-looking output that’s wrong, or quietly retry the same broken approach in a slightly different way. Wrapping agents in try/catch is the wrong mental model — the agent didn’t crash, it just kept going in a useless direction. Recovery has to be designed in, not bolted on.

The failure modes that need different recovery

Tool failures — the API returned an error or timed out — are the easiest case: the agent should see the error and try a different approach. Reasoning failures — the agent is confidently wrong about what step comes next — are harder, because the agent doesn’t know it’s wrong. Loop failures — the agent retries the same approach over and over — are the worst, because each iteration looks productive in isolation.

Recovery patterns that survive contact with reality

Cap iteration count, always. Detect repetition in the action history — if the agent has called the same tool with similar arguments three times in a row, escalate or abort. For reasoning failures, a separate “is the current plan still right?” check, run periodically by a smaller model on the action log, catches the worst cases. None of this is glamorous, and all of it gets cut from the first version of every agent because it feels paranoid until the first time it fires.

Agent failure recovery is the part of the system that exists to keep small failures from becoming catastrophic. Skipping it is how you discover that “agentic” and “autonomous” are not the same word.

When the agent fails: recovery patterns that don't loop forever

The failure modes that need different recovery

Recovery patterns that survive contact with reality

Tags :

Share :

Related Posts

Memory strategies for long-running agents

Evaluating agents when there's no single right answer

Agent guardrails without lobotomizing the agent

Planner-executor splits: when to separate them

Tool selection: when the model should pick, and when you should

Designing an agent harness that doesn't fight the model

How autonomous is too autonomous

Agent memory: episodic, semantic, and what to keep