Tool selection: when the model should pick, and when you should

Tool-using agents look powerful in demos because the model is choosing what to do next. They look fragile in production because the model is choosing what to do next. The space of available tools grows linearly with features and quadratically with edge cases — past about a dozen tools, the model starts conflating their roles and picking based on surface similarity in the tool name.

What goes wrong as tool count grows

Beyond ten or fifteen tools, the descriptions blur together in the model’s representation. The model picks a search tool when a database lookup was correct, because both have “lookup” in their description. It picks the simpler tool when the complex one was needed, because the simpler one matched the user phrasing. None of this shows up in single-call testing — it surfaces when one of the tools quietly handles a request another tool was supposed to handle, and the answer is technically valid but operationally wrong.

Architectural answers, not prompt answers

Group tools by purpose and route the request to a sub-agent that only sees the relevant subset. Surface fewer tools to the top-level model than you actually expose internally — five visible tools with clear purposes outperforms twenty undifferentiated ones. For destructive or expensive tools, require an explicit naming match, not a model-chosen one.

The number of tools an agent should choose from is much smaller than the number of tools you’d like to give it. Past a threshold, every additional tool makes every other choice worse.

Tool selection: when the model should pick, and when you should

What goes wrong as tool count grows

Architectural answers, not prompt answers

Tags :

Share :

Related Posts

When the agent fails: recovery patterns that don't loop forever

Evaluating agents when there's no single right answer

Agent guardrails without lobotomizing the agent

Planner-executor splits: when to separate them

Designing an agent harness that doesn't fight the model

Memory strategies for long-running agents

How autonomous is too autonomous

Agent memory: episodic, semantic, and what to keep