Caching LLM responses: not just by prompt hash

The first cache anyone adds to an LLM application is a key-value store mapping prompt hash to response. The hit rate looks reasonable in development and disappointing in production, because real users phrase the same question fourteen different ways and a SHA hash treats them all as different requests. The naive cache hits one in twenty. The thoughtful cache hits one in three.

Where the hit-rate actually comes from

Semantic caching — embed the prompt, look up by similarity, return the cached response if the match is close enough — recovers most of the misses the hash-based cache loses. The threshold matters: too loose and you serve wrong answers, too strict and you’re back to one in twenty. Combining exact-match and semantic-match gives you the speed of the first and the recall of the second, with the wrongness of neither. Tier the cache: keep recent hot prompts in memory, fall through to a vector store for the long tail.

What never to cache

Function-calling responses for tools that have side effects. Anything where the response depends on time, user state, or external data the cache doesn’t see. Streaming responses, unless you’re caching the full sequence and re-streaming it — half-cached half-live is the worst of both. The discipline is the same as any cache: invalidation is the hard part, and “I’ll figure it out later” means “I’ll get paged at 3am.”

The LLM cache that actually saves money is not the one in the tutorial. It’s the one your team built after six months of profiling.

Caching LLM responses: not just by prompt hash

Where the hit-rate actually comes from

What never to cache

Tags :

Share :

Related Posts

Caching strategies that actually save money