Rate limits that protect users, not just upstream

Rate limits that protect users, not just upstream

Rate limiting in an LLM app is solving three problems at once and most implementations only solve one. There’s the upstream problem: the model API has a quota and you must not exceed it. There’s the cost problem: a runaway loop in your own code can spend a thousand dollars in an hour. And there’s the user-protection problem: a single user, malicious or accidental, should not be able to degrade the service for everyone else.

What separate limits look like

Per-key limits at the upstream boundary, sized to the API quota with headroom. Per-user limits in your app, tighter than the upstream limit, sized to “how much can a single user reasonably consume in an hour.” Per-conversation token budgets, because conversations grow over time and an unbounded conversation is an unbounded cost. Each of these limits operates on a different timescale and a different unit; collapsing them into a single rate limiter is how you end up surprised at the bill.

What to do when limits trip

Surface budget exhaustion as a first-class state in your app, not a generic error. Users tolerate “you’ve used your quota for the hour” much better than “something went wrong.” For developer-facing APIs, return rate-limit headers so clients can back off correctly without you reverse-engineering their retry logic. For internal services, page on rate-limit hits at the upstream — that signal is usually a bug, not legitimate load.

A good rate limiter is invisible when things are normal and decisive when they aren’t. The bad rate limiter is the one you only think about after an incident.

Related Posts

Retry, backoff, and the ghosts in your latency graph

Retry, backoff, and the ghosts in your latency graph

Retry logic for LLM calls is one of those things t ...