Rate limits that protect users, not just upstream
- Sam Wilson
- Reliability , Rate Limiting
- 14 May, 2026
Rate limiting in an LLM app is solving three problems at once and most implementations only solve one. There’s the upstream problem: the model API has a quota and you must not exceed it. There’s the cost problem: a runaway loop in your own code can spend a thousand dollars in an hour. And there’s the user-protection problem: a single user, malicious or accidental, should not be able to degrade the service for everyone else.
What separate limits look like
Per-key limits at the upstream boundary, sized to the API quota with headroom. Per-user limits in your app, tighter than the upstream limit, sized to “how much can a single user reasonably consume in an hour.” Per-conversation token budgets, because conversations grow over time and an unbounded conversation is an unbounded cost. Each of these limits operates on a different timescale and a different unit; collapsing them into a single rate limiter is how you end up surprised at the bill.
What to do when limits trip
Surface budget exhaustion as a first-class state in your app, not a generic error. Users tolerate “you’ve used your quota for the hour” much better than “something went wrong.” For developer-facing APIs, return rate-limit headers so clients can back off correctly without you reverse-engineering their retry logic. For internal services, page on rate-limit hits at the upstream — that signal is usually a bug, not legitimate load.
A good rate limiter is invisible when things are normal and decisive when they aren’t. The bad rate limiter is the one you only think about after an incident.