Why Every API Has Rate Limits
Send too many requests to GitHub, Stripe, or AWS and you will hit a wall. The server returns HTTP 429 Too Many Requests and your script grinds to a halt. This is rate limiting in action, and every public API depends on it.
Rate limiting protects systems from four distinct threats:
Abuse and DDoS: a single attacker (or a botnet) trying to overwhelm the service.
Runaway scripts: a misbehaving client stuck in a loop, hammering an endpoint thousands of times per second by accident.
Cost control: downstream services cost money. Without limits, one user could rack up a $50,000 bill in a few hours.
Fairness: shared infrastructure should distribute capacity evenly, not let one user starve everyone else.
This article walks through how to design a rate limiter that handles all of this, scales across many servers, and adds almost no latency to legitimate requests. We will cover the major algorithms, where to deploy the limiter, how to coordinate state across servers, and the production gotchas nobody warns you about.
What We Are Building
Functional Requirements
Non-Functional Requirements
Very low latency: the rate check itself must add under 3 ms p95. Anything more and you slow every single request through the system.
High availability: if the rate limiter is down, the API behind it might be too. Pick a fail mode (open or closed) and stick with it.
Distributed: multiple servers must share state correctly. A user making 50 requests to server A and 50 to server B must count as 100 total.
Accurate: the count cannot drift. Race conditions across servers must not let users sneak past the limit.
The Algorithms
Five algorithms dominate the rate-limiting space. Each one trades accuracy, memory, and complexity differently. Let us walk through each.
Algorithm 1: Fixed Window Counter
The simplest possible rate limiter. Divide time into fixed windows (one minute, one hour) and count requests in each window. When the count hits the limit, reject. When the window ends, reset.
Simple to implement. Easy to reason about. But it has a critical flaw at window boundaries.
The Boundary Bursting Problem
A user makes 100 requests in the last 10 seconds of one window, then 100 more in the first 10 seconds of the next window. Both windows see exactly 100 requests, both are within the limit. But across that 20-second span, 200 requests went through. The user effectively doubled their rate without violating any window's count.
quiet
100 reqs
100 reqs
quiet
This is not an edge case. It happens at every single window boundary, all the time. For any rate limit that matters, fixed window alone is not enough.
Algorithm 2: Sliding Window Log
Instead of counting per window, store a timestamp for every request the user makes. To check the rate, count how many timestamps fall within the last N seconds.
def is_allowed(user_id, limit=100, window_seconds=60):
now = time.time()
cutoff = now - window_seconds
# Drop old entries
log = user_logs[user_id]
while log and log[0] < cutoff:
log.popleft()
# Check
if len(log) >= limit:
return False
log.append(now)
return True
Pros: exactly accurate. No boundary problem. The "window" really is the last 60 seconds.
Cons: stores a timestamp per request. A user making 100 requests per minute uses 800 bytes of state, every minute. At millions of users, this gets expensive fast.
Use this when accuracy matters more than memory. For most APIs, the next algorithm is a better trade-off.
Algorithm 3: Sliding Window Counter
A clever hybrid. Keep two counters: the current window and the previous window. Compute a weighted estimate based on how far into the current window we are.
def estimate_count(user_id, window_seconds=60):
now = time.time()
current_window = int(now // window_seconds)
previous_window = current_window - 1
fraction_into_current = (now % window_seconds) / window_seconds
current_count = counters[user_id][current_window]
previous_count = counters[user_id][previous_window]
return (
previous_count * (1 - fraction_into_current)
+ current_count
)
The math: if we are 30 seconds into a 60-second window, we count 50% of last window's requests + 100% of this window's. As time progresses through the current window, the previous window's contribution fades.
Pros: almost as accurate as a sliding window log, but with constant memory per user (just two counters).
Cons: approximate, not exact. Assumes uniform distribution within each window, which is not always true.
This is a strong default for most production systems. It eliminates the boundary problem at very low memory cost.
Algorithm 4: Token Bucket (The Industry Standard)
This is what AWS, Stripe, and most modern APIs actually use. It is intuitive and handles bursts gracefully.
Imagine a bucket with a maximum capacity (say 100 tokens). Tokens are added at a steady rate (say 100 per minute). Every API request consumes one token. If the bucket is empty, the request is rejected.
Two parameters control the behavior:
Capacity determines how big a burst you allow. A capacity of 100 lets a quiet user suddenly fire 100 requests at once.
Refill rate determines the long-term sustained rate. 100 per minute means over any extended period, the user averages 100 per minute.
The clever bit: tokens accumulate during quiet periods. A user who has been silent for 5 minutes can burst 100 immediate requests because the bucket is full. But they cannot exceed the refill rate over the long term.
This naturally handles the boundary problem. Bursts are allowed, but they always come at the cost of accumulated tokens, which then need to refill.
def is_allowed(user_id, capacity=100, refill_per_sec=1.66):
now = time.time()
bucket = buckets[user_id]
elapsed = now - bucket.last_refill
# Refill based on elapsed time
bucket.tokens = min(
capacity,
bucket.tokens + elapsed * refill_per_sec
)
bucket.last_refill = now
if bucket.tokens >= 1:
bucket.tokens -= 1
return True
return False
Algorithm 5: Leaky Bucket
A different mental model: requests fill a queue (the bucket), and the queue drains at a fixed rate (the leak). If the bucket overflows, new requests are rejected.
The key difference from token bucket: leaky bucket smooths the output rate. Even if 100 requests arrive in a burst, the system processes them one at a time at the leak rate. Token bucket allows the burst to pass through.
Use leaky bucket when downstream systems cannot handle bursts. Use token bucket when bursts are okay (most APIs).
Algorithm Comparison
| Algorithm | Accuracy | Memory | Bursts Allowed? | Complexity |
|---|---|---|---|---|
| Fixed Window | Poor (boundary issue) | Tiny | Yes (uncontrolled) | Trivial |
| Sliding Window Log | Exact | High | Yes | Easy |
| Sliding Window Counter | Approximate | Tiny | Yes | Medium |
| Token Bucket | Good | Tiny | Yes (controlled) | Easy |
| Leaky Bucket | Good | Medium (queue) | No (smoothed) | Medium |
For most APIs, token bucket is the right default. Use sliding window counter when you need stricter accuracy. Use leaky bucket when downstream systems need a smooth, paced flow.
Where to Place the Rate Limiter
You have an algorithm. Now where does the code that runs it live?
Middleware is the standard. It can be a dedicated rate-limiting service (a sidecar, an API gateway like Kong or Envoy, or a custom service) sitting between the load balancer and the API servers.
The Distributed Architecture
Now we put it all together. A typical production setup:
checks bucket, allows or rejects
token bucket state, shared
limits per user, endpoint
The middleware sits in front. It uses Redis as shared state so all instances see the same counts. Config service holds the rules (which user gets which limits, which endpoints are limited differently). When a request arrives, the middleware looks up the user's bucket in Redis, checks if they have tokens, and either forwards the request or returns 429.
The Race Condition Problem
This is where naive implementations break. Imagine two rate limiter instances both serving requests for the same user at the exact same moment.
This is a classic lost update race condition. Both servers read the same value, both decided to allow, both wrote back. The total count is wrong by one. Repeat this thousands of times per second across many servers and the rate limit becomes meaningless.
The Fix: Atomic Operations with Lua
The whole read-check-increment-write sequence must be atomic. Redis supports this through Lua scripts, which Redis executes as a single indivisible operation. No other client can run anything against Redis while a Lua script is running.
-- token_bucket.lua
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2]) -- tokens per second
local now = tonumber(ARGV[3])
local cost = tonumber(ARGV[4]) or 1
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill based on elapsed time
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
local allowed = 0
if tokens >= cost then
tokens = tokens - cost
allowed = 1
end
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return {allowed, tokens}
Then from your application:
import redis
import time
r = redis.Redis()
script = r.register_script(open('token_bucket.lua').read())
def is_allowed(user_id, capacity=100, refill_per_sec=1.66):
result = script(
keys=[f'rl:{user_id}'],
args=[capacity, refill_per_sec, time.time(), 1]
)
allowed, tokens_left = result
return bool(allowed), int(tokens_left)
The entire bucket update happens atomically inside Redis. No race conditions are possible, even with thousands of servers hitting the same key simultaneously.
For simpler fixed-window counting, Redis's built-in INCR command is already atomic and avoids the need for Lua.
HTTP Response Headers
Telling clients they hit a limit is only half the job. Tell them how long until they can try again, and let them know how many requests they have left on every request. This lets clients self-throttle gracefully instead of pounding 429 errors.
Standard Headers
X-RateLimit-LimitX-RateLimit-RemainingX-RateLimit-ResetRetry-AfterExample 429 Response
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714564800
Retry-After: 47
{
"error": "rate_limit_exceeded",
"message": "Try again in 47 seconds.",
"documentation_url": "https://api.example.com/docs/rate-limits"
}
Well-behaved clients (and SDKs that wrap your API) will read these headers and back off automatically.
Production Concerns
Hot Keys
Most users send a few requests. A few users send millions. Their bucket entry in Redis becomes a hot key: every request from that one user routes to the same Redis node, overwhelming it.
Fix: for very high-volume keys, shard the bucket itself. Instead of one bucket for the user, split into 10 buckets and route requests randomly across them. Each bucket holds 1/10th the rate. The aggregate is the same. The load on any single Redis node drops 10x.
Geographic Latency
If your Redis cluster is in us-east-1 and your user is in Asia, every rate check pays a cross-region round-trip (100+ ms). That blows your latency budget.
Fix: regional Redis clusters. Each region has its own rate limiter state. The trade-off: a user could exceed the global limit by sending traffic across regions. For most APIs, this is acceptable because the regional caps still bound abuse.
Failover: Fail Open or Fail Closed?
What happens if Redis is unreachable? You have two choices:
Fail open: if the rate limiter cannot check, let requests through. Your API stays up. Risk: during an outage, abusers can flood freely.
Fail closed: if the rate limiter cannot check, block all requests. The rate limiter never lets bad traffic through. Risk: a Redis outage causes a full API outage.
Most public APIs (GitHub, Stripe) fail open: availability matters more than perfect rate enforcement during a rare incident. Internal systems or anti-abuse paths often fail closed.
Dynamic Rule Updates
You will eventually want to change limits without redeploying. A spammer is hammering one endpoint? Drop their limit to 1 per minute on the fly. A premium customer needs a higher tier? Raise their limit immediately.
Fix: a config service (etcd, Consul, or even Redis itself) holds the limits. The middleware caches them locally with short TTLs (5 to 30 seconds) and re-fetches when expired. Updates propagate quickly without redeploys.
Distributed Redis
One Redis server is a bottleneck and a single point of failure. In production, run Redis in cluster mode (sharded across nodes) with replication (each shard has a replica). Lua scripts work fine in cluster mode as long as all keys touched by the script are in the same hash slot. For per-user buckets this is automatic.
Decision Guide
The One Thing to Remember
A rate limiter looks like a simple counter behind an if-statement. Behind that surface, the real engineering is about getting four things right at the same time: pick the right algorithm for your traffic shape, deploy it in middleware so it is centrally managed, share state through Redis with atomic operations to survive concurrency, and surface the limit clearly to clients through HTTP headers so well-behaved consumers self-throttle.
Get this right and you get a system that protects itself from abuse, distributes capacity fairly, and adds barely 1 ms of latency to every request. Get it wrong and you get either an outage from one runaway client or a service that legitimate users find unusable because limits are too aggressive in the wrong places.
Token bucket on top of Redis with a Lua script is the answer 80% of the time. The other 20% is knowing why your case is different.