Why Every API Has Rate Limits

Send too many requests to GitHub, Stripe, or AWS and you will hit a wall. The server returns HTTP 429 Too Many Requests and your script grinds to a halt. This is rate limiting in action, and every public API depends on it.

Rate limiting protects systems from four distinct threats:

Abuse and DDoS: a single attacker (or a botnet) trying to overwhelm the service.
Runaway scripts: a misbehaving client stuck in a loop, hammering an endpoint thousands of times per second by accident.
Cost control: downstream services cost money. Without limits, one user could rack up a $50,000 bill in a few hours.
Fairness: shared infrastructure should distribute capacity evenly, not let one user starve everyone else.

This article walks through how to design a rate limiter that handles all of this, scales across many servers, and adds almost no latency to legitimate requests. We will cover the major algorithms, where to deploy the limiter, how to coordinate state across servers, and the production gotchas nobody warns you about.

What We Are Building

Functional Requirements

Configurable Rules
Limits like "100 requests per minute per user" should be defined in config, not hard-coded.
Reject with 429
Over-limit requests get an HTTP 429 response with helpful headers.
Helpful Headers
Every response includes remaining quota and reset time so clients can self-regulate.
Multiple Limit Types
Per user, per IP, per API key, per endpoint. Sometimes combinations.

Non-Functional Requirements

Very low latency: the rate check itself must add under 3 ms p95. Anything more and you slow every single request through the system.
High availability: if the rate limiter is down, the API behind it might be too. Pick a fail mode (open or closed) and stick with it.
Distributed: multiple servers must share state correctly. A user making 50 requests to server A and 50 to server B must count as 100 total.
Accurate: the count cannot drift. Race conditions across servers must not let users sneak past the limit.

The Algorithms

Five algorithms dominate the rate-limiting space. Each one trades accuracy, memory, and complexity differently. Let us walk through each.

Algorithm 1: Fixed Window Counter

The simplest possible rate limiter. Divide time into fixed windows (one minute, one hour) and count requests in each window. When the count hits the limit, reject. When the window ends, reset.

Fixed Window: A Counter Per Window
12:00 - 12:01
100 / 100 used
Limit hit. Reject.
12:01 - 12:02
35 / 100
OK
12:02 - 12:03
60 / 100
OK

Simple to implement. Easy to reason about. But it has a critical flaw at window boundaries.

The Boundary Bursting Problem

A user makes 100 requests in the last 10 seconds of one window, then 100 more in the first 10 seconds of the next window. Both windows see exactly 100 requests, both are within the limit. But across that 20-second span, 200 requests went through. The user effectively doubled their rate without violating any window's count.

The Boundary Burst Flaw
12:00:00
quiet
12:00:50
100 reqs
window boundary
12:01:10
100 reqs
12:01:20
quiet
200 requests in 20 seconds. Each window's counter sees only 100. Limit silently violated.

This is not an edge case. It happens at every single window boundary, all the time. For any rate limit that matters, fixed window alone is not enough.

Algorithm 2: Sliding Window Log

Instead of counting per window, store a timestamp for every request the user makes. To check the rate, count how many timestamps fall within the last N seconds.

def is_allowed(user_id, limit=100, window_seconds=60):
    now = time.time()
    cutoff = now - window_seconds

    # Drop old entries
    log = user_logs[user_id]
    while log and log[0] < cutoff:
        log.popleft()

    # Check
    if len(log) >= limit:
        return False

    log.append(now)
    return True

Pros: exactly accurate. No boundary problem. The "window" really is the last 60 seconds.

Cons: stores a timestamp per request. A user making 100 requests per minute uses 800 bytes of state, every minute. At millions of users, this gets expensive fast.

Use this when accuracy matters more than memory. For most APIs, the next algorithm is a better trade-off.

Algorithm 3: Sliding Window Counter

A clever hybrid. Keep two counters: the current window and the previous window. Compute a weighted estimate based on how far into the current window we are.

def estimate_count(user_id, window_seconds=60):
    now = time.time()
    current_window = int(now // window_seconds)
    previous_window = current_window - 1

    fraction_into_current = (now % window_seconds) / window_seconds

    current_count = counters[user_id][current_window]
    previous_count = counters[user_id][previous_window]

    return (
        previous_count * (1 - fraction_into_current)
        + current_count
    )

The math: if we are 30 seconds into a 60-second window, we count 50% of last window's requests + 100% of this window's. As time progresses through the current window, the previous window's contribution fades.

Pros: almost as accurate as a sliding window log, but with constant memory per user (just two counters).
Cons: approximate, not exact. Assumes uniform distribution within each window, which is not always true.

This is a strong default for most production systems. It eliminates the boundary problem at very low memory cost.

Algorithm 4: Token Bucket (The Industry Standard)

This is what AWS, Stripe, and most modern APIs actually use. It is intuitive and handles bursts gracefully.

Imagine a bucket with a maximum capacity (say 100 tokens). Tokens are added at a steady rate (say 100 per minute). Every API request consumes one token. If the bucket is empty, the request is rejected.

Token Bucket: Tokens Refill Steadily
Refill: 100 tokens / minute
drips into
Capacity: 100 tokens
consumed by
Each request takes 1 token

Two parameters control the behavior:

Capacity determines how big a burst you allow. A capacity of 100 lets a quiet user suddenly fire 100 requests at once.
Refill rate determines the long-term sustained rate. 100 per minute means over any extended period, the user averages 100 per minute.

The clever bit: tokens accumulate during quiet periods. A user who has been silent for 5 minutes can burst 100 immediate requests because the bucket is full. But they cannot exceed the refill rate over the long term.

This naturally handles the boundary problem. Bursts are allowed, but they always come at the cost of accumulated tokens, which then need to refill.

def is_allowed(user_id, capacity=100, refill_per_sec=1.66):
    now = time.time()
    bucket = buckets[user_id]
    elapsed = now - bucket.last_refill

    # Refill based on elapsed time
    bucket.tokens = min(
        capacity,
        bucket.tokens + elapsed * refill_per_sec
    )
    bucket.last_refill = now

    if bucket.tokens >= 1:
        bucket.tokens -= 1
        return True
    return False

Algorithm 5: Leaky Bucket

A different mental model: requests fill a queue (the bucket), and the queue drains at a fixed rate (the leak). If the bucket overflows, new requests are rejected.

The key difference from token bucket: leaky bucket smooths the output rate. Even if 100 requests arrive in a burst, the system processes them one at a time at the leak rate. Token bucket allows the burst to pass through.

Token Bucket vs Leaky Bucket
Token Bucket
Allows bursts through. Smooths over time, not in real time.
Input: ▮ ▮ ▮ ▮ ▮
Output: ▮ ▮ ▮ ▮ ▮ (passes through)
Leaky Bucket
Forces a constant output rate. Bursts queue up. Excess rejected.
Input: ▮ ▮ ▮ ▮ ▮
Output: ▮ . ▮ . ▮ . ▮ . ▮ (paced)

Use leaky bucket when downstream systems cannot handle bursts. Use token bucket when bursts are okay (most APIs).

Algorithm Comparison

Algorithm Accuracy Memory Bursts Allowed? Complexity
Fixed Window Poor (boundary issue) Tiny Yes (uncontrolled) Trivial
Sliding Window Log Exact High Yes Easy
Sliding Window Counter Approximate Tiny Yes Medium
Token Bucket Good Tiny Yes (controlled) Easy
Leaky Bucket Good Medium (queue) No (smoothed) Medium

For most APIs, token bucket is the right default. Use sliding window counter when you need stricter accuracy. Use leaky bucket when downstream systems need a smooth, paced flow.

Where to Place the Rate Limiter

You have an algorithm. Now where does the code that runs it live?

Three Placement Options
Client
Inside the client app
Logic lives in the user's browser or app. Cannot be trusted: a malicious client just removes the check.
Bad
Server
Inside each API server
Works, but mixes rate-limiting code with business logic. Every team has to remember to apply it.
Okay
Middleware
Dedicated layer in front of APIs
Single responsibility. Centrally managed policies. Rate limiter can be deployed and scaled independently.
Best

Middleware is the standard. It can be a dedicated rate-limiting service (a sidecar, an API gateway like Kong or Envoy, or a custom service) sitting between the load balancer and the API servers.

The Distributed Architecture

Now we put it all together. A typical production setup:

Distributed Rate Limiter Architecture
Client
User / App
requests
Edge
Load Balancer
routes to
Gate
Rate Limiter Middleware
checks bucket, allows or rejects
reads / writes
State
Redis
token bucket state, shared
Config Service
limits per user, endpoint
if allowed
App
API Server
API Server
API Server

The middleware sits in front. It uses Redis as shared state so all instances see the same counts. Config service holds the rules (which user gets which limits, which endpoints are limited differently). When a request arrives, the middleware looks up the user's bucket in Redis, checks if they have tokens, and either forwards the request or returns 429.

The Race Condition Problem

This is where naive implementations break. Imagine two rate limiter instances both serving requests for the same user at the exact same moment.

The Race Condition
Time
Server A
Server B
T1
read count: 3
T2
read count: 3
T3
check: 3 < 100, OK
T4
check: 3 < 100, OK
T5
write count: 4
T6
write count: 4
Both writes set count to 4. Two requests counted as one. The lost increment lets users sneak past the limit.

This is a classic lost update race condition. Both servers read the same value, both decided to allow, both wrote back. The total count is wrong by one. Repeat this thousands of times per second across many servers and the rate limit becomes meaningless.

The Fix: Atomic Operations with Lua

The whole read-check-increment-write sequence must be atomic. Redis supports this through Lua scripts, which Redis executes as a single indivisible operation. No other client can run anything against Redis while a Lua script is running.

-- token_bucket.lua
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])  -- tokens per second
local now = tonumber(ARGV[3])
local cost = tonumber(ARGV[4]) or 1

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill based on elapsed time
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)

local allowed = 0
if tokens >= cost then
    tokens = tokens - cost
    allowed = 1
end

redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)

return {allowed, tokens}

Then from your application:

import redis
import time

r = redis.Redis()
script = r.register_script(open('token_bucket.lua').read())

def is_allowed(user_id, capacity=100, refill_per_sec=1.66):
    result = script(
        keys=[f'rl:{user_id}'],
        args=[capacity, refill_per_sec, time.time(), 1]
    )
    allowed, tokens_left = result
    return bool(allowed), int(tokens_left)

The entire bucket update happens atomically inside Redis. No race conditions are possible, even with thousands of servers hitting the same key simultaneously.

For simpler fixed-window counting, Redis's built-in INCR command is already atomic and avoids the need for Lua.

HTTP Response Headers

Telling clients they hit a limit is only half the job. Tell them how long until they can try again, and let them know how many requests they have left on every request. This lets clients self-throttle gracefully instead of pounding 429 errors.

Standard Headers

Header
Type
Meaning
X-RateLimit-Limit
integer
Total requests allowed in this window.
X-RateLimit-Remaining
integer
Requests left before hitting the limit.
X-RateLimit-Reset
unix timestamp
When the limit resets / refills.
Retry-After
seconds
Used in 429 responses. How long to wait before retrying.

Example 429 Response

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714564800
Retry-After: 47

{
  "error": "rate_limit_exceeded",
  "message": "Try again in 47 seconds.",
  "documentation_url": "https://api.example.com/docs/rate-limits"
}

Well-behaved clients (and SDKs that wrap your API) will read these headers and back off automatically.

Production Concerns

Hot Keys

Most users send a few requests. A few users send millions. Their bucket entry in Redis becomes a hot key: every request from that one user routes to the same Redis node, overwhelming it.

Fix: for very high-volume keys, shard the bucket itself. Instead of one bucket for the user, split into 10 buckets and route requests randomly across them. Each bucket holds 1/10th the rate. The aggregate is the same. The load on any single Redis node drops 10x.

Geographic Latency

If your Redis cluster is in us-east-1 and your user is in Asia, every rate check pays a cross-region round-trip (100+ ms). That blows your latency budget.

Fix: regional Redis clusters. Each region has its own rate limiter state. The trade-off: a user could exceed the global limit by sending traffic across regions. For most APIs, this is acceptable because the regional caps still bound abuse.

Failover: Fail Open or Fail Closed?

What happens if Redis is unreachable? You have two choices:

Fail open: if the rate limiter cannot check, let requests through. Your API stays up. Risk: during an outage, abusers can flood freely.
Fail closed: if the rate limiter cannot check, block all requests. The rate limiter never lets bad traffic through. Risk: a Redis outage causes a full API outage.

Most public APIs (GitHub, Stripe) fail open: availability matters more than perfect rate enforcement during a rare incident. Internal systems or anti-abuse paths often fail closed.

Dynamic Rule Updates

You will eventually want to change limits without redeploying. A spammer is hammering one endpoint? Drop their limit to 1 per minute on the fly. A premium customer needs a higher tier? Raise their limit immediately.

Fix: a config service (etcd, Consul, or even Redis itself) holds the limits. The middleware caches them locally with short TTLs (5 to 30 seconds) and re-fetches when expired. Updates propagate quickly without redeploys.

Distributed Redis

One Redis server is a bottleneck and a single point of failure. In production, run Redis in cluster mode (sharded across nodes) with replication (each shard has a replica). Lua scripts work fine in cluster mode as long as all keys touched by the script are in the same hash slot. For per-user buckets this is automatic.

Decision Guide

Picking an Algorithm
Do you need exact (no approximation) accuracy?
Yes
Is memory cheap (small user base or short windows)?
Yes
Sliding Window Log
Exact accuracy. Stores timestamps per request.
No
Sliding Window Counter
Approximate but very accurate, constant memory.
No
Should bursts be allowed to pass through?
Yes
Token Bucket
Industry default. Allows bursts up to capacity, paces over time.
No
Leaky Bucket
Smooths output to a constant rate. Use when downstream cannot tolerate spikes.

The One Thing to Remember

A rate limiter looks like a simple counter behind an if-statement. Behind that surface, the real engineering is about getting four things right at the same time: pick the right algorithm for your traffic shape, deploy it in middleware so it is centrally managed, share state through Redis with atomic operations to survive concurrency, and surface the limit clearly to clients through HTTP headers so well-behaved consumers self-throttle.

Get this right and you get a system that protects itself from abuse, distributes capacity fairly, and adds barely 1 ms of latency to every request. Get it wrong and you get either an outage from one runaway client or a service that legitimate users find unusable because limits are too aggressive in the wrong places.

Token bucket on top of Redis with a Lua script is the answer 80% of the time. The other 20% is knowing why your case is different.