Designing an API Rate Limiter

Why Every API Has Rate Limits

Send too many requests to GitHub, Stripe, or AWS and you will hit a wall. The server returns HTTP 429 Too Many Requests and your script grinds to a halt. This is rate limiting in action, and every public API depends on it.

Rate limiting protects systems from four distinct threats:

Abuse and DDoS: a single attacker (or a botnet) trying to overwhelm the service.
Runaway scripts: a misbehaving client stuck in a loop, hammering an endpoint thousands of times per second by accident.
Cost control: downstream services cost money. Without limits, one user could rack up a $50,000 bill in a few hours.
Fairness: shared infrastructure should distribute capacity evenly, not let one user starve everyone else.

This article walks through how to design a rate limiter that handles all of this, scales across many servers, and adds almost no latency to legitimate requests. We will cover the major algorithms, where to deploy the limiter, how to coordinate state across servers, and the production gotchas nobody warns you about.

What We Are Building

Functional Requirements

Configurable Rules

Limits like "100 requests per minute per user" should be defined in config, not hard-coded.

Reject with 429

Over-limit requests get an HTTP 429 response with helpful headers.

Helpful Headers

Every response includes remaining quota and reset time so clients can self-regulate.

Multiple Limit Types

Per user, per IP, per API key, per endpoint. Sometimes combinations.

Non-Functional Requirements

Very low latency: the rate check itself must add under 3 ms p95. Anything more and you slow every single request through the system.
High availability: if the rate limiter is down, the API behind it might be too. Pick a fail mode (open or closed) and stick with it.
Distributed: multiple servers must share state correctly. A user making 50 requests to server A and 50 to server B must count as 100 total.
Accurate: the count cannot drift. Race conditions across servers must not let users sneak past the limit.

The Algorithms

Five algorithms dominate the rate-limiting space. Each one trades accuracy, memory, and complexity differently. Let us walk through each.

Algorithm 1: Fixed Window Counter

The simplest possible rate limiter. Divide time into fixed windows (one minute, one hour) and count requests in each window. When the count hits the limit, reject. When the window ends, reset.

Fixed Window: A Counter Per Window

12:00 - 12:01

100 / 100 used

Limit hit. Reject.

12:01 - 12:02

35 / 100

12:02 - 12:03

60 / 100

Simple to implement. Easy to reason about. But it has a critical flaw at window boundaries.

The Boundary Bursting Problem

A user makes 100 requests in the last 10 seconds of one window, then 100 more in the first 10 seconds of the next window. Both windows see exactly 100 requests, both are within the limit. But across that 20-second span, 200 requests went through. The user effectively doubled their rate without violating any window's count.

The Boundary Burst Flaw

12:00:00
quiet

12:00:50
100 reqs

window boundary

12:01:10
100 reqs

12:01:20
quiet

200 requests in 20 seconds. Each window's counter sees only 100. Limit silently violated.

This is not an edge case. It happens at every single window boundary, all the time. For any rate limit that matters, fixed window alone is not enough.

Algorithm 2: Sliding Window Log

Instead of counting per window, store a timestamp for every request the user makes. To check the rate, count how many timestamps fall within the last N seconds.

def is_allowed(user_id, limit=100, window_seconds=60):
    now = time.time()
    cutoff = now - window_seconds

    # Drop old entries
    log = user_logs[user_id]
    while log and log[0] < cutoff:
        log.popleft()

    # Check
    if len(log) >= limit:
        return False

    log.append(now)
    return True

Pros: exactly accurate. No boundary problem. The "window" really is the last 60 seconds.

Cons: stores a timestamp per request. A user making 100 requests per minute uses 800 bytes of state, every minute. At millions of users, this gets expensive fast.

Use this when accuracy matters more than memory. For most APIs, the next algorithm is a better trade-off.

Algorithm 3: Sliding Window Counter

A clever hybrid. Keep two counters: the current window and the previous window. Compute a weighted estimate based on how far into the current window we are.

def estimate_count(user_id, window_seconds=60):
    now = time.time()
    current_window = int(now // window_seconds)
    previous_window = current_window - 1

    fraction_into_current = (now % window_seconds) / window_seconds

    current_count = counters[user_id][current_window]
    previous_count = counters[user_id][previous_window]

    return (
        previous_count * (1 - fraction_into_current)
        + current_count
    )

The math: if we are 30 seconds into a 60-second window, we count 50% of last window's requests + 100% of this window's. As time progresses through the current window, the previous window's contribution fades.

Pros: almost as accurate as a sliding window log, but with constant memory per user (just two counters).
Cons: approximate, not exact. Assumes uniform distribution within each window, which is not always true.

This is a strong default for most production systems. It eliminates the boundary problem at very low memory cost.

Algorithm 4: Token Bucket (The Industry Standard)

This is what AWS, Stripe, and most modern APIs actually use. It is intuitive and handles bursts gracefully.

Imagine a bucket with a maximum capacity (say 100 tokens). Tokens are added at a steady rate (say 100 per minute). Every API request consumes one token. If the bucket is empty, the request is rejected.

Token Bucket: Tokens Refill Steadily

Refill: 100 tokens / minute

drips into

Capacity: 100 tokens

● ● ● ● ● ● ● ●

consumed by

Each request takes 1 token

Two parameters control the behavior:

Capacity determines how big a burst you allow. A capacity of 100 lets a quiet user suddenly fire 100 requests at once.
Refill rate determines the long-term sustained rate. 100 per minute means over any extended period, the user averages 100 per minute.

The clever bit: tokens accumulate during quiet periods. A user who has been silent for 5 minutes can burst 100 immediate requests because the bucket is full. But they cannot exceed the refill rate over the long term.

This naturally handles the boundary problem. Bursts are allowed, but they always come at the cost of accumulated tokens, which then need to refill.

def is_allowed(user_id, capacity=100, refill_per_sec=1.66):
    now = time.time()
    bucket = buckets[user_id]
    elapsed = now - bucket.last_refill

    # Refill based on elapsed time
    bucket.tokens = min(
        capacity,
        bucket.tokens + elapsed * refill_per_sec
    )
    bucket.last_refill = now

    if bucket.tokens >= 1:
        bucket.tokens -= 1
        return True
    return False

Algorithm 5: Leaky Bucket

A different mental model: requests fill a queue (the bucket), and the queue drains at a fixed rate (the leak). If the bucket overflows, new requests are rejected.

The key difference from token bucket: leaky bucket smooths the output rate. Even if 100 requests arrive in a burst, the system processes them one at a time at the leak rate. Token bucket allows the burst to pass through.

Token Bucket vs Leaky Bucket

Token Bucket

Allows bursts through. Smooths over time, not in real time.

Input: ▮ ▮ ▮ ▮ ▮

Output: ▮ ▮ ▮ ▮ ▮ (passes through)

Leaky Bucket

Forces a constant output rate. Bursts queue up. Excess rejected.

Input: ▮ ▮ ▮ ▮ ▮

Output: ▮ . ▮ . ▮ . ▮ . ▮ (paced)

Use leaky bucket when downstream systems cannot handle bursts. Use token bucket when bursts are okay (most APIs).

Algorithm Comparison

Algorithm	Accuracy	Memory	Bursts Allowed?	Complexity
Fixed Window	Poor (boundary issue)	Tiny	Yes (uncontrolled)	Trivial
Sliding Window Log	Exact	High	Yes	Easy
Sliding Window Counter	Approximate	Tiny	Yes	Medium
Token Bucket	Good	Tiny	Yes (controlled)	Easy
Leaky Bucket	Good	Medium (queue)	No (smoothed)	Medium

For most APIs, token bucket is the right default. Use sliding window counter when you need stricter accuracy. Use leaky bucket when downstream systems need a smooth, paced flow.

Where to Place the Rate Limiter

You have an algorithm. Now where does the code that runs it live?

Three Placement Options

Client

Inside the client app

Logic lives in the user's browser or app. Cannot be trusted: a malicious client just removes the check.

Bad

Server

Inside each API server

Works, but mixes rate-limiting code with business logic. Every team has to remember to apply it.

Okay

Middleware

Dedicated layer in front of APIs

Single responsibility. Centrally managed policies. Rate limiter can be deployed and scaled independently.

Best

Middleware is the standard. It can be a dedicated rate-limiting service (a sidecar, an API gateway like Kong or Envoy, or a custom service) sitting between the load balancer and the API servers.

The Distributed Architecture

Now we put it all together. A typical production setup:

Distributed Rate Limiter Architecture

Client

User / App

requests

Edge

Load Balancer

routes to

Gate

Rate Limiter Middleware
checks bucket, allows or rejects

reads / writes

State

Redis
token bucket state, shared

Config Service
limits per user, endpoint

if allowed

App

API Server

The middleware sits in front. It uses Redis as shared state so all instances see the same counts. Config service holds the rules (which user gets which limits, which endpoints are limited differently). When a request arrives, the middleware looks up the user's bucket in Redis, checks if they have tokens, and either forwards the request or returns 429.

The Race Condition Problem

This is where naive implementations break. Imagine two rate limiter instances both serving requests for the same user at the exact same moment.

The Race Condition

Time

Server A

Server B

read count: 3

check: 3 < 100, OK

write count: 4

Both writes set count to 4. Two requests counted as one. The lost increment lets users sneak past the limit.

This is a classic lost update race condition. Both servers read the same value, both decided to allow, both wrote back. The total count is wrong by one. Repeat this thousands of times per second across many servers and the rate limit becomes meaningless.

The Fix: Atomic Operations with Lua

The whole read-check-increment-write sequence must be atomic. Redis supports this through Lua scripts, which Redis executes as a single indivisible operation. No other client can run anything against Redis while a Lua script is running.

-- token_bucket.lua
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])  -- tokens per second
local now = tonumber(ARGV[3])
local cost = tonumber(ARGV[4]) or 1

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill based on elapsed time
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)

local allowed = 0
if tokens >= cost then
    tokens = tokens - cost
    allowed = 1
end

redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)

return {allowed, tokens}

Then from your application:

import redis
import time

r = redis.Redis()
script = r.register_script(open('token_bucket.lua').read())

def is_allowed(user_id, capacity=100, refill_per_sec=1.66):
    result = script(
        keys=[f'rl:{user_id}'],
        args=[capacity, refill_per_sec, time.time(), 1]
    )
    allowed, tokens_left = result
    return bool(allowed), int(tokens_left)

The entire bucket update happens atomically inside Redis. No race conditions are possible, even with thousands of servers hitting the same key simultaneously.

For simpler fixed-window counting, Redis's built-in INCR command is already atomic and avoids the need for Lua.

HTTP Response Headers

Telling clients they hit a limit is only half the job. Tell them how long until they can try again, and let them know how many requests they have left on every request. This lets clients self-throttle gracefully instead of pounding 429 errors.

Standard Headers

Header

Type

Meaning

X-RateLimit-Limit

integer

Total requests allowed in this window.

X-RateLimit-Remaining

integer

Requests left before hitting the limit.

X-RateLimit-Reset

unix timestamp

When the limit resets / refills.

Retry-After

seconds

Used in 429 responses. How long to wait before retrying.

Example 429 Response

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714564800
Retry-After: 47

{
  "error": "rate_limit_exceeded",
  "message": "Try again in 47 seconds.",
  "documentation_url": "https://api.example.com/docs/rate-limits"
}

Well-behaved clients (and SDKs that wrap your API) will read these headers and back off automatically.

Production Concerns

Hot Keys

Most users send a few requests. A few users send millions. Their bucket entry in Redis becomes a hot key: every request from that one user routes to the same Redis node, overwhelming it.

Fix: for very high-volume keys, shard the bucket itself. Instead of one bucket for the user, split into 10 buckets and route requests randomly across them. Each bucket holds 1/10th the rate. The aggregate is the same. The load on any single Redis node drops 10x.

Geographic Latency

If your Redis cluster is in us-east-1 and your user is in Asia, every rate check pays a cross-region round-trip (100+ ms). That blows your latency budget.

Fix: regional Redis clusters. Each region has its own rate limiter state. The trade-off: a user could exceed the global limit by sending traffic across regions. For most APIs, this is acceptable because the regional caps still bound abuse.

Failover: Fail Open or Fail Closed?

What happens if Redis is unreachable? You have two choices:

Fail open: if the rate limiter cannot check, let requests through. Your API stays up. Risk: during an outage, abusers can flood freely.
Fail closed: if the rate limiter cannot check, block all requests. The rate limiter never lets bad traffic through. Risk: a Redis outage causes a full API outage.

Most public APIs (GitHub, Stripe) fail open: availability matters more than perfect rate enforcement during a rare incident. Internal systems or anti-abuse paths often fail closed.

Dynamic Rule Updates

You will eventually want to change limits without redeploying. A spammer is hammering one endpoint? Drop their limit to 1 per minute on the fly. A premium customer needs a higher tier? Raise their limit immediately.

Fix: a config service (etcd, Consul, or even Redis itself) holds the limits. The middleware caches them locally with short TTLs (5 to 30 seconds) and re-fetches when expired. Updates propagate quickly without redeploys.

Distributed Redis

One Redis server is a bottleneck and a single point of failure. In production, run Redis in cluster mode (sharded across nodes) with replication (each shard has a replica). Lua scripts work fine in cluster mode as long as all keys touched by the script are in the same hash slot. For per-user buckets this is automatic.

Decision Guide

Picking an Algorithm

Do you need exact (no approximation) accuracy?

Yes

Is memory cheap (small user base or short windows)?

Yes

Sliding Window Log

Exact accuracy. Stores timestamps per request.

Sliding Window Counter

Approximate but very accurate, constant memory.

Should bursts be allowed to pass through?

Yes

Token Bucket

Industry default. Allows bursts up to capacity, paces over time.

Leaky Bucket

Smooths output to a constant rate. Use when downstream cannot tolerate spikes.

The One Thing to Remember

A rate limiter looks like a simple counter behind an if-statement. Behind that surface, the real engineering is about getting four things right at the same time: pick the right algorithm for your traffic shape, deploy it in middleware so it is centrally managed, share state through Redis with atomic operations to survive concurrency, and surface the limit clearly to clients through HTTP headers so well-behaved consumers self-throttle.

Get this right and you get a system that protects itself from abuse, distributes capacity fairly, and adds barely 1 ms of latency to every request. Get it wrong and you get either an outage from one runaway client or a service that legitimate users find unusable because limits are too aggressive in the wrong places.

Token bucket on top of Redis with a Lua script is the answer 80% of the time. The other 20% is knowing why your case is different.