You've built an API. It works. Users are happy. Then one morning, your database CPU spikes to 100%, your server starts returning 503s, and your on-call phone won't stop buzzing.
The culprit might be a malicious actor hammering your endpoints. Or an enthusiastic developer running a script that loops 10,000 times without a sleep. Or simply your own growth catching you off guard.
Rate limiting is the mechanism that prevents any single client — intentional or accidental — from consuming a disproportionate share of your resources. It's a foundational piece of API infrastructure that protects:
Your infrastructure from overload and cascading failures
Your other users from noisy-neighbor degradation
Your costs from runaway compute bills
Your API contract from abuse that violates your terms of service
This guide covers the algorithms behind rate limiting, practical implementation patterns, and production-grade code examples across the most common stacks.
The Core Algorithms
Before writing a single line of code, it's worth understanding the four canonical rate limiting algorithms. Each makes different trade-offs between precision, memory usage, and burst tolerance.
1. Fixed Window Counter
The simplest approach. Divide time into fixed windows (e.g., one minute) and count requests per client per window. When the count exceeds the limit, reject the request.
Cons: Vulnerable to the boundary burst problem. A client can make 100 requests at 12:00:59 and another 100 at 12:01:01 — 200 requests in 2 seconds, technically within the rules.
2. Sliding Window Log
Instead of counting in fixed buckets, store a timestamped log of each request. When a new request arrives, discard log entries older than the window size and check if the remaining count exceeds the limit.
Limit: 5 requests per minute
Log for Client A: [12:00:10, 12:00:25, 12:00:40, 12:00:55]
New request at 12:01:20:
→ Discard entries older than 12:00:20
→ Remaining log: [12:00:25, 12:00:40, 12:00:55]
→ Count: 3 → ✅ allowed, add 12:01:20 to log
Pros: Precise — no boundary burst problem.
Cons: Memory-intensive. Each client's log can grow to entries. At high request volumes, this becomes costly.
Share this article:
limit
3. Sliding Window Counter (Hybrid)
A practical middle ground. Uses two fixed-window counters — the current window and the previous — and computes a weighted estimate of requests in the rolling window.
Pros: Low memory (two counters per client), smooth behavior, good approximation of true sliding window.
Cons: It's an approximation — can allow slightly more or fewer requests than the exact limit under edge conditions.
4. Token Bucket
Clients have a "bucket" that fills with tokens at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, which defines the burst limit.
Fill rate: 10 tokens/second
Capacity: 50 tokens (burst)
Client makes 50 requests instantly → ✅ bucket drains to 0
Client makes request 1 second later → ✅ 10 new tokens available
Client makes 60 requests instantly → ❌ only 10 tokens available
Pros: Naturally handles bursty traffic. Clients that stay under their average rate accumulate credit for occasional spikes.
Cons: Slightly more state to maintain (current token count + last refill timestamp).
The Leaky Bucket is a close cousin: instead of allowing bursts, it smooths traffic by processing requests at a constant outflow rate, queuing excess requests up to a maximum queue depth.
Choosing Your Rate Limit Key
A rate limit is only meaningful relative to an identity. The key you choose determines who is being limited.
Key
Use Case
Granularity
IP address
Public/unauthenticated APIs
Coarse — shared IPs (NAT, proxies) cause false positives
API key
Authenticated APIs
Good — maps directly to a developer/client
User ID
User-facing APIs
Fine — limits individual users regardless of IP
Tenant ID
Multi-tenant SaaS
Limits entire organizations
Endpoint + API key
Differentiated limits per route
Fine-grained — different limits for /search vs /export
In practice, most APIs use a composite key: api_key + endpoint or user_id + action_type. This lets you set a global limit of 1,000 requests/minute per API key while also capping expensive endpoints like /reports/generate to 10 requests/minute.
Implementation: Node.js with Redis
Redis is the industry-standard backing store for rate limiting. Its atomic operations, TTL support, and sub-millisecond latency make it ideal.
Setup
npm install ioredis express
Sliding Window Counter in Redis
import Redis from "ioredis";
const redis = new Redis(process.env.REDIS_URL);
interface RateLimitResult {
allowed: boolean;
remaining: number;
resetAt: number; // Unix timestamp
}
async function checkRateLimit(
key: string,
limit: number,
windowSeconds: number
): Promise<RateLimitResult> {
const now = Date.now();
const windowMs = windowSeconds * 1000;
const currentWindow = Math.floor(now / windowMs);
const prevWindow = currentWindow - 1;
const currentKey = `rl:${key}:${currentWindow}`;
const prevKey = `rl:${key}:${prevWindow}`;
const [prevCount, currentCount] = await redis.mget(prevKey, currentKey);
const prev = parseInt(prevCount ?? "0", 10);
const curr = parseInt(currentCount ?? "0", 10);
// How far into the current window we are (0.0 → 1.0)
const elapsed = (now % windowMs) / windowMs;
// Weighted estimate of requests in the rolling window
const rate = Math.floor(prev * (1 - elapsed) + curr);
if (rate >= limit) {
return {
allowed: false,
remaining: 0,
resetAt: Math.floor((currentWindow + 1) * windowMs) / 1000,
};
}
// Increment current window counter atomically
const pipeline = redis.pipeline();
pipeline.incr(currentKey);
pipeline.expire(currentKey, windowSeconds * 2); // Keep for 2 windows
await pipeline.exec();
return {
allowed: true,
remaining: limit - rate - 1,
resetAt: Math.floor((currentWindow + 1) * windowMs) / 1000,
};
}
import rateLimit from "express-rate-limit";
import { RedisStore } from "rate-limit-redis";
import Redis from "ioredis";
const client = new Redis(process.env.REDIS_URL);
const limiter = rateLimit({
windowMs: 60 * 1000, // 1 minute
max: 100, // 100 requests per window
standardHeaders: "draft-7", // Return RateLimit headers
legacyHeaders: false,
store: new RedisStore({
sendCommand: (...args: string[]) => client.call(...args),
}),
keyGenerator: (req) => req.headers["x-api-key"] as string ?? req.ip,
handler: (req, res) => {
res.status(429).json({
error: "Too Many Requests",
retryAfter: res.getHeader("RateLimit-Reset"),
});
},
});
app.use("/api/", limiter);
Python: slowapi
pip install slowapi redis
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from fastapi import FastAPI, Request
limiter = Limiter(key_func=get_remote_address, storage_uri="redis://localhost:6379")
app = FastAPI()
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
@app.get("/api/data")
@limiter.limit("100/minute")
async def get_data(request: Request):
return {"data": "..."}
The HTTP Response Contract
A well-behaved rate limiter communicates clearly with its clients. Follow the emerging IETF standard (draft-ietf-httpapi-ratelimit) for headers:
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1719532800
Retry-After: 47
{
"error": "Too Many Requests",
"message": "You have exceeded 100 requests per minute.",
"retryAfter": 1719532800
}
Header
Meaning
X-RateLimit-Limit
Maximum requests allowed in the window
X-RateLimit-Remaining
Requests left in the current window
X-RateLimit-Reset
Unix timestamp when the window resets
Retry-After
Seconds (or HTTP date) until the client may retry
Never silently drop requests. Always return 429 Too Many Requests with a Retry-After header. Clients that don't know they're being limited will retry immediately, making the problem worse.
Advanced Patterns
Tiered Rate Limits
Different customers warrant different limits. Map your API key or plan tier to a limit configuration at lookup time:
const TIER_LIMITS: Record<string, { limit: number; window: number }> = {
free: { limit: 60, window: 60 },
pro: { limit: 1000, window: 60 },
enterprise: { limit: 10000, window: 60 },
};
async function getTierForKey(apiKey: string): Promise<string> {
// Look up the tier from your database or a fast cache
return redis.hget("api_key_tiers", apiKey) ?? "free";
}
// In your middleware:
const tier = await getTierForKey(apiKey);
const { limit, window } = TIER_LIMITS[tier];
const result = await checkRateLimit(apiKey, limit, window);
Distributed Rate Limiting with Lua Scripts
For high-throughput scenarios, race conditions in multi-step Redis operations can cause limit enforcement gaps. Use a Lua script to make the check-and-increment atomic:
-- rate_limit.lua
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local count = redis.call("INCR", key)
if count == 1 then
redis.call("EXPIRE", key, window)
end
if count > limit then
return {0, limit - count, 0}
end
local ttl = redis.call("TTL", key)
return {1, limit - count, now + ttl}
Hard limits (reject at threshold) aren't always the right choice. Consider:
Soft limits with degraded responses: Return a simplified or cached response instead of a 429 for users slightly over their limit
Queuing: For async-friendly endpoints, queue excess requests and return a 202 Accepted with a job ID rather than rejecting outright
Throttling: Introduce artificial latency before responding to requests that exceed a soft threshold — slows abusers without breaking legitimate clients
Where to Put Rate Limiting in Your Stack
Rate limiting can live at multiple layers. The right answer depends on your architecture.
Internet
│
▼
[CDN / WAF] ← IP-based, DDoS-level limiting (Cloudflare, AWS WAF)
│
▼
[API Gateway] ← Key-based limiting per route (Kong, AWS API Gateway, Nginx)
│
▼
[Application Layer] ← Business logic limiting (user-level, tenant-level, quota tracking)
│
▼
[Database Layer] ← Query rate limits as last resort (pgBouncer, connection pooling)
Best practice: apply rate limiting at the API Gateway layer for broad protection and at the application layer for fine-grained business rules. Don't rely solely on the application layer — by the time a request reaches your app, you've already paid the compute cost.
Testing Your Rate Limiter
Don't ship rate limiting logic you haven't stress-tested. At minimum, verify:
// Jest + supertest example
describe("Rate Limiter", () => {
it("allows requests under the limit", async () => {
for (let i = 0; i < 99; i++) {
const res = await request(app)
.get("/api/data")
.set("x-api-key", "test-key-1");
expect(res.status).toBe(200);
expect(res.headers["x-ratelimit-remaining"]).toBe(String(99 - i));
}
});
it("rejects requests over the limit", async () => {
// Exhaust the limit first
for (let i = 0; i < 100; i++) {
await request(app).get("/api/data").set("x-api-key", "test-key-2");
}
const res = await request(app)
.get("/api/data")
.set("x-api-key", "test-key-2");
expect(res.status).toBe(429);
expect(res.headers["retry-after"]).toBeDefined();
});
it("does not share limits between clients", async () => {
for (let i = 0; i < 100; i++) {
await request(app).get("/api/data").set("x-api-key", "heavy-user");
}
const res = await request(app)
.get("/api/data")
.set("x-api-key", "other-user");
expect(res.status).toBe(200); // Other user unaffected
});
});
Also consider load testing with k6 or wrk to verify behavior under realistic concurrent load — race conditions in rate limiters often only appear under concurrency.
Common Mistakes to Avoid
Limiting by IP alone. IPv6, NAT, and shared office networks mean one IP can represent thousands of legitimate users — or one user can cycle through thousands of IPs.
No headers on successful responses. Clients should be able to track their consumption proactively, not just discover limits when they hit them.
Forgetting OPTIONS requests. CORS preflight requests count as requests. Don't accidentally rate-limit your own frontend's preflight checks.
Storing rate limit state in memory. In-process memory doesn't survive restarts and doesn't work across horizontally scaled instances. Always use a shared store (Redis, Memcached).
Setting limits too low for legitimate use cases. Analyze your actual usage patterns before choosing a number. A mobile app that polls for notifications every 30 seconds will make 2 requests/minute — an SDK that fetches data on page load might burst to 20 in 500ms. Set limits that protect without punishing normal behavior.
Conclusion
Rate limiting is one of those infrastructure concerns that's easy to defer and expensive to retrofit. The core algorithm — count requests, compare to a threshold, reject or allow — is simple. But the details matter: which algorithm fits your traffic shape, which key identifies your clients, how you communicate limits to callers, and where in your stack enforcement lives.
Start with a sliding window counter backed by Redis, set conservative limits informed by your actual traffic data, and expose clear headers so your clients can adapt. Layer in tiered limits and per-endpoint differentiation as your API matures.
Your future self — paged at 3am while a runaway script hammers production — will be grateful you did it right the first time.