System Design
What are common rate limiting strategies in system design?
Rate limiting controls how many requests a client can make within a time period. Common strategies include fixed window, sliding window, token bucket, and leaky bucket.
The Short Answer
Rate limiting controls how often a client can call an API or perform an action within a time period.
It protects systems from overload, abuse, scraping, brute-force attempts, runaway clients, and unfair resource usage.
The Real Problem It Solves
Imagine one client suddenly sending thousands of requests per second to your API.
Without rate limiting, that single client can consume database connections, CPU, memory, network bandwidth, downstream service capacity, and cache resources that should be shared by everyone.
Without Rate Limiting
With Rate Limiting
Where Rate Limiting Usually Lives
Rate limiting can be enforced at different layers depending on what you are protecting.
API Gateway
Load Balancer / Edge
Application Service
Strategy 1: Fixed Window Counter
Fixed window is the simplest strategy. Divide time into windows and count requests in the current window.
Limit: 100 requests per minute
12:00:00 - 12:00:59 → allow up to 100
12:01:00 - 12:01:59 → counter resetsThe weakness is the boundary problem. A client may send 100 requests at the end of one minute and another 100 at the start of the next, effectively creating a burst of 200 requests in a very short time.
Strategy 2: Sliding Window
Sliding window tries to smooth out the fixed-window boundary problem by looking at a rolling time range instead of a hard calendar boundary.
Limit: 100 requests per 60 seconds
At 12:01:20:
count requests from 12:00:20 to 12:01:20This is more accurate and fair than a fixed window, but usually costs more memory or computation depending on implementation.
Strategy 3: Token Bucket
Token bucket is one of the most useful mental models.
Imagine each client has a bucket. Tokens are added at a steady rate. Each request needs one token. If the bucket has a token, the request is allowed. If not, the request is rejected or delayed.
If the bucket is empty, the request cannot go through immediately.
Token bucket is good when you want to allow normal traffic plus some reasonable bursts.
Strategy 4: Leaky Bucket
Leaky bucket smooths traffic by processing requests at a steady outflow rate.
Think of incoming requests entering a queue. Requests leave the queue at a controlled fixed rate. If the queue is full, extra requests are dropped or rejected.
Leaky bucket is useful when downstream systems need a smooth, predictable request rate.
Choosing the Right Strategy
| Strategy | Best For | Tradeoff |
|---|---|---|
| Fixed Window | Simple limits | Boundary bursts |
| Sliding Window | Fairer rolling limits | More memory or computation |
| Token Bucket | Allowing controlled bursts | Needs refill logic |
| Leaky Bucket | Smoothing downstream traffic | Can queue old requests |
Distributed Rate Limiting
Rate limiting becomes harder when your service runs on many servers.
If each server keeps its own local counter, one user may exceed the real global limit by spreading requests across servers.
Local Counters Only
Shared Counter
What Should Happen When the Limit Is Exceeded?
A rate limiter should not just silently fail. The system should communicate clearly.
HTTP/1.1 429 Too Many Requests
Retry-After: 30Status code 429 tells the client it made too many requests. A Retry-After header can tell the client when to try again.
Common Technologies and Libraries
In real systems, rate limiting is often implemented using infrastructure components or specialized libraries rather than handwritten logic from scratch.
Redis
NGINX
API Gateways
Bucket4j
Resilience4j
Guava RateLimiter
The Interview-Friendly Explanation
Common Interview Follow-Ups
Where should rate limiting be implemented?
Common places are API gateways, edge/load balancer layers, and application services. Gateways are good for broad API limits, while application services are better for business-specific limits.
What is the problem with fixed window rate limiting?
It is simple, but it has a boundary problem. A client can send many requests at the end of one window and many more at the beginning of the next.
Why is token bucket popular?
It allows controlled bursts while still enforcing a long-term average rate. This matches many real API traffic patterns.
Why is distributed rate limiting harder?
With multiple servers, local counters can disagree. To enforce a global limit, servers usually need a shared store like Redis or centralized enforcement at a gateway.
What HTTP response should be returned when a client is rate limited?
Usually HTTP 429 Too Many Requests, often with a Retry-After header.