System Design

What are common rate limiting strategies in system design?

Rate limiting controls how many requests a client can make within a time period. Common strategies include fixed window, sliding window, token bucket, and leaky bucket.

System DesignRate LimitingAPIsScalabilityDistributed Systems

The Short Answer

Rate limiting controls how often a client can call an API or perform an action within a time period.

It protects systems from overload, abuse, scraping, brute-force attempts, runaway clients, and unfair resource usage.

The key idea: a rate limiter is a gatekeeper. It decides whether a request should be allowed now, delayed, or rejected.

The Real Problem It Solves

Imagine one client suddenly sending thousands of requests per second to your API.

Without rate limiting, that single client can consume database connections, CPU, memory, network bandwidth, downstream service capacity, and cache resources that should be shared by everyone.

Without Rate Limiting

Aggressive client
Floods API
Backend overload

With Rate Limiting

Client request
Rate limiter checks quota
Allow or reject

Where Rate Limiting Usually Lives

Rate limiting can be enforced at different layers depending on what you are protecting.

API Gateway

Good for global API limits before traffic reaches services.

Load Balancer / Edge

Useful for coarse limits, bot protection, and DDoS-style defense.

Application Service

Useful for business-specific limits like login attempts or per-user actions.
A serious system may use more than one layer: coarse limits at the edge, and fine-grained business limits inside the application.

Strategy 1: Fixed Window Counter

Fixed window is the simplest strategy. Divide time into windows and count requests in the current window.

java
Limit: 100 requests per minute

12:00:00 - 12:00:59 → allow up to 100
12:01:00 - 12:01:59 → counter resets
Window 1
100 requests
Window 2
counter resets

The weakness is the boundary problem. A client may send 100 requests at the end of one minute and another 100 at the start of the next, effectively creating a burst of 200 requests in a very short time.

Strategy 2: Sliding Window

Sliding window tries to smooth out the fixed-window boundary problem by looking at a rolling time range instead of a hard calendar boundary.

java
Limit: 100 requests per 60 seconds

At 12:01:20:
count requests from 12:00:20 to 12:01:20

This is more accurate and fair than a fixed window, but usually costs more memory or computation depending on implementation.

Sliding window is often a strong default when accuracy matters more than absolute simplicity.

Strategy 3: Token Bucket

Token bucket is one of the most useful mental models.

Imagine each client has a bucket. Tokens are added at a steady rate. Each request needs one token. If the bucket has a token, the request is allowed. If not, the request is rejected or delayed.

Tokens refill
Request arrives
Spend token
Allow request

If the bucket is empty, the request cannot go through immediately.

Token bucket is good when you want to allow normal traffic plus some reasonable bursts.

Strategy 4: Leaky Bucket

Leaky bucket smooths traffic by processing requests at a steady outflow rate.

Think of incoming requests entering a queue. Requests leave the queue at a controlled fixed rate. If the queue is full, extra requests are dropped or rejected.

Incoming requests
Queue bucket
Fixed outflow
Stable processing

Leaky bucket is useful when downstream systems need a smooth, predictable request rate.

Choosing the Right Strategy

StrategyBest ForTradeoff
Fixed WindowSimple limitsBoundary bursts
Sliding WindowFairer rolling limitsMore memory or computation
Token BucketAllowing controlled burstsNeeds refill logic
Leaky BucketSmoothing downstream trafficCan queue old requests

Distributed Rate Limiting

Rate limiting becomes harder when your service runs on many servers.

If each server keeps its own local counter, one user may exceed the real global limit by spreading requests across servers.

Local Counters Only

User hits Server A, B, and C. Each server thinks the user is still under the limit.

Shared Counter

Servers coordinate through Redis or another shared store to enforce a global limit.
In distributed systems, rate limiting often needs a shared store like Redis, or enforcement at a centralized gateway.

What Should Happen When the Limit Is Exceeded?

A rate limiter should not just silently fail. The system should communicate clearly.

java
HTTP/1.1 429 Too Many Requests
Retry-After: 30

Status code 429 tells the client it made too many requests. A Retry-After header can tell the client when to try again.

Common Technologies and Libraries

In real systems, rate limiting is often implemented using infrastructure components or specialized libraries rather than handwritten logic from scratch.

Redis

Very common for distributed rate limiting because multiple servers can share counters and expiration windows.

NGINX

Can enforce request-per-second or connection limits at the edge before traffic reaches the application.

API Gateways

Kong, Envoy, Spring Cloud Gateway, and AWS API Gateway all support rate limiting policies.

Bucket4j

Popular Java library implementing token bucket algorithms with support for local and distributed limits.

Resilience4j

Java resilience library that includes rate limiting alongside retries, circuit breakers, and bulkheads.

Guava RateLimiter

Lightweight in-process Java rate limiter from Google Guava, useful for controlling local throughput.
In production systems, Redis-backed distributed rate limiting is extremely common because it allows multiple application servers to coordinate shared request limits.

The Interview-Friendly Explanation

Rate limiting protects a system by controlling how many requests a client can make within a time period. Fixed window is simple but can allow boundary bursts. Sliding window is fairer but more expensive. Token bucket allows controlled bursts. Leaky bucket smooths traffic at a steady rate. In distributed systems, counters usually need to be shared through something like Redis or enforced at an API gateway.

Common Interview Follow-Ups

Where should rate limiting be implemented?

Common places are API gateways, edge/load balancer layers, and application services. Gateways are good for broad API limits, while application services are better for business-specific limits.

What is the problem with fixed window rate limiting?

It is simple, but it has a boundary problem. A client can send many requests at the end of one window and many more at the beginning of the next.

Why is token bucket popular?

It allows controlled bursts while still enforcing a long-term average rate. This matches many real API traffic patterns.

Why is distributed rate limiting harder?

With multiple servers, local counters can disagree. To enforce a global limit, servers usually need a shared store like Redis or centralized enforcement at a gateway.

What HTTP response should be returned when a client is rate limited?

Usually HTTP 429 Too Many Requests, often with a Retry-After header.

Final Takeaway

Rate limiting is not only about rejecting traffic. It is about protecting system stability, preserving fairness, shaping bursts, and deciding how your system behaves under pressure.