How to Design Rate Limiting That Actually Works

Every so often I step back from writing code to think about the bigger picture.

Modern systems handle dozens — sometimes hundreds — of requests per minute. Some are legitimate: real users doing real things. Others are malicious: probing for vulnerabilities, scraping data, or simply trying to exhaust your resources.

Rate limiting is one of the most effective defenses against these threats. In plain terms: it defines the maximum number of requests a user can make to a system within a given time window.

But there's a critical question worth asking first.

AI Agent Automation with Artificial Intelligence
AI Agent Automation with Artificial Intelligence
Build AI agent workflows and automation systems with Python.
Go to course →

Does One Approach Fit Every System?

No.

Every system has different dynamics and internal constraints. Some allow thousands of requests per day; others need to throttle within minutes. Even different microservices inside the same system may require different strategies.

There is no single "best" rate limiting strategy. The right one depends on what your system needs to protect, and how it needs to behave under load.

Here are the four foundational algorithms.

1. Token Bucket

The most common approach — and the easiest to reason about.

The idea: every user gets a fixed pool of "tokens" within a time window. Each request consumes one token. When the window resets, the pool refills. If a user runs out of tokens, requests are rejected.

Example: "Tuncer can make 20 requests per minute."

How it works:

  • Each user gets a token bucket
  • Each request deducts one token
  • After the defined window, the bucket refills
  • No tokens → 429 Too Many Requests

Advantages:

  • Simple to implement
  • Absorbs short bursts — if there are tokens available, spikes go through

Disadvantage:

  • Under heavy load, if everyone has tokens simultaneously, the system can still get overwhelmed

2. Leaky Bucket

Built on a physical metaphor: imagine a bucket with a small hole at the bottom.

No matter how fast you pour water in, the bucket drains at a constant rate. If it overflows, water is lost — requests are dropped.

In system terms: regardless of how fast requests arrive, the system processes them at a fixed rate. If the queue fills up, new requests are rejected.

Token Bucket  → Accept spikes, process all of them
Leaky Bucket  → Absorb spikes, process at a constant rate

Where this shines:

Consider a logging service writing to Elasticsearch. Under normal load, everything is fine. Then traffic spikes 10x.

  • Token Bucket: "Sure, come on in." Elasticsearch gets overwhelmed.
  • Leaky Bucket: "I'll write at my usual pace." The downstream system is protected.

Critical logs (ERROR, WARN) always make it to the queue. Low-priority logs (INFO, DEBUG) get dropped when the queue is full.

Advantage: Protects downstream systems; output is predictable and stable

Disadvantage: Cuts legitimate spikes too — not suitable for systems that need to absorb sudden bursts


3. Fixed Window

This is the approach you encounter in banking and finance.

"You can make a maximum of 5 wire transfers per day." That's Fixed Window.

How it works:

  • Time is divided into equal windows (hour, day, month, year)
  • Requests are counted within each window
  • Exceed the limit → requests are rejected for the remainder of the window
  • New window starts → counter resets

The problem: Boundary Exploitation

There's a meaningful security gap here.

Say your limit is 100 requests per day. A user sends 100 requests at 11:59 PM, then 100 more at 12:01 AM. The system treats these as separate days — all 200 pass.

200 requests in 2 minutes. Limit was 100.

This is called boundary exploitation. It's the biggest weakness of Fixed Window.


4. Sliding Window Log

Developed specifically to close the gap in Fixed Window.

The difference: instead of a fixed starting point, the window is calculated backwards from each request.

Example: Your limit is 100 requests per day. You make your first request at 4:00 PM. Your window now resets at 4:00 PM tomorrow — not at midnight.

Each new request slides the window forward. Boundary exploitation becomes impossible.

Advantage: 100% accuracy. Time boundaries can never be gamed.

Disadvantage: Every request requires storing a timestamp. At 1 million, 10 million, or 100 million operations, this creates significant storage overhead. In-memory solutions like Redis help, but don't eliminate the cost.


Which Algorithm Should You Use?

AlgorithmImplementationSpike ToleranceAccuracyMemory Cost
Token Bucket✅ Easy✅ HighMediumLow
Leaky BucketMedium❌ NoneMediumLow
Fixed Window✅ EasyMedium⚠️ LowLow
Sliding Window LogComplexMedium✅ High⚠️ High

The answer depends on your system:

  • General API rate limits → Token Bucket is usually sufficient
  • Protecting downstream services → Leaky Bucket
  • Banking / compliance rules → Fixed Window (simplicity is the priority)
  • Security-critical systems requiring exact accuracy → Sliding Window Log

Final Thought

What matters here is properly analyzing the need and choosing a structure that fits your system.

The question "which algorithm is best?" is the wrong question. The right question is: which algorithm fits your specific problem?

Every solution has benefits and costs. Being able to make that decision consciously — seeing all the trade-offs and choosing deliberately — is what engineering actually looks like.

Rate LimitingBackendSystem DesignToken BucketLeaky BucketAlgorithmAPISecurityArchitecture
Tuncer Bağçabaşı
Tuncer Bağçabaşı
Software Engineer & AI Researcher
← All posts