Problem: Your LLM API Is Getting Hammered

LLM rate limiting is the difference between a $50/month API bill and a $5,000 one. Without it, a single runaway client, a botched retry loop, or a bad actor can burn through your OpenAI or Anthropic quota in minutes.

You'll learn:

Three battle-tested rate limiting strategies — token bucket, sliding window, and fixed window
A Redis-backed limiter that works across multiple FastAPI workers
Per-user and per-IP enforcement with graceful 429 responses
How to avoid false positives that block legitimate users

Time: 20 min | Difficulty: Intermediate

Why LLM APIs Need Rate Limiting Differently

Standard REST endpoints cost microseconds per call. An LLM inference call costs $0.002–$0.06 per request and can take 5–30 seconds. A single user spamming POST /chat at 10 req/s doesn't just hurt your bill — it starves every other user in the queue.

Symptoms of a missing rate limiter:

OpenAI 429 — You exceeded your current quota errors in production
Monthly API spend spiking with no traffic growth
One tenant consuming 80%+ of your LLM capacity
Anthropic overloaded_error during peak hours because your own upstream calls pile up

The fix is layered: enforce limits at the edge (per IP), at the application layer (per user/API key), and optionally at the model tier (per model slug).

LLM rate limiting architecture: client → edge limiter → app limiter → LLM API Three-layer defense: IP throttle at the edge, per-user token bucket in FastAPI, and upstream quota guard before hitting OpenAI/Anthropic.

Solution

Step 1: Install Dependencies

# Using uv (recommended) — installs in ~2s vs pip's 20s
uv pip install fastapi uvicorn slowapi redis hiredis openai anthropic

# Or pip
pip install fastapi uvicorn slowapi redis hiredis openai anthropic --break-system-packages

Verify Redis is running. Use Docker if you don't have a local instance:

docker run -d -p 6379:6379 --name redis-limiter redis:7-alpine
redis-cli ping   # → PONG

Expected output: PONG

If it fails:

Connection refused → Redis isn't running. Start it with the Docker command above.
command not found: redis-cli → install with brew install redis (macOS) or apt install redis-tools (Ubuntu).

Step 2: Choose Your Strategy

Three algorithms dominate LLM rate limiting in 2026. Pick the one that matches your traffic shape.

Strategy	Best For	Burst Allowed	Memory
Token Bucket	Smooth steady traffic	✅ Yes	Low
Sliding Window	Strict per-minute SLAs	❌ No	Medium
Fixed Window	Simple billing-aligned limits	✅ At boundary	Lowest

For LLM APIs, token bucket is the default choice. Users can burst 5 requests quickly, but sustained abuse is capped. Fixed window is dangerous for LLMs — a user can double-burst at the window boundary.

Step 3: Implement the Redis Token Bucket

# rate_limiter.py
import time
import redis.asyncio as aioredis
from fastapi import HTTPException, Request

REDIS_URL = "redis://localhost:6379"

# One client shared across workers — hiredis parser makes this ~3x faster
_redis: aioredis.Redis | None = None

async def get_redis() -> aioredis.Redis:
    global _redis
    if _redis is None:
        _redis = aioredis.from_url(REDIS_URL, decode_responses=True)
    return _redis


async def token_bucket_check(
    key: str,
    capacity: int = 10,       # max tokens (burst ceiling)
    refill_rate: float = 1.0, # tokens added per second
    cost: int = 1,            # tokens consumed per LLM call
) -> bool:
    """
    Returns True if the request is allowed, False if rate-limited.
    Uses a Lua script for atomic read-modify-write — critical under concurrency.
    """
    r = await get_redis()
    now = time.time()

    lua_script = """
    local key = KEYS[1]
    local capacity = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])
    local cost = tonumber(ARGV[3])
    local now = tonumber(ARGV[4])

    local data = redis.call('HMGET', key, 'tokens', 'last_refill')
    local tokens = tonumber(data[1]) or capacity
    local last_refill = tonumber(data[2]) or now

    -- Refill based on elapsed time
    local elapsed = now - last_refill
    tokens = math.min(capacity, tokens + elapsed * refill_rate)

    if tokens >= cost then
        tokens = tokens - cost
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 3600)  -- TTL: 1 hour of inactivity
        return 1  -- allowed
    else
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 3600)
        return 0  -- blocked
    end
    """
    result = await r.eval(lua_script, 1, key, capacity, refill_rate, cost, now)
    return bool(result)


async def require_rate_limit(request: Request, user_id: str) -> None:
    """
    Dependency for FastAPI routes. Raises 429 if rate-limited.
    Enforces both per-user and per-IP limits.
    """
    ip = request.client.host
    allowed_by_ip = await token_bucket_check(
        key=f"rl:ip:{ip}",
        capacity=20,       # 20-req burst per IP
        refill_rate=2.0,   # 2 req/s sustained per IP
    )
    if not allowed_by_ip:
        raise HTTPException(
            status_code=429,
            detail="Too many requests from this IP. Retry after a few seconds.",
            headers={"Retry-After": "5"},
        )

    allowed_by_user = await token_bucket_check(
        key=f"rl:user:{user_id}",
        capacity=10,       # 10-req burst per user
        refill_rate=0.5,   # 30 req/min sustained per user — sweet spot for chat UX
    )
    if not allowed_by_user:
        raise HTTPException(
            status_code=429,
            detail="Rate limit exceeded. You can send ~30 messages per minute.",
            headers={"Retry-After": "10"},
        )

Step 4: Wire It Into FastAPI

# main.py
from fastapi import FastAPI, Depends, Request
from pydantic import BaseModel
from openai import AsyncOpenAI
from rate_limiter import require_rate_limit

app = FastAPI()
openai_client = AsyncOpenAI()  # reads OPENAI_API_KEY from env


class ChatRequest(BaseModel):
    message: str
    user_id: str


@app.post("/chat")
async def chat(
    payload: ChatRequest,
    request: Request,
):
    # Rate limit check — runs before any LLM call touches the wire
    await require_rate_limit(request, user_id=payload.user_id)

    response = await openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": payload.message}],
        max_tokens=1024,
    )
    return {"reply": response.choices[0].message.content}


@app.exception_handler(429)
async def rate_limit_handler(request: Request, exc):
    from fastapi.responses import JSONResponse
    return JSONResponse(
        status_code=429,
        content={"error": exc.detail, "code": "rate_limited"},
        headers=exc.headers or {},
    )

Start the server:

uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4

Expected output:

INFO:     Started server process [12345]
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

If it fails:

ModuleNotFoundError: No module named 'slowapi' → rerun the pip install command from Step 1.
redis.exceptions.ConnectionError → Redis isn't reachable. Check redis-cli ping.

Step 5: Add the Sliding Window for Strict Quota Enforcement

Token bucket allows bursts. If you sell tiered plans (e.g., Starter: 100 calls/day, Pro: 2,000 calls/day), use a sliding window instead — it prevents any window boundary abuse.

# sliding_window.py
import time
import redis.asyncio as aioredis

async def sliding_window_check(
    key: str,
    limit: int,         # max calls in the window
    window_seconds: int = 86400,  # 86400 = 1 day
) -> tuple[bool, int]:
    """
    Returns (allowed: bool, remaining: int).
    Uses Redis sorted sets — each call is scored by Unix timestamp.
    """
    r = await get_redis()  # reuse client from rate_limiter.py
    now = time.time()
    window_start = now - window_seconds

    pipe = r.pipeline()
    # Remove expired entries
    pipe.zremrangebyscore(key, 0, window_start)
    # Count calls in window
    pipe.zcard(key)
    # Add this call
    pipe.zadd(key, {str(now): now})
    # Keep key alive for one window
    pipe.expire(key, window_seconds)
    results = await pipe.execute()

    current_count = results[1]  # count before adding this call
    if current_count >= limit:
        # Remove the call we just added — don't count refused calls
        await r.zrem(key, str(now))
        return False, 0

    remaining = limit - current_count - 1
    return True, remaining

Use it for daily quota checks per API key:

# In your route, after token bucket:
allowed, remaining = await sliding_window_check(
    key=f"quota:daily:{payload.user_id}",
    limit=100,            # Starter plan: 100 calls/day
    window_seconds=86400,
)
if not allowed:
    raise HTTPException(
        status_code=429,
        detail="Daily quota exceeded. Upgrade to Pro for 2,000 calls/day — starts at $29/month.",
        headers={"X-RateLimit-Remaining": "0", "Retry-After": "86400"},
    )

Verification

Run a quick load test with curl in a loop:

for i in $(seq 1 15); do
  curl -s -o /dev/null -w "%{http_code}\n" \
    -X POST http://localhost:8000/chat \
    -H "Content-Type: application/json" \
    -d '{"message": "hello", "user_id": "test-user-1"}'
done

You should see:

200
200
...
200   ← 10 allowed (burst capacity)
429
429
429
429
429   ← remaining calls blocked

Then wait 10 seconds and retry — tokens refill at 0.5/s, so you'll get ~5 new requests through.

sleep 10
curl -s -w "%{http_code}\n" -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "hello", "user_id": "test-user-1"}'
# → 200

What You Learned

Token bucket is the right default for LLM chat — it allows natural burst behavior while capping sustained abuse. Fixed window is dangerous at the boundary.
Lua scripts are mandatory for Redis rate limiters under concurrency. Non-atomic check-then-set causes race conditions that let abusers slip through at scale.
Layer your limits: IP limits stop bots, user limits enforce fair use, sliding windows enforce paid quotas. Don't rely on a single limit.
Always return Retry-After in 429 responses — well-behaved clients (and your own frontend) will back off automatically instead of hammering harder.
When NOT to use this: if you're on a managed API gateway (AWS API Gateway, Cloudflare Workers AI), use their built-in rate limiting. Rolling your own Redis solution on top adds latency for no gain.

Tested on Python 3.12, FastAPI 0.115, redis-py 5.x, Redis 7.2, Ubuntu 24.04 & macOS Sequoia

FAQ

Q: Does this work with Anthropic's API as well as OpenAI? A: Yes — the limiter runs in your application layer before any upstream call. Swap openai_client.chat.completions.create for anthropic.Anthropic().messages.create and the limiting logic is identical.

Q: What's the difference between rate limiting and throttling? A: Rate limiting rejects requests that exceed a threshold with a 429. Throttling queues them and slows delivery. For LLM APIs, rejection is almost always better — queuing causes memory buildup and unpredictable latency spikes.

Q: Can I run this without Redis, on a single-worker server? A: Yes. Replace the Redis calls with an in-memory dict and asyncio.Lock. This works fine for single-process deployments but breaks across multiple Uvicorn workers — each worker has its own memory space.

Q: How much does Redis add to latency? A: A local Redis call takes ~0.1–0.3ms. Against an LLM call that takes 500–5000ms, this is noise. On Redis Cloud (AWS us-east-1), expect ~1–2ms round-trip — still negligible.

Q: What's the minimum plan for Redis Cloud to support this in production? A: The Free tier (30MB) is sufficient for up to ~50,000 active user keys. Paid plans start at $7/month for 250MB on AWS us-east-1 — more than enough for most early-stage LLM products.