Production Rate Limiter with Redis: Sliding Window, Token Bucket, and Per-User LLM Quotas

Your LLM API has no rate limits. One user runs a batch job, burns $500 in 10 minutes, and your OpenAI account gets suspended. Your CTO is now asking why your "production-grade" API doesn't have the most basic guardrail in software. Rate limiting isn't a nice-to-have; it's your financial circuit breaker and system stability guarantee. And if you're not using Redis for it, you're either over-engineering with a separate service or under-engineering with a local counter that falls apart the moment you scale beyond one server.

Redis is the Swiss Army knife for this job. Used by 30% of professional developers as the #2 most popular NoSQL database (Stack Overflow 2025), it’s the de facto standard for stateful coordination between stateless application servers. Its single-node throughput of 1M+ operations/sec with pipelining (Redis Labs benchmark 2025) means your rate limiter will never be the bottleneck. Let's build one that actually works in production.

Why Your Fixed Window Counter is Leaking Requests

You've probably seen—or built—the naive rate limiter: a Redis key like rate_limit:user_123 with an INCR and a TTL. If the count exceeds 100, block. This is the Fixed Window algorithm. Its fatal flaw is the boundary problem. A user can make 100 requests at 11:59:59, another 100 at 12:00:01, and hammer your system with 200 requests in two seconds. For an LLM API where each call can cost cents to dollars, this leak is a budget killer.

The professional alternatives are Sliding Window and Token Bucket.

Sliding Window looks at a rolling window of time (e.g., the last 60 seconds). It's precise but requires more storage.
Token Bucket allows for bursts up to a capacity, with tokens refilling at a steady rate. It's more flexible for variable workloads.

The choice isn't academic. For cost-enforcement on an LLM API, you need the precision of a sliding window. For managing general API load with some burst tolerance, the token bucket is more forgiving. We'll implement both.

Implementing a Precise Sliding Window with Sorted Sets

The fixed window's problem is its lack of memory. The sliding window remembers every relevant request within the time window. In Redis, the perfect data structure for this is the Sorted Set (ZSET). We'll use timestamps as scores.

Here's the atomic operation using redis-cli commands, which you can adapt to redis-py or ioredis:


# Limit: 10 requests per 60 seconds
KEY="rate_limit:sliding:alice:/v1/chat"
NOW=$(date +%s)
WINDOW_SECONDS=60
MAX_REQUESTS=10

# 1. Remove all requests older than the window
ZREMRANGEBYSCORE $KEY 0 $((NOW - WINDOW_SECONDS))
# 2. Count the remaining requests (those within the window)
CURRENT_COUNT=$(ZCARD $KEY)
# 3. Check if the limit is exceeded
if [ $CURRENT_COUNT -ge $MAX_REQUESTS ]; then
  echo "Rate limit exceeded. Requests in window: $CURRENT_COUNT"
  exit 1
fi
# 4. Add the current request to the set
ZADD $KEY $NOW $NOW
# 5. Expire the entire key to clean up memory after the window passes
EXPIRE $KEY $WINDOW_SECONDS

This works, but it's not atomic across multiple commands. A user could sneak in extra requests between steps 1 and 4. For production, you must use a Lua script to guarantee atomicity. Open your VS Code integrated terminal (Ctrl+``) and create sliding_window.lua`:

-- KEYS[1] = rate limit key (e.g., rate_limit:sliding:alice)
-- ARGV[1] = window size in seconds
-- ARGV[2] = max requests per window
-- ARGV[3] = current timestamp in seconds

local key = KEYS[1]
local window = tonumber(ARGV[1])
local max_requests = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local clear_before = now - window
redis.call('ZREMRANGEBYSCORE', key, 0, clear_before)

local current_count = redis.call('ZCARD', key)

if current_count >= max_requests then
    -- Return the time until the oldest request expires, for a Retry-After header
    local oldest = redis.call('ZRANGE', key, 0, 0, 'WITHSCORES')
    return {0, oldest[2] + window - now}
end

-- Add the new request with the timestamp as both score and member
redis.call('ZADD', key, now, now)
-- Renew the TTL
redis.call('EXPIRE', key, window)
return {1, max_requests - current_count - 1}

Call it from your Node.js/Python app:

// Using ioredis
const script = `...`; // The Lua script above
const client = new Redis();
const [allowed, retryAfterOrRemaining] = await client.eval(
  script, 1, 'rate_limit:sliding:alice:/v1/chat', 60, 10, Math.floor(Date.now() / 1000)
);
if (allowed === 0) {
  res.setHeader('Retry-After', Math.ceil(retryAfterOrRemaining));
  res.status(429).send('Too Many Requests');
}

Why a Sorted Set? Because ZREMRANGEBYSCORE and ZCARD are fast. Redis Sorted Set operations are O(log N)—you can ZADD 1 million members in ~1.2ms and ZRANGE 100 members in ~0.08ms. This performance is why it can handle the load.

The Token Bucket: Allowing Controlled Bursts

Sometimes you want to let users burst. The token bucket algorithm gives you this: a bucket holds tokens (up to a capacity). Each request consumes one token. Tokens refill at a steady refill_rate per second. A user who hasn't made requests can burst up to the full capacity at once.

Implementing this requires tracking the current tokens and the last refill time. We need atomicity for the "check tokens, decrement, calculate refill" operation. Again, Lua is non-negotiable.

-- KEYS[1] = bucket key
-- ARGV[1] = bucket capacity
-- ARGV[2] = refill tokens per second
-- ARGV[3] = tokens to consume (usually 1)
-- ARGV[4] = current timestamp in seconds

local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_per_second = tonumber(ARGV[2])
local requested = tonumber(ARGV[3])
local now = tonumber(ARGV[4])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1] or capacity)
local last_refill = tonumber(bucket[2] or now)

-- Calculate refill since last call
local seconds_passed = math.max(0, now - last_refill)
tokens = math.min(capacity, tokens + (seconds_passed * refill_per_second))

if tokens < requested then
    -- Not enough tokens
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', last_refill)
    redis.call('EXPIRE', key, math.ceil(capacity / refill_per_second) * 2)
    return {0, tokens, (requested - tokens) / refill_per_second} -- allowed, remaining, wait_time
end

-- Consume the tokens
tokens = tokens - requested
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_per_second) * 2)
return {1, tokens, 0} -- allowed, remaining, wait_time

This uses a Redis Hash (HMSET, HMGET) to store the bucket state atomically. The EXPIRE is a safety net to clean up abandoned buckets, set to roughly double the time it would take to fully refill the bucket from empty.

Namespacing Keys for Per-User, Per-Org, and Per-Endpoint Quotas

Your Redis keys are your rate limit schema. Design them poorly, and you'll face the WRONGTYPE Operation against a key holding the wrong kind of value error when you accidentally try to ZADD to a key that's a Hash. The fix is disciplined namespacing.

Use a consistent, segmented key pattern:

rate_limit:sliding:{user_id}:{endpoint_path}
rate_limit:bucket:{org_id}:global
rate_limit:fixed:{ip_address}:login

This does two things: it prevents key-type collisions, and it allows for flexible invalidation. Need to reset all limits for a rogue organization? Use redis-cli --scan --pattern 'rate_limit:*:org_abc:*' | xargs redis-cli del. Want to shard limits across a Redis Cluster? The {user_id} part inside curly braces ensures all keys for that user land on the same hash slot, critical for multi-key operations.

Returning Rate Limit Headers and the Correct 429 Response

A good API tells clients when they're being limited. Standard headers are:

X-RateLimit-Limit: The request limit (e.g., 10).
X-RateLimit-Remaining: Requests left in the current window.
X-RateLimit-Reset: Unix timestamp when the window resets (for fixed window) or the oldest request expires (for sliding).
Retry-After: For a 429 response, the seconds to wait (crucial for LLM batch jobs to back off).

Our Lua scripts already return the data needed for these. The sliding window script returns the remaining count and the oldest request's expiry time. The token bucket script returns the remaining tokens and the wait time for new tokens. Populate your HTTP response headers accordingly.

Making It Distributed: The Whole Point of Using Redis

This is why you're here. Your two API servers in different availability zones need to share a rate limit counter. With the local-memory express-rate-limit, user requests would be split across servers, each allowing the full limit. With Redis as the central, shared state, both servers are reading and writing to the same counters.

Critical Consideration: Network latency. A naive implementation doing 2-3 Redis commands per request can add milliseconds. The solution is pipelining and Lua scripts (which we're already using). Sending the entire logic as one script is a single round-trip. Redis 7.x achieves 1M+ operations/sec on a single node with pipelining, but even without, the overhead is minimal compared to the LLM API call you're protecting.

Strategy	Operations per Request	Network Round-Trips	Atomic?	Suited For
Multi-Command (Naive)	3-5	3-5	❌ No	Learning, never production
Lua Script (Our Method)	1	1	✅ Yes	Production, distributed systems
Redis Module (redis-cell)	1	1	✅ Yes	Production, if you control Redis

The table shows the stark difference. The Lua script approach reduces a 5-round-trip operation to 1, turning a potential 5ms Redis overhead into <1ms. For a 50ms database query, a cache-aside pattern with Redis is ~166x faster on a hit. While rate limiting is not a cache, the same network efficiency principle applies.

Monitoring and Alerting: Knowing When You're Under Siege

You've deployed your rate limiter. Now you need to know when it's firing. Use redis-cli MONITOR sparingly in staging to see the commands. For production, track:

Rate of 429 responses in your API metrics (Prometheus, Datadog).
Memory usage of rate-limit keys. A surge can indicate a bug or an attack. Set maxmemory in redis.conf with an allkeys-lru eviction policy for this cache-like data. If you hit OOM command not allowed when used memory > 'maxmemory', that's your signal to either increase memory or review your key expiration strategy.
Redis CPU and connection count. Use the INFO command or RedisInsight.

Set up a Grafana dashboard with an alert rule: sum(rate(http_requests_total{status="429"}[5m])) > 10. When a user's batch job goes wild or a script kiddie finds your endpoint, you'll get a Slack alert before your cloud bill does.

Next Steps: From Rate Limiter to Full-Fledged Queue

You've stopped the financial bleed. But what about the 429s? Telling a user "no" is better than bankruptcy, but queuing the requests is better UX. This is where BullMQ or Celery with Redis as the broker comes in.

Instead of rejecting the 101st LLM request, push it into a Redis Stream or a BullMQ queue. A separate worker process consumes jobs at the sustainable rate. BullMQ can process 50,000 jobs/min on a single Redis instance with 8 workers. This pattern turns a hard limit into a smooth, throttled workflow.

For your LLM API, the final architecture might be:

Sliding Window Rate Limiter (Redis Sorted Set + Lua): Hard stop for cost protection.
Priority Queue (BullMQ): For requests within the limit, manage execution order.
Token Bucket (Redis Hash + Lua): For internal calls to the LLM provider, respecting their quotas.

Redis is the connective tissue for all three. With Redis 8.0 introducing native vector search, you could even start storing and searching embeddings right beside your rate limit counters. The journey from a single INCR command to a coordinated, distributed, and observable rate limiting system is a masterclass in using Redis for what it does best: providing fast, atomic, and shared state for stateless applications. Now go plug that $500-per-minute leak.