Problem: Your AI SaaS Is Bleeding Money Without Rate Limiting
You launched your AI-powered app. Users love it. Then one heavy user — or a bot — hammers your endpoint and your OpenAI bill spikes $800 overnight. No rate limiting means no protection.
You'll learn:
- How to limit requests per user per minute/day
- How to track token usage, not just request counts
- How to enforce different limits per pricing tier
Time: 25 min | Level: Intermediate
Why This Happens
AI APIs charge per token, not per request. A single user can send one request that consumes 10,000 tokens. Simple "5 requests per minute" limits miss this entirely. You need layered limits: request rate, token budget, and per-tier quotas.
Common symptoms:
- Unexpected OpenAI/Anthropic bills at end of month
- A single free-tier user consuming 80% of your quota
- No visibility into who's using what
Solution
Step 1: Set Up Redis for Rate Limit State
Rate limiting needs shared state across server instances. Redis is the standard choice.
npm install ioredis express-rate-limit rate-limit-redis
// lib/redis.ts
import Redis from "ioredis";
// Use a single connection — don't create per-request
export const redis = new Redis(process.env.REDIS_URL!, {
maxRetriesPerRequest: 3,
lazyConnect: true,
});
Expected: Redis connects on first use. No errors on startup.
If it fails:
- ECONNREFUSED: Your
REDIS_URLis wrong or Redis isn't running - ETIMEDOUT: Use
lazyConnect: trueto prevent startup crashes
Step 2: Implement Request Rate Limiting
Limit how often a user can hit your endpoint at all. This is your first line of defense.
// middleware/requestRateLimit.ts
import { RateLimiterRedis } from "rate-limiter-flexible";
import { redis } from "../lib/redis";
const limiterByUser = new RateLimiterRedis({
storeClient: redis,
keyPrefix: "rl_req",
points: 20, // 20 requests...
duration: 60, // ...per 60 seconds
blockDuration: 60, // Block for 60s if exceeded
});
export async function requestRateLimit(req: Request, res: Response, next: NextFunction) {
const userId = req.user?.id ?? req.ip; // Fall back to IP for anon users
try {
await limiterByUser.consume(userId);
next();
} catch {
// RateLimiterRedis throws when limit exceeded — not an error
res.status(429).json({
error: "Too many requests",
retryAfter: 60,
});
}
}
Why blockDuration: Without it, a user hitting the limit still consumes a Redis call per request. Blocking for 60s cuts those wasted lookups.
Step 3: Track Token Usage Per User
Request limits aren't enough. Add a daily token budget that resets at midnight UTC.
// lib/tokenBudget.ts
import { redis } from "./redis";
const DAILY_LIMITS: Record<string, number> = {
free: 50_000, // ~25 average completions
pro: 500_000,
enterprise: -1, // -1 = unlimited
};
export async function checkAndConsumeTokens(
userId: string,
tier: "free" | "pro" | "enterprise",
estimatedTokens: number
): Promise<{ allowed: boolean; remaining: number }> {
const limit = DAILY_LIMITS[tier];
// Don't count for unlimited tiers
if (limit === -1) return { allowed: true, remaining: -1 };
const today = new Date().toISOString().slice(0, 10); // "2026-02-21"
const key = `tokens:${userId}:${today}`;
const used = await redis.incrby(key, estimatedTokens);
// Set TTL to 48h on first write — Redis cleans up automatically
if (used === estimatedTokens) {
await redis.expire(key, 172_800);
}
const remaining = limit - used;
return { allowed: remaining >= 0, remaining: Math.max(0, remaining) };
}
Expected: checkAndConsumeTokens("user_123", "free", 1500) returns { allowed: true, remaining: 48500 } on first call of the day.
Step 4: Wire It Into Your AI Route
Put the rate limit checks before you call your AI provider.
// routes/chat.ts
import { checkAndConsumeTokens } from "../lib/tokenBudget";
import { requestRateLimit } from "../middleware/requestRateLimit";
import { estimateTokens } from "../lib/tokenEstimator";
router.post("/chat", requestRateLimit, async (req, res) => {
const { messages } = req.body;
const user = req.user!;
// Estimate tokens before calling the API to fail fast
const estimated = estimateTokens(messages);
const { allowed, remaining } = await checkAndConsumeTokens(
user.id,
user.tier,
estimated
);
if (!allowed) {
return res.status(429).json({
error: "Daily token limit reached",
resetAt: "midnight UTC",
upgradeUrl: "/pricing",
});
}
// Call your AI provider
const completion = await openai.chat.completions.create({
model: "gpt-4o",
messages,
});
// Record actual usage after the call
const actualTokens = completion.usage?.total_tokens ?? estimated;
await checkAndConsumeTokens(user.id, user.tier, actualTokens - estimated);
res.json({ message: completion.choices[0].message, tokensRemaining: remaining });
});
Why estimate first: You prevent the API call entirely if the user is over budget. This saves real money, not just a Redis write.
Step 5: Add Headers So Clients Know Their Limits
Good APIs tell clients what's happening. This prevents confused users and support tickets.
// middleware/rateLimitHeaders.ts
export function addRateLimitHeaders(res: Response, remaining: number, resetAt: Date) {
res.setHeader("X-RateLimit-Remaining-Tokens", remaining);
res.setHeader("X-RateLimit-Reset", Math.floor(resetAt.getTime() / 1000));
res.setHeader("X-RateLimit-Limit-Tokens", "50000"); // Or pull from user tier
}
Include these in your /chat route response before sending. Standard headers mean your users can build retry logic without guessing.
Verification
# Send 21 requests in 60 seconds and confirm the 21st returns 429
for i in $(seq 1 21); do curl -s -o /dev/null -w "%{http_code}\n" \
-X POST http://localhost:3000/chat \
-H "Authorization: Bearer $TOKEN" \
-d '{"messages":[{"role":"user","content":"hi"}]}'; done
You should see: 20 lines of 200 followed by one 429.
# Check Redis to confirm token tracking
redis-cli GET "tokens:user_123:2026-02-21"
You should see: A number matching your total token usage for the day.
What You Learned
- Request rate limiting (20 req/min) and token budgets (50k/day) solve different problems — you need both
- Estimate tokens before the API call to fail fast and save money
- Use Redis TTL instead of cron jobs to reset daily budgets
- Always return
Retry-Afterand token headers so clients can handle limits gracefully
Limitation: Token estimation is approximate. tiktoken gives accurate counts for OpenAI models, but you'll still want to record actual usage after each call and reconcile the difference.
When NOT to use this: If you're on a fixed-credit system (like Anthropic's Workspaces), the provider handles quotas for you. Only build this yourself if you're reselling AI access or need per-user billing visibility.
Tested on Node.js 22.x, Redis 7.2, rate-limiter-flexible 5.x, ioredis 5.x