Step-by-Step: Setting up Rate Limiting for Your Public AI SaaS

Problem: Your AI SaaS Is Bleeding Money Without Rate Limiting

You launched your AI-powered app. Users love it. Then one heavy user — or a bot — hammers your endpoint and your OpenAI bill spikes $800 overnight. No rate limiting means no protection.

You'll learn:

How to limit requests per user per minute/day
How to track token usage, not just request counts
How to enforce different limits per pricing tier

Time: 25 min | Level: Intermediate

Why This Happens

AI APIs charge per token, not per request. A single user can send one request that consumes 10,000 tokens. Simple "5 requests per minute" limits miss this entirely. You need layered limits: request rate, token budget, and per-tier quotas.

Common symptoms:

Unexpected OpenAI/Anthropic bills at end of month
A single free-tier user consuming 80% of your quota
No visibility into who's using what

Solution

Step 1: Set Up Redis for Rate Limit State

Rate limiting needs shared state across server instances. Redis is the standard choice.

npm install ioredis express-rate-limit rate-limit-redis

// lib/redis.ts
import Redis from "ioredis";

// Use a single connection — don't create per-request
export const redis = new Redis(process.env.REDIS_URL!, {
  maxRetriesPerRequest: 3,
  lazyConnect: true,
});

Expected: Redis connects on first use. No errors on startup.

If it fails:

ECONNREFUSED: Your REDIS_URL is wrong or Redis isn't running
ETIMEDOUT: Use lazyConnect: true to prevent startup crashes

Step 2: Implement Request Rate Limiting

Limit how often a user can hit your endpoint at all. This is your first line of defense.

// middleware/requestRateLimit.ts
import { RateLimiterRedis } from "rate-limiter-flexible";
import { redis } from "../lib/redis";

const limiterByUser = new RateLimiterRedis({
  storeClient: redis,
  keyPrefix: "rl_req",
  points: 20,        // 20 requests...
  duration: 60,      // ...per 60 seconds
  blockDuration: 60, // Block for 60s if exceeded
});

export async function requestRateLimit(req: Request, res: Response, next: NextFunction) {
  const userId = req.user?.id ?? req.ip; // Fall back to IP for anon users

  try {
    await limiterByUser.consume(userId);
    next();
  } catch {
    // RateLimiterRedis throws when limit exceeded — not an error
    res.status(429).json({
      error: "Too many requests",
      retryAfter: 60,
    });
  }
}

Why blockDuration: Without it, a user hitting the limit still consumes a Redis call per request. Blocking for 60s cuts those wasted lookups.

Step 3: Track Token Usage Per User

Request limits aren't enough. Add a daily token budget that resets at midnight UTC.

// lib/tokenBudget.ts
import { redis } from "./redis";

const DAILY_LIMITS: Record<string, number> = {
  free: 50_000,       // ~25 average completions
  pro: 500_000,
  enterprise: -1,     // -1 = unlimited
};

export async function checkAndConsumeTokens(
  userId: string,
  tier: "free" | "pro" | "enterprise",
  estimatedTokens: number
): Promise<{ allowed: boolean; remaining: number }> {
  const limit = DAILY_LIMITS[tier];

  // Don't count for unlimited tiers
  if (limit === -1) return { allowed: true, remaining: -1 };

  const today = new Date().toISOString().slice(0, 10); // "2026-02-21"
  const key = `tokens:${userId}:${today}`;

  const used = await redis.incrby(key, estimatedTokens);

  // Set TTL to 48h on first write — Redis cleans up automatically
  if (used === estimatedTokens) {
    await redis.expire(key, 172_800);
  }

  const remaining = limit - used;
  return { allowed: remaining >= 0, remaining: Math.max(0, remaining) };
}

Expected: checkAndConsumeTokens("user_123", "free", 1500) returns { allowed: true, remaining: 48500 } on first call of the day.

Step 4: Wire It Into Your AI Route

Put the rate limit checks before you call your AI provider.

// routes/chat.ts
import { checkAndConsumeTokens } from "../lib/tokenBudget";
import { requestRateLimit } from "../middleware/requestRateLimit";
import { estimateTokens } from "../lib/tokenEstimator";

router.post("/chat", requestRateLimit, async (req, res) => {
  const { messages } = req.body;
  const user = req.user!;

  // Estimate tokens before calling the API to fail fast
  const estimated = estimateTokens(messages);

  const { allowed, remaining } = await checkAndConsumeTokens(
    user.id,
    user.tier,
    estimated
  );

  if (!allowed) {
    return res.status(429).json({
      error: "Daily token limit reached",
      resetAt: "midnight UTC",
      upgradeUrl: "/pricing",
    });
  }

  // Call your AI provider
  const completion = await openai.chat.completions.create({
    model: "gpt-4o",
    messages,
  });

  // Record actual usage after the call
  const actualTokens = completion.usage?.total_tokens ?? estimated;
  await checkAndConsumeTokens(user.id, user.tier, actualTokens - estimated);

  res.json({ message: completion.choices[0].message, tokensRemaining: remaining });
});

Why estimate first: You prevent the API call entirely if the user is over budget. This saves real money, not just a Redis write.

Step 5: Add Headers So Clients Know Their Limits

Good APIs tell clients what's happening. This prevents confused users and support tickets.

// middleware/rateLimitHeaders.ts
export function addRateLimitHeaders(res: Response, remaining: number, resetAt: Date) {
  res.setHeader("X-RateLimit-Remaining-Tokens", remaining);
  res.setHeader("X-RateLimit-Reset", Math.floor(resetAt.getTime() / 1000));
  res.setHeader("X-RateLimit-Limit-Tokens", "50000"); // Or pull from user tier
}

Include these in your /chat route response before sending. Standard headers mean your users can build retry logic without guessing.

Verification

# Send 21 requests in 60 seconds and confirm the 21st returns 429
for i in $(seq 1 21); do curl -s -o /dev/null -w "%{http_code}\n" \
  -X POST http://localhost:3000/chat \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"messages":[{"role":"user","content":"hi"}]}'; done

You should see: 20 lines of 200 followed by one 429.

# Check Redis to confirm token tracking
redis-cli GET "tokens:user_123:2026-02-21"

You should see: A number matching your total token usage for the day.

What You Learned

Request rate limiting (20 req/min) and token budgets (50k/day) solve different problems — you need both
Estimate tokens before the API call to fail fast and save money
Use Redis TTL instead of cron jobs to reset daily budgets
Always return Retry-After and token headers so clients can handle limits gracefully

Limitation: Token estimation is approximate. tiktoken gives accurate counts for OpenAI models, but you'll still want to record actual usage after each call and reconcile the difference.

When NOT to use this: If you're on a fixed-credit system (like Anthropic's Workspaces), the provider handles quotas for you. Only build this yourself if you're reselling AI access or need per-user billing visibility.

Tested on Node.js 22.x, Redis 7.2, rate-limiter-flexible 5.x, ioredis 5.x