Multi-Tenant LLM SaaS Architecture: Isolating Prompts, Costs, and Rate Limits Per Customer

Design a multi-tenant LLM backend where each customer's prompts, spend, and rate limits are fully isolated — preventing data leakage, cost overruns, and noisy neighbour problems.

Tenant A's aggressive usage just ate 80% of your shared rate limit pool and caused Tenant B to get 429 errors for 12 minutes. In B2B SaaS, that's a contract conversation. Here's the isolation architecture that prevents it.

You built a slick AI feature, hooked it up to GPT-4, and watched your first ten customers sign up. Then, on Tuesday at 2 PM, your Slack lights up. Tenant B's CTO is furious—their mission-critical workflow is broken, throwing "rate limit exceeded" errors. You check the logs. Tenant A, running an automated script with a malformed loop, just hammered the OpenAI API 2,000 times in 60 seconds, draining your shared pool. You've just graduated from a technical hiccup to a business-critical SLA violation. Welcome to multi-tenant LLM hell.

The naive "single API key to rule them all" model works until your first enterprise customer asks for an uptime guarantee and a detailed usage report. Without isolation, you're flying blind: a Pillar survey in 2025 found that multi-tenant LLM SaaS projects see an average 340% cost overrun without per-tenant token tracking. Your margins evaporate, your reliability tanks, and you start getting invoices for conversations you never meant to have.

This guide is the architectural intervention. We're building the walls between tenants—for prompts, costs, rate limits, and data—using a model-agnostic backend. The goal isn't just to stop the bleeding; it's to build the foundation for a white-label AI SaaS product that can command 60-75% gross margins, precisely because you have control, visibility, and isolation.

The Four Isolation Problems: Prompts, Costs, Rate Limits, and Data

Isolation in a multi-tenant LLM system isn't one problem; it's four intertwined problems that will bite you in different phases of growth.

  1. Prompt & Context Isolation: The most insidious leak. Tenant A's proprietary data, embedded in a prompt template or retrieved from a vector store, must never, under any circumstance, be sent in a request for Tenant B. This is a data breach waiting to happen.
  2. Cost Isolation: If you can't attribute every penny of your OpenAI, Anthropic, or Groq bill to a specific tenant, you cannot price profitably. Shared costs mean one tenant's inefficiency (or malice) is subsidized by all the others.
  3. Rate Limit Isolation: Provider-level rate limits (e.g., 10k RPM on GPT-4o) are a shared resource. A single tenant's burst must be throttled before it consumes the shared pool and causes collateral damage for everyone else, triggering those 429 errors.
  4. Data & Cache Isolation: Caching LLM responses is a huge performance and cost win. But a cached response for Tenant A must be inaccessible to Tenant B. Similarly, any tenant-specific data (chat history, embeddings) needs strict namespace separation.

Fail at any one, and your product isn't enterprise-ready. Let's start building the fences.

Tenant Context Injection: The Middleware That Knows Who's Who

Every request entering your system must be stamped with a tenant identity before it touches anything else. This is non-negotiable. Your API gateway or first piece of middleware should resolve the tenant context from the request—be it a JWT claim, a subdomain, or an API key header.

Here’s a FastAPI middleware pattern that sets the stage for everything to come. It extracts a tenant ID and user ID and injects them into a request-state object. Every subsequent layer—rate limiter, prompt renderer, LLM router—will have access to this context.


from fastapi import Request, Depends, HTTPException
from typing import Optional
import uuid

class TenantContext:
    def __init__(self, tenant_id: str, user_id: Optional[str], request_id: str):
        self.tenant_id = tenant_id
        self.user_id = user_id
        self.request_id = request_id  # For distributed tracing

async def tenant_middleware(request: Request, call_next):
    # 1. Resolve Tenant Identity
    api_key = request.headers.get("X-API-Key")
    tenant_id = None

    # Example: Look up API key in Supabase (cached in Redis for production)
    if api_key:
        # Pseudocode for a cached lookup
        tenant_id = await redis.get(f"api_key:{api_key}:tenant_id")
        if not tenant_id:
            # Hit your database (e.g., Supabase)
            # tenant_id = await supabase.get_tenant_by_key(api_key)
            # await redis.setex(f"api_key:{api_key}:tenant_id", 3600, tenant_id)
            tenant_id = "tenant_abc123"  # Placeholder
    else:
        # Fallback to JWT or subdomain resolution
        raise HTTPException(status_code=401, detail="Tenant identity required")

    # 2. Resolve User (if authenticated)
    user_id = request.headers.get("X-User-Id", "system")

    # 3. Generate a request ID for tracing
    request_id = str(uuid.uuid4())

    # 4. Attach context to request state
    request.state.tenant_context = TenantContext(
        tenant_id=tenant_id,
        user_id=user_id,
        request_id=request_id
    )

    # 5. Proceed with the enriched request
    response = await call_next(request)

    # 6. Attach the request ID to response headers for client-side debugging
    response.headers["X-Request-ID"] = request_id
    return response

This context is your golden thread. Pass it explicitly through function calls or use a carefully scoped contextvar. Don't lose it.

Per-Tenant Redis Namespacing: Your Rate and Cost Command Center

Redis is the operational brain for real-time isolation. We'll use it for two critical patterns: the Token Bucket Algorithm for rate limiting and atomic counters for cost tracking. The cardinal rule: every Redis key must be prefixed with the tenant_id.

Rate Limiting: A token bucket per tenant, per feature. This ensures Tenant A's script can be throttled to, say, 100 requests per minute without affecting Tenant B's allowance.

Cost Tracking: An atomic counter for tokens used. Every time you call an LLM, you add the usage (prompt + completion tokens) to this counter. This is your real-time cost ledger.

# services/rate_limit_and_cost.py
import asyncio
from typing import Literal
import litellm
from litellm import completion

async def make_isolated_llm_call(
    tenant_context: TenantContext,
    messages: list,
    model: str = "gpt-4o-mini"
):
    tenant_id = tenant_context.tenant_id
    feature = "chat"  # Could be "summarize", "generate_sql", etc.

    # --- 1. RATE LIMIT CHECK ---
    rate_limit_key = f"{tenant_id}:rate_limit:{feature}"
    # Use Redis' `CL.THROTTLE` (Redis Cell) or a Lua script for token bucket.
    # Pseudocode:
    is_allowed = await redis.execute_command(
        'CL.THROTTLE', rate_limit_key, 100, 60, 1
        # 100 requests per 60 seconds, cost of 1 per request
    )
    if not is_allowed[0]:  # 0 means allowed
        raise HTTPException(status_code=429, detail="Tenant rate limit exceeded")

    # --- 2. MAKE LLM CALL WITH LITELLM (Model-Agnostic) ---
    # LiteLLM router adds ~8ms overhead but enables cost-saving fallbacks.
    try:
        response = await completion(
            model=model,
            messages=messages,
            # LiteLLM automatically returns usage metrics
        )
    except Exception as e:
        # Critical: Implement model fallbacks per tenant tier.
        # A timeout to one provider shouldn't break the request.
        # Fix: set fallbacks=['gpt-4o', 'claude-3-sonnet', 'ollama/llama3']
        raise

    # --- 3. ATOMIC COST TRACKING ---
    if response.usage:
        total_tokens = response.usage.total_tokens
        cost_key = f"{tenant_id}:spend:total_tokens"
        # Atomically increment the counter. Do this in the same Redis pipeline
        # as the rate limit for consistency.
        await redis.incrby(cost_key, total_tokens)

        # Optional: Track per-feature spend
        feature_cost_key = f"{tenant_id}:spend:{feature}:tokens"
        await redis.incrby(feature_cost_key, total_tokens)

    # --- 4. TENANT-AWARE CACHING (Optional) ---
    # Cache key MUST include tenant_id to prevent leakage.
    cache_key = f"{tenant_id}:cache:{model}:{hash(str(messages))}"
    # ... cache logic ...

    return response

Real Error & Fix:

Error: Tenant A's prompt leaking to Tenant B — You discover cached responses from Tenant A appearing in Tenant B's requests. Fix: The root cause is almost always a cache key without a tenant prefix. Prefix all Redis keys with tenant_id and validate this pattern in a security middleware audit.

Prompt Isolation: Versioned Templates and Context Boundaries

Your prompts are application logic. Treat them as code. Prompt versioning reduces regression incidents by 67% compared to ad-hoc prompt management (LangSmith data). Store them in a Git repository, not just a database column. A database is fine for the active version pointer, but Git gives you history, diffs, and rollbacks.

The isolation trick is in the renderer. When you inject tenant-specific context (like retrieved documents, user data, or product info), you must guarantee that data source is scoped to the tenant.

# services/prompt_service.py
import jinja2
from typing import Dict, Any

class PromptRenderer:
    def __init__(self):
        self.env = jinja2.Environment(loader=jinja2.FileSystemLoader("./prompt_templates"))

    async def render_for_tenant(
        self,
        tenant_context: TenantContext,
        template_name: str,
        template_vars: Dict[str, Any]
    ) -> str:
        # 1. Load the correct version of the template.
        #    Fetch the Git commit hash or version ID associated with this tenant's plan.
        version = await self._get_template_version(tenant_context.tenant_id, template_name)
        template = self.env.get_template(f"{template_name}/{version}.j2")

        # 2. **CRITICAL: Enrich template vars with TENANT-SCOPED data only.**
        #    This function must only query databases using `tenant_context.tenant_id` in the WHERE clause.
        tenant_scoped_data = await self._fetch_tenant_data(tenant_context.tenant_id)
        all_vars = {**template_vars, **tenant_scoped_data}

        # 3. Render.
        return template.render(**all_vars)

    async def _fetch_tenant_data(self, tenant_id: str) -> Dict[str, Any]:
        # Example: Fetch tenant's document chunks from a vector store.
        # The Supabase query MUST include `tenant_id = $1`.
        # async with supabase_client as db:
        #     results = db.from_tenant_scoped_table().select("*").eq('tenant_id', tenant_id).execute()
        return {"company_name": "Acme Inc.", "kb_articles": [...]}

Real Error & Fix:

Error: Prompt regression after update — A new prompt template deployed on Friday causes a 30% drop in output quality for all customers. Fix: Run an evaluation suite against a golden dataset before promoting to production. Use LangSmith to track chain-of-thought, correctness, and cost metrics for each prompt version across a representative dataset.

Tier-Based Limits: Enforcing the Contract at the Gateway

Your Free, Pro, and Enterprise plans aren't just marketing. They are technical contracts enforced at the API gateway. These checks should happen before any heavy LLM or retrieval work.

Limit TypeFree TierPro TierEnterprise (Custom)Enforcement Point
Requests/Day10010,000100,000+Redis Counter (Fast)
Tokens/Month10k1MUnlimitedRedis Counter (Post-Call)
Max Input Length2k chars10k chars100k charsMiddleware (Pre-Call)
Supported ModelsGPT-3.5GPT-4o, Claude HaikuAll + Fine-tunesLiteLLM Router Config
Rate Limit (RPM)10100500Redis Token Bucket
Offline Cache-10MB50MBIndexedDB Quota

This table should be stored as configuration, not hardcoded. Your tenant middleware should attach a tenant_tier object to the request context, and subsequent services check against it.

Cost Attribution Dashboard: From Redis Counters to Real-Time Dollars

Those atomic Redis counters are raw tokens. You need to convert them to dollars and expose them. Build a real-time dashboard that shows:

  • Current Month Spend: (Total Tokens * Per-Token Cost)
  • Spend by Feature: Chat vs. Summarization vs. Code Generation.
  • Spend by Model: GPT-4 vs. Claude vs. a cheaper local model.

Calculate costs by pulling the latest per-model, per-provider token prices (store these in config). A scheduled job can roll up the high-frequency Redis counters into a time-series database (like Supabase) for historical reporting and invoicing via Stripe.

# services/cost_calculator.py
async def get_tenant_spend_dashboard(tenant_id: str):
    # Fetch raw token counts from Redis
    total_tokens = int(await redis.get(f"{tenant_id}:spend:total_tokens") or 0)
    chat_tokens = int(await redis.get(f"{tenant_id}:spend:chat:tokens") or 0)

    # Define model costs (example; fetch from config)
    model_costs = {
        "gpt-4o": {"input": 2.50 / 1_000_000, "output": 10.00 / 1_000_000},
        "gpt-4o-mini": {"input": 0.15 / 1_000_000, "output": 0.60 / 1_000_000},
        "claude-3-5-sonnet": {"input": 3.00 / 1_000_000, "output": 15.00 / 1_000_000},
    }

    # In a real system, you'd track spend per-model. This is a simplified aggregation.
    # Assume an average cost for calculation.
    avg_cost_per_token = 5.0 / 1_000_000  # $5 per million tokens
    estimated_spend = total_tokens * avg_cost_per_token

    return {
        "total_tokens": total_tokens,
        "estimated_spend_usd": round(estimated_spend, 2),
        "breakdown": {
            "chat": chat_tokens,
            # ... other features
        }
    }

The Non-Negotiable Audit Log

Every LLM call must be logged with a immutable audit trail. This is for debugging, compliance, and answering the "what happened?" question when a tenant complains. Log to a structured data store (Supabase table) or a dedicated logging service.

Each log entry should contain:

  • tenant_id, user_id, request_id (from your initial middleware)
  • Timestamp
  • Feature/endpoint invoked
  • Model requested and model used (fallbacks change this)
  • Prompt tokens, completion tokens, total tokens
  • The exact prompt sent (sanitized of PII if necessary) and the response received.
  • Latency and any errors.

This log is your source of truth. It allows you to reconstruct any session, prove isolation, and generate itemized bills.

The Offline-First Edge: Isolating Data in the Browser

For web apps, consider an offline-first architecture using IndexedDB. Offline-first AI apps retain 89% of functionality without a network vs. 23% for server-dependent apps. Cache vector embeddings, prompt templates, and even small models locally, but namespace everything by tenant_id.

// frontend/storage/tenant-aware-indexeddb.js
import { openDB } from 'idb';

const DB_NAME = 'AI_WORKER';
const STORE_VECTORS = 'vectors';

async function getTenantDB(tenantId) {
    // Database name includes tenantId to ensure total separation at the storage level.
    const dbName = `${DB_NAME}_${tenantId}`;
    const db = await openDB(dbName, 1, {
        upgrade(db) {
            if (!db.objectStoreNames.contains(STORE_VECTORS)) {
                const store = db.createObjectStore(STORE_VECTORS, { keyPath: 'id' });
                store.createIndex('by-timestamp', 'timestamp');
            }
        },
    });
    return db;
}

async function cacheEmbedding(tenantId, key, embedding) {
    const db = await getTenantDB(tenantId);
    const tx = db.transaction(STORE_VECTORS, 'readwrite');
    await tx.store.put({
        id: key,
        vector: embedding,
        timestamp: Date.now(),
        tenant_id: tenantId // Redundant but safe
    });
    await tx.done;
    // Implement LRU eviction to stay under quota.
    await enforceLRU(tenantId, db);
}

Real Error & Fix:

Error: IndexedDB QuotaExceededError — The browser throws this when your app tries to cache too much data for a tenant. Fix: Implement LRU (Least Recently Used) eviction for your vector cache and set a hard max storage limit per tenant (e.g., 50MB). Proactively clean up old entries.

Next Steps: From Isolation to Scale

You now have the blueprint. The isolation architecture is the table stakes for a serious multi-tenant LLM SaaS. What's next?

  1. Automate the Pipeline: Use GitHub Actions to test prompt versions against your golden dataset. Deploy only versions that pass quality and cost thresholds.
  2. Implement Circuit Breakers: In your LiteLLM router, add circuit breakers for LLM providers. If Anthropic's API starts throwing 5xx errors, fail fast and route all traffic to your fallback provider for that tenant.
  3. Build the Billing Integration: Connect your real-time cost aggregators to Stripe or Chargebee. Generate metered invoices automatically. This turns cost tracking into revenue.
  4. Plan for Data Residency: For global enterprises, isolation might need to be geographic. This means tenant-aware deployment of your backend and vector databases to specific cloud regions (e.g., EU data stays in EU).
  5. Simulate Load: Use k6 or Locust to simulate the "Tenant A runaway script" scenario. Verify that your token buckets hold, Tenant B's requests succeed, and your dashboard updates in real-time.

The shift from a shared playground to a gated community is the defining architectural transition for an AI SaaS. It's what lets you sleep at night, price with confidence, and hand an enterprise customer an audit log that proves their data never left their designated silo. Stop letting one tenant's traffic spike become everyone's emergency. Build the walls.