You built an AI writing tool. An agency wants to resell it under their brand with their OpenAI key. Your current architecture has your API key hardcoded in the backend. Here's the BYOK (Bring Your Own Key) architecture.
Your first instinct is to fork the repo, swap the API key, and hand it over. Then the next agency asks. And the next. Suddenly you’re managing twelve nearly-identical codebases, and your RTX 4090 is crying silicon tears from all the git merge conflicts. You’ve accidentally built a consultancy, not a scalable white-label AI SaaS.
The alternative—keeping a shared key and eating the cost—is a financial death spiral. Multi-tenant LLM SaaS sees an average 340% cost overrun without per-tenant token tracking. That agency running 10k blog outlines a month? That’s your margin evaporating.
The escape hatch is a BYOK (Bring Your Own Key) model. The customer brings their API key from OpenAI, Anthropic, or Azure. You provide the polished application, the routing logic, the analytics, and the brand—their key pays the raw LLM bill. Your gross margin jumps to 60-75%, compared to 40-55% for custom-built AI products where you shoulder the inference costs. You trade raw revenue for pure, high-margin software.
This is the architecture for a hardened, white-label AI SaaS that doesn’t bleed money or leak prompts between tenants.
BYOK vs. Shared Key: The Margin and Liability Knife-Edge
Let’s crystallize the tradeoff. With a shared key, you are a utility. You buy LLM compute wholesale and resell it retail. Every token consumed is a direct, variable cost to you. Your pricing is a guessing game against your burn rate.
With BYOK, you are a platform. Your customer’s key is their fuel. Your cost base becomes fixed (infrastructure, support), while your revenue is primarily subscription-based. The risk of a customer’s viral tweet racking up a $10k OpenAI bill transfers from your P&L to theirs.
Here’s the breakpoint. Calculate your effective cost per 1k tokens (including failures, retries, support). If you can mark that up more than your customers value the convenience, shared key works. For most, the math is brutal. BYOK aligns incentives: customers are mindful of usage, and you’re not sweating the API logs.
The Critical Liability Shift: With a shared key, you own GDPR compliance for AI-generated data, you are liable for ToS violations (e.g., generating harmful content via your key), and you must implement and pay for exhaustive rate limiting. With BYOK, the API key holder (your customer) is the controller of that AI interaction. Your role shifts to a processor. This changes everything in your Terms of Service (more on that later).
Encrypting Customer Keys: Not Your Database, Not Your Problem
You will store customer API keys. This is non-negotiable for a good UX. The cardinal sin is storing them in plaintext in your users table.
The blueprint: Encrypt at the application level with AES-256-GCM before the data touches your database. The encryption key (a data encryption key or DEK) is itself encrypted by a key encryption key (KEK) managed in a cloud KMS (Key Management Service). This is envelope encryption.
import os
from cryptography.fernet import Fernet
import boto3
from supabase import create_client, Client
import json
# Initialize clients
supabase: Client = create_client(os.getenv('SUPABASE_URL'), os.getenv('SUPABASE_KEY'))
kms_client = boto3.client('kms', region_name='us-east-1')
def encrypt_with_kms(plaintext_key: bytes) -> tuple[bytes, bytes]:
"""Encrypts a data key using AWS KMS. Returns (ciphertext_blob, encrypted_data_key)."""
# Generate a local data key
data_key = Fernet.generate_key()
# Encrypt the data key with KMS
kms_response = kms_client.encrypt(
KeyId=os.getenv('KMS_KEY_ARN'),
Plaintext=data_key
)
encrypted_dek = kms_response['CiphertextBlob']
# Use the local data key to encrypt the payload
f = Fernet(data_key)
encrypted_payload = f.encrypt(plaintext_key)
return encrypted_payload, encrypted_dek
async def store_tenant_api_key(tenant_id: str, provider_key: str):
"""
Stores an encrypted API key for a tenant.
The encrypted key and its encrypted data key are stored separately.
"""
provider_key_bytes = provider_key.encode('utf-8')
encrypted_key, encrypted_dek = encrypt_with_kms(provider_key_bytes)
# Store in Supabase. In a real setup, these might be in separate tables or vaults.
response = supabase.table('tenant_credentials').upsert({
'tenant_id': tenant_id,
'encrypted_api_key': encrypted_key.hex(), # Store as hex string
'encrypted_data_key': encrypted_dek.hex(),
'key_id': os.getenv('KMS_KEY_ARN') # Track which KEK was used
}).execute()
if not response.data:
raise Exception("Failed to store encrypted key")
The Supabase table tenant_credentials now holds useless ciphertext without access to your KMS. Even a full database breach yields no API keys.
Per-Request Key Injection: Decrypt on Demand, Not on Boot
Never decrypt all customer keys at application startup and hold them in memory. Decrypt just-in-time for the request, use it, and let the garbage collector shred the plaintext.
This is where LiteLLM shines. It’s a model-agnostic router that can dynamically set the api_key per request. We’ll build a middleware that:
- Authenticates the request (via JWT, API key) to identify the
tenant_id. - Fetches and decrypts the tenant’s provider key.
- Injects it into the request context for LiteLLM.
# FastAPI middleware for per-request key decryption and injection
from fastapi import Request, HTTPException
from starlette.middleware.base import BaseHTTPMiddleware
import logging
class TenantKeyMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
# 1. Extract tenant_id from request (e.g., from JWT in Authorization header)
tenant_id = get_tenant_id_from_request(request) # Your auth logic here
if not tenant_id:
raise HTTPException(status_code=401, detail="Invalid tenant")
# 2. Fetch encrypted credentials
creds = supabase.table('tenant_credentials')\
.select('*')\
.eq('tenant_id', tenant_id)\
.single()\
.execute()
if not creds.data:
raise HTTPException(status_code=400, detail="Tenant API key not configured")
# 3. Decrypt the key (reverse of the encrypt_with_kms function)
decrypted_provider_key = decrypt_key(
bytes.fromhex(creds.data['encrypted_api_key']),
bytes.fromhex(creds.data['encrypted_data_key'])
)
# 4. Attach to request state
request.state.tenant_id = tenant_id
request.state.provider_api_key = decrypted_provider_key
# Proceed with the request
response = await call_next(request)
return response
# In your LiteLLM completion endpoint
@app.post("/v1/completions")
async def completion(request: Request, completion_request: dict):
tenant_id = request.state.tenant_id
api_key = request.state.provider_api_key
# Configure LiteLLM to use the tenant's key for this call
# This isolates costs and quotas perfectly per tenant
response = await litellm.acompletion(
model="gpt-4o",
messages=completion_request["messages"],
api_key=api_key, # Injected per-request
metadata={
"tenant_id": tenant_id, # For logging
"project": "white_label_app"
}
)
return response
Real Error & Fix:
LiteLLM Error: provider timeout, no fallback configured
Your agency’s OpenAI key might hit a rate limit. Without a fallback, their users see an error. With LiteLLM, you define cascading fallbacks using their configured keys.
# Fix: Configure fallbacks in the LiteLLM call
response = await litellm.acompletion(
model="gpt-4o",
messages=messages,
api_key=api_key,
fallbacks=[
"gpt-4o", # Primary
"claude-3-sonnet", # Fallback to Anthropic (requires key setup)
"ollama/llama3" # Final fallback to a local model
],
metadata={"tenant_id": tenant_id}
)
This adds 8ms overhead for model switching but can save a request and improve perceived reliability by 40%.
Usage Metering: Tracking What You Don't Pay For
If you’re not paying the bill, why track tokens? Three reasons: feature gating, abuse prevention, and value analytics.
You might offer a "Professional" plan limited to 1M tokens/month. You need to meter usage to enforce that limit, even though the underlying cost isn't yours. You also need to detect a tenant whose key is suddenly generating 10x the volume—it could be a bug in your app or them reselling your white-label service.
Use LiteLLM’s built-in callback system to log every response’s token count to a Redis stream keyed by tenant_id.
# Callback to log usage per tenant to Redis
import redis
import json
r = redis.Redis(host='localhost', port=6379, decode_responses=True)
def token_logger(
kwargs, # kwargs to completion
completion_response, # response from LLM
start_time, end_time # start/end time
):
tenant_id = kwargs.get("metadata", {}).get("tenant_id")
usage = completion_response.get("usage", {})
prompt_tokens = usage.get("prompt_tokens", 0)
completion_tokens = usage.get("completion_tokens", 0)
if tenant_id:
# Log to a Redis sorted set for rolling monthly totals
key = f"usage:{tenant_id}:{datetime.utcnow().strftime('%Y-%m')}"
r.zincrby(key, prompt_tokens + completion_tokens, "total_tokens")
# Also log to a stream for real-time alerts
log_entry = {
"tenant": tenant_id,
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"model": kwargs.get("model"),
"timestamp": end_time.isoformat()
}
r.xadd(f"usage_stream:{tenant_id}", log_entry)
# Attach the callback
litellm.success_callback = [token_logger]
Now you can implement rate limiting.
Benchmark: Multi-tenant rate limiting: Redis token bucket adds 0.8ms vs no rate limiting, prevents 100% cost overruns.
The White-Labeling Layer: Domain, Logo, and System Prompt
The UI is Next.js. White-labeling means:
- Domain: They point
app.theiragency.comto your Vercel project. Use Vercel's multi-tenant configuration to map domains totenant_id. - Logo & CSS: A
tenantstable in Supabase storeslogo_url,primary_color, andfavicon. Your Next.js app fetches this on loading via thetenant_idderived from the hostname. - System Prompt Branding: This is where the AI itself becomes "theirs." You cannot let Tenant A’s custom instructions leak into Tenant B’s session.
Real Error & Fix:
Tenant A's prompt leaking to Tenant B
This happens if you cache system prompts in Redis without tenant isolation.
# Fix: Prefix all Redis keys with tenant_id
def get_tenant_system_prompt(tenant_id: str) -> str:
cache_key = f"{tenant_id}:system_prompt:v1" # CRITICAL PREFIX
cached = r.get(cache_key)
if cached:
return cached
# Fetch from DB, cache, return
prompt = fetch_prompt_from_db(tenant_id)
r.setex(cache_key, 3600, prompt)
return prompt
Store prompt templates in Git for version control, not just a database. Prompt versioning reduces regression incidents by 67% vs ad-hoc prompt management. The tradeoff? Git retrieval at ~200ms vs database-stored prompts at 12ms. For system prompts loaded once per session, the audit trail is worth the latency.
BYOK vs. Shared Key: A Margin Analysis at Scale
Let’s model this with real numbers. Assume a “Pro” plan priced at $99/month.
| Cost Factor | Shared Key Model (You Pay) | BYOK Model (Customer Pays) |
|---|---|---|
| LLM API Cost (5M tokens/mo @ $0.50/1K) | $2,500 | $0 |
| Infrastructure (Redis, Vercel, LiteLLM proxy) | ~$300 | ~$300 |
| Support & Operations Cost | ~$500 (high, due to cost alerts) | ~$200 |
| Total Cost per 100 Customers | ~$330,000 | ~$50,000 |
| Revenue (100 customers) | $9,900 | $9,900 |
| Gross Margin | -3150% (Catastrophic Loss) | ~80% |
The shared key model is only viable if you charge by usage (per token) with a hefty markup and have perfect cost controls. BYOK turns a variable cost nightmare into a predictable SaaS business.
Legal: The Three Clauses Your ToS Must Have for BYOK
- API Key Liability & Acceptable Use: “Customer is solely responsible for the compliance of all Content generated via their provided API keys with the terms of the underlying AI Provider (e.g., OpenAI Usage Policies). Customer indemnifies [Your SaaS] against violations arising from their API key usage.”
- Data Processing Agreement (DPA): Explicitly state you act as a processor on their behalf concerning the AI-generated content. The customer is the controller. This is crucial for GDPR/CPRA.
- Suspension for Abuse: “We may suspend service if activity via your API key threatens the stability of our platform or violates our Acceptable Use Policy, with notice where feasible.”
Consult a lawyer. This is not optional.
Next Steps: From Architecture to Implementation
Your path forward is clear:
- Refactor Key Storage: Implement the envelope encryption pattern tonight. Move keys out of your
.envfile and into atenant_credentialstable. - Deploy LiteLLM Proxy: Run the LiteLLM proxy in a Docker container. It’s the workhorse that will handle model routing, fallbacks, and per-request key injection with minimal overhead.
- Implement the Middleware: Build the authentication and key-decryption middleware. This is the core of your tenant isolation.
- Wire Up Telemetry: Connect LiteLLM callbacks to Redis. Start tracking usage per tenant immediately, even if you don’t yet bill on it. The data is invaluable.
- Create the Onboarding Flow: Build a secure UI where a tenant can paste their OpenAI key. This is their first act of trust—make it feel like a bank vault.
- Version Your Prompts: Create a Git repo for your system prompts. Use a CI/CD pipeline to run an evaluation suite against a golden dataset before promoting to production. This fixes the "Prompt regression after update" error.
You’re no longer selling AI tokens. You’re selling a branded, reliable, intelligent workspace. The API key is their problem. The experience is your product. Go build the platform.