Problem: Your LLM Goes Down and Takes Your App With It
LLM fallback chain is the production pattern that keeps your AI app alive when OpenAI throws a 503, Anthropic rate-limits you at 2 AM, or Groq's free tier hits its daily cap.
Single-provider apps fail silently or loudly — either way, users churn. A fallback chain routes each request through an ordered list of providers, tries the next one on failure, and returns the first successful response. No change to your API contract, no manual intervention.
You'll learn:
- How to wire a priority-ordered fallback chain across OpenAI, Anthropic, and Groq
- How to classify errors (rate limit vs. hard failure) to skip or retry the right provider
- How to add response-time budgets so a slow primary doesn't eat your latency SLA
Time: 20 min | Difficulty: Intermediate
Why LLM APIs Fail in Production
Production LLM traffic hits four failure modes constantly:
- Rate limits (429) — You exceeded RPM/TPM. The provider is fine; you're just moving too fast.
- Service outages (503 / 502) — The provider's API is degraded or down.
- Timeout — The model is overloaded. The request never completes within your budget.
- Context overflows (400) — Your prompt is too long for the selected model.
Each failure type calls for a different response. A 429 from OpenAI means "wait or switch." A 503 means "skip immediately." A 400 means the same prompt will fail on any model with the same context window — you need a model with a larger context, not just a different provider.
A naive try/except retries everything the same way. A proper fallback chain classifies the error first, then routes accordingly.
Request flow: primary provider → error classifier → fallback queue → first successful response returned
Solution
Step 1: Install Dependencies
This guide uses LiteLLM as the unified provider interface and tenacity for retry logic with exponential backoff.
# uv is the recommended package manager for Python 3.12 in 2026
uv add litellm tenacity python-dotenv
Set your provider keys in .env:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GROQ_API_KEY=gsk_...
Step 2: Define the Fallback Chain Config
Keep the provider priority list in one place. Changing the order or adding a provider should be a one-line edit.
# config.py
from dataclasses import dataclass, field
@dataclass
class ProviderConfig:
model: str
timeout: float # seconds before we give up on this provider
max_retries: int = 1 # retries within the same provider before moving on
FALLBACK_CHAIN: list[ProviderConfig] = [
ProviderConfig(model="openai/gpt-4o-mini", timeout=8.0, max_retries=1),
ProviderConfig(model="anthropic/claude-haiku-4-5-20251001", timeout=10.0, max_retries=1),
ProviderConfig(model="groq/llama-3.1-8b-instant", timeout=5.0, max_retries=0),
]
The primary is GPT-4o Mini — cheap, fast, and available on OpenAI's $20/month tier. Anthropic Claude Haiku is the first fallback: comparable cost at $0.25/MTok input, higher context. Groq's Llama 3.1 8B is the last resort — free tier, sub-second latency, but smaller context window.
Step 3: Build the Error Classifier
# errors.py
import httpx
import litellm
class ErrorKind:
RATE_LIMIT = "rate_limit" # 429 — retry after delay or skip
OVERLOAD = "overload" # 503/502 — skip immediately
TIMEOUT = "timeout" # no response in budget — skip
CONTEXT = "context" # 400 context length — skip, need bigger model
UNKNOWN = "unknown" # anything else — skip
def classify(exc: Exception) -> str:
if isinstance(exc, litellm.RateLimitError):
return ErrorKind.RATE_LIMIT
if isinstance(exc, litellm.ServiceUnavailableError):
return ErrorKind.OVERLOAD
if isinstance(exc, (litellm.Timeout, TimeoutError)):
return ErrorKind.TIMEOUT
if isinstance(exc, litellm.ContextWindowExceededError):
return ErrorKind.CONTEXT
# Catch raw HTTP errors from provider SDKs
if isinstance(exc, httpx.TimeoutException):
return ErrorKind.TIMEOUT
return ErrorKind.UNKNOWN
Step 4: Write the Fallback Chain Runner
# fallback.py
import asyncio
import logging
from litellm import acompletion
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import litellm
from config import FALLBACK_CHAIN, ProviderConfig
from errors import classify, ErrorKind
logger = logging.getLogger(__name__)
async def _call_provider(cfg: ProviderConfig, messages: list[dict]) -> str:
"""Single provider attempt — raises on any failure."""
response = await asyncio.wait_for(
acompletion(model=cfg.model, messages=messages),
timeout=cfg.timeout, # hard wall clock budget per provider
)
return response.choices[0].message.content
async def run_fallback_chain(messages: list[dict]) -> str:
"""
Try each provider in order. Return the first successful response.
Raises RuntimeError only if every provider in the chain fails.
"""
last_exc: Exception | None = None
for cfg in FALLBACK_CHAIN:
attempts = cfg.max_retries + 1 # retries are extra attempts, not total
for attempt in range(attempts):
try:
content = await _call_provider(cfg, messages)
if attempt > 0 or cfg != FALLBACK_CHAIN[0]:
logger.warning("Served by fallback: %s (attempt %d)", cfg.model, attempt + 1)
return content
except Exception as exc:
kind = classify(exc)
last_exc = exc
# Context overflow: this model can never handle the prompt — skip chain entry
if kind == ErrorKind.CONTEXT:
logger.info("Context overflow on %s — skipping to next provider", cfg.model)
break
# Rate limit with retries remaining: wait, then retry same provider
if kind == ErrorKind.RATE_LIMIT and attempt < cfg.max_retries:
wait = 2 ** attempt # 1s, 2s, 4s…
logger.info("Rate limited on %s — waiting %ds", cfg.model, wait)
await asyncio.sleep(wait)
continue
# Any other error, or rate limit with no retries left: move to next provider
logger.warning("Provider %s failed (%s): %s", cfg.model, kind, exc)
break
raise RuntimeError(
f"All providers in fallback chain exhausted. Last error: {last_exc}"
) from last_exc
The critical design decision: break exits the inner retry loop and advances to the next provider. continue retries the same provider after a delay. Context overflows always break — the prompt is the problem, not the provider's availability.
Step 5: Wire It Into Your App
# main.py
import asyncio
import os
from dotenv import load_dotenv
from fallback import run_fallback_chain
load_dotenv()
async def main():
messages = [
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "Explain exponential backoff in two sentences."},
]
try:
reply = await run_fallback_chain(messages)
print(reply)
except RuntimeError as e:
# All providers failed — log, alert, return graceful error to user
print(f"Service unavailable: {e}")
asyncio.run(main())
Expected output:
Exponential backoff delays retries by doubling the wait time after each failure,
reducing load on an overloaded service. It prevents thundering herd problems
by spreading retry attempts over time rather than slamming the API all at once.
If it fails:
AuthenticationError→ Check.envis loaded and your key is not expiredModuleNotFoundError: litellm→ Runuv add litellmand confirm your venv is activeRuntimeError: All providers exhausted→ All three APIs are failing; add a fourth provider or cache last-known-good responses
Step 6: Add Observability (Optional but Recommended)
In production on AWS us-east-1, you want to know which provider served each request and how often you're hitting fallbacks. A simple counter tells you when your primary is degraded before your users notice.
# metrics.py — drop-in replacement calls; wire to your existing telemetry
from collections import defaultdict
_hits: dict[str, int] = defaultdict(int)
_errors: dict[str, int] = defaultdict(int)
def record_hit(model: str):
_hits[model] += 1
def record_error(model: str, kind: str):
_errors[f"{model}:{kind}"] += 1
def report() -> dict:
return {"hits": dict(_hits), "errors": dict(_errors)}
Add record_hit(cfg.model) on success and record_error(cfg.model, kind) on each caught exception in run_fallback_chain. Ship these counters to CloudWatch or Datadog. Set an alert when fallback_hit_rate > 5% on any 5-minute window — that's your early warning that the primary is degrading.
Verification
python main.py
To force a fallback, temporarily set OPENAI_API_KEY=invalid and rerun. You should see:
WARNING Provider openai/gpt-4o-mini failed (unknown): AuthenticationError...
Served by fallback: anthropic/claude-haiku-4-5-20251001 (attempt 1)
Exponential backoff delays retries...
The response comes through cleanly despite the broken primary.
Provider Comparison for Fallback Chains
| Provider | Best role | Context window | Pricing (input) | Timeout sweet spot |
|---|---|---|---|---|
| OpenAI GPT-4o Mini | Primary | 128K | $0.15/MTok | 8s |
| Anthropic Claude Haiku | First fallback | 200K | $0.25/MTok | 10s |
| Groq Llama 3.1 8B | Last resort | 128K | Free tier / $0.05/MTok | 5s |
| Google Gemini Flash | Alternative primary | 1M | $0.075/MTok | 8s |
All USD pricing reflects public API tiers as of March 2026. Groq's free tier is capped at 30 RPM and 14,400 requests/day — adequate as a safety net, not a primary.
Choose a 2-provider chain if you want simplicity and your traffic is under 1M tokens/day. Choose a 3-provider chain if you serve US production traffic overnight and can't afford gaps in coverage across provider maintenance windows.
What You Learned
- Error classification drives routing — treating all failures identically is the most common mistake in naive retry implementations
asyncio.wait_for()gives you a hard per-provider timeout thatlitellm's built-in timeout doesn't always enforce reliably across provider SDKs- Context overflow errors should skip the entire chain entry — a larger-context model further down the chain can still serve the request
- Fallback hit rate is the metric to watch; alert at 5% to catch primary degradation before it becomes an outage
Tested on Python 3.12.3, LiteLLM 1.35, macOS Sequoia 15 & Ubuntu 24.04
FAQ
Q: Does this pattern work with streaming responses?
A: Yes, but asyncio.wait_for() wraps the entire stream. For streaming, set a timeout on time-to-first-token using a separate asyncio.Task that cancels if no chunk arrives within your budget (typically 3–5s).
Q: What is the difference between max_retries here and LiteLLM's built-in retry? A: LiteLLM's built-in retry retries the same model with the same error handling for all error types. The classifier here lets you retry rate limits (worth waiting) but skip immediately on service unavailability (not worth waiting) — they serve different failure modes.
Q: Minimum Python version for this pattern?
A: Python 3.11+ for asyncio.wait_for reliability improvements; Python 3.12 recommended for the ExceptionGroup handling that LiteLLM uses internally.
Q: Can I use this with a synchronous FastAPI endpoint?
A: Yes. Wrap run_fallback_chain in asyncio.run() if you're outside an async context, or declare your FastAPI route as async def and await it directly — the latter is strongly preferred for throughput.