Problem: Your LLM Goes Down and Takes Your App With It

LLM fallback chain is the production pattern that keeps your AI app alive when OpenAI throws a 503, Anthropic rate-limits you at 2 AM, or Groq's free tier hits its daily cap.

Single-provider apps fail silently or loudly — either way, users churn. A fallback chain routes each request through an ordered list of providers, tries the next one on failure, and returns the first successful response. No change to your API contract, no manual intervention.

You'll learn:

How to wire a priority-ordered fallback chain across OpenAI, Anthropic, and Groq
How to classify errors (rate limit vs. hard failure) to skip or retry the right provider
How to add response-time budgets so a slow primary doesn't eat your latency SLA

Time: 20 min | Difficulty: Intermediate

Why LLM APIs Fail in Production

Production LLM traffic hits four failure modes constantly:

Rate limits (429) — You exceeded RPM/TPM. The provider is fine; you're just moving too fast.
Service outages (503 / 502) — The provider's API is degraded or down.
Timeout — The model is overloaded. The request never completes within your budget.
Context overflows (400) — Your prompt is too long for the selected model.

Each failure type calls for a different response. A 429 from OpenAI means "wait or switch." A 503 means "skip immediately." A 400 means the same prompt will fail on any model with the same context window — you need a model with a larger context, not just a different provider.

A naive try/except retries everything the same way. A proper fallback chain classifies the error first, then routes accordingly.

LLM Fallback Chain multi-provider request routing architecture Request flow: primary provider → error classifier → fallback queue → first successful response returned

Solution

Step 1: Install Dependencies

This guide uses LiteLLM as the unified provider interface and tenacity for retry logic with exponential backoff.

# uv is the recommended package manager for Python 3.12 in 2026
uv add litellm tenacity python-dotenv

Set your provider keys in .env:

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GROQ_API_KEY=gsk_...

Step 2: Define the Fallback Chain Config

Keep the provider priority list in one place. Changing the order or adding a provider should be a one-line edit.

# config.py
from dataclasses import dataclass, field

@dataclass
class ProviderConfig:
    model: str
    timeout: float          # seconds before we give up on this provider
    max_retries: int = 1    # retries within the same provider before moving on

FALLBACK_CHAIN: list[ProviderConfig] = [
    ProviderConfig(model="openai/gpt-4o-mini",      timeout=8.0,  max_retries=1),
    ProviderConfig(model="anthropic/claude-haiku-4-5-20251001", timeout=10.0, max_retries=1),
    ProviderConfig(model="groq/llama-3.1-8b-instant", timeout=5.0, max_retries=0),
]

The primary is GPT-4o Mini — cheap, fast, and available on OpenAI's $20/month tier. Anthropic Claude Haiku is the first fallback: comparable cost at $0.25/MTok input, higher context. Groq's Llama 3.1 8B is the last resort — free tier, sub-second latency, but smaller context window.

Step 3: Build the Error Classifier

# errors.py
import httpx
import litellm

class ErrorKind:
    RATE_LIMIT   = "rate_limit"    # 429 — retry after delay or skip
    OVERLOAD     = "overload"      # 503/502 — skip immediately
    TIMEOUT      = "timeout"       # no response in budget — skip
    CONTEXT      = "context"       # 400 context length — skip, need bigger model
    UNKNOWN      = "unknown"       # anything else — skip


def classify(exc: Exception) -> str:
    if isinstance(exc, litellm.RateLimitError):
        return ErrorKind.RATE_LIMIT
    if isinstance(exc, litellm.ServiceUnavailableError):
        return ErrorKind.OVERLOAD
    if isinstance(exc, (litellm.Timeout, TimeoutError)):
        return ErrorKind.TIMEOUT
    if isinstance(exc, litellm.ContextWindowExceededError):
        return ErrorKind.CONTEXT
    # Catch raw HTTP errors from provider SDKs
    if isinstance(exc, httpx.TimeoutException):
        return ErrorKind.TIMEOUT
    return ErrorKind.UNKNOWN

Step 4: Write the Fallback Chain Runner

# fallback.py
import asyncio
import logging
from litellm import acompletion
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import litellm

from config import FALLBACK_CHAIN, ProviderConfig
from errors import classify, ErrorKind

logger = logging.getLogger(__name__)


async def _call_provider(cfg: ProviderConfig, messages: list[dict]) -> str:
    """Single provider attempt — raises on any failure."""
    response = await asyncio.wait_for(
        acompletion(model=cfg.model, messages=messages),
        timeout=cfg.timeout,           # hard wall clock budget per provider
    )
    return response.choices[0].message.content


async def run_fallback_chain(messages: list[dict]) -> str:
    """
    Try each provider in order. Return the first successful response.
    Raises RuntimeError only if every provider in the chain fails.
    """
    last_exc: Exception | None = None

    for cfg in FALLBACK_CHAIN:
        attempts = cfg.max_retries + 1   # retries are extra attempts, not total

        for attempt in range(attempts):
            try:
                content = await _call_provider(cfg, messages)
                if attempt > 0 or cfg != FALLBACK_CHAIN[0]:
                    logger.warning("Served by fallback: %s (attempt %d)", cfg.model, attempt + 1)
                return content

            except Exception as exc:
                kind = classify(exc)
                last_exc = exc

                # Context overflow: this model can never handle the prompt — skip chain entry
                if kind == ErrorKind.CONTEXT:
                    logger.info("Context overflow on %s — skipping to next provider", cfg.model)
                    break

                # Rate limit with retries remaining: wait, then retry same provider
                if kind == ErrorKind.RATE_LIMIT and attempt < cfg.max_retries:
                    wait = 2 ** attempt        # 1s, 2s, 4s…
                    logger.info("Rate limited on %s — waiting %ds", cfg.model, wait)
                    await asyncio.sleep(wait)
                    continue

                # Any other error, or rate limit with no retries left: move to next provider
                logger.warning("Provider %s failed (%s): %s", cfg.model, kind, exc)
                break

    raise RuntimeError(
        f"All providers in fallback chain exhausted. Last error: {last_exc}"
    ) from last_exc

The critical design decision: break exits the inner retry loop and advances to the next provider. continue retries the same provider after a delay. Context overflows always break — the prompt is the problem, not the provider's availability.

Step 5: Wire It Into Your App

# main.py
import asyncio
import os
from dotenv import load_dotenv
from fallback import run_fallback_chain

load_dotenv()

async def main():
    messages = [
        {"role": "system", "content": "You are a concise technical assistant."},
        {"role": "user",   "content": "Explain exponential backoff in two sentences."},
    ]

    try:
        reply = await run_fallback_chain(messages)
        print(reply)
    except RuntimeError as e:
        # All providers failed — log, alert, return graceful error to user
        print(f"Service unavailable: {e}")

asyncio.run(main())

Expected output:

Exponential backoff delays retries by doubling the wait time after each failure,
reducing load on an overloaded service. It prevents thundering herd problems
by spreading retry attempts over time rather than slamming the API all at once.

If it fails:

AuthenticationError → Check .env is loaded and your key is not expired
ModuleNotFoundError: litellm → Run uv add litellm and confirm your venv is active
RuntimeError: All providers exhausted → All three APIs are failing; add a fourth provider or cache last-known-good responses

Step 6: Add Observability (Optional but Recommended)

In production on AWS us-east-1, you want to know which provider served each request and how often you're hitting fallbacks. A simple counter tells you when your primary is degraded before your users notice.

# metrics.py — drop-in replacement calls; wire to your existing telemetry
from collections import defaultdict

_hits: dict[str, int] = defaultdict(int)
_errors: dict[str, int] = defaultdict(int)

def record_hit(model: str):
    _hits[model] += 1

def record_error(model: str, kind: str):
    _errors[f"{model}:{kind}"] += 1

def report() -> dict:
    return {"hits": dict(_hits), "errors": dict(_errors)}

Add record_hit(cfg.model) on success and record_error(cfg.model, kind) on each caught exception in run_fallback_chain. Ship these counters to CloudWatch or Datadog. Set an alert when fallback_hit_rate > 5% on any 5-minute window — that's your early warning that the primary is degrading.

Verification

python main.py

To force a fallback, temporarily set OPENAI_API_KEY=invalid and rerun. You should see:

WARNING  Provider openai/gpt-4o-mini failed (unknown): AuthenticationError...
Served by fallback: anthropic/claude-haiku-4-5-20251001 (attempt 1)
Exponential backoff delays retries...

The response comes through cleanly despite the broken primary.

Provider Comparison for Fallback Chains

Provider	Best role	Context window	Pricing (input)	Timeout sweet spot
OpenAI GPT-4o Mini	Primary	128K	$0.15/MTok	8s
Anthropic Claude Haiku	First fallback	200K	$0.25/MTok	10s
Groq Llama 3.1 8B	Last resort	128K	Free tier / $0.05/MTok	5s
Google Gemini Flash	Alternative primary	1M	$0.075/MTok	8s

All USD pricing reflects public API tiers as of March 2026. Groq's free tier is capped at 30 RPM and 14,400 requests/day — adequate as a safety net, not a primary.

Choose a 2-provider chain if you want simplicity and your traffic is under 1M tokens/day. Choose a 3-provider chain if you serve US production traffic overnight and can't afford gaps in coverage across provider maintenance windows.

What You Learned

Error classification drives routing — treating all failures identically is the most common mistake in naive retry implementations
asyncio.wait_for() gives you a hard per-provider timeout that litellm's built-in timeout doesn't always enforce reliably across provider SDKs
Context overflow errors should skip the entire chain entry — a larger-context model further down the chain can still serve the request
Fallback hit rate is the metric to watch; alert at 5% to catch primary degradation before it becomes an outage

Tested on Python 3.12.3, LiteLLM 1.35, macOS Sequoia 15 & Ubuntu 24.04

FAQ

Q: Does this pattern work with streaming responses? A: Yes, but asyncio.wait_for() wraps the entire stream. For streaming, set a timeout on time-to-first-token using a separate asyncio.Task that cancels if no chunk arrives within your budget (typically 3–5s).

Q: What is the difference between max_retries here and LiteLLM's built-in retry? A: LiteLLM's built-in retry retries the same model with the same error handling for all error types. The classifier here lets you retry rate limits (worth waiting) but skip immediately on service unavailability (not worth waiting) — they serve different failure modes.

Q: Minimum Python version for this pattern? A: Python 3.11+ for asyncio.wait_for reliability improvements; Python 3.12 recommended for the ExceptionGroup handling that LiteLLM uses internally.

Q: Can I use this with a synchronous FastAPI endpoint? A: Yes. Wrap run_fallback_chain in asyncio.run() if you're outside an async context, or declare your FastAPI route as async def and await it directly — the latter is strongly preferred for throughput.