Build an LLM Fallback Chain: Multi-Provider Reliability Pattern 2026

Build a production-ready LLM fallback chain across OpenAI, Anthropic, and Groq using Python 3.12. Prevent outages with automatic provider switching and retry logic.

Problem: Your LLM Goes Down and Takes Your App With It

LLM fallback chain is the production pattern that keeps your AI app alive when OpenAI throws a 503, Anthropic rate-limits you at 2 AM, or Groq's free tier hits its daily cap.

Single-provider apps fail silently or loudly — either way, users churn. A fallback chain routes each request through an ordered list of providers, tries the next one on failure, and returns the first successful response. No change to your API contract, no manual intervention.

You'll learn:

  • How to wire a priority-ordered fallback chain across OpenAI, Anthropic, and Groq
  • How to classify errors (rate limit vs. hard failure) to skip or retry the right provider
  • How to add response-time budgets so a slow primary doesn't eat your latency SLA

Time: 20 min | Difficulty: Intermediate


Why LLM APIs Fail in Production

Production LLM traffic hits four failure modes constantly:

  • Rate limits (429) — You exceeded RPM/TPM. The provider is fine; you're just moving too fast.
  • Service outages (503 / 502) — The provider's API is degraded or down.
  • Timeout — The model is overloaded. The request never completes within your budget.
  • Context overflows (400) — Your prompt is too long for the selected model.

Each failure type calls for a different response. A 429 from OpenAI means "wait or switch." A 503 means "skip immediately." A 400 means the same prompt will fail on any model with the same context window — you need a model with a larger context, not just a different provider.

A naive try/except retries everything the same way. A proper fallback chain classifies the error first, then routes accordingly.

LLM Fallback Chain multi-provider request routing architecture Request flow: primary provider → error classifier → fallback queue → first successful response returned


Solution

Step 1: Install Dependencies

This guide uses LiteLLM as the unified provider interface and tenacity for retry logic with exponential backoff.

# uv is the recommended package manager for Python 3.12 in 2026
uv add litellm tenacity python-dotenv

Set your provider keys in .env:

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GROQ_API_KEY=gsk_...

Step 2: Define the Fallback Chain Config

Keep the provider priority list in one place. Changing the order or adding a provider should be a one-line edit.

# config.py
from dataclasses import dataclass, field

@dataclass
class ProviderConfig:
    model: str
    timeout: float          # seconds before we give up on this provider
    max_retries: int = 1    # retries within the same provider before moving on

FALLBACK_CHAIN: list[ProviderConfig] = [
    ProviderConfig(model="openai/gpt-4o-mini",      timeout=8.0,  max_retries=1),
    ProviderConfig(model="anthropic/claude-haiku-4-5-20251001", timeout=10.0, max_retries=1),
    ProviderConfig(model="groq/llama-3.1-8b-instant", timeout=5.0, max_retries=0),
]

The primary is GPT-4o Mini — cheap, fast, and available on OpenAI's $20/month tier. Anthropic Claude Haiku is the first fallback: comparable cost at $0.25/MTok input, higher context. Groq's Llama 3.1 8B is the last resort — free tier, sub-second latency, but smaller context window.


Step 3: Build the Error Classifier

# errors.py
import httpx
import litellm

class ErrorKind:
    RATE_LIMIT   = "rate_limit"    # 429 — retry after delay or skip
    OVERLOAD     = "overload"      # 503/502 — skip immediately
    TIMEOUT      = "timeout"       # no response in budget — skip
    CONTEXT      = "context"       # 400 context length — skip, need bigger model
    UNKNOWN      = "unknown"       # anything else — skip


def classify(exc: Exception) -> str:
    if isinstance(exc, litellm.RateLimitError):
        return ErrorKind.RATE_LIMIT
    if isinstance(exc, litellm.ServiceUnavailableError):
        return ErrorKind.OVERLOAD
    if isinstance(exc, (litellm.Timeout, TimeoutError)):
        return ErrorKind.TIMEOUT
    if isinstance(exc, litellm.ContextWindowExceededError):
        return ErrorKind.CONTEXT
    # Catch raw HTTP errors from provider SDKs
    if isinstance(exc, httpx.TimeoutException):
        return ErrorKind.TIMEOUT
    return ErrorKind.UNKNOWN

Step 4: Write the Fallback Chain Runner

# fallback.py
import asyncio
import logging
from litellm import acompletion
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import litellm

from config import FALLBACK_CHAIN, ProviderConfig
from errors import classify, ErrorKind

logger = logging.getLogger(__name__)


async def _call_provider(cfg: ProviderConfig, messages: list[dict]) -> str:
    """Single provider attempt — raises on any failure."""
    response = await asyncio.wait_for(
        acompletion(model=cfg.model, messages=messages),
        timeout=cfg.timeout,           # hard wall clock budget per provider
    )
    return response.choices[0].message.content


async def run_fallback_chain(messages: list[dict]) -> str:
    """
    Try each provider in order. Return the first successful response.
    Raises RuntimeError only if every provider in the chain fails.
    """
    last_exc: Exception | None = None

    for cfg in FALLBACK_CHAIN:
        attempts = cfg.max_retries + 1   # retries are extra attempts, not total

        for attempt in range(attempts):
            try:
                content = await _call_provider(cfg, messages)
                if attempt > 0 or cfg != FALLBACK_CHAIN[0]:
                    logger.warning("Served by fallback: %s (attempt %d)", cfg.model, attempt + 1)
                return content

            except Exception as exc:
                kind = classify(exc)
                last_exc = exc

                # Context overflow: this model can never handle the prompt — skip chain entry
                if kind == ErrorKind.CONTEXT:
                    logger.info("Context overflow on %s — skipping to next provider", cfg.model)
                    break

                # Rate limit with retries remaining: wait, then retry same provider
                if kind == ErrorKind.RATE_LIMIT and attempt < cfg.max_retries:
                    wait = 2 ** attempt        # 1s, 2s, 4s…
                    logger.info("Rate limited on %s — waiting %ds", cfg.model, wait)
                    await asyncio.sleep(wait)
                    continue

                # Any other error, or rate limit with no retries left: move to next provider
                logger.warning("Provider %s failed (%s): %s", cfg.model, kind, exc)
                break

    raise RuntimeError(
        f"All providers in fallback chain exhausted. Last error: {last_exc}"
    ) from last_exc

The critical design decision: break exits the inner retry loop and advances to the next provider. continue retries the same provider after a delay. Context overflows always break — the prompt is the problem, not the provider's availability.


Step 5: Wire It Into Your App

# main.py
import asyncio
import os
from dotenv import load_dotenv
from fallback import run_fallback_chain

load_dotenv()

async def main():
    messages = [
        {"role": "system", "content": "You are a concise technical assistant."},
        {"role": "user",   "content": "Explain exponential backoff in two sentences."},
    ]

    try:
        reply = await run_fallback_chain(messages)
        print(reply)
    except RuntimeError as e:
        # All providers failed — log, alert, return graceful error to user
        print(f"Service unavailable: {e}")

asyncio.run(main())

Expected output:

Exponential backoff delays retries by doubling the wait time after each failure,
reducing load on an overloaded service. It prevents thundering herd problems
by spreading retry attempts over time rather than slamming the API all at once.

If it fails:

  • AuthenticationError → Check .env is loaded and your key is not expired
  • ModuleNotFoundError: litellm → Run uv add litellm and confirm your venv is active
  • RuntimeError: All providers exhausted → All three APIs are failing; add a fourth provider or cache last-known-good responses

In production on AWS us-east-1, you want to know which provider served each request and how often you're hitting fallbacks. A simple counter tells you when your primary is degraded before your users notice.

# metrics.py — drop-in replacement calls; wire to your existing telemetry
from collections import defaultdict

_hits: dict[str, int] = defaultdict(int)
_errors: dict[str, int] = defaultdict(int)

def record_hit(model: str):
    _hits[model] += 1

def record_error(model: str, kind: str):
    _errors[f"{model}:{kind}"] += 1

def report() -> dict:
    return {"hits": dict(_hits), "errors": dict(_errors)}

Add record_hit(cfg.model) on success and record_error(cfg.model, kind) on each caught exception in run_fallback_chain. Ship these counters to CloudWatch or Datadog. Set an alert when fallback_hit_rate > 5% on any 5-minute window — that's your early warning that the primary is degrading.


Verification

python main.py

To force a fallback, temporarily set OPENAI_API_KEY=invalid and rerun. You should see:

WARNING  Provider openai/gpt-4o-mini failed (unknown): AuthenticationError...
Served by fallback: anthropic/claude-haiku-4-5-20251001 (attempt 1)
Exponential backoff delays retries...

The response comes through cleanly despite the broken primary.


Provider Comparison for Fallback Chains

ProviderBest roleContext windowPricing (input)Timeout sweet spot
OpenAI GPT-4o MiniPrimary128K$0.15/MTok8s
Anthropic Claude HaikuFirst fallback200K$0.25/MTok10s
Groq Llama 3.1 8BLast resort128KFree tier / $0.05/MTok5s
Google Gemini FlashAlternative primary1M$0.075/MTok8s

All USD pricing reflects public API tiers as of March 2026. Groq's free tier is capped at 30 RPM and 14,400 requests/day — adequate as a safety net, not a primary.

Choose a 2-provider chain if you want simplicity and your traffic is under 1M tokens/day. Choose a 3-provider chain if you serve US production traffic overnight and can't afford gaps in coverage across provider maintenance windows.


What You Learned

  • Error classification drives routing — treating all failures identically is the most common mistake in naive retry implementations
  • asyncio.wait_for() gives you a hard per-provider timeout that litellm's built-in timeout doesn't always enforce reliably across provider SDKs
  • Context overflow errors should skip the entire chain entry — a larger-context model further down the chain can still serve the request
  • Fallback hit rate is the metric to watch; alert at 5% to catch primary degradation before it becomes an outage

Tested on Python 3.12.3, LiteLLM 1.35, macOS Sequoia 15 & Ubuntu 24.04


FAQ

Q: Does this pattern work with streaming responses? A: Yes, but asyncio.wait_for() wraps the entire stream. For streaming, set a timeout on time-to-first-token using a separate asyncio.Task that cancels if no chunk arrives within your budget (typically 3–5s).

Q: What is the difference between max_retries here and LiteLLM's built-in retry? A: LiteLLM's built-in retry retries the same model with the same error handling for all error types. The classifier here lets you retry rate limits (worth waiting) but skip immediately on service unavailability (not worth waiting) — they serve different failure modes.

Q: Minimum Python version for this pattern? A: Python 3.11+ for asyncio.wait_for reliability improvements; Python 3.12 recommended for the ExceptionGroup handling that LiteLLM uses internally.

Q: Can I use this with a synchronous FastAPI endpoint? A: Yes. Wrap run_fallback_chain in asyncio.run() if you're outside an async context, or declare your FastAPI route as async def and await it directly — the latter is strongly preferred for throughput.