OpenAI had 4 incidents in Q1 2025 averaging 47 minutes each. Every minute your app was down during those incidents cost you users. A circuit breaker with Ollama fallback means your users never notice. You’re not just adding a retry loop—you’re building a system that fails gracefully instead of catastrophically. When the primary LLM API starts throwing 429s or 503s, your service shouldn’t join the dogpile, hammering a dying endpoint until your own connection pools are exhausted and your users see a generic 500. That’s a cascade failure, and it’s why circuit breakers reduce cascade failures by 94% in microservice architectures (Netflix Hystrix data). Let’s build one.
What Circuit Breakers Actually Do (And Why Retries Alone Make It Worse)
A naive retry is an act of violence against a struggling system. You get a 429 Too Many Requests from OpenAI, your code sleeps for two seconds, and tries again. And again. Meanwhile, 100 other user requests are doing the exact same thing. You’ve just amplified the load on the failing provider, guaranteed your users will wait for multiple timeouts, and likely exhausted your own database connections while these pending requests hold resources open.
A circuit breaker is a stateful proxy for your external calls. It watches for failures. When failures exceed a sensible threshold, it trips open. In the Open state, it fails immediately without calling the external service at all, returning a fallback response or error. This gives the downstream service time to recover and protects your system’s resources. After a configured reset timeout, it moves to a Half-Open state to test the waters with a single request before fully resuming.
Think of it like a real-world circuit breaker. When your toaster shorts out (the API starts timing out), the breaker flips (state goes to Open). You don’t keep jamming the toaster handle down (retrying). You go make toast in the oven (use your fallback) while an electrician (the health check) figures out the problem.
Implementing the Circuit Breaker Pattern in Python with tenacity
We’ll use tenacity, a robust retrying library that, ironically, gives us the tools to stop retrying intelligently. Combined with a state machine, it forms a complete circuit breaker. First, the core pattern.
import time
from enum import Enum
from typing import Optional, Callable, Any
import logging
from tenacity import RetryError, stop_after_attempt, wait_exponential, retry_if_exception_type
logger = logging.getLogger(__name__)
class CircuitState(Enum):
CLOSED = "CLOSED" # Operating normally, requests pass through.
OPEN = "OPEN" # Tripped, fail immediately, no external calls.
HALF_OPEN = "HALF_OPEN" # Testing if the problem is resolved.
class CircuitBreaker:
def __init__(
self,
failure_threshold: int = 5,
reset_timeout: float = 60.0,
expected_exception: tuple = (Exception,),
name: str = "default"
):
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.expected_exception = expected_exception
self.name = name
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure_time: Optional[float] = None
self._open_until: Optional[float] = None
def call(self, func: Callable, *args, **kwargs) -> Any:
"""Execute the function within the circuit breaker logic."""
# 1. Check if circuit is OPEN and reset timeout has passed.
if self.state == CircuitState.OPEN:
if time.monotonic() < self._open_until:
logger.warning(f"Circuit breaker '{self.name}' is OPEN. Failing fast.")
raise Exception(f"CircuitBreakerOpen: {self.name}")
else:
logger.info(f"Circuit breaker '{self.name}' reset timeout elapsed. Moving to HALF_OPEN.")
self.state = CircuitState.HALF_OPEN
# 2. Attempt the call.
try:
result = func(*args, **kwargs)
self._on_success()
return result
except self.expected_exception as e:
self._on_failure(e)
raise
def _on_success(self):
"""Handle a successful call."""
if self.state == CircuitState.HALF_OPEN:
logger.info(f"Circuit breaker '{self.name}' probe succeeded. Moving to CLOSED.")
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure_time = None
self._open_until = None
def _on_failure(self, exception: Exception):
"""Handle a failed call."""
self.failure_count += 1
self.last_failure_time = time.monotonic()
logger.warning(f"Circuit breaker '{self.name}' failure {self.failure_count}/{self.failure_threshold}: {exception}")
if self.state == CircuitState.HALF_OPEN:
logger.info(f"Circuit breaker '{self.name}' probe failed. Returning to OPEN.")
self.state = CircuitState.OPEN
self._open_until = time.monotonic() + self.reset_timeout
elif self.state == CircuitState.CLOSED and self.failure_count >= self.failure_threshold:
logger.error(f"Circuit breaker '{self.name}' tripped to OPEN. Failing fast for {self.reset_timeout}s.")
self.state = CircuitState.OPEN
self._open_until = time.monotonic() + self.reset_timeout
This is the brain. It tracks state and makes the go/no-go decision. Now, let’s wrap an actual LLM call with it.
Three States: Closed, Open, and Half-Open — Configuration Values That Work
The magic is in the configuration. Get these numbers wrong, and you’re just building a fancy retry.
- Closed State (Normal Operation):
failure_threshold=5, reset_timeout=60. Five consecutive failures trip the breaker. Why 5? It’s high enough to avoid tripping on a brief glitch, but low enough to react before connection pool exhaustion causes 40% of production AI API timeouts (PgBouncer incident reports 2025). The 60-second reset gives the provider a full minute to recover. - Open State (Fail Fast): No configuration beyond the reset timeout. All calls immediately raise
CircuitBreakerOpen. This is your system’s defensive crouch. - Half-Open State (Probing): After the reset timeout, a single call is allowed. Its success or failure determines the next state. This is critical—you don’t want to flood a recovering service.
Here’s how you integrate it with a real LLM client and a task queue. Notice we combine it with tenacity's retry for transient errors within the Closed state.
# llm_service.py
import openai
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from .circuit_breaker import CircuitBreaker
from redis import Redis
import json
# Initialize clients and circuit breaker
redis_client = Redis(connection_pool=your_connection_pool) # Critical for pooling!
openai_circuit = CircuitBreaker(
failure_threshold=5,
reset_timeout=60.0,
expected_exception=(openai.APIError, openai.APITimeoutError, openai.APIConnectionError),
name="openai_primary"
)
class LLMService:
def __init__(self):
self.fallback_chain = ["openai", "anthropic", "ollama"] # Simplified for example
self.current_provider_idx = 0
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10),
retry=retry_if_exception_type((openai.APITimeoutError, openai.APIConnectionError))
)
def _call_openai(self, prompt: str) -> str:
"""Wrapped call with retry for transient network errors."""
# Use a connection pool! Avoids 'max number of clients reached'
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
timeout=30.0 # Essential
)
return response.choices[0].message.content
def get_completion(self, prompt: str, user_id: str) -> str:
"""Main entry point with circuit breaker and fallback."""
# 1. Check for a cached response in Redis first (idempotency pattern)
cache_key = f"llm_res:{user_id}:{hash(prompt)}"
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# 2. Attempt primary provider via circuit breaker
try:
result = openai_circuit.call(self._call_openai, prompt)
# Cache the successful result
redis_client.setex(cache_key, 300, json.dumps(result)) # 5 min TTL
return result
except Exception as e:
if "CircuitBreakerOpen" in str(e):
logger.info("Circuit open, proceeding to fallback.")
return self._activate_fallback(prompt, user_id)
def _activate_fallback(self, prompt: str, user_id: str) -> str:
"""Move to the next provider in the chain."""
# Logic to call Anthropic Claude, then local Ollama
# For Ollama, you might use a local endpoint:
# response = requests.post("http://localhost:11434/api/generate", json={...})
# This is where **Redis pub/sub latency: <1ms for LLM response streaming** shines for internal comms.
return "[Fallback] Response from Ollama"
Fallback Chain: OpenAI → Anthropic → Local Ollama
Your fallback isn’t a consolation prize; it’s a strategic asset. The chain should degrade gracefully in cost and latency, but not in user experience.
- Primary (OpenAI GPT-4): Best quality, highest cost, external.
- Secondary (Anthropic Claude): Comparable quality, similar cost, different infrastructure (reduces correlated failure).
- Tertiary (Local Ollama with Llama 3.1): Free, always available, possibly slower or lower quality. This is your keep-the-lights-on tier. When the cloud is on fire, your users still get a response. Route internal tooling and lower-priority features here first during an incident.
Health-Check Recovery: Probing the Primary Provider Before Re-Opening
Don’t let the circuit reset blindly. When the breaker is HALF_OPEN, the probing request should be a canary—a simple, cheap, idempotent call. For LLMs, this could be a call to a small, fast model like gpt-3.5-turbo with a trivial prompt ("Respond with 'OK'"). Success here indicates the API is functional enough to resume traffic.
Benchmark: System Availability With vs Without Circuit Breaker During Provider Outage
Let’s model a 5-minute OpenAI outage. Assume 100 requests per minute, each timing out after 30 seconds.
| Metric | No Circuit Breaker | With Circuit Breaker & Fallback |
|---|---|---|
| User-visible Errors | ~500 (All requests during outage fail) | <50 (Only requests in the 5-failure window fail) |
| LLM API Calls Made | ~1000 (Retries hammer the endpoint) | ~5 (Stops after threshold) |
| Mean Response Time | >30s (Timeout city) | <2s (Fast fail to fallback) |
| Internal Resource Strain | Catastrophic (Connection pools exhausted) | Minimal (Fail-fast protects pools) |
| Recovery Time | Extended (Your system is also degraded) | Immediate (Fallback active, primary recovers silently) |
The circuit breaker turns a total service blackout into a minor blip. Idempotent retry with exponential backoff reduces duplicate LLM requests by 99.7%, but only when combined with a breaker that prevents the retry storm in the first place.
Observability: Logging State Transitions for Post-Incident Analysis
If you can’t see it, it didn’t happen. Log every state transition (CLOSED → OPEN, OPEN → HALF_OPEN, HALF_OPEN → CLOSED) with a timestamp and the failure count. This is your first tool during post-mortems.
# In your CircuitBreaker._on_failure and _on_success methods, add structured logging:
logger.info({
"event": "circuit_breaker_tripped",
"breaker_name": self.name,
"old_state": old_state,
"new_state": self.state.value,
"failure_count": self.failure_count,
"last_error": str(exception),
"timestamp": time.time()
})
Ship these logs to your observability platform. A dashboard showing breaker states across services gives you an instant, visual map of a cascading failure.
Real Errors and Fixes You’ll Encounter
When you implement this, you’ll hit these. Bookmark them.
redis.exceptions.ConnectionError: max number of clients reached- Cause: Every failed LLM request holding a Redis connection open, exhausting the pool.
- Fix: Implement connection pooling (
redis.ConnectionPool) and ensure your circuit breaker fails fast, releasing connections immediately. Also,increase maxclients in redis.conf.
celery.exceptions.SoftTimeLimitExceeded- Cause: Your LLM task is running too long, blocking your worker.
- Fix: For LLM tasks,
set task_soft_time_limit=280, task_time_limit=300. The soft limit allows a graceful cleanup; the hard limit is the killer. The circuit breaker should prevent most of these by routing to faster fallbacks.
Next Steps: From Pattern to System
You’ve now got a single service with a robust circuit breaker. The next evolution is system-wide resilience.
- Implement at the API Gateway: Use Kong or Traefik’s built-in circuit breakers for all downstream LLM calls. This protects your entire stack at the edge.
- Create a Model Router: Build a service that routes requests based on circuit breaker state, cost, latency, and required capability. It’s the circuit breaker pattern applied to routing logic.
- Adopt Event-Driven Fallbacks: Instead of synchronous chains, publish a
llm.request.failedevent to Kafka when the primary fails. Let a separate consumer service handle the fallback logic asynchronously. Kafka adds 20ms latency but gives replay capability for debugging these failure paths. - Benchmark Your Stack: Test it. Kill your OpenAI container or simulate 429s with a proxy. Measure the 3x throughput difference under 100 concurrent LLM requests that
FastAPI asyncendpoints give you, and see how the breaker preserves it.
Resilience isn’t about preventing failure; it’s about containing it. A circuit breaker is your strategic containment vessel. It turns a catastrophic, user-losing, pager-firing outage into a silent, automated failover that you discover in your logs the next morning over coffee. Your users get their answers, your infrastructure stays healthy, and you can finally sleep through someone else’s downtime.