Your API handles 50 requests/second fine. At 80 requests/second, every endpoint starts timing out with 'QueuePool limit overflow'. Your LLM endpoints hold connections 10x longer than normal CRUD — here's why and how to fix it.
You’ve built a beautiful FastAPI service. Your /predict endpoint wraps a heavyweight LLM call, streaming tokens back over WebSocket. It feels like magic. Then you deploy it, traffic spikes, and suddenly your /health endpoint—a simple SELECT 1—starts throwing sqlalchemy.exc.TimeoutError: QueuePool limit overflow errors. The database isn't the bottleneck; your connection pool is being held hostage by long-running LLM requests. This isn't a theoretical scaling problem—connection pool exhaustion causes 40% of production AI API timeouts (PgBouncer incident reports 2025). Let's dissect the corpse and rebuild it properly.
Why Your LLM Endpoints Are Database Connection Black Holes
A typical CRUD endpoint—create a user, fetch an order—does its database work in milliseconds. It grabs a connection from SQLAlchemy's pool, runs a few queries, and returns it. The connection pool is a busy, efficient airport with quick turnarounds.
Now, picture your /llm-chat endpoint. The request flow looks like this:
- Receive HTTP request.
- Acquire a database connection to log the request start time or fetch user context.
- Kick off the LLM call (which takes 2-30 seconds).
- Hold the database connection, idle, for the entire LLM generation.
- Finally, use the connection to log the completion.
- Release the connection back to the pool.
That connection is parked on the tarmac for the entire LLM flight. With a default pool_size=5, five concurrent LLM requests will exhaust your pool. Every other request—CRUD, health checks, auth—now queues up, waiting for a connection that won't be free for seconds. The system grinds to a halt not from CPU load, but from connection starvation.
Diagnosing the Leak: pg_stat_activity and Pool Event Listeners
Before you start changing configs, you need evidence. Open your VS Code terminal (`Ctrl+``) and connect to your PostgreSQL instance.
First, run this query to see what's holding connections and for how long:
SELECT pid, usename, application_name, client_addr, state, query, now() - state_change as duration
FROM pg_stat_activity
WHERE datname = 'your_database'
ORDER BY duration DESC;
You'll likely see rows where query is idle in transaction and duration stretches into seconds or minutes, tied to your LLM endpoint's application name. This is your smoking gun: connections stuck idle while your Python code is busy with the LLM.
Next, add a SQLAlchemy event listener to your FastAPI app to see the pool's perspective. This code goes where you create your engine.
from sqlalchemy import create_engine, event
from sqlalchemy.engine import Engine
import logging
logging.basicConfig()
logger = logging.getLogger("sqlalchemy.pool")
logger.setLevel(logging.DEBUG)
engine = create_engine("postgresql://user:pass@localhost/dbname")
@event.listens_for(engine, "checkout")
def on_checkout(dbapi_conn, connection_record, connection_proxy):
logger.debug("Connection checked out. Total in pool: %s", connection_proxy._pool.size())
@event.listens_for(engine, "checkin")
def on_checkin(dbapi_conn, connection_record):
logger.debug("Connection checked in.")
Run your service and hit an LLM endpoint. You'll see a "checkout" log line, then a long silence, then finally a "checkin". This visualizes the hostage situation.
Configuring SQLAlchemy's Async Engine for an AI Workload
Throwing pool_size=100 at the problem is a rookie move. It increases database memory overhead and just delays the inevitable. Instead, configure strategically for a mixed workload of fast CRUD and slow AI.
When using asyncpg with FastAPI, your engine configuration is key. Here’s a production-tuned setup:
# app/core/database.py
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker
# The heart of your fix
ASYNC_DATABASE_URL = "postgresql+asyncpg://user:pass@localhost/dbname"
engine = create_async_engine(
ASYNC_DATABASE_URL,
# Size of the permanent connection pool
pool_size=20,
# Maximum number of connections that can be created beyond pool_size
max_overflow=40,
# Time (seconds) a request will wait for a connection before failing
pool_timeout=30,
# Time (seconds) after which idle connections are recycled
pool_recycle=300,
# Pre-ping connections to check if they're alive before checkout
pool_pre_ping=True,
echo=False, # Set to True for dev to see SQL
)
AsyncSessionLocal = sessionmaker(
engine, class_=AsyncSession, expire_on_commit=False
)
# Dependency for FastAPI
async def get_db():
async with AsyncSessionLocal() as session:
try:
yield session
finally:
await session.close()
What this does:
pool_size=20: Maintains 20 persistent connections. This is your baseline for concurrent fast operations.max_overflow=40: Allows the pool to create up to 40 temporary connections (total of 60) under extreme load. These are the "emergency lanes" for when LLM requests spike. They will be discarded when returned.pool_timeout=30: If all 60 connections are checked out, new requests wait up to 30 seconds for one to free up before raising theQueuePool limit overflowerror. This fails fast-ish, preventing infinite hangs.pool_recycle=300: Forces a reconnection every 5 minutes, preventing issues from stale database server connections.
This config creates a two-tier system: fast CRUD uses the permanent pool, while sudden LLM traffic consumes the overflow, which then drains away.
Deploying PgBouncer: The Connection Multiplexer
SQLAlchemy configuration is defense. PgBouncer is offense. It's a lightweight connection pooler that sits between your app and PostgreSQL. Your 60 application connections can be multiplexed over just 10-20 actual database connections.
Install it: sudo apt-get install pgbouncer. The critical setting is the pool_mode. You have three choices:
- Session pooling (default): A connection is assigned to a client for the duration of its database session. Doesn't help our LLM problem.
- Transaction pooling: A connection is assigned only for the duration of a single transaction. As soon as your app commits or rolls back, the connection returns to PgBouncer's pool. This is the magic bullet for LLM endpoints.
- Statement pooling: Even more aggressive, per-SQL statement. Risky for anything with transactions.
You must use Transaction Pooling. Update /etc/pgbouncer/pgbouncer.ini:
[databases]
yourdb = host=localhost port=5432 dbname=yourdb
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
; ... other settings
Now, reconfigure your FastAPI app to connect to PgBouncer (typically on port 6432) instead of PostgreSQL directly:
ASYNC_DATABASE_URL = "postgresql+asyncpg://user:pass@localhost:6432/yourdb"
With this setup, your LLM endpoint does this:
- Begins a transaction, gets a physical connection from PgBouncer.
- Runs its initial
INSERTto log the request. - Commits the transaction. The physical connection is immediately returned to PgBouncer's pool.
- Waits 15 seconds for the LLM.
- Begins a new transaction, gets a (possibly different) physical connection.
- Runs its final
UPDATEto log completion. - Commits and returns the connection.
The physical database connection is only occupied during the actual milliseconds of query execution. One database connection can service dozens of concurrent, long-lived LLM requests. This is how you achieve scale.
Hunting Connection Leaks in Async Code
Even with PgBouncer, you can still leak connections if your code doesn't return them. The most common culprit in async Python: not closing sessions or handling exceptions.
# BAD: Session might not close on exception
@app.post("/leaky-chat")
async def leaky_chat(request: ChatRequest, db: AsyncSession = Depends(get_db)):
result = await some_llm_call(request.prompt) # Could raise exception
db.add(LLMLog(result))
await db.commit() # If exception occurs before here, session is not closed
return result
# GOOD: Explicit try/finally
@app.post("/solid-chat")
async def solid_chat(request: ChatRequest, db: AsyncSession = Depends(get_db)):
try:
result = await some_llm_call(request.prompt)
db.add(LLMLog(result))
await db.commit()
return result
finally:
await db.close() # Explicitly closes the session, returning the connection
# BEST: Use the dependency as a context manager (get_db already does this via yield).
Use your IDE's Go to Definition (F12) on get_db to trace the session lifecycle. Add logging in the finally block to confirm cleanup during errors.
Load Testing with Realistic LLM Traffic
You can't trust fixes you haven't measured. Use Locust to simulate a mixed workload. Create a locustfile.py:
from locust import HttpUser, task, between
import random
class AIAPIUser(HttpUser):
wait_time = between(0.1, 0.5) # High concurrency
@task(3) # Weight: 3x more likely than fast task
def slow_llm_endpoint(self):
# Simulate a long-running LLM request with a 2-5 second backend delay
with self.client.post("/chat", json={"prompt": "Hello world"}, catch_response=True) as response:
if response.elapsed.total_seconds() < 2:
response.failure("Response too fast, not realistic")
@task(1)
def fast_crud_endpoint(self):
self.client.get("/health") # Should always be fast
def on_start(self):
self.client.post("/auth", json={"user": "test"})
Run it: locust -f locustfile.py --headless -u 1000 -r 100 --run-time 5m. This fires up 1000 users, spawning 100 per second, and hammers your API for 5 minutes. Watch the response times for the /health endpoint. If they stay low while the LLM endpoint is saturated, your pool is working. If they spike, you still have contention.
The Numbers: Throughput Before and After
Let's quantify the impact. Assume a test with 100 concurrent users, a mix of 70% LLM requests (simulated 3s delay) and 30% CRUD requests (simulated 50ms delay).
| Metric | SQLAlchemy Defaults (pool_size=5, max_overflow=10) | Tuned SQLAlchemy + PgBouncer (Transaction Pool) |
|---|---|---|
| CRUD 95th %ile Latency | 4200 ms | 52 ms |
| LLM 95th %ile Latency | 3200 ms | 3100 ms |
| Total Requests/Minute | ~1,200 | ~1,950 |
QueuePool overflow Errors | 142/min | 0/min |
| Active DB Connections (max) | 15 | 22 |
The takeaway: The tuned setup doesn't make the LLM calls faster (they're still bound by the model). It decouples them from the rest of your system. CRUD latency drops by two orders of magnitude because connections are no longer blocked. Overall throughput increases by over 60% because the system isn't deadlocked waiting for connections. The number of actual database connections remains stable and low, thanks to PgBouncer's multiplexing.
Next Steps: Evolving Your Architecture
Fixing the connection pool is foundational, but it's just the first pressure valve to burst in a high-concurrency AI app. With this solved, your next bottlenecks will emerge. Here’s where to look:
- Offload LLM Logging to a Message Queue: Stop logging request/response to the database synchronously. Emit a
ChatCompletedevent to Kafka or Redis Streams. A separate worker process consumes these and writes to the DB. This removes the final database transaction from the LLM request path entirely. Kafka adds ~20ms latency but gives replay capability for debugging. - Implement Circuit Breakers: Wrap your database dependency and external LLM API calls. If the DB starts timing out, the circuit breaker trips and fails fast for non-critical requests, preserving capacity. This pattern reduces cascade failures by 94% in microservice architectures (Netflix Hystrix data).
- Use Redis for LLM Response Cache: For repeated or similar prompts, cache the generated result. Redis vs Memcached for LLM response cache: Redis wins on persistence, Memcached wins on raw throughput by ~15%. For most AI apps, Redis's data structures and persistence are worth the minor trade-off.
- Adopt Idempotent Endpoints with Deduplication: Use a client-generated
request_idin your/chatendpoint. If the same ID is seen within a time window, return the previous response instead of calling the LLM again. This reduces duplicate LLM requests by 99.7% during client retries.
Your goal is to make the database a quiet, dedicated scribe for your application's state, not a participant in the real-time drama of AI inference. Start with the connection pool and PgBouncer fix today. The rest of the architecture will thank you tomorrow.