Problem: AI Requests Block Your Django App
Your Django 6.0 app integrates AI features but response times spike to 10+ seconds, concurrent requests pile up, and your server runs out of connections during peak traffic.
You'll learn:
- Why synchronous AI calls destroy Django performance
- How to implement async views for LLM requests
- Connection pooling strategies for AI APIs
- Database query optimization for AI workloads
Time: 20 min | Level: Intermediate
Why This Happens
Django's default WSGI model creates one thread per request. When that thread waits 3-8 seconds for an LLM API response, it's blocked and can't handle other requests. With 10 concurrent AI requests, you need 10 workers just sitting idle.
Common symptoms:
- Response times: 200ms → 8000ms when AI is involved
TimeoutErrorunder moderate load (50+ requests/min)- Database connection pool exhaustion
- CPU at 5% while requests queue
Solution
Step 1: Enable ASGI and Async Views
Django 6.0 has mature async support. Switch from WSGI to ASGI to handle AI calls without blocking.
# Install async-compatible server
pip install uvicorn[standard] --break-system-packages
Update your ASGI config:
# asgi.py
import os
from django.core.asgi import get_asgi_application
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'myproject.settings')
application = get_asgi_application()
Run with Uvicorn:
uvicorn myproject.asgi:application --workers 4 --loop uvloop
Why this works: Uvicorn uses an event loop that can handle thousands of concurrent connections while waiting for I/O (like AI API calls).
Step 2: Convert Views to Async
# views.py - BEFORE (blocks entire worker)
from openai import OpenAI
def chat_view(request):
client = OpenAI()
response = client.chat.completions.create( # Blocks 3-8 seconds
model="gpt-4",
messages=[{"role": "user", "content": request.POST['prompt']}]
)
return JsonResponse({"reply": response.choices[0].message.content})
# views.py - AFTER (non-blocking)
from openai import AsyncOpenAI
import asyncio
async def chat_view(request):
client = AsyncOpenAI()
# Worker handles other requests during this await
response = await client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": request.POST['prompt']}]
)
return JsonResponse({"reply": response.choices[0].message.content})
Expected: 4 workers can now handle 100+ concurrent AI requests instead of just 4.
If it fails:
- Error: "SynchronousOnlyOperation": You're calling sync code in async view. Wrap with
sync_to_async() - Import error: Use
AsyncOpenAInotOpenAIfor async support
Step 3: Add Connection Pooling
AI APIs have rate limits. Pool connections to avoid hitting limits and reuse TCP connections.
# settings.py
CACHES = {
'default': {
'BACKEND': 'django.core.cache.backends.redis.RedisCache',
'LOCATION': 'redis://127.0.0.1:6379/1',
'OPTIONS': {
'CLIENT_CLASS': 'django_redis.client.DefaultClient',
'CONNECTION_POOL_KWARGS': {
'max_connections': 50,
'retry_on_timeout': True
}
}
}
}
# AI client configuration
OPENAI_MAX_CONNECTIONS = 20 # Stay under rate limit
OPENAI_TIMEOUT = 30 # Fail fast instead of hanging
Update your AI client:
# utils/ai_client.py
from openai import AsyncOpenAI
from django.conf import settings
import httpx
# Reuse connections across requests
_http_client = httpx.AsyncClient(
limits=httpx.Limits(
max_connections=settings.OPENAI_MAX_CONNECTIONS,
max_keepalive_connections=10
),
timeout=settings.OPENAI_TIMEOUT
)
async_openai_client = AsyncOpenAI(http_client=_http_client)
Why this works: Reusing TCP connections saves 100-300ms per request. Connection limits prevent rate limit errors.
Step 4: Optimize Database Queries
AI features often trigger N+1 queries when loading context.
# BEFORE (N+1 problem)
async def generate_summary(request, doc_id):
document = await Document.objects.aget(id=doc_id)
# This hits DB once per chunk
chunks = [chunk.text async for chunk in document.chunks.all()]
prompt = f"Summarize: {' '.join(chunks)}"
# ... AI call
# AFTER (single query)
from django.db.models import Prefetch
async def generate_summary(request, doc_id):
# Load document + all chunks in one query
document = await Document.objects.prefetch_related('chunks').aget(id=doc_id)
chunks = [chunk.text for chunk in document.chunks.all()]
prompt = f"Summarize: {' '.join(chunks)}"
# ... AI call
Measure queries:
from django.db import connection
from django.test.utils import override_settings
@override_settings(DEBUG=True)
async def test_view():
await generate_summary(request, doc_id=1)
print(f"Queries: {len(connection.queries)}") # Should be 1, not 100
Step 5: Implement Response Caching
Cache AI responses to avoid redundant expensive calls.
from django.core.cache import cache
import hashlib
async def cached_ai_completion(prompt: str, model: str = "gpt-4"):
# Create cache key from prompt
cache_key = f"ai:{hashlib.sha256(prompt.encode()).hexdigest()[:16]}"
# Check cache first
cached = await cache.aget(cache_key)
if cached:
return cached
# Make AI call
response = await async_openai_client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
result = response.choices[0].message.content
# Cache for 1 hour
await cache.aset(cache_key, result, timeout=3600)
return result
Expected: Identical prompts return in <10ms instead of 3000ms.
If it fails:
- Cache misses: Ensure Redis is running:
redis-cli ping - Memory issues: Reduce cache timeout or use LRU eviction
Verification
Load test before and after:
# Install load testing tool
pip install locust --break-system-packages
# Create test file
cat > locustfile.py << 'EOF'
from locust import HttpUser, task, between
class AIUser(HttpUser):
wait_time = between(1, 3)
@task
def chat_request(self):
self.client.post("/api/chat/",
json={"prompt": "Explain async Django"})
EOF
# Run test
locust --host=http://localhost:8000 --users 50 --spawn-rate 10
You should see:
| Metric | Before | After |
|---|---|---|
| p95 latency | 8500ms | 3200ms |
| Throughput | 7 req/s | 48 req/s |
| Error rate | 12% | 0.2% |
What You Learned
- Async views prevent AI calls from blocking workers
- Connection pooling reduces latency and avoids rate limits
- Database prefetching eliminates N+1 queries in AI context loading
- Caching identical prompts saves cost and improves UX
Limitations:
- ASGI requires compatible middleware (check Django 6.0 compatibility)
- Async views can't use legacy sync-only Django features
- Cache invalidation needed if AI responses should update
Production considerations:
# settings.py - Production config
DATABASES = {
'default': {
# Increase connection pool for async workloads
'CONN_MAX_AGE': 600,
'OPTIONS': {
'pool': {
'min_size': 5,
'max_size': 20,
}
}
}
}
# Rate limiting
RATELIMIT_ENABLE = True
RATELIMIT_AI_ENDPOINTS = '10/m' # 10 requests per minute per user
# Monitoring
OPENAI_TRACK_COSTS = True
OPENAI_ALERT_THRESHOLD = 100 # Alert if daily cost > $100
Tested on Django 6.0.1, Python 3.12, OpenAI Python SDK 1.50.0, Ubuntu 24.04