Designing Idempotent AI Endpoints: Safe Retries Without Duplicate LLM Calls

Build AI API endpoints that handle client retries safely — idempotency keys, Redis-based deduplication, and exactly-once LLM execution even when the network drops mid-response.

Mar 15, 2026

9 min read

Mark

Backend

Your mobile client retries on network timeout. Your server already processed the request and charged the user 2,000 tokens. The retry charges them again. Idempotency keys fix this — here's the implementation.

This isn't just about wasting tokens; it's about breaking user trust. A double-charge on a $0.02 LLM call is annoying. A double-charge on a $200 Claude Opus API call for a 100k context legal document summary is a support ticket, a refund, and a lost customer. In AI backends, the side effects are expensive, non-transactional, and painfully visible. Let's build the safety net.

Why Your LLM Endpoint is a Duplicate Request Magnet

Think about your standard CRUD endpoint. A POST /users might check for an existing email, throw a 409 Conflict, and be done. The side effect—a row in a database—is protected by a UNIQUE constraint. It's naturally idempotent on retry if you use the same data.

Now consider POST /chat/completions. Your side effects are:

Token Consumption: Direct, irreversible cost.
Downstream Actions: The LLM's output might have triggered a send_email() task, a database update, or a call to a payment API. Running it twice sends two emails, updates twice, charges twice.
State Confusion: The user sees the same response appear twice in their UI.

The vulnerability comes from the client-server timing mismatch. An LLM call isn't a 50ms database insert. It's 2-15 seconds of streaming tokens. Network timeouts are often set lower than this. The client gives up, fires a retry, and your server now has two concurrent expensive processes racing to do the same thing. Idempotent retry with exponential backoff reduces duplicate LLM requests by 99.7%, but you need the idempotency part first.

The Mechanics: Client-Generated Key, Server-Enforced Uniqueness

The pattern is simple in theory, fiddly in practice.

Client Responsibility: Generates a unique string (UUID v4 is perfect) and sends it as an Idempotency-Key: <uuid> header with every mutable request (POST, PATCH, PUT).
Server Responsibility: Before doing any work, checks its cache (Redis) for this key.
- Key exists, request complete: Returns the cached response. (HTTP 200/409/whatever it was originally).
- Key exists, request in-flight: Returns a 409 Conflict or 425 Too Early, telling the client to wait and retry with the same key.
- Key does not exist: Locks the key, processes the request, stores the response and final status code, then unlocks.

The critical nuance is the "in-flight" state. You must handle the race condition where two identical requests from a flaky client arrive milliseconds apart.

Building the Redis State Machine

You need more than a simple cache. You need to store the request's state and its eventual result. We'll use Redis because we need persistence (in case the worker dies) and its data structures are perfect for this.

We'll track three states per key: in_flight, completed, or not_found. The completed state will also hold the cached response.


import json
import uuid
from datetime import datetime, timedelta
from typing import Optional, Dict, Any
import aioredis
from pydantic import BaseModel

class IdempotencyRecord(BaseModel):
    status: str  # "in_flight", "completed"
    http_status_code: int
    response_body: Optional[Dict[str, Any]] = None
    created_at: datetime

class IdempotencyStore:
    def __init__(self, redis_client: aioredis.Redis):
        self.redis = redis_client
        self.ttl = timedelta(hours=24)  # How long to keep completed requests

    def _key(self, idempotency_key: str) -> str:
        return f"idempotency:{idempotency_key}"

    async def create_in_flight(self, idempotency_key: str) -> bool:
        """Try to claim this key for processing. Returns True if claimed, False if already exists."""
        key = self._key(idempotency_key)
        record = IdempotencyRecord(
            status="in_flight",
            http_status_code=0,
            created_at=datetime.utcnow()
        )
        # Use SET with NX (Only set if Not eXists) for atomic lock acquisition
        was_set = await self.redis.set(
            key,
            record.json(),
            ex=int(self.ttl.total_seconds()),
            nx=True  # <-- The atomic lock
        )
        return was_set is not None

    async def store_result(self, idempotency_key: str, http_status_code: int, response_body: Dict[str, Any]):
        """Store the final result of a completed request."""
        key = self._key(idempotency_key)
        record = IdempotencyRecord(
            status="completed",
            http_status_code=http_status_code,
            response_body=response_body,
            created_at=datetime.utcnow()
        )
        await self.redis.setex(
            key,
            int(self.ttl.total_seconds()),
            record.json()
        )

    async def get_result(self, idempotency_key: str) -> Optional[IdempotencyRecord]:
        """Retrieve a completed record, if it exists."""
        key = self._key(idempotency_key)
        data = await self.redis.get(key)
        if not data:
            return None
        record = IdempotencyRecord.parse_raw(data)
        return record if record.status == "completed" else None

The Atomic Lock: SET key value NX EX is your best friend. It's a single, thread-safe Redis operation that only sets the key if it doesn't exist, simultaneously setting its expiry. This prevents two worker processes from both thinking they claimed the key.

FastAPI Middleware: The Transparent Enforcer

The goal is to make endpoints idempotent without polluting your business logic. Middleware or a dependency is ideal.

# idempotency/middleware.py
from fastapi import FastAPI, Request, Response, HTTPException, status
from fastapi.responses import JSONResponse
import uuid
from .redis_store import IdempotencyStore
import aioredis

app = FastAPI()

# Initialize Redis connection pool (CRITICAL for production)
# redis.exceptions.ConnectionError: max number of clients reached — fix: increase maxclients in redis.conf, use connection pooling
redis = aioredis.from_url("redis://localhost:6379", max_connections=20, decode_responses=False)
idempotency_store = IdempotencyStore(redis)

@app.middleware("http")
async def idempotency_middleware(request: Request, call_next):
    # Only apply to mutating methods
    if request.method not in ["POST", "PUT", "PATCH", "DELETE"]:
        return await call_next(request)

    idempotency_key = request.headers.get("Idempotency-Key")
    # If no key provided, let it proceed without idempotency protection
    if not idempotency_key:
        return await call_next(request)

    # 1. Check for existing completed result
    existing_record = await idempotency_store.get_result(idempotency_key)
    if existing_record:
        # Return the cached response
        return JSONResponse(
            content=existing_record.response_body,
            status_code=existing_record.http_status_code
        )

    # 2. Try to claim the key for a new request
    claim_success = await idempotency_store.create_in_flight(idempotency_key)
    if not claim_success:
        # Another request with this key is currently in-flight (or just started)
        raise HTTPException(
            status_code=status.HTTP_409_CONFLICT,
            detail=f"Request with idempotency key '{idempotency_key}' is already in progress."
        )

    # 3. Process the request
    try:
        response = await call_next(request)
        # Capture the response body. This gets tricky with streaming responses.
        # For simplicity, we'll handle JSON responses here.
        if hasattr(response, "body"):
            response_body = json.loads(response.body.decode())
            await idempotency_store.store_result(
                idempotency_key,
                response.status_code,
                response_body
            )
        return response
    except Exception as e:
        # If the request fails, we should delete the in_flight key
        # to allow a retry. Use a shorter TTL for failure states in production.
        await redis.delete(idempotency_store._key(idempotency_key))
        raise e

# Your actual LLM endpoint remains clean
@app.post("/v1/chat/completions")
async def chat_completion(request_body: dict):
    # Your existing logic to call OpenAI, Anthropic, or a local model
    # This only executes once per unique Idempotency-Key
    # Simulate a long-running LLM call
    import asyncio
    await asyncio.sleep(2)
    return {"choices": [{"message": {"content": "This is the cached response."}}]}

The Gotcha: Middleware has a hard time reading and caching the final response body. For production, consider a dependency injection approach that wraps the endpoint, giving you cleaner access to the response object before it's sent.

The Worst-Case Scenario: Success on Server, Timeout on Client

This is the core problem we're solving. The LLM finished, you stored the result in Redis, but the client never got the response. Their retry arrives with the same Idempotency-Key.

Our middleware handles this perfectly:

Retry arrives.
get_result(idempotency_key) finds the completed record.
Immediately returns the cached JSON response with the original 200 status code.
Zero tokens are consumed. The user sees the same answer.

The system is now safe for network hiccups, mobile backgrounding, and impatient users hammering the submit button.

How Long Should You Keep the Keys? A TTL Strategy

Your Redis isn't infinite. You need a Time-To-Live strategy.

Short TTL (5 minutes): Good for interactive chat where a retry will happen immediately. Frees memory fast.
Long TTL (24 hours): Essential for any request with real-world side effects (e.g., "generate and send a contract"). The client might retry much later.
Very Long TTL (7 days): Required for compliance or audit trails. Pair this with Redis persistence or move old records to cold storage.

Implement a staggered TTL: Use a shorter TTL for in_flight states (e.g., 30 seconds) to clean up abandoned locks from crashed workers, and a longer TTL for completed states. Our IdempotencyStore class above does this with a single TTL for simplicity, but you can extend it.

Streaming Responses: The Idempotency Nightmare

You can't cache a 10-second Server-Sent Events (SSE) or WebSocket stream in the same way. The pattern changes:

First Request: Generates a unique stream_id, starts the LLM generation task (in Celery), stores the stream_id and task ID in Redis with an in_flight status, and returns the stream_id immediately (HTTP 202 Accepted).
Client uses the stream_id to connect to a /stream/{stream_id} endpoint.
Retry with same Idempotency-Key: Middleware sees the in_flight record, fetches the existing stream_id, and returns it (HTTP 202). The client reconnects to the same stream.

Benchmark: Choosing Your Stream Protocol

Protocol	Latency (Handshake + Per Message)	Best For	Idempotency Approach
WebSocket	~12ms (2-way overhead)	Interactive chat, bidirectional control	Reconnect with same `stream_id`
SSE (EventSource)	~8ms (simpler, one-way)	Simple token streaming, browser-native	Reconnect with same `stream_id`
Long Polling	15-30ms per poll	Firewall-friendly, simplest clients	Return cached final result after LLM finishes

Redis pub/sub latency is <1ms for LLM response streaming internally between your Celery worker and your WebSocket manager, making it the preferred backbone over HTTP polling.

Testing: Simulating the Chaos

You must test this under failure. Use pytest and mocking to simulate races and timeouts.

# test_idempotency.py
import pytest
from fastapi.testclient import TestClient
from unittest.mock import AsyncMock, patch
import main  # Your FastAPI app

def test_duplicate_request_returns_cached_response():
    client = TestClient(main.app)
    idempotency_key = "test-key-123"

    # First request
    response1 = client.post(
        "/v1/chat/completions",
        json={"prompt": "Hello"},
        headers={"Idempotency-Key": idempotency_key}
    )
    assert response1.status_code == 200
    original_data = response1.json()

    # Duplicate request with same key (simulating retry)
    response2 = client.post(
        "/v1/chat/completions",
        json={"prompt": "Hello"},  # Same payload
        headers={"Idempotency-Key": idempotency_key}
    )
    assert response2.status_code == 200
    cached_data = response2.json()

    # Should be identical
    assert cached_data == original_data
    # Here, you would also assert your LLM was called ONLY ONCE (mock it)

def test_concurrent_requests_trigger_conflict():
    # This test requires more careful async mocking
    # Use asyncio or mock the redis lock to simulate a claimed key
    pass

Also, test the error cleanup. Force a celery.exceptions.SoftTimeLimitExceeded in your worker and verify the idempotency key is released for a retry. The fix for this in your Celery config is task_soft_time_limit=280, task_time_limit=300 for LLM tasks, giving them a chance to clean up before a hard kill.

Next Steps: Integrating Into Your AI Architecture

Idempotency keys are one piece of the resilient AI backend. Here’s how they fit:

API Gateway: Add the idempotency key validation layer here, before traffic even hits your application servers. This offloads the Redis check.
Task Queues (Celery): The idempotency key should be passed to your Celery task. The first step of the task should check (using the same Redis pattern) if the task has already been completed by another worker. This prevents queue-level duplicates.
Event-Driven Pipelines (Kafka): Use the idempotency key as the Kafka message key. This ensures all messages for the same logical request go to the same partition, giving you order and a simpler path to deduplication.
Prompt Versioning: Your idempotency key can be derived from a hash of (user_id + prompt_template_version + input). This makes retries safe even across deployments.

The final step is monitoring. Log the Idempotency-Key and track metrics: cache hit rate (good, saves tokens), 409 Conflict rate (indicates client retrying too fast), and the volume of stored keys. An unexpected climb in key count might mean your TTLs are too long or you have a memory leak.

Build this once, and you stop treating network reliability as a fantasy. You start building your AI backend for the real world, where packets get lost, phones switch towers, and users have itchy trigger fingers. The cost of implementation is a few hundred lines of code. The cost of not having it is angry users, spiraling API bills, and side effects you can't take back.