Fix LLM API Timeout Errors in Production in 15 Minutes

Stop large language model API calls from timing out in production. Fix streaming, retry logic, and gateway config for reliable LLM responses.

Problem: Your LLM API Calls Keep Timing Out in Production

Everything works locally, but in production your LLM API calls fail with ETIMEDOUT, 504 Gateway Timeout, or just hang indefinitely. Larger models with long outputs are the worst offenders.

You'll learn:

  • Why LLM APIs timeout differently than regular HTTP calls
  • How to fix this with streaming and proper timeout config
  • How to add retry logic that doesn't make things worse

Time: 15 min | Level: Intermediate


Why This Happens

LLM APIs generate tokens sequentially. A response that takes 45 seconds to complete will hold a connection open the entire time. Most HTTP clients, proxies, and load balancers default to 30-second timeouts — shorter than many LLM responses.

Common symptoms:

  • Requests succeed locally, fail in production behind a load balancer
  • Errors appear on longer prompts, not short ones
  • 504 Gateway Timeout from Nginx/Cloudflare, not the LLM provider
  • Client-side ETIMEDOUT or socket hang up

The real culprits are usually your gateway timeout (not the LLM provider) and your client not using streaming.


Solution

Step 1: Switch to Streaming

Without streaming, your connection must stay open until the full response is generated. With streaming, tokens arrive incrementally — keeping the connection alive and letting you process output as it arrives.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function streamLLMResponse(prompt: string): Promise<string> {
  let fullText = "";

  // Streaming keeps the connection alive by receiving chunks continuously
  // instead of waiting for one large response to complete
  const stream = await client.messages.stream({
    model: "claude-opus-4-6",
    max_tokens: 4096,
    messages: [{ role: "user", content: prompt }],
  });

  for await (const chunk of stream) {
    if (
      chunk.type === "content_block_delta" &&
      chunk.delta.type === "text_delta"
    ) {
      fullText += chunk.delta.text;
      process.stdout.write(chunk.delta.text); // Optional: real-time output
    }
  }

  return fullText;
}

Expected: Tokens appear progressively; no more hanging on large responses.

If it fails:

  • stream is not async iterable: Update your SDK — older versions don't support for await on streams.
  • Still timing out: Your gateway is killing the connection. Move to Step 2.

Step 2: Fix Your Gateway Timeout

Even with streaming, a proxy sitting in front of your app will kill connections that idle too long between chunks. Fix this at the infrastructure level.

Nginx:

location /api/llm {
    proxy_pass http://your-app;

    # Standard timeout — not enough for LLMs
    # proxy_read_timeout 60s;

    # Extend to handle long LLM responses
    proxy_read_timeout 300s;
    proxy_send_timeout 300s;
    proxy_connect_timeout 30s;

    # Critical: disable buffering so streaming chunks pass through immediately
    proxy_buffering off;
    proxy_cache off;
}

AWS ALB / Cloudflare: Set idle timeout to 300s in your load balancer settings. Cloudflare's default 100s timeout requires a paid plan to extend — if you're on free, proxy bypass is your only option.

Expected: Gateway no longer kills long-running LLM connections.


Step 3: Add Smart Retry Logic

Retrying blindly on timeout makes things worse — you'll stack duplicate requests during a slow period. Use exponential backoff and only retry on specific error types.

type RetryableError = {
  status?: number;
  code?: string;
};

async function llmWithRetry(
  prompt: string,
  maxRetries = 3
): Promise<string> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await streamLLMResponse(prompt);
    } catch (err) {
      const error = err as RetryableError;
      const isLastAttempt = attempt === maxRetries - 1;

      // Don't retry on auth errors or invalid requests — they won't self-heal
      const isFatal = error.status === 401 || error.status === 400;

      if (isLastAttempt || isFatal) throw err;

      // Only retry on rate limits (429) or server errors (5xx)
      const isRetryable =
        error.status === 429 ||
        (error.status !== undefined && error.status >= 500) ||
        error.code === "ETIMEDOUT";

      if (!isRetryable) throw err;

      // Exponential backoff: 1s, 2s, 4s...
      const delay = Math.pow(2, attempt) * 1000;
      console.log(`Attempt ${attempt + 1} failed. Retrying in ${delay}ms...`);
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }

  throw new Error("Max retries exceeded");
}

If it fails:

  • Retries triggering rate limits: Add jitter — delay + Math.random() * 1000 — to prevent synchronized retries across multiple instances.
  • Duplicate responses in DB: Add an idempotency key per request before retrying.

Step 4: Set a Client-Side Abort Timeout

Give every request a hard ceiling so a truly stuck call doesn't block your server indefinitely.

async function llmWithTimeout(
  prompt: string,
  timeoutMs = 120_000 // 2 minutes max per request
): Promise<string> {
  const controller = new AbortController();
  const timer = setTimeout(() => controller.abort(), timeoutMs);

  try {
    // Pass the signal through to your HTTP client
    const stream = await client.messages.stream(
      {
        model: "claude-opus-4-6",
        max_tokens: 4096,
        messages: [{ role: "user", content: prompt }],
      },
      { signal: controller.signal }
    );

    let fullText = "";
    for await (const chunk of stream) {
      if (
        chunk.type === "content_block_delta" &&
        chunk.delta.type === "text_delta"
      ) {
        fullText += chunk.delta.text;
      }
    }
    return fullText;
  } finally {
    clearTimeout(timer); // Always clean up the timer
  }
}

Expected: Requests abort cleanly after timeoutMs instead of hanging.


Verification

Run a stress test with a long prompt to confirm everything holds:

# Send 5 concurrent long-output requests
for i in {1..5}; do
  curl -X POST http://localhost:3000/api/llm \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Write a detailed 2000-word technical essay on distributed systems."}' \
    --max-time 180 &
done
wait
echo "All requests completed"

You should see: All 5 requests complete without timeout errors, with streaming output arriving progressively.

Check your gateway logs for any lingering 504s:

# Nginx
grep "504" /var/log/nginx/error.log | tail -20

# Or watch live
tail -f /var/log/nginx/error.log | grep -E "timeout|504"

What You Learned

  • LLM APIs need longer timeouts than standard HTTP because generation is sequential and slow
  • Streaming is the most effective fix — it keeps connections alive and unblocks gateways
  • Proxy buffering must be disabled or streaming chunks won't pass through in real time
  • Only retry on 429 and 5xx errors — retrying on 400 or 401 wastes time and quota
  • Always pair retries with exponential backoff and a hard client-side abort timer

Limitation: Streaming requires your client (frontend or downstream service) to handle chunked responses. If you're returning LLM output to a mobile app or legacy client, you may need a buffering layer that streams from the LLM but returns a complete response downstream.

When NOT to use this approach: If your use case requires the full response before any processing (e.g., JSON parsing the entire output), streaming complicates your code. In that case, focus on gateway timeout fixes and retry logic instead, and keep prompts short enough to complete within 60 seconds.


Tested on Anthropic SDK 0.39+, OpenAI SDK 4.x, Node.js 22, Nginx 1.26, AWS ALB