When you can't SSH into production but need to fix it now
Problem: Production is Down and You Have Zero Access
Your server returns 500 errors. You don't have SSH access, can't read logs, and your DevOps team is asleep. You need to debug using only HTTP responses and AI analysis.
You'll learn:
- How to extract diagnostic info from error responses
- Using AI to analyze HTTP headers and behavior patterns
- Quick fixes you can deploy without server access
Time: 12 min | Level: Intermediate
Why This Happens
500 errors mean the server crashed processing your request, but the response still contains clues: timing patterns, partial headers, and error page content reveal the root cause.
Common symptoms:
- Intermittent 500s (not all requests fail)
- Specific endpoints always fail
- Error pages with generic messages
- No access to application logs
Solution
Step 1: Capture the Full Response
# Get complete response with timing
curl -i -w "\nTime: %{time_total}s\n" https://your-api.com/endpoint
# Test multiple times to see patterns
for i in {1..5}; do
curl -s -o /dev/null -w "%{http_code} - %{time_total}s\n" https://your-api.com/endpoint
sleep 1
done
Expected: You'll see response codes and timing patterns
Save this output - you'll feed it to AI next.
Step 2: Extract Diagnostic Headers
# Check for server software fingerprints
curl -I https://your-api.com/endpoint | grep -E "(Server|X-|Via|CF-)"
Look for:
Server: nginx/1.18.0(server software version)X-Runtime: 30.001(timing hints at timeout)CF-Cache-Status: MISS(CDN behavior)X-Request-ID: abc123(trace failed requests)
Step 3: Analyze with AI
Create a diagnostic prompt for Claude or GPT-4:
I'm getting 500 errors on production. I don't have log access.
Here's what I captured:
**Error Pattern:**
- Endpoint: POST /api/process
- Response time: 30.2s, 29.8s, 30.1s (consistent)
- HTTP 500 every time
- Works fine: GET /api/status
**Headers:**
HTTP/2 500 server: nginx/1.18.0 x-runtime: 30.001 connection: close
**Error page excerpt:**
[Paste any HTML error content]
What's the likely cause and how can I fix it without SSH access?
Why this works: AI patterns recognize timeout signatures (30s = common default), server configs, and deployment issues from response metadata alone.
Step 4: Test AI Hypothesis
AI will suggest causes like:
Hypothesis 1: Gateway Timeout (30s limit)
# Test with immediate response endpoint
curl -w "Time: %{time_total}s\n" https://your-api.com/api/health
# If health check works but process endpoint times out:
# Problem is request processing, not server crash
Hypothesis 2: Rate Limiting
# Check for rate limit headers
curl -I https://your-api.com/endpoint | grep -i "rate\|limit\|retry"
# Test from different IP
curl --interface eth1 https://your-api.com/endpoint
Hypothesis 3: Database Connection Pool Exhausted
# Rapid requests should all fail if pool is depleted
for i in {1..20}; do
curl -s -o /dev/null -w "%{http_code}\n" https://your-api.com/endpoint &
done
wait
# If first few succeed, then all fail = connection pool issue
Step 5: Apply Fix Without Server Access
Based on AI diagnosis, try these remote fixes:
For timeouts:
# If you have API access to reverse proxy
curl -X PATCH https://cloudflare-api/zones/YOUR_ZONE/settings \
-H "Authorization: Bearer $TOKEN" \
-d '{"value": 60}' # Increase timeout to 60s
For rate limits:
# Flush CDN cache to reset counters
curl -X POST https://api.cloudflare.com/client/v4/zones/YOUR_ZONE/purge_cache \
-H "Authorization: Bearer $TOKEN" \
-d '{"purge_everything":true}'
For database issues (emergency):
# If you have database admin panel access
# Restart connection pooler via control panel UI
# Or trigger app restart via deployment webhook
curl -X POST https://vercel.com/api/v1/deployments/YOUR_ID/redeploy \
-H "Authorization: Bearer $TOKEN"
Verification
# Test the fix
curl -w "\nStatus: %{http_code}\nTime: %{time_total}s\n" \
https://your-api.com/endpoint
# Monitor for 2 minutes
watch -n 5 'curl -s -o /dev/null -w "%{http_code}\n" https://your-api.com/endpoint'
You should see: 200 status codes with reasonable response times (<5s).
What You Learned
- HTTP response timing reveals timeout issues (watch for 30s, 60s, 120s patterns)
- Headers contain server fingerprints AI can analyze
- CDN and deployment APIs let you fix issues remotely
- Testing patterns (rapid requests, different IPs) narrow root cause
Limitations:
- Can't fix code bugs without deployment access
- Database corruption needs direct server access
- Security breaches require immediate log review
AI Prompt Templates
For Intermittent Errors
Intermittent 500 errors on [endpoint]:
- Success rate: 70% (7 out of 10 requests succeed)
- No pattern by time of day
- Response times: Fast (0.2s) when working, timeout (30s) when failing
- Headers: [paste]
Diagnose the issue and suggest remote fixes.
For Specific Endpoint Failures
Only POST /api/upload fails with 500:
- GET requests work fine
- Small payloads (<1MB) succeed
- Large payloads (>5MB) always fail at 30s
- Headers: [paste]
What's breaking and how do I fix it via API/CDN controls?
For Complete Outages
All endpoints return 500:
- Started at [timestamp]
- Last successful deploy: [timestamp]
- Recent changes: [if known]
- Error page shows: [paste]
- Headers: [paste]
Is this deployment, infrastructure, or external service failure?
Pro tip: Save these curl commands in a debug-500.sh script. When production breaks, you'll have diagnostics ready in 30 seconds.
Tested with nginx 1.24+, Cloudflare, Vercel, AWS Lambda. Works with Claude 3.5 Sonnet, GPT-4.