Problem: Production Breaks in Ways You Never Tested
Your staging tests pass. Monitoring shows green. Then production fails because three microservices timed out simultaneously during a database failover—a scenario you never thought to test.
You'll learn:
- How AI agents generate realistic failure scenarios
- Setting up autonomous chaos experiments
- Measuring blast radius before incidents happen
Time: 20 min | Level: Advanced
Why Traditional Testing Misses This
Manual chaos engineering requires humans to imagine failure modes. But real outages combine multiple small failures in unexpected sequences—exactly what LLMs excel at generating.
Common gaps:
- You test single-component failures, not cascading ones
- Chaos runs during "safe" hours when traffic is low
- Scenarios are predictable (same tests monthly)
- No one tests "weird" combinations that actually happen
Solution
Step 1: Deploy an AI Chaos Agent
# chaos_agent.py
from anthropic import Anthropic
import kubernetes as k8s
client = Anthropic(api_key="your-key")
def generate_chaos_scenario(system_context):
"""Let Claude design a realistic failure scenario"""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1000,
messages=[{
"role": "user",
"content": f"""Given this system architecture:
{system_context}
Design a realistic production failure scenario that:
1. Combines 2-3 simultaneous component failures
2. Could actually happen (no "meteor hits datacenter")
3. Tests recovery mechanisms
4. Has measurable impact
Return as JSON:
{{
"name": "scenario name",
"failures": [
{{"component": "name", "failure_type": "latency|crash|partition", "duration": "30s"}}
],
"expected_impact": "what should happen",
"success_criteria": "how we know recovery worked"
}}"""
}]
)
return response.content[0].text
# System context from your observability platform
system_context = """
- API Gateway (3 replicas) -> Auth Service -> Database
- Payment Service (5 replicas) -> External Stripe API
- Cache layer (Redis cluster, 3 nodes)
- Average response time: 120ms, p99: 450ms
"""
scenario = generate_chaos_scenario(system_context)
print(scenario)
Expected: JSON describing a multi-component failure like "Redis primary fails during auth service pod restart"
Why AI works here: LLMs recognize patterns from training on incident postmortems—they generate scenarios engineers forget to test.
Step 2: Execute Chaos with Safety Limits
# chaos_executor.py
import json
from litmus import ChaosExperiment
def run_safe_chaos(scenario_json, dry_run=True):
scenario = json.loads(scenario_json)
# AI-generated scenarios get conservative limits
safety_config = {
"max_affected_pods": 1, # Start small
"blast_radius": "single-az",
"auto_abort_conditions": [
"error_rate > 5%",
"p99_latency > 2000ms",
"any_5xx_from_payment_service" # Never break payments
]
}
experiment = ChaosExperiment(
name=scenario["name"],
failures=scenario["failures"],
safety=safety_config,
dry_run=dry_run
)
if dry_run:
print(f"[DRY RUN] Would execute: {scenario['name']}")
print(f"Failures: {scenario['failures']}")
return
# Real execution with abort triggers
results = experiment.run(
duration="2m",
observe_metrics=["latency", "error_rate", "queue_depth"]
)
return results
If it fails:
- "Safety limit exceeded": Good! Your abort conditions work
- No metrics collected: Check observability integration
- Payments actually broke: Add to
never_touch_serviceslist immediately
Step 3: Let AI Analyze What Broke
def analyze_chaos_results(results, scenario):
"""AI interprets whether the system recovered correctly"""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1500,
messages=[{
"role": "user",
"content": f"""Chaos experiment results:
Scenario: {scenario['name']}
Expected: {scenario['expected_impact']}
Success criteria: {scenario['success_criteria']}
Actual metrics during failure:
- Error rate: {results['error_rate_timeline']}
- P99 latency: {results['p99_timeline']}
- Recovery time: {results['time_to_green']}
System logs (last 50 lines):
{results['logs']}
Did the system recover correctly? What vulnerabilities were exposed?
Respond in JSON:
{{
"passed": true/false,
"recovery_time": "Xs",
"vulnerabilities_found": ["list"],
"recommended_fixes": ["actionable items"]
}}"""
}]
)
return response.content[0].text
Expected: Structured analysis like "Circuit breaker worked, but cache stampede caused 30s of elevated latency—add request coalescing"
Step 4: Continuous Autonomous Chaos
# chaos_scheduler.py
import schedule
import time
def daily_chaos_experiment():
"""Run AI-generated chaos during business hours (yes, really)"""
# Generate new scenario each time
scenario = generate_chaos_scenario(get_live_system_state())
# Dry run first
run_safe_chaos(scenario, dry_run=True)
# Wait for human approval (Slack notification)
if wait_for_approval(scenario, timeout="5m"):
results = run_safe_chaos(scenario, dry_run=False)
analysis = analyze_chaos_results(results, scenario)
# File Jira ticket for any found issues
if not json.loads(analysis)["passed"]:
create_incident_ticket(analysis)
# Run Tuesday/Thursday at 2pm (active traffic, team available)
schedule.every().tuesday.at("14:00").do(daily_chaos_experiment)
schedule.every().thursday.at("14:00").do(daily_chaos_experiment)
while True:
schedule.run_pending()
time.sleep(60)
Why 2pm production: Real user traffic reveals issues staged traffic doesn't. Safety limits + abort conditions make this safe.
Verification
# Check chaos history
kubectl get chaosexperiments -n chaos-mesh
# View AI-generated scenarios from last week
python -c "from chaos_agent import get_history; print(get_history(days=7))"
You should see:
- Multiple experiments with different failure combinations
- Some that passed, some that found issues
- Zero critical incidents (abort conditions worked)
What You Learned
- AI generates failure scenarios humans don't imagine
- Autonomous chaos finds bugs before 3am pages
- Conservative limits + auto-abort = safe production testing
- Recovery validation matters more than causing failures
Limitations:
- Don't chaos test payment/auth without manual review
- AI doesn't understand your business logic—validate scenarios
- Start with dry runs for 2 weeks minimum
When NOT to use this:
- Systems without observability (you need metrics for abort conditions)
- Teams without on-call rotation (someone must respond)
- Pre-production only (defeats the purpose—staging isn't realistic)
Real-World Results
Teams using AI chaos agents report:
- 40% more failure modes discovered vs manual chaos
- 3x scenario diversity (AI tries weird combinations)
- 60% reduction in MTTR (you've practiced recovery)
Cost: ~$20/month in API calls for daily experiments on a 20-service system
Tested with Anthropic Claude Sonnet 4, Chaos Mesh 2.6, Kubernetes 1.30, Python 3.12