Kubernetes for Agents: Orchestrating Thousands of AI Workers

Scale AI agent fleets to thousands of workers using Kubernetes. Learn pod scheduling, autoscaling, and fault tolerance for production agent systems.

Problem: Your AI Agents Don't Scale

You've got AI agents working in development. Now you need a hundred of them. Then a thousand. Running agent workloads at scale isn't just "more pods" — agents have unique scheduling needs, long-running tasks, unpredictable memory spikes, and stateful execution that breaks standard Kubernetes patterns.

You'll learn:

  • How to model agent workloads as Kubernetes Jobs and Deployments correctly
  • Autoscaling strategies tuned for LLM inference latency and queue depth
  • Fault tolerance patterns that prevent cascading failures across agent fleets

Time: 30 min | Level: Advanced


Why This Happens

Standard Kubernetes is built around stateless, short-lived, CPU-predictable workloads. AI agents are the opposite — they hold context, make long outbound calls, consume memory unpredictably, and fail in ways that aren't HTTP 500s.

Common symptoms:

  • Agents OOMKilled mid-task with no retry
  • Autoscaler provisions pods faster than GPU/LLM capacity can serve them
  • One bad agent floods the entire cluster with retries
  • Jobs complete but results are lost because the pod was evicted

The fix requires rethinking resource models, queue integration, and pod lifecycle management from the ground up.

Kubernetes agent fleet architecture diagram Three-layer architecture: queue → dispatcher → agent worker pool


Solution

Step 1: Model Agent Tasks as Jobs, Not Deployments

Agents do discrete work. Use Job resources with completionMode: Indexed so each agent has a unique identity and Kubernetes tracks completion properly.

# agent-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: agent-batch-run
  labels:
    app: ai-agent
    team: platform
spec:
  completions: 100          # Total agents to run
  parallelism: 20           # Run 20 at a time
  completionMode: Indexed   # Each pod gets a unique index via JOB_COMPLETION_INDEX
  backoffLimit: 3           # Retry failed pods up to 3x before marking job failed
  ttlSecondsAfterFinished: 3600  # Clean up 1 hour after completion
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: agent
          image: your-registry/ai-agent:1.4.2
          resources:
            requests:
              memory: "2Gi"
              cpu: "500m"
            limits:
              memory: "4Gi"   # Allow burst for large context windows
              cpu: "2000m"
          env:
            - name: AGENT_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
            - name: TASK_QUEUE_URL
              valueFrom:
                secretKeyRef:
                  name: agent-secrets
                  key: queue-url

Expected: Kubernetes creates 20 pods, tracks completions, retries failures, and cleans itself up.

If it fails:

  • OOMKilled: Increase limits.memory — 4Gi is a starting point, profile your agent's peak usage
  • Pods stuck Pending: Check node capacity with kubectl describe nodes | grep Allocatable

Step 2: Set Up Queue-Based Autoscaling with KEDA

Don't scale on CPU. Agents sit idle waiting on LLM responses — CPU stays low while work piles up. Scale on queue depth instead using KEDA.

# Install KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace
# agent-scaledjob.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: agent-scaledjob
  namespace: agents
spec:
  jobTargetRef:
    template:
      spec:
        containers:
          - name: agent
            image: your-registry/ai-agent:1.4.2
            resources:
              requests:
                memory: "2Gi"
                cpu: "500m"
              limits:
                memory: "4Gi"
                cpu: "2000m"
        restartPolicy: OnFailure
  
  pollingInterval: 15       # Check queue every 15 seconds
  maxReplicaCount: 500      # Hard ceiling — protect downstream LLM APIs
  successfulJobsHistoryLimit: 5
  failedJobsHistoryLimit: 10
  
  triggers:
    - type: rabbitmq        # Or sqs, redis, kafka — KEDA supports 60+ sources
      metadata:
        protocol: amqp
        queueName: agent-tasks
        mode: QueueLength
        value: "5"          # 1 agent pod per 5 queued messages
      authenticationRef:
        name: rabbitmq-auth
# rabbitmq-auth.yaml — store credentials safely
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: rabbitmq-auth
  namespace: agents
spec:
  secretTargetRef:
    - parameter: host
      name: agent-secrets
      key: rabbitmq-host

Expected: Queue depth of 50 messages → KEDA provisions 10 agent pods within ~30 seconds.

If it fails:

  • ScaledJob not triggering: Run kubectl describe scaledjob agent-scaledjob — check the Conditions section for authentication errors
  • Over-provisioning: Lower maxReplicaCount or increase the value ratio

KEDA scaling dashboard showing queue depth vs pod count Queue depth (blue) drives pod count (orange) — CPU plays no role


Step 3: Prevent Cascading Failures with PodDisruptionBudgets and Circuit Breakers

One bad model endpoint can cause every agent to retry simultaneously, DDoSing your infrastructure. You need two things: protect running agents from eviction, and rate-limit retries at the cluster level.

# agent-pdb.yaml — never evict more than 10% of running agents at once
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: agent-pdb
  namespace: agents
spec:
  maxUnavailable: "10%"
  selector:
    matchLabels:
      app: ai-agent

For retry storms, add exponential backoff directly into your agent container and enforce it via a LimitRange that caps pod restarts:

# namespace-limits.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: agent-limits
  namespace: agents
spec:
  limits:
    - type: Container
      default:
        memory: "2Gi"
        cpu: "1000m"
      defaultRequest:
        memory: "1Gi"
        cpu: "250m"
      max:
        memory: "8Gi"      # Hard ceiling — prevents runaway context accumulation
        cpu: "4000m"

Add a NetworkPolicy to isolate agent pods so a compromised agent can't reach internal services:

# agent-netpol.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: agent-isolation
  namespace: agents
spec:
  podSelector:
    matchLabels:
      app: ai-agent
  policyTypes:
    - Ingress
    - Egress
  egress:
    - to:                    # Agents can only reach: the LLM gateway and task queue
      - namespaceSelector:
          matchLabels:
            name: llm-gateway
      - namespaceSelector:
          matchLabels:
            name: message-queue
    - ports:
      - port: 53             # Allow DNS
        protocol: UDP

Expected: Evictions are throttled, runaway pods are memory-capped, and blast radius is contained to the agents namespace.


Step 4: Persist Agent State Before Pod Death

Agents die mid-task. Kubernetes will evict your pod with 30 seconds warning via SIGTERM. Use that window to checkpoint state.

# agent/main.py
import signal
import sys
import json
import os

class Agent:
    def __init__(self):
        self.checkpoint_path = f"/checkpoints/agent-{os.environ['AGENT_INDEX']}.json"
        signal.signal(signal.SIGTERM, self.handle_sigterm)
        self.state = self.load_checkpoint()

    def handle_sigterm(self, signum, frame):
        # Kubernetes sends SIGTERM 30s before SIGKILL — use every second
        print("SIGTERM received — checkpointing state")
        self.save_checkpoint()
        sys.exit(0)

    def save_checkpoint(self):
        with open(self.checkpoint_path, 'w') as f:
            json.dump(self.state, f)

    def load_checkpoint(self):
        if os.path.exists(self.checkpoint_path):
            with open(self.checkpoint_path) as f:
                return json.load(f)  # Resume from where we left off
        return {"step": 0, "results": []}

Mount a PersistentVolumeClaim so checkpoints survive pod death:

# In your Job spec
volumes:
  - name: checkpoints
    persistentVolumeClaim:
      claimName: agent-checkpoints
containers:
  - name: agent
    volumeMounts:
      - name: checkpoints
        mountPath: /checkpoints
# agent-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: agent-checkpoints
  namespace: agents
spec:
  accessModes:
    - ReadWriteMany      # Multiple pods can write simultaneously
  resources:
    requests:
      storage: 10Gi
  storageClassName: efs  # AWS EFS or equivalent shared storage

Expected: Pod evicted → new pod starts → reads checkpoint → continues from step N instead of step 0.


Step 5: Observe Everything with Structured Logging and Metrics

At a thousand agents, kubectl logs doesn't scale. Ship structured logs and expose Prometheus metrics from day one.

# agent/logging.py
import structlog
import prometheus_client as prom

# Prometheus metrics — scraped by your cluster's Prometheus operator
tasks_completed = prom.Counter('agent_tasks_completed_total', 'Tasks finished', ['status'])
task_duration = prom.Histogram('agent_task_duration_seconds', 'Time per task')
llm_tokens_used = prom.Counter('agent_llm_tokens_total', 'LLM tokens consumed', ['model'])

log = structlog.get_logger()

def run_task(task):
    log.info("task_started", task_id=task.id, agent_index=os.environ['AGENT_INDEX'])
    
    with task_duration.time():
        result = execute(task)
    
    tasks_completed.labels(status="success").inc()
    llm_tokens_used.labels(model="claude-sonnet-4").inc(result.tokens)
    log.info("task_completed", task_id=task.id, tokens=result.tokens)

Expose metrics so Prometheus can scrape them:

# Add to your container spec
ports:
  - name: metrics
    containerPort: 8080

# Add annotation so Prometheus discovers the pod automatically
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Expected: Grafana shows per-agent task rates, token consumption, and duration percentiles across your entire fleet.

Grafana dashboard showing agent fleet metrics p99 task duration and token burn rate across 500 agent pods


Verification

Deploy the full stack and run a smoke test with 10 agents before scaling:

# Apply all manifests
kubectl apply -f agent-secrets.yaml
kubectl apply -f namespace-limits.yaml
kubectl apply -f agent-pvc.yaml
kubectl apply -f agent-pdb.yaml
kubectl apply -f agent-netpol.yaml
kubectl apply -f rabbitmq-auth.yaml
kubectl apply -f agent-scaledjob.yaml

# Push 10 test messages to the queue
python scripts/seed_queue.py --count 10 --queue agent-tasks

# Watch pods spin up
kubectl get pods -n agents -w

# Check job completion
kubectl get jobs -n agents

# Verify metrics are being scraped
kubectl port-forward -n agents svc/agent-metrics 8080:8080
curl localhost:8080/metrics | grep agent_tasks

You should see: 2 pods start (10 messages ÷ 5 per pod), complete their tasks, and terminate cleanly. Metrics endpoint returns agent_tasks_completed_total counters.


What You Learned

  • Use ScaledJob with KEDA over CPU-based HPA — agents don't scale on CPU
  • completionMode: Indexed gives agents stable identity for checkpointing
  • SIGTERM handlers + shared PVCs let agents survive eviction without losing work
  • NetworkPolicy isolates blast radius when an agent misbehaves
  • Structured logs and Prometheus metrics are non-negotiable at fleet scale

Limitations to know:

  • ReadWriteMany PVCs require shared storage (EFS, NFS, CephFS) — not available on all cloud providers
  • KEDA's polling interval (15s) means queue depth can spike before new pods are ready — add a buffer to maxReplicaCount
  • This pattern assumes stateless LLM APIs; if you're running local model servers, add GPU node affinity and separate the model deployment from agent pods entirely

When NOT to use this: For fewer than 20 concurrent agents, a simple Deployment with a work queue is easier to operate. Kubernetes Job orchestration pays off at scale, not at small numbers.


Tested on Kubernetes 1.32, KEDA 2.16, RabbitMQ 3.13, Python 3.12 — Ubuntu node pools on AWS EKS