Optimize AI Workloads on Kubernetes in 30 Minutes

Problem: AI Workloads Are Eating Your Cluster

You've deployed a model inference service or training job on Kubernetes — and it's either crashing from OOM errors, stalling on pending GPU pods, or consuming resources that starve everything else.

You'll learn:

How to configure GPU resource requests and limits correctly
How to use node affinity and taints to isolate AI workloads
How to autoscale inference pods with KEDA based on queue depth

Time: 30 min | Level: Advanced

Why This Happens

Kubernetes doesn't know an AI workload is special — it treats a PyTorch training job the same as a web server. Without explicit GPU scheduling, resource quotas, and isolation, the scheduler places pods wherever capacity exists, GPUs go unrequested, and a single runaway training job can starve the entire cluster.

Common symptoms:

Inference pods stuck in Pending state despite available nodes
OOMKilled restarts on model-loading containers
CPU-only pods scheduled on expensive GPU nodes
Training jobs blocking autoscaler from scaling down

Solution

Step 1: Install the NVIDIA Device Plugin

Kubernetes needs the NVIDIA device plugin to expose GPUs as schedulable resources.

# Apply the official DaemonSet
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.0/deployments/static/nvidia-device-plugin.yml

# Verify GPUs are visible on nodes
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\\.com/gpu

Expected: Each GPU node should show a count under the GPU column (e.g., 4).

If it fails:

Plugin pod in CrashLoopBackOff: Check that NVIDIA drivers are installed on the node with nvidia-smi.
GPU column shows <none>: The node labels may be missing — check with kubectl describe node <node-name>.

Step 2: Taint GPU Nodes and Add Labels

Prevent CPU-only workloads from landing on expensive GPU nodes using taints, and label nodes for targeted scheduling.

# Taint all GPU nodes so only AI workloads land there
kubectl taint nodes -l cloud.google.com/gke-accelerator=nvidia-tesla-a100 \
  workload=ai:NoSchedule

# Label them for affinity rules
kubectl label nodes -l cloud.google.com/gke-accelerator=nvidia-tesla-a100 \
  workload-type=gpu-inference

Now update your AI workload manifests to tolerate the taint and prefer those nodes:

# inference-deployment.yaml
spec:
  template:
    spec:
      tolerations:
        - key: "workload"
          operator: "Equal"
          value: "ai"
          effect: "NoSchedule"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: workload-type
                    operator: In
                    values:
                      - gpu-inference
      containers:
        - name: inference
          image: your-model-server:latest
          resources:
            requests:
              memory: "16Gi"
              cpu: "4"
              nvidia.com/gpu: "1"   # Request exactly what you need
            limits:
              memory: "24Gi"
              cpu: "8"
              nvidia.com/gpu: "1"   # Limits must equal requests for GPUs

Why limits equal requests for GPUs: Kubernetes GPU scheduling is all-or-nothing. Unlike CPU/memory, you can't burst GPU allocation — setting limits higher than requests causes undefined behavior.

Node affinity and taints diagram GPU nodes accept only AI workloads via taint/toleration. General nodes receive everything else.

Step 3: Set Namespace Resource Quotas

Prevent any single team or job from monopolizing GPU resources across the cluster.

# ai-namespace-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: ai-workload-quota
  namespace: ai-workloads
spec:
  hard:
    requests.nvidia.com/gpu: "8"     # Max GPUs requestable in this namespace
    limits.nvidia.com/gpu: "8"
    requests.memory: "128Gi"
    limits.memory: "256Gi"
    pods: "50"

kubectl apply -f ai-namespace-quota.yaml

# Verify it's enforced
kubectl describe resourcequota ai-workload-quota -n ai-workloads

Expected: You'll see Used vs Hard columns showing current consumption against limits.

Step 4: Autoscale Inference with KEDA

Static replicas waste GPU resources during off-peak hours. KEDA lets you scale inference pods based on actual queue depth (e.g., a Redis list or Kafka topic).

# Install KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

Create a ScaledObject that ties your inference deployment to a Redis queue:

# inference-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: inference-scaler
  namespace: ai-workloads
spec:
  scaleTargetRef:
    name: inference-deployment
  minReplicaCount: 1      # Keep 1 warm pod — cold starts on GPU are expensive
  maxReplicaCount: 8      # Capped by your GPU quota above
  cooldownPeriod: 120     # Seconds to wait before scaling down — avoid thrashing
  triggers:
    - type: redis
      metadata:
        address: redis-service.ai-workloads.svc.cluster.local:6379
        listName: inference-queue
        listLength: "5"   # One pod per 5 queued items

kubectl apply -f inference-scaledobject.yaml

# Watch it scale
kubectl get hpa -n ai-workloads -w

Expected: When queue depth rises above 5 items, KEDA triggers scale-up. When it drains, pods scale back to 1 after 120 seconds.

KEDA scaling diagram KEDA watches the Redis queue and drives the HPA. Pods scale out as requests accumulate.

Step 5: Add GPU Utilization Monitoring

Without metrics, you can't tell if GPUs are sitting idle or being fully utilized. Deploy the NVIDIA DCGM Exporter to surface GPU metrics to Prometheus.

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring \
  --set serviceMonitor.enabled=true

Then add a Prometheus alert to catch idle GPUs:

# gpu-idle-alert.yaml
groups:
  - name: gpu-utilization
    rules:
      - alert: GPUIdleForTooLong
        expr: DCGM_FI_DEV_GPU_UTIL < 10
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu }} on {{ $labels.instance }} is under 10% utilization for 15 minutes"

Why this matters: An idle GPU node costs the same as a busy one. This alert surfaces opportunities to scale down or reschedule workloads.

Verification

Run this full health check after applying all changes:

# Check GPU allocation across nodes
kubectl get nodes -o=custom-columns=\
  NAME:.metadata.name,\
  ALLOC_GPU:.status.allocatable.'nvidia\.com/gpu',\
  CAP_GPU:.status.capacity.'nvidia\.com/gpu'

# Confirm no CPU pods landed on GPU nodes
kubectl get pods --all-namespaces -o wide | grep <your-gpu-node-name>

# Check KEDA scaler status
kubectl describe scaledobject inference-scaler -n ai-workloads

You should see: GPU nodes with correct allocatable counts, only AI pods on GPU nodes, and the ScaledObject reporting IsActive: true when the queue has items.

What You Learned

GPU resource requests and limits must always be equal in Kubernetes — no bursting
Taints and tolerations are the cleanest way to isolate expensive GPU nodes from general workloads
KEDA enables event-driven autoscaling tied to real queue depth, not CPU/memory proxies
DCGM Exporter gives you GPU utilization visibility that Kubernetes doesn't provide out of the box

Limitation: This guide targets NVIDIA GPUs. AMD ROCm and Intel Gaudi require different device plugins and the KEDA patterns still apply, but driver setup differs significantly.

When NOT to use this: If you're running occasional batch training jobs rather than persistent inference, consider Kubernetes Jobs with completions and parallelism instead of Deployments — autoscaling a Job is handled differently.

Tested on Kubernetes 1.30, KEDA 2.14, NVIDIA Device Plugin 0.16.0, GKE and self-managed clusters.