Kubernetes Debugging Playbook: CrashLoopBackOff, OOMKilled, and ImagePullBackOff — Exact Fixes

A systematic debugging approach for the 8 most common Kubernetes failure modes — with exact kubectl commands, what each error code means, and step-by-step resolution for each.

You run kubectl get pods and see 7 errors you don't understand. This playbook gives you an exact command sequence to resolve each one. Your cluster isn't broken; it's just speaking a language of cryptic statuses and back-off timers. With Kubernetes now used by 66% of organizations running containers (CNCF Annual Survey 2025), you're not alone in staring down a CrashLoopBackOff. The average cluster runs 400+ pods, so statistically, a few are always misbehaving. Let's stop guessing and start fixing.

The kubectl Debugging Workflow: get → describe → logs → exec

Before we dive into specific errors, you need a consistent attack pattern. Random kubectl commands waste time. Follow this sequence like a reflex.

  1. kubectl get pods: Your dashboard. Use -o wide for node info, --watch for real-time updates. See the STATUS and READY columns first.

    kubectl get pods -n your-namespace -o wide
    
  2. kubectl describe pod <pod-name>: The detective's notebook. This is your most important command. It shows events, config errors, scheduling failures, and image pull issues. The Events: section at the bottom is pure gold.

    kubectl describe pod/my-app-7cbbf6b9f8-2zx5k -n your-namespace
    
  3. kubectl logs <pod-name>: The application's scream. If the pod is running (or was running), see what it output. For crashed pods, add --previous to see the logs from the instant before death.

    # For a currently running pod
    kubectl logs my-app-7cbbf6b9f8-2zx5k
    # For a pod that crashed
    kubectl logs my-app-7cbbf6b9f8-2zx5k --previous
    
  4. kubectl exec -it <pod-name> -- /bin/sh: The live autopsy. If logs aren't enough, get inside a running pod to inspect files, environment variables, and run commands.

    kubectl exec -it my-app-7cbbf6b9f8-2zx5k -- /bin/sh
    # Once inside, check: env, cat /etc/resolv.conf, ps aux, netstat -tulpn
    

Pro Tip in VS Code: Open the integrated terminal (`Ctrl+``) and use the Kubernetes extension. Right-click pods to run these commands from a GUI. Pair it with Continue.dev to ask natural language questions about your error output.

CrashLoopBackOff: 5 Root Causes and How to Find Yours in 60 Seconds

The dreaded CrashLoopBackOff. The pod starts, crashes, waits, and repeats. It's not an error; it's a symptom. Here’s how to diagnose it in under a minute.

  1. Run kubectl logs --previous. This is your first move. The last output before the crash is here.

    • Fix: If you see Error: failed to load environment variables or a specific config error, check your ConfigMap/Secret references and env: definitions.
  2. Check the CMD or ENTRYPOINT. Your container might be starting a process that exits immediately. Use describe to see the container's command.

    • Fix: kubectl exec into a previous instance if possible, or test your Docker image locally with docker run. Ensure the main process is a long-running one (e.g., a web server, not a bash script that finishes).
  3. Look for port conflicts. Two containers in the same pod trying to bind to port 8080.

    • Fix: Check your pod spec's containerPort definitions. Containers in a pod share a network namespace; ports must be unique.
  4. Missing dependencies or volumes. The app expects a file at /app/config.yaml that isn't mounted.

    • Fix: kubectl describe will show volume mount errors. Verify your PersistentVolumeClaim names and ConfigMap mount paths.
  5. Incorrect startup order (in multi-container pods). Your app container starts before the init container that sets up the database.

    • Fix: Use init containers (initContainers:) for setup tasks that must complete before the main app runs.

Real Error & Exact Fix:

Error: CrashLoopBackOff Fix: Run kubectl logs <pod> --previous to see last crash output; common causes: missing env vars, wrong CMD, port conflicts.

OOMKilled: Reading Memory Metrics and Setting Correct Limits

OOMKilled means the kernel murdered your container for using too much memory. It's brutal and final. With 96% of organizations reporting increased deployment frequency with Kubernetes (Red Hat State of Kubernetes 2025), hasty resources: blocks are a common culprit.

First, confirm it's OOM:

kubectl describe pod my-memory-hog-pod

Look for Last State: Terminated with Reason: OOMKilled in the container status.

Diagnosing the Why:

  1. You set a limit that's too low. The classic. Your app needs 500Mi but you set limit: 256Mi.
  2. You didn't set a limit at all. This is worse. The pod can consume all node memory, causing other pods to be OOMKilled or the node to become unstable. Always set limits.
  3. A memory leak. The app slowly consumes memory until it hits the limit.

Setting Correct Limits: Stop guessing. Use metrics. If you have Prometheus and Grafana, look at the container's memory usage over time. No metrics? Start with a generous limit and monitor.


apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: app
        image: myapp:1.0
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"  # Start 2x request, adjust based on metrics
            cpu: "500m"

Real Error & Exact Fix:

Error: 0/3 nodes are available: insufficient memory Fix: Check kubectl describe nodes for allocatable vs requested; add resource limits to pods.

Pro Tip: Use Vertical Pod Autoscaler (VPA) in recommendation mode to analyze your pod's historical usage and suggest requests and limits. For scaling nodes based on resource pressure, tools like Karpenter automatically provision the right-sized nodes.

ImagePullBackOff: Registry Auth, Tag Typos, and Private Registry Setup

ImagePullBackOff means Kubernetes can't fetch your container image. kubectl describe will give you the exact reason.

Run kubectl describe pod and look for Events:

  • ErrImagePull or ImagePullBackOff with details.

The Three Main Culprits:

  1. The image tag doesn't exist. You typed myapp:latesst instead of myapp:latest.

    • Fix: Verify the tag exists in your registry (docker pull myapp:latesst locally).
  2. Private registry authentication failure. "pull access denied".

    • Fix: Create a docker-registry Secret and reference it in your pod spec.
    kubectl create secret docker-registry my-registry-key \
      --docker-server=registry.mycompany.com \
      --docker-username=myuser \
      --docker-password=mypassword \
      --docker-email=myemail@company.com
    
    # In your Pod or Deployment spec
    spec:
      imagePullSecrets:
        - name: my-registry-key
      containers:
      - name: app
        image: registry.mycompany.com/myapp:latest
    
  3. Network issues or rate limiting. Especially with docker.io.

    • Fix: For Docker Hub, use authenticated pulls or mirror images to a private registry.

Real Error & Exact Fix:

Error: ImagePullBackOff Fix: Verify image tag exists, check imagePullSecrets for private registry, run kubectl describe pod for exact pull error.

Pending Pod: Node Selector Mismatch, Resource Pressure, and Taint/Toleration

A Pending pod is one the scheduler can't place. kubectl describe pod is your only tool here.

Check the Events: for these phrases:

  • node(s) didn't match node selector: Your pod has a nodeSelector label (e.g., disk: ssd) that no node has.

    • Fix: Label a node (kubectl label node node-1 disk=ssd) or remove the selector.
  • 0/X nodes are available: insufficient cpu/memory: The cluster is out of resources.

    • Fix: Check kubectl describe nodes to see Allocatable vs total Requests. You need to add nodes (scale your cluster), reduce pod requests, or evict other pods.
  • 0/X nodes are available: node(s) had taint {key:value}, that the pod didn't tolerate: Nodes can repel pods with taints. Pods need tolerations to land on them.

    • Fix: Add a toleration to your pod spec or remove the taint from the node (if you own it).
    # Example: Tolerating a node dedicated to "monitoring"
    tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "monitoring"
      effect: "NoSchedule"
    

Benchmark Context: When a pod is finally scheduled, speed matters. A k3s cluster starts in ~30s, while a full kubeadm cluster can take ~4 minutes. For scaling, remember that HPA scale-up typically takes 15–30s from metric breach to a new pod being ready, due to default metric scrape intervals.

Service Not Reachable: Selector Labels, Port Config, and NetworkPolicy

Your pod is Running but you get Connection refused or a timeout. The Service abstraction is leaking.

Debugging Checklist:

  1. Selector Labels Mismatch: A Service finds pods via selector. If the labels on your pod don't match the Service's selector, the Service has an empty endpoint list.

    # Check endpoints. Empty? Selector mismatch.
    kubectl get endpoints my-service
    # Compare labels
    kubectl get pods --show-labels
    kubectl describe service my-service
    
  2. Port Misconfiguration: The Service's targetPort must match the container's containerPort.

    # Service
    apiVersion: v1
    kind: Service
    spec:
      ports:
      - port: 80        # Service port
        targetPort: 8080 # Must match container port
      selector:
        app: my-app
    
    # Pod (in Deployment template)
    containers:
    - name: app
      image: myapp:1.0
      ports:
      - containerPort: 8080 # Must match Service's targetPort
    
  3. NetworkPolicy Blocking Traffic: If your cluster uses NetworkPolicy (e.g., with Calico, Cilium), a default-deny policy might be blocking traffic.

    # Check for NetworkPolicies in the namespace
    kubectl get networkpolicy -n your-namespace
    

Quick Test: kubectl port-forward directly to a pod, bypassing the Service. If that works, the problem is with the Service or network policy.

kubectl port-forward pod/my-app-pod 8080:8080

Building a Debugging Runbook for Your Team with kubectl Aliases

You've solved your problem. Now, prevent the next 2 AM page. Build a runbook. Start by saving these kubectl aliases in your team's shared ~/.bashrc or ~/.zshrc.

# Diagnostic Aliases
alias kdp='kubectl describe pod'
alias klp='kubectl logs --previous'
alias kdebug='kubectl exec -it -- /bin/sh'
alias ktail='kubectl logs -f'
alias knodes='kubectl describe nodes | grep -A 10 -B 5 Allocatable'

# Quick Context & Info
alias kctx='kubectl config get-contexts'
alias kres='kubectl get pods -o=jsonpath="{range .items[*]}{.metadata.name}{\"\t\"}{.spec.containers[*].resources.limits.memory}{\"\n\"}{end}"'

# Fast apply from common tools
alias kh='helm'
alias kk='kubectl kustomize'

Create a Shared Document with your most common errors and the exact 3-command sequence to solve them. Use the format:

  • Symptom: CrashLoopBackOff
  • 1-Minute Drill:
    1. kubectl logs <pod> --previous
    2. kubectl describe pod <pod>
    3. kubectl get events --sort-by=.lastTimestamp

Leverage Your IDE: In VS Code, use snippets to store these command sequences. With GitHub Copilot or Amazon Q Developer, you can now ask "how to debug a pending pod" and get context-aware commands from your runbook.

Performance Note: When your cluster grows, use server-side filtering. kubectl get pods on a 1000-pod cluster takes ~2s client-side but only ~200ms with server-side filtering.

# Slow: Client filters all 1000 pods
kubectl get pods --all-namespaces | grep Running
# Fast: Ask the API server to filter
kubectl get pods --all-namespaces --field-selector=status.phase=Running

Next Steps: From Debugging to Observability

You've moved from panic to diagnosis. The next evolution is to see problems before they cause outages. Shift from debugging to observability.

  1. Implement Structured Logging: Ensure your application outputs logs as JSON. Use a sidecar or daemonset like Fluentd or Fluent Bit to ship logs to a central system like Loki or Elasticsearch.

  2. Standardize on Health Checks: Define accurate livenessProbe and readinessProbe in every Deployment. Kubernetes can only restart unhealthy pods if it knows what "unhealthy" means.

    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 10
    
  3. Deploy a Service Mesh for Advanced Traffic Insights: For complex microservices, a tool like Istio provides detailed metrics, retries, timeouts, and a visual graph of service dependencies, turning network mysteries into actionable data.

  4. Schedule Regular Cluster Hygiene: With 34% of clusters still using deprecated APIs (Fairwinds 2025), run kubectl-convert and audit tools regularly. Use Helm for managed updates—a chart install (~12s for 15 resources) is often faster and safer than raw kubectl apply (~25s).

Your cluster is a system. Systems fail. Your new skill isn't preventing failure—it's navigating it with precision, turning cryptic errors into clear commands, and building the playbook that keeps your team shipping code, even when a pod decides to take the day off.