You run kubectl get pods and see 7 errors you don't understand. This playbook gives you an exact command sequence to resolve each one. Your cluster isn't broken; it's just speaking a language of cryptic statuses and back-off timers. With Kubernetes now used by 66% of organizations running containers (CNCF Annual Survey 2025), you're not alone in staring down a CrashLoopBackOff. The average cluster runs 400+ pods, so statistically, a few are always misbehaving. Let's stop guessing and start fixing.
The kubectl Debugging Workflow: get → describe → logs → exec
Before we dive into specific errors, you need a consistent attack pattern. Random kubectl commands waste time. Follow this sequence like a reflex.
kubectl get pods: Your dashboard. Use-o widefor node info,--watchfor real-time updates. See theSTATUSandREADYcolumns first.kubectl get pods -n your-namespace -o widekubectl describe pod <pod-name>: The detective's notebook. This is your most important command. It shows events, config errors, scheduling failures, and image pull issues. TheEvents:section at the bottom is pure gold.kubectl describe pod/my-app-7cbbf6b9f8-2zx5k -n your-namespacekubectl logs <pod-name>: The application's scream. If the pod is running (or was running), see what it output. For crashed pods, add--previousto see the logs from the instant before death.# For a currently running pod kubectl logs my-app-7cbbf6b9f8-2zx5k # For a pod that crashed kubectl logs my-app-7cbbf6b9f8-2zx5k --previouskubectl exec -it <pod-name> -- /bin/sh: The live autopsy. If logs aren't enough, get inside a running pod to inspect files, environment variables, and run commands.kubectl exec -it my-app-7cbbf6b9f8-2zx5k -- /bin/sh # Once inside, check: env, cat /etc/resolv.conf, ps aux, netstat -tulpn
Pro Tip in VS Code: Open the integrated terminal (`Ctrl+``) and use the Kubernetes extension. Right-click pods to run these commands from a GUI. Pair it with Continue.dev to ask natural language questions about your error output.
CrashLoopBackOff: 5 Root Causes and How to Find Yours in 60 Seconds
The dreaded CrashLoopBackOff. The pod starts, crashes, waits, and repeats. It's not an error; it's a symptom. Here’s how to diagnose it in under a minute.
Run
kubectl logs --previous. This is your first move. The last output before the crash is here.- Fix: If you see
Error: failed to load environment variablesor a specific config error, check yourConfigMap/Secretreferences andenv:definitions.
- Fix: If you see
Check the
CMDorENTRYPOINT. Your container might be starting a process that exits immediately. Usedescribeto see the container's command.- Fix:
kubectl execinto a previous instance if possible, or test your Docker image locally withdocker run. Ensure the main process is a long-running one (e.g., a web server, not abashscript that finishes).
- Fix:
Look for port conflicts. Two containers in the same pod trying to bind to port
8080.- Fix: Check your pod spec's
containerPortdefinitions. Containers in a pod share a network namespace; ports must be unique.
- Fix: Check your pod spec's
Missing dependencies or volumes. The app expects a file at
/app/config.yamlthat isn't mounted.- Fix:
kubectl describewill show volume mount errors. Verify yourPersistentVolumeClaimnames andConfigMapmount paths.
- Fix:
Incorrect startup order (in multi-container pods). Your app container starts before the init container that sets up the database.
- Fix: Use init containers (
initContainers:) for setup tasks that must complete before the main app runs.
- Fix: Use init containers (
Real Error & Exact Fix:
Error:
CrashLoopBackOffFix: Runkubectl logs <pod> --previousto see last crash output; common causes: missing env vars, wrong CMD, port conflicts.
OOMKilled: Reading Memory Metrics and Setting Correct Limits
OOMKilled means the kernel murdered your container for using too much memory. It's brutal and final. With 96% of organizations reporting increased deployment frequency with Kubernetes (Red Hat State of Kubernetes 2025), hasty resources: blocks are a common culprit.
First, confirm it's OOM:
kubectl describe pod my-memory-hog-pod
Look for Last State: Terminated with Reason: OOMKilled in the container status.
Diagnosing the Why:
- You set a limit that's too low. The classic. Your app needs 500Mi but you set
limit: 256Mi. - You didn't set a limit at all. This is worse. The pod can consume all node memory, causing other pods to be OOMKilled or the node to become unstable. Always set limits.
- A memory leak. The app slowly consumes memory until it hits the limit.
Setting Correct Limits: Stop guessing. Use metrics. If you have Prometheus and Grafana, look at the container's memory usage over time. No metrics? Start with a generous limit and monitor.
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: app
image: myapp:1.0
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi" # Start 2x request, adjust based on metrics
cpu: "500m"
Real Error & Exact Fix:
Error:
0/3 nodes are available: insufficient memoryFix: Checkkubectl describe nodesfor allocatable vs requested; add resource limits to pods.
Pro Tip: Use Vertical Pod Autoscaler (VPA) in recommendation mode to analyze your pod's historical usage and suggest requests and limits. For scaling nodes based on resource pressure, tools like Karpenter automatically provision the right-sized nodes.
ImagePullBackOff: Registry Auth, Tag Typos, and Private Registry Setup
ImagePullBackOff means Kubernetes can't fetch your container image. kubectl describe will give you the exact reason.
Run kubectl describe pod and look for Events:
ErrImagePullorImagePullBackOffwith details.
The Three Main Culprits:
The image tag doesn't exist. You typed
myapp:latesstinstead ofmyapp:latest.- Fix: Verify the tag exists in your registry (
docker pull myapp:latesstlocally).
- Fix: Verify the tag exists in your registry (
Private registry authentication failure. "pull access denied".
- Fix: Create a
docker-registrySecret and reference it in your pod spec.
kubectl create secret docker-registry my-registry-key \ --docker-server=registry.mycompany.com \ --docker-username=myuser \ --docker-password=mypassword \ --docker-email=myemail@company.com# In your Pod or Deployment spec spec: imagePullSecrets: - name: my-registry-key containers: - name: app image: registry.mycompany.com/myapp:latest- Fix: Create a
Network issues or rate limiting. Especially with
docker.io.- Fix: For Docker Hub, use authenticated pulls or mirror images to a private registry.
Real Error & Exact Fix:
Error:
ImagePullBackOffFix: Verify image tag exists, checkimagePullSecretsfor private registry, runkubectl describe podfor exact pull error.
Pending Pod: Node Selector Mismatch, Resource Pressure, and Taint/Toleration
A Pending pod is one the scheduler can't place. kubectl describe pod is your only tool here.
Check the Events: for these phrases:
node(s) didn't match node selector: Your pod has anodeSelectorlabel (e.g.,disk: ssd) that no node has.- Fix: Label a node (
kubectl label node node-1 disk=ssd) or remove the selector.
- Fix: Label a node (
0/X nodes are available: insufficient cpu/memory: The cluster is out of resources.- Fix: Check
kubectl describe nodesto seeAllocatablevs totalRequests. You need to add nodes (scale your cluster), reduce pod requests, or evict other pods.
- Fix: Check
0/X nodes are available: node(s) had taint {key:value}, that the pod didn't tolerate: Nodes can repel pods with taints. Pods need tolerations to land on them.- Fix: Add a toleration to your pod spec or remove the taint from the node (if you own it).
# Example: Tolerating a node dedicated to "monitoring" tolerations: - key: "dedicated" operator: "Equal" value: "monitoring" effect: "NoSchedule"
Benchmark Context: When a pod is finally scheduled, speed matters. A k3s cluster starts in ~30s, while a full kubeadm cluster can take ~4 minutes. For scaling, remember that HPA scale-up typically takes 15–30s from metric breach to a new pod being ready, due to default metric scrape intervals.
Service Not Reachable: Selector Labels, Port Config, and NetworkPolicy
Your pod is Running but you get Connection refused or a timeout. The Service abstraction is leaking.
Debugging Checklist:
Selector Labels Mismatch: A Service finds pods via
selector. If the labels on your pod don't match the Service'sselector, the Service has an empty endpoint list.# Check endpoints. Empty? Selector mismatch. kubectl get endpoints my-service # Compare labels kubectl get pods --show-labels kubectl describe service my-servicePort Misconfiguration: The Service's
targetPortmust match the container'scontainerPort.# Service apiVersion: v1 kind: Service spec: ports: - port: 80 # Service port targetPort: 8080 # Must match container port selector: app: my-app# Pod (in Deployment template) containers: - name: app image: myapp:1.0 ports: - containerPort: 8080 # Must match Service's targetPortNetworkPolicy Blocking Traffic: If your cluster uses NetworkPolicy (e.g., with Calico, Cilium), a default-deny policy might be blocking traffic.
# Check for NetworkPolicies in the namespace kubectl get networkpolicy -n your-namespace
Quick Test: kubectl port-forward directly to a pod, bypassing the Service. If that works, the problem is with the Service or network policy.
kubectl port-forward pod/my-app-pod 8080:8080
Building a Debugging Runbook for Your Team with kubectl Aliases
You've solved your problem. Now, prevent the next 2 AM page. Build a runbook. Start by saving these kubectl aliases in your team's shared ~/.bashrc or ~/.zshrc.
# Diagnostic Aliases
alias kdp='kubectl describe pod'
alias klp='kubectl logs --previous'
alias kdebug='kubectl exec -it -- /bin/sh'
alias ktail='kubectl logs -f'
alias knodes='kubectl describe nodes | grep -A 10 -B 5 Allocatable'
# Quick Context & Info
alias kctx='kubectl config get-contexts'
alias kres='kubectl get pods -o=jsonpath="{range .items[*]}{.metadata.name}{\"\t\"}{.spec.containers[*].resources.limits.memory}{\"\n\"}{end}"'
# Fast apply from common tools
alias kh='helm'
alias kk='kubectl kustomize'
Create a Shared Document with your most common errors and the exact 3-command sequence to solve them. Use the format:
- Symptom:
CrashLoopBackOff - 1-Minute Drill:
kubectl logs <pod> --previouskubectl describe pod <pod>kubectl get events --sort-by=.lastTimestamp
Leverage Your IDE: In VS Code, use snippets to store these command sequences. With GitHub Copilot or Amazon Q Developer, you can now ask "how to debug a pending pod" and get context-aware commands from your runbook.
Performance Note: When your cluster grows, use server-side filtering. kubectl get pods on a 1000-pod cluster takes ~2s client-side but only ~200ms with server-side filtering.
# Slow: Client filters all 1000 pods
kubectl get pods --all-namespaces | grep Running
# Fast: Ask the API server to filter
kubectl get pods --all-namespaces --field-selector=status.phase=Running
Next Steps: From Debugging to Observability
You've moved from panic to diagnosis. The next evolution is to see problems before they cause outages. Shift from debugging to observability.
Implement Structured Logging: Ensure your application outputs logs as JSON. Use a sidecar or daemonset like Fluentd or Fluent Bit to ship logs to a central system like Loki or Elasticsearch.
Standardize on Health Checks: Define accurate
livenessProbeandreadinessProbein every Deployment. Kubernetes can only restart unhealthy pods if it knows what "unhealthy" means.livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 30 periodSeconds: 10Deploy a Service Mesh for Advanced Traffic Insights: For complex microservices, a tool like Istio provides detailed metrics, retries, timeouts, and a visual graph of service dependencies, turning network mysteries into actionable data.
Schedule Regular Cluster Hygiene: With 34% of clusters still using deprecated APIs (Fairwinds 2025), run
kubectl-convertand audit tools regularly. Use Helm for managed updates—a chart install (~12s for 15 resources) is often faster and safer than rawkubectl apply(~25s).
Your cluster is a system. Systems fail. Your new skill isn't preventing failure—it's navigating it with precision, turning cryptic errors into clear commands, and building the playbook that keeps your team shipping code, even when a pod decides to take the day off.