Your vLLM pod serves 10 req/s fine. At 50 req/s it OOMs, restarts, and drops every in-flight request. This guide fixes all three problems. You’re not just deploying a container; you’re orchestrating a stateful, GPU-hungry, memory-intensive beast in a system where 96% of organizations report Kubernetes increased their deployment frequency (Red Hat State of Kubernetes 2025). The average cluster now runs 400+ pods (CNCF 2025), and your vLLM server needs to be a resilient citizen, not a resource-hogging liability.
We’ll move from a brittle kubectl apply to a production-grade deployment with GPU scheduling, intelligent autoscaling, and zero-downtime updates. This is for when you’ve outgrown running ollama serve in a terminal.
GPU Nodes: Making Your Cluster See the Silicon
First, your cluster needs to know a GPU from a hole in the ground. If you run kubectl describe nodes and don’t see nvidia.com/gpu in the allocatable resources, you’re dead in the water.
On your GPU-equipped node (this applies to k3s, EKS, GKE with NVIDIA drivers, or AKS), you need the NVIDIA Device Plugin. It’s not magic, it’s a DaemonSet that advertises GPU capacity to the kubelet.
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
--namespace kube-system \
--version 0.15.0 \
--set runtimeClassName=nvidia
After a minute, verify:
kubectl describe node <your-gpu-node> | grep -A5 -B5 Capacity
You should see:
Capacity:
nvidia.com/gpu: 1
Allocatable:
nvidia.com/gpu: 1
Now, we don’t want your vLLM pod landing on a puny CPU node. Use node affinity to guide it home. Label your GPU node:
kubectl label nodes <node-name> hardware-type=nvidia-gpu
This label will be your beacon. Without it, you’ll face the classic error: 0/3 nodes are available: insufficient nvidia.com/gpu. The fix is always twofold: 1) Ensure the device plugin pods are running (kubectl get pods -n kube-system | grep nvidia), and 2) Verify your node has the correct label and the resource appears allocatable.
The vLLM Deployment: Where Resource Limits Are a Promise, Not a Suggestion
Here’s where most deployments fail. A vLLM server loading a 70B model needs a precise cocktail of memory and GPU. Guessing leads to OOMKilled or CrashLoopBackOff.
Let’s build a robust Deployment. We’ll use an initContainer to pull the model weights before the main container starts—this separates the pull logic and allows for shared volume caching.
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
namespace: llm
spec:
replicas: 1
selector:
matchLabels:
app: vllm-inference
template:
metadata:
labels:
app: vllm-inference
spec:
# Affinity: Pin me to a GPU node.
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: hardware-type
operator: In
values:
- nvidia-gpu
containers:
- name: vllm-server
image: vllm/vllm-openai:latest
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model"
- "meta-llama/Meta-Llama-3.1-70B-Instruct"
- "--port"
- "8000"
- "--tensor-parallel-size"
- "1"
# Critical: Load the model from the volume, not the web on startup.
- "--load-format"
- "safetensors"
- "--disable-log-requests"
# Resource limits are NON-NEGOTIABLE
resources:
limits:
nvidia.com/gpu: 1
memory: "140Gi"
cpu: "4"
requests:
nvidia.com/gpu: 1
memory: "140Gi"
cpu: "2"
ports:
- containerPort: 8000
# The readiness probe that waits for the model to load
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30 # Give the model loading a head start
periodSeconds: 5
failureThreshold: 3
# A startup probe is better for long-loading apps (K8s 1.20+)
startupProbe:
httpGet:
path: /health
port: 8000
failureThreshold: 30 # Allow up to 150 seconds (30 * 5s) to load
periodSeconds: 5
volumeMounts:
- name: model-storage
mountPath: /root/.cache/huggingface/hub
# Init container to pull weights once, share across pods.
initContainers:
- name: model-downloader
image: busybox:latest
command: ['sh', '-c']
args:
- |
# Simple check - if the model file exists, skip.
if [ -f /models/llama-3.1-70b/model.safetensors ]; then
echo "Model already present, skipping download."
exit 0
fi
echo "Model not found. Would use git lfs or aria2 here."
exit 1
volumeMounts:
- name: model-storage
mountPath: /models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: vllm-model-pvc
Key tactics here:
startupProbeoverinitContainerfor model loading: AninitContainerthat runs the vLLM server would work, but it breaks the Kubernetes model. ThestartupProbeis the correct primitive for "this container takes a long time to be ready." It prevents the kubelet from killing the container while it’s loading 70B parameters.- Memory Limits == Requests: For memory-intensive workloads, set your limit equal to your request. This guarantees the pod gets its RAM and prevents the node from over-provisioning, which leads to noisy neighbor OOM kills.
- The PVC is your lifeline: Without it, every pod restart, every scale event, re-downloads 140GB. That’s a recipe for
ImagePullBackOff-style timeouts and colossal cloud egress bills. The fix for a failing pod is alwayskubectl logs <pod> --previous. If you see HuggingFace timeout errors, your download logic is broken.
Attach a Persistent Volume: Your 140GB Model is Not a Temporary File
A hostPath volume is fine for a k3s lab. For a real cloud cluster (EKS, GKE, AKS), you need a network-attached persistent volume.
# model-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-model-pvc
namespace: llm
spec:
accessModes:
- ReadOnlyMany # Multiple pods can read the same model.
storageClassName: standard # Use 'gp3' on AWS, 'premium-rwo' on Azure, 'standard' on GKE.
resources:
requests:
storage: 200Gi # Give yourself buffer space.
Mount this PVC as shown in the deployment. Now, when a new vLLM pod spins up, it reads the weights from fast network storage, not the internet. This is the difference between a 2-minute scale-out and a 2-hour one.
Autoscaling on What Matters: Queue Depth, Not CPU
CPU usage for an LLM inference server is meaningless. It’s GPU-bound and memory-bound. Scaling based on CPU will always be wrong. You need custom metrics.
Enter KEDA (Kubernetes Event-Driven Autoscaling). It’s the simplest way to scale on Prometheus metrics, like request queue depth. First, install KEDA with Helm (see a pattern? It’s 2x faster than raw kubectl apply for multi-resource bundles).
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace
Now, define a ScaledObject that tells KEDA how to scale your deployment based on a metric. Let’s assume you’ve instrumented your vLLM server to expose a pending_requests metric.
# keda-hpa.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-request-scaler
namespace: llm
spec:
scaleTargetRef:
kind: Deployment
name: vllm-inference
minReplicaCount: 1
maxReplicaCount: 10
cooldownPeriod: 300 # 5 minutes after scaling in
pollingInterval: 30 # Check metrics every 30s
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090
metricName: pending_requests
query: |
avg(rate(vllm_pending_requests_total{job="vllm-inference"}[2m]))
threshold: "5" # Scale up if avg pending requests > 5
This HPA reacts to actual application backlog, not node utilization. The default HPA response time is 15–30s from metric breach to pod ready; with a pre-pulled model on a PVC, your new pod can be serving in under a minute.
Rolling Updates Without Dropping Requests
You’ve fixed scaling. Now, how do you update the vLLM version without causing the exact outage described in the opening? PodDisruptionBudget (PDB) and strategic Deployment parameters.
First, create a PDB that says "never break more than one of my pods at a time."
apiVersion: policy/v1 # Note: v1beta1 is deprecated as of K8s 1.30, and 34% of clusters still use deprecated APIs (Fairwinds 2025)
kind: PodDisruptionBudget
metadata:
name: vllm-pdb
namespace: llm
spec:
minAvailable: 1 # Always have at least 1 pod available.
selector:
matchLabels:
app: vllm-inference
Then, in your Deployment, configure the update strategy:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Can launch one new pod before killing an old one.
maxUnavailable: 0 # Never allow fewer pods than desired during update.
This combination ensures Kubernetes launches the new pod, waits for its startupProbe and readinessProbe to pass, then terminates an old pod. Zero dropped connections.
Cost Optimization: Spot Nodes and Graceful Preemption
GPUs are expensive. Spot/Preemptible instances can cut costs by 60-90%. The catch: they can vanish with 30 seconds warning. Your stateless inference service can handle this if designed correctly.
Use Karpenter (on AWS) or cluster autoscaler with priority classes. The key is to set a lower priorityClassName for your vLLM pods and a toleration for spot node taints.
# In your vLLM Deployment pod spec:
spec:
tolerations:
- key: "node.kubernetes.io/spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
- key: "node.kubernetes.io/preemptible"
operator: "Exists"
effect: "NoSchedule"
When the node is preempted, Kubernetes will gracefully evict the pod. With multiple replicas (from your HPA) and a PDB, the workload will reschedule on another node. The new pod will attach to the existing PVC and load the model, restoring capacity. The user might see a slightly higher latency during this reshuffle, but no hard errors.
Performance & Tooling: Seeing is Believing
You can’t debug what you can’t see. Use k9s or Lens to watch pods, logs, and GPU utilization in real time. For large clusters, server-side filtering is a must.
| Operation | Method | Time (Approx.) | When It Matters |
|---|---|---|---|
| Get pods (1000-pod cluster) | kubectl get pods -l app=vllm (client-side) | ~2s | When you're impatient |
| Get pods (1000-pod cluster) | kubectl get pods --server-side=true | ~200ms | Every. Single. Time. |
| Helm install (cert-manager) | helm install | ~12s | Standard practice |
| Apply raw manifests | kubectl apply -f ./ | ~25s | When you hate tooling |
| HPA scale-up | From metric breach to pod ready | 15-30s | During a traffic spike |
Enable server-side filtering in your kubectl commands. The difference is night and day.
Next Steps: From Working to Robust
You now have a vLLM deployment that scales, heals, and updates without dropping requests. Where next?
- Service Mesh & Canary Releases: Use Istio to route a percentage of traffic to a new model version (e.g.,
Llama-3.2) before a full rollout. This is your next layer of reliability. - GitOps Pipeline: Convert your manifests to a Kustomize overlay or Helm chart and let ArgoCD or Flux sync them automatically from Git. This enforces change tracking and rollback capability.
- Advanced Monitoring: Hook your Prometheus
pending_requestsmetric into Grafana alerts. Track GPU memory utilization (nvidia_smimetrics exposed via DCGM Exporter) to predict when you need to scale vertically (bigger GPU instance) vs. horizontally (more pods). - Multi-Model Serving: Adapt the PVC strategy to host multiple models. Use a model registry pattern and potentially a scheduler like KServe or TGI's Kubernetes backend to dynamically load/unload models based on demand.
The goal isn't just to run vLLM on Kubernetes. It's to make your Kubernetes LLM serving layer as boring and reliable as the rest of your infrastructure. The 66% of organizations using Kubernetes (CNCF Annual Survey 2025) aren't just running stateless web apps; they're managing complex, stateful workloads like this. Your GPU is a precious resource—treat it with the automation and respect it deserves. Now go delete that manually managed pod.