Deploying a vLLM Inference Server on Kubernetes with GPU Scheduling and Auto-Scaling

Your vLLM pod serves 10 req/s fine. At 50 req/s it OOMs, restarts, and drops every in-flight request. This guide fixes all three problems. You’re not just deploying a container; you’re orchestrating a stateful, GPU-hungry, memory-intensive beast in a system where 96% of organizations report Kubernetes increased their deployment frequency (Red Hat State of Kubernetes 2025). The average cluster now runs 400+ pods (CNCF 2025), and your vLLM server needs to be a resilient citizen, not a resource-hogging liability.

We’ll move from a brittle kubectl apply to a production-grade deployment with GPU scheduling, intelligent autoscaling, and zero-downtime updates. This is for when you’ve outgrown running ollama serve in a terminal.

GPU Nodes: Making Your Cluster See the Silicon

First, your cluster needs to know a GPU from a hole in the ground. If you run kubectl describe nodes and don’t see nvidia.com/gpu in the allocatable resources, you’re dead in the water.

On your GPU-equipped node (this applies to k3s, EKS, GKE with NVIDIA drivers, or AKS), you need the NVIDIA Device Plugin. It’s not magic, it’s a DaemonSet that advertises GPU capacity to the kubelet.


helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
  --namespace kube-system \
  --version 0.15.0 \
  --set runtimeClassName=nvidia

After a minute, verify:

kubectl describe node <your-gpu-node> | grep -A5 -B5 Capacity

You should see:

Capacity:
  nvidia.com/gpu:  1
Allocatable:
  nvidia.com/gpu:  1

Now, we don’t want your vLLM pod landing on a puny CPU node. Use node affinity to guide it home. Label your GPU node:

kubectl label nodes <node-name> hardware-type=nvidia-gpu

This label will be your beacon. Without it, you’ll face the classic error: 0/3 nodes are available: insufficient nvidia.com/gpu. The fix is always twofold: 1) Ensure the device plugin pods are running (kubectl get pods -n kube-system | grep nvidia), and 2) Verify your node has the correct label and the resource appears allocatable.

The vLLM Deployment: Where Resource Limits Are a Promise, Not a Suggestion

Here’s where most deployments fail. A vLLM server loading a 70B model needs a precise cocktail of memory and GPU. Guessing leads to OOMKilled or CrashLoopBackOff.

Let’s build a robust Deployment. We’ll use an initContainer to pull the model weights before the main container starts—this separates the pull logic and allows for shared volume caching.

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
  namespace: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-inference
  template:
    metadata:
      labels:
        app: vllm-inference
    spec:
      # Affinity: Pin me to a GPU node.
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: hardware-type
                operator: In
                values:
                - nvidia-gpu
      containers:
      - name: vllm-server
        image: vllm/vllm-openai:latest
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - "--model"
        - "meta-llama/Meta-Llama-3.1-70B-Instruct"
        - "--port"
        - "8000"
        - "--tensor-parallel-size"
        - "1"
        # Critical: Load the model from the volume, not the web on startup.
        - "--load-format"
        - "safetensors"
        - "--disable-log-requests"
        # Resource limits are NON-NEGOTIABLE
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "140Gi"
            cpu: "4"
          requests:
            nvidia.com/gpu: 1
            memory: "140Gi"
            cpu: "2"
        ports:
        - containerPort: 8000
        # The readiness probe that waits for the model to load
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30  # Give the model loading a head start
          periodSeconds: 5
          failureThreshold: 3
        # A startup probe is better for long-loading apps (K8s 1.20+)
        startupProbe:
          httpGet:
            path: /health
            port: 8000
          failureThreshold: 30  # Allow up to 150 seconds (30 * 5s) to load
          periodSeconds: 5
        volumeMounts:
        - name: model-storage
          mountPath: /root/.cache/huggingface/hub
      # Init container to pull weights once, share across pods.
      initContainers:
      - name: model-downloader
        image: busybox:latest
        command: ['sh', '-c']
        args:
        - |
          # Simple check - if the model file exists, skip.
          if [ -f /models/llama-3.1-70b/model.safetensors ]; then
            echo "Model already present, skipping download."
            exit 0
          fi
          echo "Model not found. Would use git lfs or aria2 here."
          exit 1
        volumeMounts:
        - name: model-storage
          mountPath: /models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: vllm-model-pvc

Key tactics here:

startupProbe over initContainer for model loading: An initContainer that runs the vLLM server would work, but it breaks the Kubernetes model. The startupProbe is the correct primitive for "this container takes a long time to be ready." It prevents the kubelet from killing the container while it’s loading 70B parameters.
Memory Limits == Requests: For memory-intensive workloads, set your limit equal to your request. This guarantees the pod gets its RAM and prevents the node from over-provisioning, which leads to noisy neighbor OOM kills.
The PVC is your lifeline: Without it, every pod restart, every scale event, re-downloads 140GB. That’s a recipe for ImagePullBackOff-style timeouts and colossal cloud egress bills. The fix for a failing pod is always kubectl logs <pod> --previous. If you see HuggingFace timeout errors, your download logic is broken.

Attach a Persistent Volume: Your 140GB Model is Not a Temporary File

A hostPath volume is fine for a k3s lab. For a real cloud cluster (EKS, GKE, AKS), you need a network-attached persistent volume.

# model-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-model-pvc
  namespace: llm
spec:
  accessModes:
    - ReadOnlyMany  # Multiple pods can read the same model.
  storageClassName: standard  # Use 'gp3' on AWS, 'premium-rwo' on Azure, 'standard' on GKE.
  resources:
    requests:
      storage: 200Gi  # Give yourself buffer space.

Mount this PVC as shown in the deployment. Now, when a new vLLM pod spins up, it reads the weights from fast network storage, not the internet. This is the difference between a 2-minute scale-out and a 2-hour one.

Autoscaling on What Matters: Queue Depth, Not CPU

CPU usage for an LLM inference server is meaningless. It’s GPU-bound and memory-bound. Scaling based on CPU will always be wrong. You need custom metrics.

Enter KEDA (Kubernetes Event-Driven Autoscaling). It’s the simplest way to scale on Prometheus metrics, like request queue depth. First, install KEDA with Helm (see a pattern? It’s 2x faster than raw kubectl apply for multi-resource bundles).

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

Now, define a ScaledObject that tells KEDA how to scale your deployment based on a metric. Let’s assume you’ve instrumented your vLLM server to expose a pending_requests metric.

# keda-hpa.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-request-scaler
  namespace: llm
spec:
  scaleTargetRef:
    kind: Deployment
    name: vllm-inference
  minReplicaCount: 1
  maxReplicaCount: 10
  cooldownPeriod: 300  # 5 minutes after scaling in
  pollingInterval: 30   # Check metrics every 30s
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090
      metricName: pending_requests
      query: |
        avg(rate(vllm_pending_requests_total{job="vllm-inference"}[2m]))
      threshold: "5"  # Scale up if avg pending requests > 5

This HPA reacts to actual application backlog, not node utilization. The default HPA response time is 15–30s from metric breach to pod ready; with a pre-pulled model on a PVC, your new pod can be serving in under a minute.

Rolling Updates Without Dropping Requests

You’ve fixed scaling. Now, how do you update the vLLM version without causing the exact outage described in the opening? PodDisruptionBudget (PDB) and strategic Deployment parameters.

First, create a PDB that says "never break more than one of my pods at a time."

apiVersion: policy/v1  # Note: v1beta1 is deprecated as of K8s 1.30, and 34% of clusters still use deprecated APIs (Fairwinds 2025)
kind: PodDisruptionBudget
metadata:
  name: vllm-pdb
  namespace: llm
spec:
  minAvailable: 1  # Always have at least 1 pod available.
  selector:
    matchLabels:
      app: vllm-inference

Then, in your Deployment, configure the update strategy:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Can launch one new pod before killing an old one.
      maxUnavailable: 0  # Never allow fewer pods than desired during update.

This combination ensures Kubernetes launches the new pod, waits for its startupProbe and readinessProbe to pass, then terminates an old pod. Zero dropped connections.

Cost Optimization: Spot Nodes and Graceful Preemption

GPUs are expensive. Spot/Preemptible instances can cut costs by 60-90%. The catch: they can vanish with 30 seconds warning. Your stateless inference service can handle this if designed correctly.

Use Karpenter (on AWS) or cluster autoscaler with priority classes. The key is to set a lower priorityClassName for your vLLM pods and a toleration for spot node taints.

# In your vLLM Deployment pod spec:
spec:
  tolerations:
  - key: "node.kubernetes.io/spot"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  - key: "node.kubernetes.io/preemptible"
    operator: "Exists"
    effect: "NoSchedule"

When the node is preempted, Kubernetes will gracefully evict the pod. With multiple replicas (from your HPA) and a PDB, the workload will reschedule on another node. The new pod will attach to the existing PVC and load the model, restoring capacity. The user might see a slightly higher latency during this reshuffle, but no hard errors.

Performance & Tooling: Seeing is Believing

You can’t debug what you can’t see. Use k9s or Lens to watch pods, logs, and GPU utilization in real time. For large clusters, server-side filtering is a must.

Operation	Method	Time (Approx.)	When It Matters
Get pods (1000-pod cluster)	`kubectl get pods -l app=vllm` (client-side)	~2s	When you're impatient
Get pods (1000-pod cluster)	`kubectl get pods --server-side=true`	~200ms	Every. Single. Time.
Helm install (cert-manager)	`helm install`	~12s	Standard practice
Apply raw manifests	`kubectl apply -f ./`	~25s	When you hate tooling
HPA scale-up	From metric breach to pod ready	15-30s	During a traffic spike

Enable server-side filtering in your kubectl commands. The difference is night and day.

Next Steps: From Working to Robust

You now have a vLLM deployment that scales, heals, and updates without dropping requests. Where next?

Service Mesh & Canary Releases: Use Istio to route a percentage of traffic to a new model version (e.g., Llama-3.2) before a full rollout. This is your next layer of reliability.
GitOps Pipeline: Convert your manifests to a Kustomize overlay or Helm chart and let ArgoCD or Flux sync them automatically from Git. This enforces change tracking and rollback capability.
Advanced Monitoring: Hook your Prometheus pending_requests metric into Grafana alerts. Track GPU memory utilization (nvidia_smi metrics exposed via DCGM Exporter) to predict when you need to scale vertically (bigger GPU instance) vs. horizontally (more pods).
Multi-Model Serving: Adapt the PVC strategy to host multiple models. Use a model registry pattern and potentially a scheduler like KServe or TGI's Kubernetes backend to dynamically load/unload models based on demand.

The goal isn't just to run vLLM on Kubernetes. It's to make your Kubernetes LLM serving layer as boring and reliable as the rest of your infrastructure. The 66% of organizations using Kubernetes (CNCF Annual Survey 2025) aren't just running stateless web apps; they're managing complex, stateful workloads like this. Your GPU is a precious resource—treat it with the automation and respect it deserves. Now go delete that manually managed pod.