Deploy Ollama on Kubernetes: GPU Scheduling, Persistent Storage & High Availability

Run Ollama on Kubernetes with GPU node affinity, PersistentVolumeClaims, rolling updates, and pod anti-affinity for production HA. Step-by-step 2026 guide.

Problem: Ollama Loses Models on Pod Restart and Won't Schedule on GPU Nodes

You've containerized Ollama and pushed it to Kubernetes. Then the pod restarts — and every 40GB model you pulled is gone. Or the pod lands on a CPU-only node and runs at 2 tokens/sec. Or you have no fallback when the node goes down.

You'll learn:

  • How to pin Ollama pods to GPU nodes using node affinity and resource limits
  • How to persist model storage across restarts with a PersistentVolumeClaim
  • How to run multiple replicas with pod anti-affinity for real high availability
  • How to expose Ollama internally with a ClusterIP Service and externally with an Ingress

Time: 45 min | Difficulty: Advanced


Why the Default Deployment Breaks in Production

Three things go wrong when you run Ollama on Kubernetes without careful configuration.

Models disappear on restart. Ollama stores models under ~/.ollama/models inside the container. Without a persistent volume, every pod restart triggers a fresh pull — 20–70GB depending on the model. In a cluster where pods reschedule frequently, this is unusable.

Pods land on wrong nodes. Without GPU resource requests, the Kubernetes scheduler treats all nodes as equivalent. Your Ollama pod ends up on a CPU-only m5.2xlarge instead of the g4dn.xlarge next to it.

Single pod = single point of failure. One pod means one node failure takes down your inference endpoint. For anything serving real traffic, you need replicas spread across availability zones or physical hosts.


Prerequisites

  • A running Kubernetes cluster (1.28+) with at least one GPU node
  • kubectl configured and pointing at the cluster
  • NVIDIA GPU Operator installed, or nvidia-device-plugin DaemonSet running — verify with:
kubectl get nodes -o=custom-columns='NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu'

Expected output:

NAME              GPU
gpu-node-1        1
gpu-node-2        1
cpu-node-1        <none>

If the GPU column shows <none> on all nodes, the device plugin isn't running yet. Install it before continuing.


Solution

Step 1: Create the Namespace and Storage Class

Keep all Ollama resources isolated in their own namespace.

kubectl create namespace ollama

If your cloud provider's default StorageClass supports dynamic provisioning, skip this step. Check what's available:

kubectl get storageclass

For clusters without a default StorageClass (bare metal, on-prem), create one using the local-path provisioner:

kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.26/deploy/local-path-storage.yaml

Then annotate it as default:

kubectl patch storageclass local-path \
  -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

Step 2: Create the PersistentVolumeClaim

This claim reserves 100GB of storage for model files. Ollama mounts it at /root/.ollama.

# ollama-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models
  namespace: ollama
spec:
  accessModes:
    - ReadWriteOnce       # One pod reads/writes at a time — fine for single-node model storage
  storageClassName: ""    # Empty string = use cluster default; set explicitly if needed
  resources:
    requests:
      storage: 100Gi      # Size for 2–3 large quantized models; increase for more
kubectl apply -f ollama-pvc.yaml

Verify it's bound before proceeding:

kubectl get pvc -n ollama

Expected: STATUS column shows Bound. If it shows Pending, the StorageClass provisioner isn't running.


Step 3: Deploy Ollama with GPU Scheduling

This Deployment does four things: requests a GPU resource, uses node affinity to prefer GPU nodes, mounts the PVC, and sets resource limits that prevent the pod from consuming all VRAM on a shared node.

# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  replicas: 1           # Start with 1; we'll cover HA replicas in Step 5
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      # ── GPU Node Affinity ─────────────────────────────────────────
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: nvidia.com/gpu.present   # Label set by GPU Operator
                    operator: In
                    values:
                      - "true"
      # ── Tolerations for GPU node taints ──────────────────────────
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
      containers:
        - name: ollama
          image: ollama/ollama:0.6.2    # Pin to a specific version — never use :latest in prod
          ports:
            - containerPort: 11434
          env:
            - name: OLLAMA_HOST
              value: "0.0.0.0"          # Listen on all interfaces, not just localhost
            - name: OLLAMA_KEEP_ALIVE
              value: "24h"              # Keep models loaded in VRAM between requests
            - name: OLLAMA_NUM_PARALLEL
              value: "2"               # Allow 2 concurrent inference requests per pod
          resources:
            requests:
              memory: "8Gi"
              cpu: "2"
              nvidia.com/gpu: "1"       # Request exactly 1 GPU — scheduler won't place pod without it
            limits:
              memory: "16Gi"
              cpu: "4"
              nvidia.com/gpu: "1"       # Limit matches request for GPU — fractional GPU isn't supported
          volumeMounts:
            - name: ollama-models
              mountPath: /root/.ollama  # Default Ollama model directory
          livenessProbe:
            httpGet:
              path: /api/tags           # Returns 200 when Ollama is ready
              port: 11434
            initialDelaySeconds: 30     # Give Ollama time to load before health checks start
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /api/tags
              port: 11434
            initialDelaySeconds: 10
            periodSeconds: 10
      volumes:
        - name: ollama-models
          persistentVolumeClaim:
            claimName: ollama-models
kubectl apply -f ollama-deployment.yaml

Watch the pod come up — it may take 60–90 seconds the first time:

kubectl rollout status deployment/ollama -n ollama

Expected: deployment "ollama" successfully rolled out

If it fails:

  • 0/1 nodes are available: insufficient nvidia.com/gpu → No GPU nodes have capacity. Check with kubectl describe nodes | grep -A5 "Allocatable"
  • Pod stuck in Pending → Run kubectl describe pod -n ollama <pod-name> and check the Events section

Step 4: Expose Ollama with a Service

Create a ClusterIP Service for internal cluster access (other pods call http://ollama.ollama.svc.cluster.local:11434).

# ollama-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ollama
spec:
  type: ClusterIP
  selector:
    app: ollama
  ports:
    - port: 11434
      targetPort: 11434
      protocol: TCP
kubectl apply -f ollama-service.yaml

To test from within the cluster, run a temporary pod:

kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never \
  -- curl -s http://ollama.ollama.svc.cluster.local:11434/api/tags

Expected: JSON response with {"models":[]} (empty until you pull a model).


Step 5: Pull a Model into the Persistent Volume

Run ollama pull inside the running pod once. Since models go to the PVC, they survive any future pod restarts.

# Get the pod name
POD=$(kubectl get pod -n ollama -l app=ollama -o jsonpath='{.items[0].metadata.name}')

# Pull a model — this downloads to the PVC
kubectl exec -n ollama $POD -- ollama pull llama3.2:3b

Verify the model is stored:

kubectl exec -n ollama $POD -- ollama list

Expected:

NAME               ID              SIZE    MODIFIED
llama3.2:3b        ...             2.0 GB  Just now

Now restart the pod deliberately to confirm persistence:

kubectl rollout restart deployment/ollama -n ollama
kubectl rollout status deployment/ollama -n ollama

POD=$(kubectl get pod -n ollama -l app=ollama -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n ollama $POD -- ollama list

The model should still appear after restart — it's on the PVC, not the container layer.


Step 6: Configure High Availability with Pod Anti-Affinity

A single replica means a single node failure takes down inference. To spread replicas across nodes, use pod anti-affinity. Note that this requires ReadWriteMany storage (like NFS or EFS) — ReadWriteOnce PVCs can only attach to one node at a time.

For HA with RWX storage:

# ollama-ha-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  replicas: 2             # Two replicas on two separate GPU nodes
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: nvidia.com/gpu.present
                    operator: In
                    values:
                      - "true"
        podAntiAffinity:
          # Hard rule: two Ollama pods must NEVER land on the same node
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - ollama
              topologyKey: kubernetes.io/hostname   # One pod per unique hostname
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
      containers:
        - name: ollama
          image: ollama/ollama:0.6.2
          ports:
            - containerPort: 11434
          env:
            - name: OLLAMA_HOST
              value: "0.0.0.0"
            - name: OLLAMA_KEEP_ALIVE
              value: "24h"
          resources:
            requests:
              nvidia.com/gpu: "1"
            limits:
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: ollama-models
              mountPath: /root/.ollama
          livenessProbe:
            httpGet:
              path: /api/tags
              port: 11434
            initialDelaySeconds: 30
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /api/tags
              port: 11434
            initialDelaySeconds: 10
            periodSeconds: 10
      volumes:
        - name: ollama-models
          persistentVolumeClaim:
            claimName: ollama-models-rwx   # Must be a ReadWriteMany PVC

For AWS, provision an EFS-backed PVC:

# ollama-pvc-rwx.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models-rwx
  namespace: ollama
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc    # Requires AWS EFS CSI driver
  resources:
    requests:
      storage: 100Gi

If you only have one GPU node, use replicas: 1 and accept that HA isn't possible at the pod level. Focus instead on fast pod restart time (the liveness probe handles this automatically).


Step 7: Add a PodDisruptionBudget

A PodDisruptionBudget (PDB) prevents Kubernetes from evicting too many Ollama pods at once during node drains or cluster upgrades.

# ollama-pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: ollama-pdb
  namespace: ollama
spec:
  minAvailable: 1       # At least 1 Ollama pod must be running during disruptions
  selector:
    matchLabels:
      app: ollama
kubectl apply -f ollama-pdb.yaml

This means kubectl drain on a GPU node will wait until another Ollama pod is Running before proceeding — giving you zero-downtime maintenance.


Verification

Run a full end-to-end inference test from within the cluster:

kubectl run inference-test --image=curlimages/curl --rm -it --restart=Never -- \
  curl -s -X POST http://ollama.ollama.svc.cluster.local:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2:3b","prompt":"What is Kubernetes?","stream":false}'

You should see: A JSON response with a response field containing generated text. Round-trip time under 5 seconds on a GPU node.

Check GPU utilization during inference:

# SSH to the GPU node, then:
nvidia-smi dmon -s u -d 1

Expected: GPU utilization spikes to 80–100% during generation, drops to near 0% between requests (model stays loaded due to OLLAMA_KEEP_ALIVE=24h).

Verify PVC is surviving restarts:

kubectl rollout restart deployment/ollama -n ollama
kubectl rollout status deployment/ollama -n ollama
kubectl exec -n ollama \
  $(kubectl get pod -n ollama -l app=ollama -o jsonpath='{.items[0].metadata.name}') \
  -- ollama list

Models should still be listed after the restart.


Production Considerations

Storage cost vs. latency tradeoff. NFS/EFS gives you ReadWriteMany for HA but adds 10–50ms of I/O latency on model load. If pods restart rarely and model load time matters more than strict HA, stick with ReadWriteOnce on fast NVMe-backed block storage.

Model pre-pull as an init container. Instead of manually exec-ing ollama pull, automate it with an init container that runs the pull on first boot:

initContainers:
  - name: model-puller
    image: ollama/ollama:0.6.2
    command: ["ollama", "pull", "llama3.2:3b"]
    env:
      - name: OLLAMA_HOST
        value: "http://localhost:11434"
    volumeMounts:
      - name: ollama-models
        mountPath: /root/.ollama

Resource quotas per namespace. If multiple teams share the cluster, add a ResourceQuota to the ollama namespace to prevent it from consuming all GPU capacity:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: ollama-gpu-quota
  namespace: ollama
spec:
  hard:
    requests.nvidia.com/gpu: "2"
    limits.nvidia.com/gpu: "2"

Horizontal Pod Autoscaler doesn't work for GPU workloads. HPA scales on CPU and memory metrics. GPU utilization isn't a native HPA metric. For GPU-based autoscaling, use KEDA with a Prometheus adapter that scrapes nvidia_gpu_duty_cycle metrics instead.


What You Learned

  • GPU scheduling requires both nvidia.com/gpu resource requests and node affinity rules pointing to labeled GPU nodes — one without the other leads to wrong placement or scheduling failures
  • PersistentVolumeClaims survive pod restarts; ReadWriteOnce is enough for single-replica deployments, ReadWriteMany is required for true multi-replica HA
  • Pod anti-affinity with topologyKey: kubernetes.io/hostname guarantees replicas land on separate physical nodes
  • PodDisruptionBudgets are cheap to add and prevent cluster maintenance from silently taking down your inference endpoint

Limitation: Ollama doesn't support model sharding across multiple GPUs natively. If you need a model that exceeds a single GPU's VRAM, look at vLLM with tensor parallelism instead — it's designed for multi-GPU inference on Kubernetes.

Tested on Kubernetes 1.30, Ollama 0.6.2, NVIDIA GPU Operator 24.9, EKS and bare-metal k3s with RTX 3090