Problem: Ollama Loses Models on Pod Restart and Won't Schedule on GPU Nodes
You've containerized Ollama and pushed it to Kubernetes. Then the pod restarts — and every 40GB model you pulled is gone. Or the pod lands on a CPU-only node and runs at 2 tokens/sec. Or you have no fallback when the node goes down.
You'll learn:
- How to pin Ollama pods to GPU nodes using node affinity and resource limits
- How to persist model storage across restarts with a PersistentVolumeClaim
- How to run multiple replicas with pod anti-affinity for real high availability
- How to expose Ollama internally with a ClusterIP Service and externally with an Ingress
Time: 45 min | Difficulty: Advanced
Why the Default Deployment Breaks in Production
Three things go wrong when you run Ollama on Kubernetes without careful configuration.
Models disappear on restart. Ollama stores models under ~/.ollama/models inside the container. Without a persistent volume, every pod restart triggers a fresh pull — 20–70GB depending on the model. In a cluster where pods reschedule frequently, this is unusable.
Pods land on wrong nodes. Without GPU resource requests, the Kubernetes scheduler treats all nodes as equivalent. Your Ollama pod ends up on a CPU-only m5.2xlarge instead of the g4dn.xlarge next to it.
Single pod = single point of failure. One pod means one node failure takes down your inference endpoint. For anything serving real traffic, you need replicas spread across availability zones or physical hosts.
Prerequisites
- A running Kubernetes cluster (1.28+) with at least one GPU node
kubectlconfigured and pointing at the cluster- NVIDIA GPU Operator installed, or
nvidia-device-pluginDaemonSet running — verify with:
kubectl get nodes -o=custom-columns='NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu'
Expected output:
NAME GPU
gpu-node-1 1
gpu-node-2 1
cpu-node-1 <none>
If the GPU column shows <none> on all nodes, the device plugin isn't running yet. Install it before continuing.
Solution
Step 1: Create the Namespace and Storage Class
Keep all Ollama resources isolated in their own namespace.
kubectl create namespace ollama
If your cloud provider's default StorageClass supports dynamic provisioning, skip this step. Check what's available:
kubectl get storageclass
For clusters without a default StorageClass (bare metal, on-prem), create one using the local-path provisioner:
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.26/deploy/local-path-storage.yaml
Then annotate it as default:
kubectl patch storageclass local-path \
-p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
Step 2: Create the PersistentVolumeClaim
This claim reserves 100GB of storage for model files. Ollama mounts it at /root/.ollama.
# ollama-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-models
namespace: ollama
spec:
accessModes:
- ReadWriteOnce # One pod reads/writes at a time — fine for single-node model storage
storageClassName: "" # Empty string = use cluster default; set explicitly if needed
resources:
requests:
storage: 100Gi # Size for 2–3 large quantized models; increase for more
kubectl apply -f ollama-pvc.yaml
Verify it's bound before proceeding:
kubectl get pvc -n ollama
Expected: STATUS column shows Bound. If it shows Pending, the StorageClass provisioner isn't running.
Step 3: Deploy Ollama with GPU Scheduling
This Deployment does four things: requests a GPU resource, uses node affinity to prefer GPU nodes, mounts the PVC, and sets resource limits that prevent the pod from consuming all VRAM on a shared node.
# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
replicas: 1 # Start with 1; we'll cover HA replicas in Step 5
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
# ── GPU Node Affinity ─────────────────────────────────────────
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present # Label set by GPU Operator
operator: In
values:
- "true"
# ── Tolerations for GPU node taints ──────────────────────────
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: ollama
image: ollama/ollama:0.6.2 # Pin to a specific version — never use :latest in prod
ports:
- containerPort: 11434
env:
- name: OLLAMA_HOST
value: "0.0.0.0" # Listen on all interfaces, not just localhost
- name: OLLAMA_KEEP_ALIVE
value: "24h" # Keep models loaded in VRAM between requests
- name: OLLAMA_NUM_PARALLEL
value: "2" # Allow 2 concurrent inference requests per pod
resources:
requests:
memory: "8Gi"
cpu: "2"
nvidia.com/gpu: "1" # Request exactly 1 GPU — scheduler won't place pod without it
limits:
memory: "16Gi"
cpu: "4"
nvidia.com/gpu: "1" # Limit matches request for GPU — fractional GPU isn't supported
volumeMounts:
- name: ollama-models
mountPath: /root/.ollama # Default Ollama model directory
livenessProbe:
httpGet:
path: /api/tags # Returns 200 when Ollama is ready
port: 11434
initialDelaySeconds: 30 # Give Ollama time to load before health checks start
periodSeconds: 30
readinessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 10
periodSeconds: 10
volumes:
- name: ollama-models
persistentVolumeClaim:
claimName: ollama-models
kubectl apply -f ollama-deployment.yaml
Watch the pod come up — it may take 60–90 seconds the first time:
kubectl rollout status deployment/ollama -n ollama
Expected: deployment "ollama" successfully rolled out
If it fails:
0/1 nodes are available: insufficient nvidia.com/gpu→ No GPU nodes have capacity. Check withkubectl describe nodes | grep -A5 "Allocatable"Pod stuck in Pending→ Runkubectl describe pod -n ollama <pod-name>and check the Events section
Step 4: Expose Ollama with a Service
Create a ClusterIP Service for internal cluster access (other pods call http://ollama.ollama.svc.cluster.local:11434).
# ollama-service.yaml
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ollama
spec:
type: ClusterIP
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434
protocol: TCP
kubectl apply -f ollama-service.yaml
To test from within the cluster, run a temporary pod:
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never \
-- curl -s http://ollama.ollama.svc.cluster.local:11434/api/tags
Expected: JSON response with {"models":[]} (empty until you pull a model).
Step 5: Pull a Model into the Persistent Volume
Run ollama pull inside the running pod once. Since models go to the PVC, they survive any future pod restarts.
# Get the pod name
POD=$(kubectl get pod -n ollama -l app=ollama -o jsonpath='{.items[0].metadata.name}')
# Pull a model — this downloads to the PVC
kubectl exec -n ollama $POD -- ollama pull llama3.2:3b
Verify the model is stored:
kubectl exec -n ollama $POD -- ollama list
Expected:
NAME ID SIZE MODIFIED
llama3.2:3b ... 2.0 GB Just now
Now restart the pod deliberately to confirm persistence:
kubectl rollout restart deployment/ollama -n ollama
kubectl rollout status deployment/ollama -n ollama
POD=$(kubectl get pod -n ollama -l app=ollama -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n ollama $POD -- ollama list
The model should still appear after restart — it's on the PVC, not the container layer.
Step 6: Configure High Availability with Pod Anti-Affinity
A single replica means a single node failure takes down inference. To spread replicas across nodes, use pod anti-affinity. Note that this requires ReadWriteMany storage (like NFS or EFS) — ReadWriteOnce PVCs can only attach to one node at a time.
For HA with RWX storage:
# ollama-ha-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
replicas: 2 # Two replicas on two separate GPU nodes
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
podAntiAffinity:
# Hard rule: two Ollama pods must NEVER land on the same node
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ollama
topologyKey: kubernetes.io/hostname # One pod per unique hostname
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: ollama
image: ollama/ollama:0.6.2
ports:
- containerPort: 11434
env:
- name: OLLAMA_HOST
value: "0.0.0.0"
- name: OLLAMA_KEEP_ALIVE
value: "24h"
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
volumeMounts:
- name: ollama-models
mountPath: /root/.ollama
livenessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 10
periodSeconds: 10
volumes:
- name: ollama-models
persistentVolumeClaim:
claimName: ollama-models-rwx # Must be a ReadWriteMany PVC
For AWS, provision an EFS-backed PVC:
# ollama-pvc-rwx.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-models-rwx
namespace: ollama
spec:
accessModes:
- ReadWriteMany
storageClassName: efs-sc # Requires AWS EFS CSI driver
resources:
requests:
storage: 100Gi
If you only have one GPU node, use replicas: 1 and accept that HA isn't possible at the pod level. Focus instead on fast pod restart time (the liveness probe handles this automatically).
Step 7: Add a PodDisruptionBudget
A PodDisruptionBudget (PDB) prevents Kubernetes from evicting too many Ollama pods at once during node drains or cluster upgrades.
# ollama-pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: ollama-pdb
namespace: ollama
spec:
minAvailable: 1 # At least 1 Ollama pod must be running during disruptions
selector:
matchLabels:
app: ollama
kubectl apply -f ollama-pdb.yaml
This means kubectl drain on a GPU node will wait until another Ollama pod is Running before proceeding — giving you zero-downtime maintenance.
Verification
Run a full end-to-end inference test from within the cluster:
kubectl run inference-test --image=curlimages/curl --rm -it --restart=Never -- \
curl -s -X POST http://ollama.ollama.svc.cluster.local:11434/api/generate \
-H "Content-Type: application/json" \
-d '{"model":"llama3.2:3b","prompt":"What is Kubernetes?","stream":false}'
You should see: A JSON response with a response field containing generated text. Round-trip time under 5 seconds on a GPU node.
Check GPU utilization during inference:
# SSH to the GPU node, then:
nvidia-smi dmon -s u -d 1
Expected: GPU utilization spikes to 80–100% during generation, drops to near 0% between requests (model stays loaded due to OLLAMA_KEEP_ALIVE=24h).
Verify PVC is surviving restarts:
kubectl rollout restart deployment/ollama -n ollama
kubectl rollout status deployment/ollama -n ollama
kubectl exec -n ollama \
$(kubectl get pod -n ollama -l app=ollama -o jsonpath='{.items[0].metadata.name}') \
-- ollama list
Models should still be listed after the restart.
Production Considerations
Storage cost vs. latency tradeoff. NFS/EFS gives you ReadWriteMany for HA but adds 10–50ms of I/O latency on model load. If pods restart rarely and model load time matters more than strict HA, stick with ReadWriteOnce on fast NVMe-backed block storage.
Model pre-pull as an init container. Instead of manually exec-ing ollama pull, automate it with an init container that runs the pull on first boot:
initContainers:
- name: model-puller
image: ollama/ollama:0.6.2
command: ["ollama", "pull", "llama3.2:3b"]
env:
- name: OLLAMA_HOST
value: "http://localhost:11434"
volumeMounts:
- name: ollama-models
mountPath: /root/.ollama
Resource quotas per namespace. If multiple teams share the cluster, add a ResourceQuota to the ollama namespace to prevent it from consuming all GPU capacity:
apiVersion: v1
kind: ResourceQuota
metadata:
name: ollama-gpu-quota
namespace: ollama
spec:
hard:
requests.nvidia.com/gpu: "2"
limits.nvidia.com/gpu: "2"
Horizontal Pod Autoscaler doesn't work for GPU workloads. HPA scales on CPU and memory metrics. GPU utilization isn't a native HPA metric. For GPU-based autoscaling, use KEDA with a Prometheus adapter that scrapes nvidia_gpu_duty_cycle metrics instead.
What You Learned
- GPU scheduling requires both
nvidia.com/gpuresource requests and node affinity rules pointing to labeled GPU nodes — one without the other leads to wrong placement or scheduling failures - PersistentVolumeClaims survive pod restarts;
ReadWriteOnceis enough for single-replica deployments,ReadWriteManyis required for true multi-replica HA - Pod anti-affinity with
topologyKey: kubernetes.io/hostnameguarantees replicas land on separate physical nodes - PodDisruptionBudgets are cheap to add and prevent cluster maintenance from silently taking down your inference endpoint
Limitation: Ollama doesn't support model sharding across multiple GPUs natively. If you need a model that exceeds a single GPU's VRAM, look at vLLM with tensor parallelism instead — it's designed for multi-GPU inference on Kubernetes.
Tested on Kubernetes 1.30, Ollama 0.6.2, NVIDIA GPU Operator 24.9, EKS and bare-metal k3s with RTX 3090