Remember trying to run AI models on your laptop and hearing it sound like a jet engine taking off? Those days are over. Kubernetes Ollama deployment transforms your AI model hosting from a desktop-melting nightmare into a scalable, professional operation.
This tutorial shows you how to deploy Ollama on Kubernetes clusters. You'll learn container orchestration fundamentals, configure pods for AI workloads, and create production-ready deployments. By the end, you'll run multiple AI models across your cluster without breaking a sweat.
Why Deploy Ollama on Kubernetes?
Traditional Ollama installations limit you to single-machine deployments. Your AI models compete for resources, crash frequently, and scale poorly. Kubernetes Ollama deployment solves these problems with distributed computing power.
Key Benefits of Container Orchestration for AI Models
- Resource isolation: Each model runs in separate pods
- Automatic scaling: Kubernetes adds pods based on demand
- High availability: Failed pods restart automatically
- Load distribution: Requests spread across multiple instances
- Configuration management: Updates deploy without downtime
Prerequisites for Kubernetes Ollama Deployment
Before starting this container orchestration tutorial, ensure you have:
- Kubernetes cluster (v1.24+) with kubectl access
- Docker installed and configured
- 8GB+ RAM available per Ollama pod
- NVIDIA GPU support (optional but recommended)
- Basic understanding of YAML configurations
Understanding Ollama Container Architecture
Ollama containers package AI models with their runtime environment. This containerization approach simplifies deployment across different Kubernetes nodes. Each pod contains:
- Ollama binary and dependencies
- Model files (downloaded or pre-loaded)
- Configuration files for API endpoints
- Health check endpoints for cluster management
Container Resource Requirements
Different AI models need varying resources:
- Small models (7B parameters): 4GB RAM, 2 CPU cores
- Medium models (13B parameters): 8GB RAM, 4 CPU cores
- Large models (70B parameters): 48GB RAM, 8 CPU cores
Creating Ollama Deployment Configuration
Start by creating a Kubernetes deployment manifest. This YAML file defines how Kubernetes should run your Ollama containers.
# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-deployment
namespace: ai-models
labels:
app: ollama
version: v1.0
spec:
replicas: 3 # Start with 3 pods for load distribution
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434 # Default Ollama API port
name: api
env:
- name: OLLAMA_HOST
value: "0.0.0.0" # Listen on all interfaces
- name: OLLAMA_ORIGINS
value: "*" # Allow cross-origin requests
resources:
requests:
memory: "4Gi" # Minimum memory requirement
cpu: "1000m" # 1 CPU core minimum
limits:
memory: "8Gi" # Maximum memory usage
cpu: "2000m" # 2 CPU cores maximum
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama # Model storage location
livenessProbe:
httpGet:
path: /api/tags # Health check endpoint
port: 11434
initialDelaySeconds: 60
periodSeconds: 30
readinessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 30
periodSeconds: 10
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
This deployment configuration creates three Ollama pods. Each pod runs independently and handles API requests. The persistent volume ensures model data survives pod restarts.
Configuring Persistent Storage for AI Models
AI models require persistent storage for downloaded files. Create a PersistentVolumeClaim to store model data across pod lifecycles.
# ollama-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-pvc
namespace: ai-models
spec:
accessModes:
- ReadWriteMany # Multiple pods can access simultaneously
resources:
requests:
storage: 100Gi # Adjust based on model sizes
storageClassName: fast-ssd # Use SSD for better performance
ReadWriteMany access mode allows multiple Ollama pods to share model files. This setup reduces storage requirements and improves startup times.
Storage Considerations for Production
- Model size planning: Calculate total storage needs before deployment
- Performance requirements: Use SSD storage for faster model loading
- Backup strategy: Implement regular snapshots of model data
- Access patterns: Consider separate volumes for frequently used models
Setting Up Service Discovery and Load Balancing
Kubernetes services provide stable endpoints for your Ollama pods. Create a LoadBalancer service to distribute traffic across all available instances.
# ollama-service.yaml
apiVersion: v1
kind: Service
metadata:
name: ollama-service
namespace: ai-models
labels:
app: ollama
spec:
type: LoadBalancer
selector:
app: ollama
ports:
- port: 80
targetPort: 11434
protocol: TCP
name: api
sessionAffinity: None # Distribute requests evenly
This service creates a stable IP address for client applications. Kubernetes automatically routes requests to healthy Ollama pods using round-robin load balancing.
Internal Service Configuration
For cluster-internal access, create a ClusterIP service:
# ollama-internal-service.yaml
apiVersion: v1
kind: Service
metadata:
name: ollama-internal
namespace: ai-models
spec:
type: ClusterIP
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434
name: internal-api
Deploying Ollama to Your Kubernetes Cluster
Apply your configuration files to create the Ollama deployment. Execute these commands in order:
# Create namespace for AI workloads
kubectl create namespace ai-models
# Apply persistent volume claim
kubectl apply -f ollama-pvc.yaml
# Deploy Ollama pods
kubectl apply -f ollama-deployment.yaml
# Create load balancer service
kubectl apply -f ollama-service.yaml
# Verify deployment status
kubectl get pods -n ai-models -l app=ollama
Monitor the deployment progress using kubectl commands:
# Check pod status and events
kubectl describe pods -n ai-models -l app=ollama
# View service endpoints
kubectl get svc -n ai-models
# Monitor resource usage
kubectl top pods -n ai-models
Configuring Horizontal Pod Autoscaling
Kubernetes can automatically scale your Ollama deployment based on resource usage. Create an HPA (Horizontal Pod Autoscaler) to handle traffic spikes.
# ollama-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ollama-hpa
namespace: ai-models
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale up at 70% CPU usage
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 # Scale up at 80% memory usage
behavior:
scaleUp:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling up
policies:
- type: Percent
value: 50 # Increase by 50% of current replicas
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 600 # Wait 10 minutes before scaling down
policies:
- type: Percent
value: 25 # Decrease by 25% of current replicas
periodSeconds: 60
Apply the HPA configuration:
kubectl apply -f ollama-hpa.yaml
# Monitor autoscaling decisions
kubectl get hpa -n ai-models -w
Installing and Managing AI Models
Connect to your Ollama service to download and manage AI models. Use port-forwarding for initial setup:
# Forward local port to Ollama service
kubectl port-forward svc/ollama-service 11434:80 -n ai-models
# Pull popular models (run in separate Terminal)
curl -X POST http://localhost:11434/api/pull -d '{"name": "llama2:7b"}'
curl -X POST http://localhost:11434/api/pull -d '{"name": "codellama:13b"}'
curl -X POST http://localhost:11434/api/pull -d '{"name": "mistral:7b"}'
# List installed models
curl http://localhost:11434/api/tags
Automated Model Installation
Create an init container to pre-load models during deployment:
# Add to ollama-deployment.yaml under spec.template.spec
initContainers:
- name: model-loader
image: ollama/ollama:latest
command: ["/bin/sh"]
args:
- -c
- |
ollama serve &
sleep 10
ollama pull llama2:7b
ollama pull mistral:7b
pkill ollama
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
Monitoring and Observability Setup
Implement comprehensive monitoring for your Kubernetes Ollama deployment. Use Prometheus and Grafana for metrics collection and visualization.
Prometheus ServiceMonitor Configuration
# ollama-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ollama-metrics
namespace: ai-models
labels:
app: ollama
spec:
selector:
matchLabels:
app: ollama
endpoints:
- port: api
interval: 30s
path: /metrics # If Ollama exposes Prometheus metrics
Custom Metrics Collection
Create a sidecar container to collect Ollama-specific metrics:
# Add to ollama-deployment.yaml containers section
- name: metrics-exporter
image: prom/node-exporter:latest
ports:
- containerPort: 9100
name: metrics
args:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points'
- '^/(sys|proc|dev|host|etc|rootfs/var/lib/docker/containers|rootfs/var/lib/docker/overlay2|rootfs/run/docker/netns|rootfs/var/lib/docker/aufs)($$|/)'
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
Security Best Practices for AI Model Deployment
Secure your Kubernetes Ollama deployment with proper RBAC, network policies, and container security measures.
Role-Based Access Control (RBAC)
# ollama-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: ollama-sa
namespace: ai-models
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ollama-role
namespace: ai-models
rules:
- apiGroups: [""]
resources: ["pods", "services"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: ollama-binding
namespace: ai-models
subjects:
- kind: ServiceAccount
name: ollama-sa
namespace: ai-models
roleRef:
kind: Role
name: ollama-role
apiGroup: rbac.authorization.k8s.io
Network Security Policies
Restrict network access to Ollama pods:
# ollama-networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ollama-netpol
namespace: ai-models
spec:
podSelector:
matchLabels:
app: ollama
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: frontend # Only allow frontend namespace
ports:
- protocol: TCP
port: 11434
egress:
- to: []
ports:
- protocol: TCP
port: 443 # HTTPS for model downloads
- protocol: TCP
port: 80 # HTTP for model downloads
Troubleshooting Common Deployment Issues
Pod Startup Problems
Check pod logs for startup issues:
# View detailed pod information
kubectl describe pod -n ai-models -l app=ollama
# Check container logs
kubectl logs -n ai-models -l app=ollama -c ollama
# Follow logs in real-time
kubectl logs -n ai-models -l app=ollama -f
Resource Constraints
Monitor resource usage and adjust limits:
# Check resource usage
kubectl top pods -n ai-models
# View cluster resource availability
kubectl describe nodes
# Check for resource quotas
kubectl get resourcequota -n ai-models
Storage Issues
Verify persistent volume configuration:
# Check PVC status
kubectl get pvc -n ai-models
# Verify volume mounts
kubectl describe pod -n ai-models -l app=ollama | grep -A 5 "Mounts:"
# Check storage class availability
kubectl get storageclass
Performance Optimization Strategies
GPU Acceleration Setup
Enable GPU support for faster model inference:
# Add to ollama-deployment.yaml container spec
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU per pod
nodeSelector:
accelerator: nvidia-tesla-v100 # Target GPU nodes
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Memory and CPU Tuning
Optimize resource allocation based on model requirements:
# Resource configurations for different model sizes
# Small models (7B parameters)
resources:
requests:
memory: "4Gi"
cpu: "1000m"
limits:
memory: "6Gi"
cpu: "2000m"
# Large models (70B parameters)
resources:
requests:
memory: "32Gi"
cpu: "4000m"
limits:
memory: "48Gi"
cpu: "8000m"
Model Caching Strategies
Implement model sharing across pods:
# Shared model cache volume
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: shared-model-cache
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 500Gi
storageClassName: fast-ssd
Testing Your Ollama Kubernetes Deployment
Verify your deployment works correctly with comprehensive testing:
API Connectivity Tests
# Get service external IP
OLLAMA_URL=$(kubectl get svc ollama-service -n ai-models -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
# Test API endpoint
curl -X GET http://$OLLAMA_URL/api/tags
# Test model inference
curl -X POST http://$OLLAMA_URL/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama2:7b",
"prompt": "Explain Kubernetes in simple terms",
"stream": false
}'
Load Testing
Use tools like Apache Bench to test scaling behavior:
# Install Apache Bench
apt-get update && apt-get install apache2-utils
# Run load test
ab -n 100 -c 10 -p request.json -T application/json http://$OLLAMA_URL/api/generate
# Monitor HPA scaling during test
kubectl get hpa ollama-hpa -n ai-models -w
Health Check Validation
Verify health checks work properly:
# Check readiness probe
kubectl get pods -n ai-models -l app=ollama -o wide
# Test health endpoint directly
curl http://$OLLAMA_URL/api/tags
Advanced Configuration Options
Multi-Model Deployment Strategy
Deploy different models with separate configurations:
# mistral-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-deployment
namespace: ai-models
spec:
replicas: 2
selector:
matchLabels:
app: ollama
model: mistral
template:
metadata:
labels:
app: ollama
model: mistral
spec:
containers:
- name: ollama
image: ollama/ollama:latest
env:
- name: DEFAULT_MODEL
value: "mistral:7b"
resources:
requests:
memory: "6Gi"
cpu: "1500m"
limits:
memory: "8Gi"
cpu: "2500m"
Blue-Green Deployment Strategy
Implement zero-downtime updates:
# blue-green-service.yaml
apiVersion: v1
kind: Service
metadata:
name: ollama-active
namespace: ai-models
spec:
selector:
app: ollama
version: blue # Switch between blue/green
ports:
- port: 80
targetPort: 11434
Configuration Management with ConfigMaps
Store Ollama configuration in ConfigMaps:
# ollama-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: ollama-config
namespace: ai-models
data:
ollama.conf: |
OLLAMA_HOST=0.0.0.0
OLLAMA_ORIGINS=*
OLLAMA_MAX_LOADED_MODELS=3
OLLAMA_MAX_QUEUE=512
models.json: |
{
"models": [
{"name": "llama2:7b", "auto_load": true},
{"name": "mistral:7b", "auto_load": false},
{"name": "codellama:13b", "auto_load": false}
]
}
Production Readiness Checklist
Before deploying to production, verify these essential components:
Infrastructure Requirements
- Kubernetes cluster version 1.24 or higher
- Sufficient node resources (CPU, memory, storage)
- GPU nodes available (if using GPU acceleration)
- High-performance storage class configured
- Network connectivity between nodes verified
Security Configuration
- RBAC policies implemented and tested
- Network policies restrict unauthorized access
- Container images scanned for vulnerabilities
- Secrets management configured for sensitive data
- Pod security standards enforced
Monitoring and Observability
- Prometheus metrics collection enabled
- Grafana dashboards created for visualization
- Alerting rules configured for critical events
- Log aggregation system connected
- Health checks respond correctly
Backup and Disaster Recovery
- Model data backup strategy implemented
- Configuration files version controlled
- Disaster recovery procedures documented
- Recovery time objectives defined
- Backup restoration tested
Conclusion
Kubernetes Ollama deployment transforms AI model hosting from single-machine limitations to enterprise-scale container orchestration. This tutorial covered essential deployment patterns, security configurations, and production best practices.
You now have the knowledge to deploy Ollama on Kubernetes clusters with proper resource management, automatic scaling, and high availability. Your AI models run efficiently across distributed infrastructure while maintaining security and observability standards.
Start with the basic deployment configuration, then gradually add advanced features like GPU acceleration, multi-model strategies, and comprehensive monitoring. This approach ensures stable, scalable AI model hosting that grows with your requirements.
Ready to deploy your first Kubernetes Ollama cluster? Begin with the deployment manifests provided in this tutorial and customize them for your specific use case.