Kubernetes Ollama Deployment: Container Orchestration Tutorial

Deploy Ollama on Kubernetes clusters with this step-by-step guide. Learn container orchestration, pod configuration, and AI model hosting. Start now!

Remember trying to run AI models on your laptop and hearing it sound like a jet engine taking off? Those days are over. Kubernetes Ollama deployment transforms your AI model hosting from a desktop-melting nightmare into a scalable, professional operation.

This tutorial shows you how to deploy Ollama on Kubernetes clusters. You'll learn container orchestration fundamentals, configure pods for AI workloads, and create production-ready deployments. By the end, you'll run multiple AI models across your cluster without breaking a sweat.

Why Deploy Ollama on Kubernetes?

Traditional Ollama installations limit you to single-machine deployments. Your AI models compete for resources, crash frequently, and scale poorly. Kubernetes Ollama deployment solves these problems with distributed computing power.

Key Benefits of Container Orchestration for AI Models

  • Resource isolation: Each model runs in separate pods
  • Automatic scaling: Kubernetes adds pods based on demand
  • High availability: Failed pods restart automatically
  • Load distribution: Requests spread across multiple instances
  • Configuration management: Updates deploy without downtime

Prerequisites for Kubernetes Ollama Deployment

Before starting this container orchestration tutorial, ensure you have:

  • Kubernetes cluster (v1.24+) with kubectl access
  • Docker installed and configured
  • 8GB+ RAM available per Ollama pod
  • NVIDIA GPU support (optional but recommended)
  • Basic understanding of YAML configurations
Kubernetes Dashboard - Cluster Resources

Understanding Ollama Container Architecture

Ollama containers package AI models with their runtime environment. This containerization approach simplifies deployment across different Kubernetes nodes. Each pod contains:

  • Ollama binary and dependencies
  • Model files (downloaded or pre-loaded)
  • Configuration files for API endpoints
  • Health check endpoints for cluster management

Container Resource Requirements

Different AI models need varying resources:

  • Small models (7B parameters): 4GB RAM, 2 CPU cores
  • Medium models (13B parameters): 8GB RAM, 4 CPU cores
  • Large models (70B parameters): 48GB RAM, 8 CPU cores

Creating Ollama Deployment Configuration

Start by creating a Kubernetes deployment manifest. This YAML file defines how Kubernetes should run your Ollama containers.

# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-deployment
  namespace: ai-models
  labels:
    app: ollama
    version: v1.0
spec:
  replicas: 3  # Start with 3 pods for load distribution
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434  # Default Ollama API port
          name: api
        env:
        - name: OLLAMA_HOST
          value: "0.0.0.0"  # Listen on all interfaces
        - name: OLLAMA_ORIGINS
          value: "*"  # Allow cross-origin requests
        resources:
          requests:
            memory: "4Gi"  # Minimum memory requirement
            cpu: "1000m"   # 1 CPU core minimum
          limits:
            memory: "8Gi"  # Maximum memory usage
            cpu: "2000m"   # 2 CPU cores maximum
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama  # Model storage location
        livenessProbe:
          httpGet:
            path: /api/tags  # Health check endpoint
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 30
          periodSeconds: 10
      volumes:
      - name: ollama-data
        persistentVolumeClaim:
          claimName: ollama-pvc

This deployment configuration creates three Ollama pods. Each pod runs independently and handles API requests. The persistent volume ensures model data survives pod restarts.

Configuring Persistent Storage for AI Models

AI models require persistent storage for downloaded files. Create a PersistentVolumeClaim to store model data across pod lifecycles.

# ollama-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-pvc
  namespace: ai-models
spec:
  accessModes:
    - ReadWriteMany  # Multiple pods can access simultaneously
  resources:
    requests:
      storage: 100Gi  # Adjust based on model sizes
  storageClassName: fast-ssd  # Use SSD for better performance

ReadWriteMany access mode allows multiple Ollama pods to share model files. This setup reduces storage requirements and improves startup times.

Storage Considerations for Production

  • Model size planning: Calculate total storage needs before deployment
  • Performance requirements: Use SSD storage for faster model loading
  • Backup strategy: Implement regular snapshots of model data
  • Access patterns: Consider separate volumes for frequently used models

Setting Up Service Discovery and Load Balancing

Kubernetes services provide stable endpoints for your Ollama pods. Create a LoadBalancer service to distribute traffic across all available instances.

# ollama-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: ai-models
  labels:
    app: ollama
spec:
  type: LoadBalancer
  selector:
    app: ollama
  ports:
  - port: 80
    targetPort: 11434
    protocol: TCP
    name: api
  sessionAffinity: None  # Distribute requests evenly

This service creates a stable IP address for client applications. Kubernetes automatically routes requests to healthy Ollama pods using round-robin load balancing.

Internal Service Configuration

For cluster-internal access, create a ClusterIP service:

# ollama-internal-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: ollama-internal
  namespace: ai-models
spec:
  type: ClusterIP
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434
    name: internal-api

Deploying Ollama to Your Kubernetes Cluster

Apply your configuration files to create the Ollama deployment. Execute these commands in order:

# Create namespace for AI workloads
kubectl create namespace ai-models

# Apply persistent volume claim
kubectl apply -f ollama-pvc.yaml

# Deploy Ollama pods
kubectl apply -f ollama-deployment.yaml

# Create load balancer service
kubectl apply -f ollama-service.yaml

# Verify deployment status
kubectl get pods -n ai-models -l app=ollama

Monitor the deployment progress using kubectl commands:

# Check pod status and events
kubectl describe pods -n ai-models -l app=ollama

# View service endpoints
kubectl get svc -n ai-models

# Monitor resource usage
kubectl top pods -n ai-models
kubectl output showing running Ollama pods

Configuring Horizontal Pod Autoscaling

Kubernetes can automatically scale your Ollama deployment based on resource usage. Create an HPA (Horizontal Pod Autoscaler) to handle traffic spikes.

# ollama-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ollama-hpa
  namespace: ai-models
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ollama-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale up at 70% CPU usage
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80  # Scale up at 80% memory usage
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling up
      policies:
      - type: Percent
        value: 50  # Increase by 50% of current replicas
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 600  # Wait 10 minutes before scaling down
      policies:
      - type: Percent
        value: 25  # Decrease by 25% of current replicas
        periodSeconds: 60

Apply the HPA configuration:

kubectl apply -f ollama-hpa.yaml

# Monitor autoscaling decisions
kubectl get hpa -n ai-models -w

Installing and Managing AI Models

Connect to your Ollama service to download and manage AI models. Use port-forwarding for initial setup:

# Forward local port to Ollama service
kubectl port-forward svc/ollama-service 11434:80 -n ai-models

# Pull popular models (run in separate Terminal)
curl -X POST http://localhost:11434/api/pull -d '{"name": "llama2:7b"}'
curl -X POST http://localhost:11434/api/pull -d '{"name": "codellama:13b"}'
curl -X POST http://localhost:11434/api/pull -d '{"name": "mistral:7b"}'

# List installed models
curl http://localhost:11434/api/tags

Automated Model Installation

Create an init container to pre-load models during deployment:

# Add to ollama-deployment.yaml under spec.template.spec
initContainers:
- name: model-loader
  image: ollama/ollama:latest
  command: ["/bin/sh"]
  args:
    - -c
    - |
      ollama serve &
      sleep 10
      ollama pull llama2:7b
      ollama pull mistral:7b
      pkill ollama
  volumeMounts:
  - name: ollama-data
    mountPath: /root/.ollama
Ollama Models List Output

Monitoring and Observability Setup

Implement comprehensive monitoring for your Kubernetes Ollama deployment. Use Prometheus and Grafana for metrics collection and visualization.

Prometheus ServiceMonitor Configuration

# ollama-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ollama-metrics
  namespace: ai-models
  labels:
    app: ollama
spec:
  selector:
    matchLabels:
      app: ollama
  endpoints:
  - port: api
    interval: 30s
    path: /metrics  # If Ollama exposes Prometheus metrics

Custom Metrics Collection

Create a sidecar container to collect Ollama-specific metrics:

# Add to ollama-deployment.yaml containers section
- name: metrics-exporter
  image: prom/node-exporter:latest
  ports:
  - containerPort: 9100
    name: metrics
  args:
    - '--path.procfs=/host/proc'
    - '--path.sysfs=/host/sys'
    - '--collector.filesystem.ignored-mount-points'
    - '^/(sys|proc|dev|host|etc|rootfs/var/lib/docker/containers|rootfs/var/lib/docker/overlay2|rootfs/run/docker/netns|rootfs/var/lib/docker/aufs)($$|/)'
  volumeMounts:
  - name: proc
    mountPath: /host/proc
    readOnly: true
  - name: sys
    mountPath: /host/sys
    readOnly: true

Security Best Practices for AI Model Deployment

Secure your Kubernetes Ollama deployment with proper RBAC, network policies, and container security measures.

Role-Based Access Control (RBAC)

# ollama-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ollama-sa
  namespace: ai-models
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ollama-role
  namespace: ai-models
rules:
- apiGroups: [""]
  resources: ["pods", "services"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ollama-binding
  namespace: ai-models
subjects:
- kind: ServiceAccount
  name: ollama-sa
  namespace: ai-models
roleRef:
  kind: Role
  name: ollama-role
  apiGroup: rbac.authorization.k8s.io

Network Security Policies

Restrict network access to Ollama pods:

# ollama-networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ollama-netpol
  namespace: ai-models
spec:
  podSelector:
    matchLabels:
      app: ollama
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: frontend  # Only allow frontend namespace
    ports:
    - protocol: TCP
      port: 11434
  egress:
  - to: []
    ports:
    - protocol: TCP
      port: 443  # HTTPS for model downloads
    - protocol: TCP
      port: 80   # HTTP for model downloads

Troubleshooting Common Deployment Issues

Pod Startup Problems

Check pod logs for startup issues:

# View detailed pod information
kubectl describe pod -n ai-models -l app=ollama

# Check container logs
kubectl logs -n ai-models -l app=ollama -c ollama

# Follow logs in real-time
kubectl logs -n ai-models -l app=ollama -f

Resource Constraints

Monitor resource usage and adjust limits:

# Check resource usage
kubectl top pods -n ai-models

# View cluster resource availability
kubectl describe nodes

# Check for resource quotas
kubectl get resourcequota -n ai-models

Storage Issues

Verify persistent volume configuration:

# Check PVC status
kubectl get pvc -n ai-models

# Verify volume mounts
kubectl describe pod -n ai-models -l app=ollama | grep -A 5 "Mounts:"

# Check storage class availability
kubectl get storageclass
Kubectl Troubleshooting Commands Output

Performance Optimization Strategies

GPU Acceleration Setup

Enable GPU support for faster model inference:

# Add to ollama-deployment.yaml container spec
resources:
  limits:
    nvidia.com/gpu: 1  # Request 1 GPU per pod
nodeSelector:
  accelerator: nvidia-tesla-v100  # Target GPU nodes
tolerations:
- key: nvidia.com/gpu
  operator: Exists
  effect: NoSchedule

Memory and CPU Tuning

Optimize resource allocation based on model requirements:

# Resource configurations for different model sizes
# Small models (7B parameters)
resources:
  requests:
    memory: "4Gi"
    cpu: "1000m"
  limits:
    memory: "6Gi"
    cpu: "2000m"

# Large models (70B parameters)  
resources:
  requests:
    memory: "32Gi"
    cpu: "4000m"
  limits:
    memory: "48Gi"
    cpu: "8000m"

Model Caching Strategies

Implement model sharing across pods:

# Shared model cache volume
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-model-cache
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 500Gi
  storageClassName: fast-ssd

Testing Your Ollama Kubernetes Deployment

Verify your deployment works correctly with comprehensive testing:

API Connectivity Tests

# Get service external IP
OLLAMA_URL=$(kubectl get svc ollama-service -n ai-models -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

# Test API endpoint
curl -X GET http://$OLLAMA_URL/api/tags

# Test model inference
curl -X POST http://$OLLAMA_URL/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2:7b",
    "prompt": "Explain Kubernetes in simple terms",
    "stream": false
  }'

Load Testing

Use tools like Apache Bench to test scaling behavior:

# Install Apache Bench
apt-get update && apt-get install apache2-utils

# Run load test
ab -n 100 -c 10 -p request.json -T application/json http://$OLLAMA_URL/api/generate

# Monitor HPA scaling during test
kubectl get hpa ollama-hpa -n ai-models -w

Health Check Validation

Verify health checks work properly:

# Check readiness probe
kubectl get pods -n ai-models -l app=ollama -o wide

# Test health endpoint directly
curl http://$OLLAMA_URL/api/tags
Load Testing Results and HPA Scaling Dashboard

Advanced Configuration Options

Multi-Model Deployment Strategy

Deploy different models with separate configurations:

# mistral-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral-deployment
  namespace: ai-models
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ollama
      model: mistral
  template:
    metadata:
      labels:
        app: ollama
        model: mistral
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        env:
        - name: DEFAULT_MODEL
          value: "mistral:7b"
        resources:
          requests:
            memory: "6Gi"
            cpu: "1500m"
          limits:
            memory: "8Gi"
            cpu: "2500m"

Blue-Green Deployment Strategy

Implement zero-downtime updates:

# blue-green-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: ollama-active
  namespace: ai-models
spec:
  selector:
    app: ollama
    version: blue  # Switch between blue/green
  ports:
  - port: 80
    targetPort: 11434

Configuration Management with ConfigMaps

Store Ollama configuration in ConfigMaps:

# ollama-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: ollama-config
  namespace: ai-models
data:
  ollama.conf: |
    OLLAMA_HOST=0.0.0.0
    OLLAMA_ORIGINS=*
    OLLAMA_MAX_LOADED_MODELS=3
    OLLAMA_MAX_QUEUE=512
  models.json: |
    {
      "models": [
        {"name": "llama2:7b", "auto_load": true},
        {"name": "mistral:7b", "auto_load": false},
        {"name": "codellama:13b", "auto_load": false}
      ]
    }

Production Readiness Checklist

Before deploying to production, verify these essential components:

Infrastructure Requirements

  • Kubernetes cluster version 1.24 or higher
  • Sufficient node resources (CPU, memory, storage)
  • GPU nodes available (if using GPU acceleration)
  • High-performance storage class configured
  • Network connectivity between nodes verified

Security Configuration

  • RBAC policies implemented and tested
  • Network policies restrict unauthorized access
  • Container images scanned for vulnerabilities
  • Secrets management configured for sensitive data
  • Pod security standards enforced

Monitoring and Observability

  • Prometheus metrics collection enabled
  • Grafana dashboards created for visualization
  • Alerting rules configured for critical events
  • Log aggregation system connected
  • Health checks respond correctly

Backup and Disaster Recovery

  • Model data backup strategy implemented
  • Configuration files version controlled
  • Disaster recovery procedures documented
  • Recovery time objectives defined
  • Backup restoration tested

Conclusion

Kubernetes Ollama deployment transforms AI model hosting from single-machine limitations to enterprise-scale container orchestration. This tutorial covered essential deployment patterns, security configurations, and production best practices.

You now have the knowledge to deploy Ollama on Kubernetes clusters with proper resource management, automatic scaling, and high availability. Your AI models run efficiently across distributed infrastructure while maintaining security and observability standards.

Start with the basic deployment configuration, then gradually add advanced features like GPU acceleration, multi-model strategies, and comprehensive monitoring. This approach ensures stable, scalable AI model hosting that grows with your requirements.

Ready to deploy your first Kubernetes Ollama cluster? Begin with the deployment manifests provided in this tutorial and customize them for your specific use case.