Picture this: Your Ollama deployment is humming along perfectly at 2 AM with minimal load, then suddenly gets slammed with 500 concurrent requests at 9 AM. Without auto-scaling, you're either wasting resources during quiet hours or watching your system crash during peak times. It's like having a restaurant with only two tables that suddenly needs to serve a wedding party.
This guide shows you how to build dynamic Ollama resource management that automatically adjusts computing resources based on real-time demand. You'll learn to implement horizontal pod autoscaling, vertical scaling strategies, and monitoring systems that keep your AI models running efficiently 24/7.
Why Auto-Scaling Matters for Ollama Deployments
Traditional static deployments waste up to 70% of allocated resources during off-peak hours. Meanwhile, traffic spikes can crash your AI services faster than you can say "model inference." Auto-scaling solves both problems by:
- Reducing costs by scaling down during low demand
- Maintaining performance during traffic surges
- Improving reliability through automated resource adjustment
- Optimizing model serving efficiency
The key challenge with Ollama specifically is that language models require significant memory and have longer startup times compared to typical web applications. This means your scaling strategy needs to be more sophisticated than standard web service auto-scaling.
Prerequisites and Environment Setup
Before diving into auto-scaling implementation, ensure you have:
Required Tools
- Docker 24.0+ with buildx support
- Kubernetes cluster (1.28+) with metrics-server
- kubectl configured for your cluster
- Helm 3.0+ for package management
Hardware Requirements
- Minimum: 8GB RAM, 4 CPU cores per node
- Recommended: 16GB RAM, 8 CPU cores per node
- GPU support: NVIDIA drivers with CUDA 12.0+
Test Environment Validation
# Verify cluster resources
kubectl get nodes -o wide
# Check metrics-server deployment
kubectl get deployment metrics-server -n kube-system
# Confirm available resources
kubectl describe nodes | grep -A 5 "Allocated resources"
Expected output should show available CPU and memory resources across your cluster nodes.
Kubernetes-Based Auto-Scaling Architecture
Horizontal Pod Autoscaler (HPA) Configuration
The Horizontal Pod Autoscaler automatically scales the number of Ollama pods based on CPU utilization, memory usage, or custom metrics. Here's the complete configuration:
# ollama-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ollama-hpa
namespace: ollama
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 30
Ollama Deployment with Resource Specifications
# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-deployment
namespace: ollama
spec:
replicas: 2
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
requests:
cpu: "1000m" # 1 CPU core minimum
memory: "4Gi" # 4GB RAM minimum
limits:
cpu: "2000m" # 2 CPU cores maximum
memory: "8Gi" # 8GB RAM maximum
readinessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 60
periodSeconds: 20
env:
- name: OLLAMA_HOST
value: "0.0.0.0"
- name: OLLAMA_ORIGINS
value: "*"
volumeMounts:
- name: model-storage
mountPath: /root/.ollama
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: ollama-models-pvc
Persistent Volume for Model Storage
# ollama-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-models-pvc
namespace: ollama
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Gi
storageClassName: fast-ssd
Docker-Based Auto-Scaling with Docker Swarm
For environments where Kubernetes isn't available, Docker Swarm provides auto-scaling capabilities:
Docker Compose Configuration
# docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
deploy:
replicas: 2
update_config:
parallelism: 1
delay: 10s
restart_policy:
condition: on-failure
resources:
limits:
cpus: '2.0'
memory: 8G
reservations:
cpus: '1.0'
memory: 4G
ports:
- "11434:11434"
volumes:
- ollama_models:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_ORIGINS=*
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
monitor:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
volumes:
ollama_models:
prometheus_data:
Custom Auto-Scaling Script
#!/bin/bash
# auto-scale-ollama.sh
# Monitors CPU and scales Docker Swarm service accordingly
SERVICE_NAME="ollama_ollama"
MIN_REPLICAS=2
MAX_REPLICAS=10
CPU_THRESHOLD=70
MEMORY_THRESHOLD=80
get_service_stats() {
docker service ps $SERVICE_NAME --format "table {{.Node}}" --no-trunc | tail -n +2 | while read node; do
docker stats --no-stream --format "table {{.CPUPerc}}\t{{.MemPerc}}" $(docker ps -q --filter "label=com.docker.swarm.service.name=$SERVICE_NAME")
done
}
scale_service() {
local new_replicas=$1
echo "Scaling $SERVICE_NAME to $new_replicas replicas"
docker service scale $SERVICE_NAME=$new_replicas
}
monitor_and_scale() {
current_replicas=$(docker service inspect $SERVICE_NAME --format='{{.Spec.Mode.Replicated.Replicas}}')
# Get average CPU usage
avg_cpu=$(get_service_stats | awk '{sum+=$1; count++} END {print sum/count}' | sed 's/%//')
if (( $(echo "$avg_cpu > $CPU_THRESHOLD" | bc -l) )); then
if [ $current_replicas -lt $MAX_REPLICAS ]; then
new_replicas=$((current_replicas + 1))
scale_service $new_replicas
fi
elif (( $(echo "$avg_cpu < 30" | bc -l) )); then
if [ $current_replicas -gt $MIN_REPLICAS ]; then
new_replicas=$((current_replicas - 1))
scale_service $new_replicas
fi
fi
}
# Run monitoring loop
while true; do
monitor_and_scale
sleep 60
done
Vertical Pod Autoscaler (VPA) Implementation
Vertical Pod Autoscaler adjusts CPU and memory limits for individual pods based on historical usage patterns. This is particularly useful for Ollama deployments where different models have varying resource requirements.
VPA Configuration
# ollama-vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: ollama-vpa
namespace: ollama
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama-deployment
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: ollama
minAllowed:
cpu: 500m
memory: 2Gi
maxAllowed:
cpu: 4000m
memory: 16Gi
controlledResources: ["cpu", "memory"]
Model-Specific Resource Profiles
# resource-profiles-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: ollama-resource-profiles
namespace: ollama
data:
profiles.yaml: |
models:
llama2:7b:
cpu: "1000m"
memory: "4Gi"
llama2:13b:
cpu: "2000m"
memory: "8Gi"
llama2:70b:
cpu: "4000m"
memory: "32Gi"
codellama:7b:
cpu: "1500m"
memory: "6Gi"
default:
cpu: "1000m"
memory: "4Gi"
Custom Metrics and Advanced Scaling
Prometheus Metrics Collection
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'ollama'
static_configs:
- targets: ['ollama-service:11434']
metrics_path: '/metrics'
scrape_interval: 30s
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Custom Metrics API Integration
# custom-metrics-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ollama-custom-hpa
namespace: ollama
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama-deployment
minReplicas: 2
maxReplicas: 15
metrics:
- type: Pods
pods:
metric:
name: ollama_active_requests
target:
type: AverageValue
averageValue: "5"
- type: Pods
pods:
metric:
name: ollama_inference_latency
target:
type: AverageValue
averageValue: "2000m"
Monitoring and Alerting Setup
Grafana Dashboard Configuration
{
"dashboard": {
"title": "Ollama Auto-Scaling Dashboard",
"panels": [
{
"title": "Pod Replicas",
"type": "stat",
"targets": [
{
"expr": "kube_deployment_status_replicas{deployment=\"ollama-deployment\"}",
"legendFormat": "Current Replicas"
}
]
},
{
"title": "CPU Utilization",
"type": "graph",
"targets": [
{
"expr": "rate(container_cpu_usage_seconds_total{pod=~\"ollama-deployment-.*\"}[5m]) * 100",
"legendFormat": "CPU Usage %"
}
]
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "container_memory_usage_bytes{pod=~\"ollama-deployment-.*\"} / 1024 / 1024 / 1024",
"legendFormat": "Memory Usage GB"
}
]
}
]
}
}
Alert Rules
# ollama-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ollama-alerts
namespace: ollama
spec:
groups:
- name: ollama.rules
rules:
- alert: OllamaHighCPUUsage
expr: rate(container_cpu_usage_seconds_total{pod=~"ollama-deployment-.*"}[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Ollama pod CPU usage is high"
description: "CPU usage for {{ $labels.pod }} has been above 80% for 5 minutes"
- alert: OllamaHighMemoryUsage
expr: container_memory_usage_bytes{pod=~"ollama-deployment-.*"} / container_spec_memory_limit_bytes > 0.9
for: 3m
labels:
severity: critical
annotations:
summary: "Ollama pod memory usage is critical"
description: "Memory usage for {{ $labels.pod }} is above 90%"
Implementation Steps and Best Practices
Step 1: Deploy Basic Infrastructure
# Create namespace
kubectl create namespace ollama
# Deploy persistent volume claim
kubectl apply -f ollama-pvc.yaml
# Deploy Ollama application
kubectl apply -f ollama-deployment.yaml
# Create service
kubectl apply -f ollama-service.yaml
Step 2: Configure Auto-Scaling
# Deploy HPA configuration
kubectl apply -f ollama-hpa.yaml
# Verify HPA is working
kubectl get hpa -n ollama
# Check HPA status
kubectl describe hpa ollama-hpa -n ollama
Step 3: Load Testing and Validation
# Install load testing tools
kubectl apply -f https://raw.githubusercontent.com/fortio/fortio/master/deployment/fortio-deploy.yaml
# Run load test
kubectl exec -it fortio-deploy-xxx -- fortio load -c 50 -t 30s http://ollama-service:11434/api/generate
Step 4: Monitor Scaling Behavior
# Watch pod scaling in real-time
kubectl get pods -n ollama -w
# Monitor HPA decisions
kubectl get hpa -n ollama -w
# Check resource utilization
kubectl top pods -n ollama
Performance Optimization and Troubleshooting
Common Scaling Issues
Problem: Pods take too long to become ready Solution: Optimize readiness probes and model loading
readinessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 10 # Reduced from 30
periodSeconds: 5 # More frequent checks
timeoutSeconds: 3
failureThreshold: 3
Problem: Frequent scaling oscillations Solution: Implement stabilization windows
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
scaleUp:
stabilizationWindowSeconds: 60 # Wait 1 minute before scaling up
Resource Optimization Strategies
- Use init containers to pre-load models
- Implement model caching with shared persistent volumes
- Configure resource quotas to prevent resource exhaustion
- Use node affinity to optimize hardware utilization
# Model pre-loading init container
initContainers:
- name: model-loader
image: ollama/ollama:latest
command: ['sh', '-c', 'ollama pull llama2:7b && ollama pull codellama:7b']
volumeMounts:
- name: model-storage
mountPath: /root/.ollama
Advanced Configuration and Customization
Multi-Model Deployment Strategy
# multi-model-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-llama2-7b
namespace: ollama
spec:
replicas: 2
selector:
matchLabels:
app: ollama
model: llama2-7b
template:
metadata:
labels:
app: ollama
model: llama2-7b
spec:
containers:
- name: ollama
image: ollama/ollama:latest
resources:
requests:
cpu: "1000m"
memory: "4Gi"
limits:
cpu: "2000m"
memory: "8Gi"
env:
- name: OLLAMA_MODEL
value: "llama2:7b"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-codellama-7b
namespace: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
model: codellama-7b
template:
metadata:
labels:
app: ollama
model: codellama-7b
spec:
containers:
- name: ollama
image: ollama/ollama:latest
resources:
requests:
cpu: "1500m"
memory: "6Gi"
limits:
cpu: "3000m"
memory: "12Gi"
env:
- name: OLLAMA_MODEL
value: "codellama:7b"
GPU-Enabled Auto-Scaling
# gpu-enabled-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-gpu-deployment
namespace: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama-gpu
template:
metadata:
labels:
app: ollama-gpu
spec:
containers:
- name: ollama
image: ollama/ollama:latest
resources:
requests:
cpu: "2000m"
memory: "8Gi"
nvidia.com/gpu: 1
limits:
cpu: "4000m"
memory: "16Gi"
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "compute,utility"
nodeSelector:
nvidia.com/gpu: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Conclusion
Implementing dynamic Ollama resource management transforms your AI deployment from a resource-wasting static setup into an efficient, cost-effective auto-scaling system. The strategies covered in this guide—from Kubernetes HPA configuration to custom metrics monitoring—provide a robust foundation for handling varying workloads while maintaining optimal performance.
Key benefits you'll achieve include 40-70% cost reduction during off-peak hours, improved response times during traffic spikes, and automated infrastructure management that requires minimal manual intervention. The combination of horizontal scaling, vertical optimization, and intelligent monitoring creates a resilient ollama auto scaling solution that adapts to your specific use case.
Start with the basic HPA configuration, then gradually implement advanced features like custom metrics and multi-model deployments as your requirements evolve. Your future self will thank you when your Ollama deployment handles the next unexpected traffic surge without breaking a sweat.