Running multiple AI models simultaneously feels like juggling flaming torches while riding a unicycle. One model crashes, another consumes all your RAM, and your development workflow becomes a chaotic mess. Docker Compose transforms this circus act into a well-orchestrated symphony.
This guide shows you how to create a robust multi-model Ollama environment using Docker Compose. You'll run multiple AI models concurrently, manage resources efficiently, and maintain consistent development environments across your team.
Why Docker Compose for Ollama Multi-Model Setups
Traditional Ollama installations create several pain points for developers working with multiple models:
Resource conflicts occur when models compete for GPU memory and CPU resources. Docker Compose isolates each model in its own container with defined resource limits.
Port management becomes complex with multiple Ollama instances. Container networking handles port allocation automatically.
Environment consistency breaks when team members use different Ollama versions or configurations. Docker Compose ensures identical environments across all machines.
Scaling limitations prevent you from running specialized models for different tasks. Container orchestration enables horizontal scaling based on demand.
Prerequisites and System Requirements
Before building your multi-model environment, verify your system meets these requirements:
- Docker Engine 20.10+ with Docker Compose v2
- NVIDIA GPU with CUDA support (optional but recommended)
- 16GB RAM minimum for multiple concurrent models
- 50GB storage for model files and container images
Install Docker and Docker Compose:
# Ubuntu/Debian
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER
# Install Docker Compose
sudo apt-get update
sudo apt-get install docker-compose-plugin
For GPU support, install NVIDIA Container Toolkit:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
Core Docker Compose Configuration
Create a project directory and establish the foundation configuration:
mkdir ollama-multi-model
cd ollama-multi-model
touch docker-compose.yml
Here's the base docker-compose.yml structure:
version: '3.8'
services:
# Primary Ollama instance for general-purpose models
ollama-primary:
image: ollama/ollama:latest
container_name: ollama-primary
ports:
- "11434:11434"
volumes:
- ollama-primary-data:/root/.ollama
- ./models:/models
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_ORIGINS=*
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
limits:
memory: 8G
cpus: '4'
# Specialized instance for code generation models
ollama-code:
image: ollama/ollama:latest
container_name: ollama-code
ports:
- "11435:11434"
volumes:
- ollama-code-data:/root/.ollama
- ./models:/models
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_ORIGINS=*
restart: unless-stopped
deploy:
resources:
limits:
memory: 6G
cpus: '2'
# Lightweight instance for embeddings and small models
ollama-embeddings:
image: ollama/ollama:latest
container_name: ollama-embeddings
ports:
- "11436:11434"
volumes:
- ollama-embeddings-data:/root/.ollama
- ./models:/models
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_ORIGINS=*
restart: unless-stopped
deploy:
resources:
limits:
memory: 4G
cpus: '2'
# Web UI for model management
ollama-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: ollama-webui
ports:
- "8080:8080"
volumes:
- ollama-webui-data:/app/backend/data
environment:
- OLLAMA_BASE_URLS=http://ollama-primary:11434,http://ollama-code:11434,http://ollama-embeddings:11434
- WEBUI_NAME=Multi-Model Ollama
depends_on:
- ollama-primary
- ollama-code
- ollama-embeddings
restart: unless-stopped
volumes:
ollama-primary-data:
ollama-code-data:
ollama-embeddings-data:
ollama-webui-data:
networks:
default:
driver: bridge
GPU Resource Management and Allocation
GPU memory becomes the bottleneck in multi-model setups. Configure GPU allocation strategically:
# Advanced GPU configuration
services:
ollama-primary:
# ... other config
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0'] # Use specific GPU
capabilities: [gpu]
limits:
memory: 8G
environment:
- CUDA_VISIBLE_DEVICES=0
- OLLAMA_GPU_MEMORY_FRACTION=0.6 # Use 60% of GPU memory
ollama-code:
# ... other config
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0'] # Share same GPU
capabilities: [gpu]
limits:
memory: 6G
environment:
- CUDA_VISIBLE_DEVICES=0
- OLLAMA_GPU_MEMORY_FRACTION=0.3 # Use 30% of GPU memory
For multiple GPUs, distribute models across different devices:
# Multi-GPU distribution
services:
ollama-primary:
environment:
- CUDA_VISIBLE_DEVICES=0
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
ollama-code:
environment:
- CUDA_VISIBLE_DEVICES=1
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['1']
capabilities: [gpu]
Model Installation and Management Scripts
Create automation scripts to install and manage models across instances:
# Create scripts directory
mkdir scripts
scripts/install-models.sh:
#!/bin/bash
# Install models on primary instance
echo "Installing models on primary instance..."
docker exec ollama-primary ollama pull llama3.1:8b
docker exec ollama-primary ollama pull llama3.1:70b
docker exec ollama-primary ollama pull mistral:7b
# Install code-specific models
echo "Installing code models..."
docker exec ollama-code ollama pull codellama:13b
docker exec ollama-code ollama pull starcoder2:15b
docker exec ollama-code ollama pull deepseek-coder:6.7b
# Install embedding models
echo "Installing embedding models..."
docker exec ollama-embeddings ollama pull nomic-embed-text
docker exec ollama-embeddings ollama pull mxbai-embed-large
echo "Model installation complete!"
scripts/health-check.sh:
#!/bin/bash
# Check health of all Ollama instances
services=("ollama-primary:11434" "ollama-code:11435" "ollama-embeddings:11436")
for service in "${services[@]}"; do
IFS=':' read -r name port <<< "$service"
echo "Checking $name on port $port..."
if curl -s "http://localhost:$port/api/tags" > /dev/null; then
echo "✅ $name is healthy"
else
echo "❌ $name is not responding"
fi
done
Make scripts executable:
chmod +x scripts/*.sh
Network Configuration and Service Discovery
Configure internal networking for service-to-service communication:
# Enhanced networking configuration
version: '3.8'
services:
# ... existing services
# API Gateway for load balancing
nginx-proxy:
image: nginx:alpine
container_name: ollama-proxy
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- ollama-primary
- ollama-code
- ollama-embeddings
restart: unless-stopped
networks:
ollama-network:
driver: bridge
ipam:
config:
- subnet: 172.20.0.0/16
nginx.conf for load balancing:
events {
worker_connections 1024;
}
http {
upstream ollama_backend {
server ollama-primary:11434 weight=3;
server ollama-code:11434 weight=2;
server ollama-embeddings:11434 weight=1;
}
server {
listen 80;
location /api/generate {
proxy_pass http://ollama_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_connect_timeout 60s;
proxy_read_timeout 300s;
}
location / {
proxy_pass http://ollama-webui:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
}
Performance Optimization and Resource Limits
Fine-tune resource allocation for optimal performance:
# Production-ready resource configuration
services:
ollama-primary:
# ... other config
deploy:
resources:
reservations:
memory: 4G
cpus: '2'
limits:
memory: 8G
cpus: '4'
ulimits:
memlock:
soft: -1
hard: -1
stack:
soft: 67108864
hard: 67108864
sysctls:
- net.core.somaxconn=65535
Environment-specific configurations:
Create separate compose files for different environments:
# docker-compose.dev.yml
version: '3.8'
services:
ollama-primary:
deploy:
resources:
limits:
memory: 4G
cpus: '2'
# docker-compose.prod.yml
version: '3.8'
services:
ollama-primary:
deploy:
resources:
limits:
memory: 16G
cpus: '8'
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
Deployment and Startup Process
Deploy your multi-model environment with these commands:
# Start all services
docker-compose up -d
# Monitor startup logs
docker-compose logs -f
# Install models after services are ready
./scripts/install-models.sh
# Verify deployment
./scripts/health-check.sh
Startup verification checklist:
- Service health: All containers running and healthy
- Port accessibility: APIs responding on configured ports
- GPU allocation: NVIDIA-SMI shows proper GPU usage
- Model availability: Models loaded and accessible via API
- Web UI access: Management interface available at localhost:8080
Testing and Validation
Validate your setup with comprehensive tests:
# Test primary instance
curl -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"prompt": "Write a hello world program",
"stream": false
}'
# Test code instance
curl -X POST http://localhost:11435/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "codellama:13b",
"prompt": "def fibonacci(n):",
"stream": false
}'
# Test embeddings instance
curl -X POST http://localhost:11436/api/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text",
"prompt": "Hello world"
}'
Monitoring and Maintenance
Implement monitoring for production environments:
# Add monitoring services
services:
# ... existing services
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
Regular maintenance tasks:
# Clean up unused images
docker system prune -a
# Update all services
docker-compose pull && docker-compose up -d
# Backup model data
docker run --rm -v ollama-primary-data:/data -v $(pwd):/backup alpine tar czf /backup/ollama-backup.tar.gz /data
Production Deployment Considerations
For production deployments, implement these additional configurations:
Security hardening:
services:
ollama-primary:
environment:
- OLLAMA_ORIGINS=https://yourdomain.com
- OLLAMA_HOST=127.0.0.1 # Restrict to localhost
cap_drop:
- ALL
cap_add:
- CHOWN
- DAC_OVERRIDE
read_only: true
tmpfs:
- /tmp:noexec,nosuid,size=100m
Logging configuration:
services:
ollama-primary:
logging:
driver: json-file
options:
max-size: "100m"
max-file: "3"
labels: "service=ollama-primary"
Health checks:
services:
ollama-primary:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
Troubleshooting Common Issues
GPU memory errors:
# Check GPU usage
nvidia-smi
# Reduce model size or adjust memory fractions
docker-compose down
# Edit OLLAMA_GPU_MEMORY_FRACTION values
docker-compose up -d
Port conflicts:
# Check port usage
netstat -tulpn | grep :11434
# Kill conflicting processes
sudo kill -9 $(sudo lsof -t -i:11434)
Container startup failures:
# Check container logs
docker-compose logs ollama-primary
# Restart specific service
docker-compose restart ollama-primary
Conclusion
Docker Compose transforms chaotic multi-model AI development into an organized, scalable system. You now have a production-ready environment that isolates models, manages resources efficiently, and scales with your needs.
This setup provides dedicated instances for different model types, automatic resource allocation, and comprehensive monitoring. Your team can develop with consistent environments while maintaining the flexibility to experiment with new models.
The containerized approach ensures your Docker Compose Ollama multi-model setup remains maintainable and reproducible across development, staging, and production environments.