Docker Compose Setup: Multi-Model Ollama Development Environment

Running multiple AI models simultaneously feels like juggling flaming torches while riding a unicycle. One model crashes, another consumes all your RAM, and your development workflow becomes a chaotic mess. Docker Compose transforms this circus act into a well-orchestrated symphony.

This guide shows you how to create a robust multi-model Ollama environment using Docker Compose. You'll run multiple AI models concurrently, manage resources efficiently, and maintain consistent development environments across your team.

Why Docker Compose for Ollama Multi-Model Setups

Traditional Ollama installations create several pain points for developers working with multiple models:

Resource conflicts occur when models compete for GPU memory and CPU resources. Docker Compose isolates each model in its own container with defined resource limits.

Port management becomes complex with multiple Ollama instances. Container networking handles port allocation automatically.

Environment consistency breaks when team members use different Ollama versions or configurations. Docker Compose ensures identical environments across all machines.

Scaling limitations prevent you from running specialized models for different tasks. Container orchestration enables horizontal scaling based on demand.

Prerequisites and System Requirements

Before building your multi-model environment, verify your system meets these requirements:

Docker Engine 20.10+ with Docker Compose v2
NVIDIA GPU with CUDA support (optional but recommended)
16GB RAM minimum for multiple concurrent models
50GB storage for model files and container images

Install Docker and Docker Compose:

# Ubuntu/Debian
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER

# Install Docker Compose
sudo apt-get update
sudo apt-get install docker-compose-plugin

For GPU support, install NVIDIA Container Toolkit:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

Core Docker Compose Configuration

Create a project directory and establish the foundation configuration:

mkdir ollama-multi-model
cd ollama-multi-model
touch docker-compose.yml

Here's the base docker-compose.yml structure:

version: '3.8'

services:
  # Primary Ollama instance for general-purpose models
  ollama-primary:
    image: ollama/ollama:latest
    container_name: ollama-primary
    ports:
      - "11434:11434"
    volumes:
      - ollama-primary-data:/root/.ollama
      - ./models:/models
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_ORIGINS=*
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          memory: 8G
          cpus: '4'

  # Specialized instance for code generation models
  ollama-code:
    image: ollama/ollama:latest
    container_name: ollama-code
    ports:
      - "11435:11434"
    volumes:
      - ollama-code-data:/root/.ollama
      - ./models:/models
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_ORIGINS=*
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 6G
          cpus: '2'

  # Lightweight instance for embeddings and small models
  ollama-embeddings:
    image: ollama/ollama:latest
    container_name: ollama-embeddings
    ports:
      - "11436:11434"
    volumes:
      - ollama-embeddings-data:/root/.ollama
      - ./models:/models
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_ORIGINS=*
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: '2'

  # Web UI for model management
  ollama-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: ollama-webui
    ports:
      - "8080:8080"
    volumes:
      - ollama-webui-data:/app/backend/data
    environment:
      - OLLAMA_BASE_URLS=http://ollama-primary:11434,http://ollama-code:11434,http://ollama-embeddings:11434
      - WEBUI_NAME=Multi-Model Ollama
    depends_on:
      - ollama-primary
      - ollama-code
      - ollama-embeddings
    restart: unless-stopped

volumes:
  ollama-primary-data:
  ollama-code-data:
  ollama-embeddings-data:
  ollama-webui-data:

networks:
  default:
    driver: bridge

GPU Resource Management and Allocation

GPU memory becomes the bottleneck in multi-model setups. Configure GPU allocation strategically:

# Advanced GPU configuration
services:
  ollama-primary:
    # ... other config
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']  # Use specific GPU
              capabilities: [gpu]
        limits:
          memory: 8G
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - OLLAMA_GPU_MEMORY_FRACTION=0.6  # Use 60% of GPU memory

  ollama-code:
    # ... other config
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']  # Share same GPU
              capabilities: [gpu]
        limits:
          memory: 6G
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - OLLAMA_GPU_MEMORY_FRACTION=0.3  # Use 30% of GPU memory

For multiple GPUs, distribute models across different devices:

# Multi-GPU distribution
services:
  ollama-primary:
    environment:
      - CUDA_VISIBLE_DEVICES=0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]

  ollama-code:
    environment:
      - CUDA_VISIBLE_DEVICES=1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['1']
              capabilities: [gpu]

Model Installation and Management Scripts

Create automation scripts to install and manage models across instances:

# Create scripts directory
mkdir scripts

scripts/install-models.sh:

#!/bin/bash

# Install models on primary instance
echo "Installing models on primary instance..."
docker exec ollama-primary ollama pull llama3.1:8b
docker exec ollama-primary ollama pull llama3.1:70b
docker exec ollama-primary ollama pull mistral:7b

# Install code-specific models
echo "Installing code models..."
docker exec ollama-code ollama pull codellama:13b
docker exec ollama-code ollama pull starcoder2:15b
docker exec ollama-code ollama pull deepseek-coder:6.7b

# Install embedding models
echo "Installing embedding models..."
docker exec ollama-embeddings ollama pull nomic-embed-text
docker exec ollama-embeddings ollama pull mxbai-embed-large

echo "Model installation complete!"

scripts/health-check.sh:

#!/bin/bash

# Check health of all Ollama instances
services=("ollama-primary:11434" "ollama-code:11435" "ollama-embeddings:11436")

for service in "${services[@]}"; do
    IFS=':' read -r name port <<< "$service"
    echo "Checking $name on port $port..."
    
    if curl -s "http://localhost:$port/api/tags" > /dev/null; then
        echo "✅ $name is healthy"
    else
        echo "❌ $name is not responding"
    fi
done

Make scripts executable:

chmod +x scripts/*.sh

Network Configuration and Service Discovery

Configure internal networking for service-to-service communication:

# Enhanced networking configuration
version: '3.8'

services:
  # ... existing services

  # API Gateway for load balancing
  nginx-proxy:
    image: nginx:alpine
    container_name: ollama-proxy
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - ollama-primary
      - ollama-code
      - ollama-embeddings
    restart: unless-stopped

networks:
  ollama-network:
    driver: bridge
    ipam:
      config:
        - subnet: 172.20.0.0/16

nginx.conf for load balancing:

events {
    worker_connections 1024;
}

http {
    upstream ollama_backend {
        server ollama-primary:11434 weight=3;
        server ollama-code:11434 weight=2;
        server ollama-embeddings:11434 weight=1;
    }

    server {
        listen 80;
        
        location /api/generate {
            proxy_pass http://ollama_backend;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_connect_timeout 60s;
            proxy_read_timeout 300s;
        }
        
        location / {
            proxy_pass http://ollama-webui:8080;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }
    }
}

Performance Optimization and Resource Limits

Fine-tune resource allocation for optimal performance:

# Production-ready resource configuration
services:
  ollama-primary:
    # ... other config
    deploy:
      resources:
        reservations:
          memory: 4G
          cpus: '2'
        limits:
          memory: 8G
          cpus: '4'
    ulimits:
      memlock:
        soft: -1
        hard: -1
      stack:
        soft: 67108864
        hard: 67108864
    sysctls:
      - net.core.somaxconn=65535

Environment-specific configurations:

Create separate compose files for different environments:

# docker-compose.dev.yml
version: '3.8'

services:
  ollama-primary:
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: '2'

# docker-compose.prod.yml
version: '3.8'

services:
  ollama-primary:
    deploy:
      resources:
        limits:
          memory: 16G
          cpus: '8'
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3

Deployment and Startup Process

Deploy your multi-model environment with these commands:

# Start all services
docker-compose up -d

# Monitor startup logs
docker-compose logs -f

# Install models after services are ready
./scripts/install-models.sh

# Verify deployment
./scripts/health-check.sh

Startup verification checklist:

Service health: All containers running and healthy
Port accessibility: APIs responding on configured ports
GPU allocation: NVIDIA-SMI shows proper GPU usage
Model availability: Models loaded and accessible via API
Web UI access: Management interface available at localhost:8080

Testing and Validation

Validate your setup with comprehensive tests:

# Test primary instance
curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "prompt": "Write a hello world program",
    "stream": false
  }'

# Test code instance
curl -X POST http://localhost:11435/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "codellama:13b",
    "prompt": "def fibonacci(n):",
    "stream": false
  }'

# Test embeddings instance
curl -X POST http://localhost:11436/api/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "prompt": "Hello world"
  }'

Monitoring and Maintenance

Implement monitoring for production environments:

# Add monitoring services
services:
  # ... existing services
  
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

Regular maintenance tasks:

# Clean up unused images
docker system prune -a

# Update all services
docker-compose pull && docker-compose up -d

# Backup model data
docker run --rm -v ollama-primary-data:/data -v $(pwd):/backup alpine tar czf /backup/ollama-backup.tar.gz /data

Production Deployment Considerations

For production deployments, implement these additional configurations:

Security hardening:

services:
  ollama-primary:
    environment:
      - OLLAMA_ORIGINS=https://yourdomain.com
      - OLLAMA_HOST=127.0.0.1  # Restrict to localhost
    cap_drop:
      - ALL
    cap_add:
      - CHOWN
      - DAC_OVERRIDE
    read_only: true
    tmpfs:
      - /tmp:noexec,nosuid,size=100m

Logging configuration:

services:
  ollama-primary:
    logging:
      driver: json-file
      options:
        max-size: "100m"
        max-file: "3"
        labels: "service=ollama-primary"

Health checks:

services:
  ollama-primary:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

Troubleshooting Common Issues

GPU memory errors:

# Check GPU usage
nvidia-smi

# Reduce model size or adjust memory fractions
docker-compose down
# Edit OLLAMA_GPU_MEMORY_FRACTION values
docker-compose up -d

Port conflicts:

# Check port usage
netstat -tulpn | grep :11434

# Kill conflicting processes
sudo kill -9 $(sudo lsof -t -i:11434)

Container startup failures:

# Check container logs
docker-compose logs ollama-primary

# Restart specific service
docker-compose restart ollama-primary

Conclusion

Docker Compose transforms chaotic multi-model AI development into an organized, scalable system. You now have a production-ready environment that isolates models, manages resources efficiently, and scales with your needs.

This setup provides dedicated instances for different model types, automatic resource allocation, and comprehensive monitoring. Your team can develop with consistent environments while maintaining the flexibility to experiment with new models.

The containerized approach ensures your Docker Compose Ollama multi-model setup remains maintainable and reproducible across development, staging, and production environments.