Deploying Ollama as a Production API Server: Docker, Load Balancing, and Monitoring

Move Ollama from a local dev tool to a production inference server — covering Docker Compose setup, nginx load balancing, Prometheus monitoring, and concurrency limits.

ollama serve works fine on your laptop. Under 50 concurrent users it crashes, leaks memory, and brings down your whole server. Here's the production setup that doesn't.

Your shiny new RTX 4090 is crying tears of silicon—it's trying to run Llama 3.1 70B alone, and your ollama serve process just got OOM-killed because you forgot to quantize. You're not alone; with Ollama hitting 5M downloads in January 2026, a lot of us are moving from local tinkering to "oh crap, people actually depend on this." The promise is real: running Llama 3.1 8B locally costs $0 vs ~$0.06/1K tokens on GPT-4o, and 70% of self-hosted LLM users cite data privacy as their primary reason. But the default setup is a toy. Let's build the real thing: a hardened, monitored, load-balanced Ollama API server that won't fold under pressure.

Why ollama serve Is a Development Command, Not a Server

The moment you run ollama serve, you get a REST API on localhost:11434. It's perfect for testing, for LangChain prototypes, and for feeling like a wizard. It's also a single point of catastrophic failure with no concurrency controls, no process supervision, and a habit of dying silently.

The core issue is that ollama serve runs as a foreground process. If it crashes, your API is dead. If your server reboots, it stays dead until you manually SSH back in. It also loads models into memory/VRAM on-demand, which leads to the classic "Slow first response (~30s)" error as users wait for the model to load. The fix? Set OLLAMA_KEEP_ALIVE=24h in your environment to keep the model warm, but that's just the first band-aid.

The real problem is scale. The process isn't designed for multiple concurrent requests fighting over the same GPU memory. You'll see cryptic CUDA errors or just timeouts. For production, we need isolation, redundancy, and a way to put a fence around the resource hog.

Docker Compose: Caging the Beast for Reliability

We don't run databases directly on the host OS anymore. We containerize them for isolation, reproducible environments, and easy lifecycle management. Your LLM inference server deserves the same respect.

Here’s a docker-compose.yml that defines a robust Ollama service. It uses the official image, pins a specific version for stability, and sets crucial environment variables.

version: '3.8'

services:
  ollama:
    image: ollama/ollama:0.5.3
    container_name: production-ollama
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - OLLAMA_KEEP_ALIVE=24h
      - OLLAMA_MAX_CONCURRENT_REQUESTS=4
      - OLLAMA_HOST=0.0.0.0
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

volumes:
  ollama_data:

Key battle-tested configurations:

  • restart: unless-stopped: The guardian angel. If the container crashes (e.g., a transient GPU OOM), Docker automatically restarts it.
  • deploy.resources: This explicitly claims the GPU for Docker. No other process can accidentally steal VRAM.
  • OLLAMA_MAX_CONCURRENT_REQUESTS=4: This is your most critical tuning knob. It limits how many inference requests Ollama will process simultaneously. Exceeding your GPU's parallel capacity is the fastest path to OOM errors. For a 70B model, you might set this to 1 or 2.
  • Healthcheck: This pings the /api/tags endpoint. If Ollama is dead inside the container (maybe a model load failed), Docker knows and can restart it.

To deploy, save the file and run docker compose up -d. Then, pull your production model inside the container:

docker exec production-ollama ollama pull llama3.1:70b-instruct-q4_K_M

That q4_K_M quant is non-negotiable for production. It's the fix for "VRAM OOM with 70B model", bringing VRAM needs down to a manageable ~40GB instead of the 140GB+ required for FP16.

Putting a Traffic Cop in Front: nginx Reverse Proxy and Rate Limiting

Exposing the Ollama port directly to the internet is asking for trouble. You need a reverse proxy. nginx will handle SSL termination, rate limiting, and basic load balancing when we add a second instance.

Create an nginx.conf file. This setup does three crucial things: it limits request bodies (no 100GB Modelfile uploads), rate limits API calls per IP, and routes traffic.

http {
    # Rate limiting zone: 10 requests per minute per IP
    limit_req_zone $binary_remote_addr zone=ollama_api:10m rate=10r/m;

    upstream ollama_servers {
        # Docker Compose service name
        server ollama:11434;
        # Add a second server later: server ollama2:11434;
        keepalive 10;
    }

    server {
        listen 80;
        server_name api.your-llm.com;

        # Redirect to HTTPS in production
        return 301 https://$server_name$request_uri;
    }

    server {
        listen 443 ssl http2;
        server_name api.your-llm.com;

        ssl_certificate /etc/letsencrypt/live/api.your-llm.com/fullchain.pem;
        ssl_certificate_key /etc/letsencrypt/live/api.your-llm.com/privkey.pem;

        client_max_body_size 100M; # Prevent huge uploads

        location /api/ {
            # Apply rate limiting
            limit_req zone=ollama_api burst=20 nodelay;
            limit_req_status 429;

            proxy_pass http://ollama_servers;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_buffering off;
            proxy_read_timeout 300s; # Long timeout for generation
            proxy_send_timeout 300s;
        }

        # Health check endpoint for load balancer
        location /health {
            access_log off;
            proxy_pass http://ollama_servers/api/tags;
        }
    }
}

The limit_req_zone is your first line of defense against a script-kiddie or a buggy client spamming your API and drowning out legitimate users. The 429 Too Many Requests response is polite but firm.

Tuning Concurrency: The Art of OLLAMA_MAX_CONCURRENT_REQUESTS

This one environment variable is the difference between a stable server and a smoking crater. Setting it too high will cause GPU OOM errors. Setting it too low leaves precious throughput on the table.

How do you find the sweet spot? You benchmark. It depends entirely on your model size, quantization level, and GPU memory.

Let's say you're running a Mistral 7B model. Here are your VRAM constraints:

  • 4-bit quant (q4_K_M): ~5GB VRAM
  • 8-bit quant (q8_0): ~7GB VRAM
  • FP16 (no quant): ~14GB VRAM

On an RTX 4090 with 24GB VRAM, you could theoretically run 4 concurrent instances of the 4-bit quantized model (4 * 5GB = 20GB). In practice, you need overhead for KV caches and activations. Start with OLLAMA_MAX_CONCURRENT_REQUESTS=2 and stress test.

Use a simple Python script with asyncio to simulate concurrent users and measure throughput (tokens/sec) and error rates. Gradually increase the concurrency until you see failed requests or a dramatic drop in token speed. That's your limit.

Seeing the Fire: Prometheus + Grafana for Token Throughput

You can't manage what you can't measure. "The API feels slow" isn't a metric. You need to know: What's my average tokens/second? What's my 95th percentile response latency? How many requests are failing?

Ollama exposes a Prometheus metrics endpoint at http://localhost:11434/api/metrics. It's a goldmine of data: inference duration, token counts, and even GPU stats if you have the right drivers.

Here's a docker-compose.monitoring.yml to stack Prometheus and Grafana on top of your Ollama service:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prom_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your_secure_password
    ports:
      - "3000:3000"

And the corresponding prometheus.yml to scrape Ollama:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'ollama'
    static_configs:
      - targets: ['host.docker.internal:11434'] # Use this if Prometheus is in Docker but Ollama is on host
        # - targets: ['ollama:11434'] # Use this if both are in the same Docker Compose network
    metrics_path: '/api/metrics'

In Grafana, you can build a dashboard tracking:

  • Inference Rate: rate(ollama_inference_duration_seconds_count[5m])
  • Avg Token Throughput: rate(ollama_inference_tokens_total[5m]) / rate(ollama_inference_duration_seconds_count[5m])
  • Response Latency (p95): histogram_quantile(0.95, rate(ollama_inference_duration_seconds_bucket[5m]))

Now you can see the performance cliff when concurrency is too high.

Benchmark: Single Instance vs. Load-Balanced Dual Instance

Let's get concrete with numbers. Is adding a second Ollama instance worth it? It depends on your bottleneck.

Scenario: You're serving Llama 3.1 8B (q4_K_M) to a team of 20 developers using Continue.dev or Aider for code completion.

MetricSingle Instance (RTX 4090)Two Instances, Load Balanced (2x RTX 4090)Notes
Peak Tokens/sec~120 tok/s~240 tok/sLinear scaling with GPU count.
Concurrent RequestsOLLAMA_MAX_CONCURRENT_REQUESTS=3OLLAMA_MAX_CONCURRENT_REQUESTS=3 per instanceTotal cluster capacity: 6.
Cost of FailureTotal outage.50% capacity loss. Graceful degradation.nginx will route traffic to the healthy node.
First-Token Latency~300ms (model warm)~300msUnchanged if models are kept loaded.
ComplexityLow.High. Need shared volume for models? Need to manage 2x GPU memory.

The table reveals the trade-off. For maximum raw throughput on a high-traffic public API, scale horizontally. But for most internal tools, a single, well-tuned instance is simpler and often sufficient. The ~12 tokens/sec you get from a Llama 3.1 70B on an M2 Max is perfectly usable for a low-concurrency research assistant.

The real win from two instances isn't just speed—it's resilience. With a health check in nginx, if one instance dies (e.g., Error: model 'llama3' not found because of a corrupted pull), traffic automatically fails over to the other while you docker compose logs the broken one and run docker exec ... ollama pull llama3.1 to fix it.

Health Checks, Auto-Restart, and the Path to Graceful Degradation

We've built in health checks at every layer:

  1. Docker-level: The healthcheck in docker-compose.yml restarts a broken container.
  2. Proxy-level: nginx's /health endpoint (routing to /api/tags) can mark a backend as "down."
  3. Application-level: Your client code (in LangChain, LlamaIndex, or a custom app) must have timeouts and retries with exponential backoff.

Graceful degradation is the final piece. When your monitoring dashboard shows sustained 100% GPU utilization and rising latency, what happens? The naive system queues requests until it collapses.

The sophisticated system starts shedding load. You can configure nginx to return a 503 Service Unavailable for non-critical requests (like /api/generate for a chat feature) once a certain latency threshold is hit on the /health endpoint, while still allowing /api/tags (model management) to function. This lets you administratively pull a new model or restart services while telling users, "We're experiencing high load," instead of just failing silently.

Next Steps: From Hardened Server to LLM-Powered Workflow

You now have a production Ollama cluster. It's containerized, load-balanced, monitored, and resilient. This isn't the end—it's the foundation. The real power comes from integrating this stable API into your workflows.

  • Integrate with your IDE: Point the Continue.dev or Codeium extension in VS Code (Ctrl+Shift+P to open settings) to your internal https://api.your-llm.com endpoint. Now your entire team has private, zero-cost code completion powered by CodeLlama 34B (which scores a respectable 53.7% on HumanEval) without touching the public internet.
  • Build retrieval pipelines: Use LlamaIndex with your local Ollama as the LLM backend to query your internal documentation. No data leaves your network.
  • Automate tasks: Hook this API into Aider for git-aware code refactors, or Shell-GPT for a secure, company-specific CLI assistant.

The mistake is thinking of Ollama as just a local toy. The opportunity is treating it as a critical, private, cost-effective infrastructure component. You've moved past ollama serve. Now go build something on the rock-solid foundation you just laid.