Ollama Production Health Checks: Complete Monitoring and Observability Guide

Master Ollama production health checks with monitoring tools, observability metrics, and automated alerts. Ensure 99.9% uptime for your LLM deployments.

Your Ollama instance just went down at 3 AM, and you found out from angry users instead of your monitoring system. Sound familiar? That awkward moment when your LLM stops responding, and you're frantically checking logs while your production application throws errors.

This guide shows you how to implement Ollama production health checks that catch issues before users notice them. You'll learn to build monitoring systems that track performance, detect failures, and alert you instantly when problems occur.

Why Ollama Production Health Checks Matter

Production Ollama deployments face unique challenges that traditional web application monitoring doesn't cover. LLM inference can fail silently, consume excessive memory, or respond with degraded quality without obvious error signals.

The consequences of poor Ollama monitoring include:

  • Silent failures that affect user experience without triggering alerts
  • Resource exhaustion leading to system crashes
  • Model loading delays causing application timeouts
  • Quality degradation that users notice before you do

Essential Ollama Health Check Components

Basic Connection Health Check

Start with a simple connectivity test that verifies Ollama responds to requests:

#!/bin/bash
# basic-ollama-health-check.sh

OLLAMA_URL="http://localhost:11434"
TIMEOUT=30

# Test basic connectivity
response=$(curl -s -w "%{http_code}" -o /dev/null --max-time $TIMEOUT "$OLLAMA_URL/api/tags")

if [ "$response" = "200" ]; then
    echo "✓ Ollama is responding"
    exit 0
else
    echo "✗ Ollama health check failed (HTTP: $response)"
    exit 1
fi

Model Loading Verification

Verify that your required models load correctly and respond within acceptable timeframes:

# model_health_check.py
import requests
import time
import json
import sys

def check_model_health(model_name, max_response_time=10):
    """Check if a specific model responds within time limits"""
    
    url = "http://localhost:11434/api/generate"
    payload = {
        "model": model_name,
        "prompt": "Hello",
        "stream": False
    }
    
    start_time = time.time()
    
    try:
        response = requests.post(url, json=payload, timeout=max_response_time)
        response_time = time.time() - start_time
        
        if response.status_code == 200:
            print(f"✓ Model {model_name}: {response_time:.2f}s response time")
            return True
        else:
            print(f"✗ Model {model_name}: HTTP {response.status_code}")
            return False
            
    except requests.exceptions.Timeout:
        print(f"✗ Model {model_name}: Timeout after {max_response_time}s")
        return False
    except Exception as e:
        print(f"✗ Model {model_name}: Error - {str(e)}")
        return False

# Check multiple models
models_to_check = ["llama2", "codellama", "mistral"]
all_healthy = True

for model in models_to_check:
    if not check_model_health(model):
        all_healthy = False

sys.exit(0 if all_healthy else 1)

Comprehensive Monitoring Setup

Prometheus Metrics Collection

Create custom metrics that track Ollama performance indicators:

# ollama_metrics_exporter.py
from prometheus_client import start_http_server, Gauge, Counter, Histogram
import requests
import time
import threading

# Define metrics
ollama_up = Gauge('ollama_up', 'Ollama service availability')
ollama_response_time = Histogram('ollama_response_time_seconds', 'Response time for Ollama requests')
ollama_memory_usage = Gauge('ollama_memory_usage_bytes', 'Memory usage by Ollama')
ollama_requests_total = Counter('ollama_requests_total', 'Total requests to Ollama')
ollama_errors_total = Counter('ollama_errors_total', 'Total errors from Ollama')

def collect_ollama_metrics():
    """Continuously collect Ollama metrics"""
    
    while True:
        try:
            # Test basic connectivity
            start_time = time.time()
            response = requests.get("http://localhost:11434/api/tags", timeout=10)
            response_time = time.time() - start_time
            
            if response.status_code == 200:
                ollama_up.set(1)
                ollama_response_time.observe(response_time)
                ollama_requests_total.inc()
            else:
                ollama_up.set(0)
                ollama_errors_total.inc()
                
        except Exception as e:
            ollama_up.set(0)
            ollama_errors_total.inc()
            print(f"Metrics collection error: {e}")
        
        time.sleep(30)  # Collect metrics every 30 seconds

if __name__ == '__main__':
    # Start Prometheus metrics server
    start_http_server(8000)
    print("Metrics server started on port 8000")
    
    # Start metrics collection in background thread
    metrics_thread = threading.Thread(target=collect_ollama_metrics)
    metrics_thread.daemon = True
    metrics_thread.start()
    
    # Keep the main thread alive
    try:
        while True:
            time.sleep(1)
    except KeyboardInterrupt:
        print("Shutting down metrics exporter")

Docker Health Check Configuration

Add health checks directly to your Ollama Docker deployment:

# Dockerfile with health check
FROM ollama/ollama:latest

# Add health check script
COPY health_check.sh /usr/local/bin/health_check.sh
RUN chmod +x /usr/local/bin/health_check.sh

# Configure Docker health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD /usr/local/bin/health_check.sh || exit 1

EXPOSE 11434
# docker-compose.yml with health checks
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    depends_on:
      - ollama

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - prometheus

volumes:
  ollama_data:

Advanced Observability Patterns

Quality-Based Health Checks

Monitor response quality to detect model degradation:

# quality_health_check.py
import requests
import json
import re
from textstat import flesch_reading_ease

def check_response_quality(model_name, test_prompts):
    """Check if model responses meet quality thresholds"""
    
    quality_scores = []
    
    for prompt in test_prompts:
        try:
            response = requests.post(
                "http://localhost:11434/api/generate",
                json={
                    "model": model_name,
                    "prompt": prompt,
                    "stream": False
                },
                timeout=30
            )
            
            if response.status_code == 200:
                result = response.json()
                text = result.get('response', '')
                
                # Basic quality checks
                word_count = len(text.split())
                readability = flesch_reading_ease(text)
                has_repetition = check_repetition(text)
                
                quality_score = calculate_quality_score(word_count, readability, has_repetition)
                quality_scores.append(quality_score)
                
        except Exception as e:
            print(f"Quality check error for prompt '{prompt}': {e}")
            return False
    
    average_quality = sum(quality_scores) / len(quality_scores) if quality_scores else 0
    threshold = 70  # Minimum acceptable quality score
    
    if average_quality >= threshold:
        print(f"✓ Model quality check passed: {average_quality:.1f}")
        return True
    else:
        print(f"✗ Model quality degraded: {average_quality:.1f} < {threshold}")
        return False

def check_repetition(text):
    """Detect excessive repetition in model output"""
    sentences = text.split('.')
    unique_sentences = set(sentences)
    repetition_ratio = 1 - (len(unique_sentences) / len(sentences)) if sentences else 0
    return repetition_ratio > 0.3  # 30% repetition threshold

def calculate_quality_score(word_count, readability, has_repetition):
    """Calculate composite quality score"""
    score = 50  # Base score
    
    # Word count contribution (prefer 50-200 words)
    if 50 <= word_count <= 200:
        score += 20
    elif word_count < 10:
        score -= 30
    
    # Readability contribution
    if readability > 60:
        score += 20
    elif readability < 30:
        score -= 20
    
    # Repetition penalty
    if has_repetition:
        score -= 25
    
    return max(0, min(100, score))

# Test prompts for quality assessment
test_prompts = [
    "Explain the concept of machine learning in simple terms.",
    "What are the benefits of using Docker containers?",
    "Describe the process of photosynthesis."
]

# Run quality check
model_healthy = check_response_quality("llama2", test_prompts)

Resource Monitoring Integration

Track system resources alongside Ollama health:

# resource_monitor.py
import psutil
import requests
import time
import json

def get_ollama_process_stats():
    """Get CPU and memory usage for Ollama process"""
    
    for proc in psutil.process_iter(['pid', 'name', 'cpu_percent', 'memory_info']):
        if 'ollama' in proc.info['name'].lower():
            return {
                'pid': proc.info['pid'],
                'cpu_percent': proc.info['cpu_percent'],
                'memory_mb': proc.info['memory_info'].rss / 1024 / 1024,
                'memory_percent': proc.memory_percent()
            }
    return None

def check_resource_health():
    """Monitor resource usage and alert on thresholds"""
    
    # System-wide checks
    cpu_usage = psutil.cpu_percent(interval=1)
    memory = psutil.virtual_memory()
    disk = psutil.disk_usage('/')
    
    # Ollama-specific checks
    ollama_stats = get_ollama_process_stats()
    
    # Define thresholds
    thresholds = {
        'cpu_warning': 80,
        'cpu_critical': 95,
        'memory_warning': 80,
        'memory_critical': 95,
        'disk_warning': 85,
        'disk_critical': 95
    }
    
    alerts = []
    
    # Check system CPU
    if cpu_usage > thresholds['cpu_critical']:
        alerts.append(f"CRITICAL: CPU usage at {cpu_usage:.1f}%")
    elif cpu_usage > thresholds['cpu_warning']:
        alerts.append(f"WARNING: CPU usage at {cpu_usage:.1f}%")
    
    # Check system memory
    if memory.percent > thresholds['memory_critical']:
        alerts.append(f"CRITICAL: Memory usage at {memory.percent:.1f}%")
    elif memory.percent > thresholds['memory_warning']:
        alerts.append(f"WARNING: Memory usage at {memory.percent:.1f}%")
    
    # Check Ollama process
    if ollama_stats:
        if ollama_stats['memory_mb'] > 8192:  # 8GB threshold
            alerts.append(f"WARNING: Ollama using {ollama_stats['memory_mb']:.0f}MB memory")
        
        if ollama_stats['cpu_percent'] > 90:
            alerts.append(f"WARNING: Ollama CPU usage at {ollama_stats['cpu_percent']:.1f}%")
    
    return alerts

# Example usage
alerts = check_resource_health()
if alerts:
    for alert in alerts:
        print(alert)
else:
    print("✓ All resource checks passed")

Automated Alert Configuration

Slack Integration for Critical Alerts

Set up automated notifications for production issues:

# slack_alerting.py
import requests
import json
import os
from datetime import datetime

class SlackAlerter:
    def __init__(self, webhook_url):
        self.webhook_url = webhook_url
    
    def send_alert(self, severity, title, message, details=None):
        """Send formatted alert to Slack"""
        
        # Color coding for different severities
        colors = {
            'critical': '#FF0000',
            'warning': '#FFA500',
            'info': '#00FF00'
        }
        
        payload = {
            "attachments": [
                {
                    "color": colors.get(severity, '#808080'),
                    "title": f"{severity.upper()}: {title}",
                    "text": message,
                    "fields": [
                        {
                            "title": "Timestamp",
                            "value": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
                            "short": True
                        },
                        {
                            "title": "Service",
                            "value": "Ollama Production",
                            "short": True
                        }
                    ],
                    "footer": "Ollama Health Monitor"
                }
            ]
        }
        
        if details:
            payload["attachments"][0]["fields"].append({
                "title": "Details",
                "value": details,
                "short": False
            })
        
        try:
            response = requests.post(self.webhook_url, json=payload)
            response.raise_for_status()
            print(f"Alert sent successfully: {title}")
        except Exception as e:
            print(f"Failed to send alert: {e}")

# Example usage
def check_and_alert():
    slack = SlackAlerter(os.getenv('SLACK_WEBHOOK_URL'))
    
    # Run health checks
    try:
        response = requests.get("http://localhost:11434/api/tags", timeout=10)
        if response.status_code != 200:
            slack.send_alert(
                severity='critical',
                title='Ollama Service Down',
                message='Ollama is not responding to health checks',
                details=f'HTTP Status: {response.status_code}'
            )
    except requests.exceptions.Timeout:
        slack.send_alert(
            severity='critical',
            title='Ollama Timeout',
            message='Ollama health check timed out after 10 seconds'
        )
    except Exception as e:
        slack.send_alert(
            severity='critical',
            title='Ollama Health Check Failed',
            message='Unable to connect to Ollama service',
            details=str(e)
        )

Automated Recovery Actions

Implement self-healing capabilities for common issues:

#!/bin/bash
# auto_recovery.sh

OLLAMA_SERVICE="ollama"
MAX_RESTART_ATTEMPTS=3
RESTART_COUNT_FILE="/tmp/ollama_restart_count"

check_ollama_health() {
    curl -s -f http://localhost:11434/api/tags > /dev/null
    return $?
}

restart_ollama() {
    echo "$(date): Attempting to restart Ollama service"
    
    # Get current restart count
    if [ -f "$RESTART_COUNT_FILE" ]; then
        RESTART_COUNT=$(cat "$RESTART_COUNT_FILE")
    else
        RESTART_COUNT=0
    fi
    
    # Check if we've exceeded max attempts
    if [ "$RESTART_COUNT" -ge "$MAX_RESTART_ATTEMPTS" ]; then
        echo "$(date): Max restart attempts reached. Manual intervention required."
        # Send critical alert
        curl -X POST "$SLACK_WEBHOOK_URL" \
            -H 'Content-type: application/json' \
            --data '{"text":"🚨 CRITICAL: Ollama failed to restart after '$MAX_RESTART_ATTEMPTS' attempts"}'
        exit 1
    fi
    
    # Attempt restart
    systemctl restart "$OLLAMA_SERVICE"
    sleep 30  # Wait for service to start
    
    # Verify restart was successful
    if check_ollama_health; then
        echo "$(date): Ollama restarted successfully"
        rm -f "$RESTART_COUNT_FILE"  # Reset counter on success
        
        # Send recovery notification
        curl -X POST "$SLACK_WEBHOOK_URL" \
            -H 'Content-type: application/json' \
            --data '{"text":"✅ Ollama service recovered after restart"}'
    else
        # Increment restart counter
        RESTART_COUNT=$((RESTART_COUNT + 1))
        echo "$RESTART_COUNT" > "$RESTART_COUNT_FILE"
        echo "$(date): Restart attempt $RESTART_COUNT failed"
    fi
}

# Main health check logic
if ! check_ollama_health; then
    echo "$(date): Ollama health check failed"
    restart_ollama
else
    echo "$(date): Ollama health check passed"
    # Reset restart counter on successful health check
    rm -f "$RESTART_COUNT_FILE"
fi

Monitoring Dashboard Setup

Grafana Dashboard Configuration

Create visual dashboards for Ollama metrics:

{
  "dashboard": {
    "title": "Ollama Production Health",
    "panels": [
      {
        "title": "Service Availability",
        "type": "stat",
        "targets": [
          {
            "expr": "ollama_up",
            "legendFormat": "Ollama Status"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "color": {
              "mode": "thresholds"
            },
            "thresholds": {
              "steps": [
                {"color": "red", "value": 0},
                {"color": "green", "value": 1}
              ]
            }
          }
        }
      },
      {
        "title": "Response Time",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, ollama_response_time_seconds_bucket)",
            "legendFormat": "95th Percentile"
          },
          {
            "expr": "histogram_quantile(0.50, ollama_response_time_seconds_bucket)",
            "legendFormat": "Median"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(ollama_errors_total[5m])",
            "legendFormat": "Errors per second"
          }
        ]
      }
    ]
  }
}

Production Deployment Checklist

Before deploying Ollama health checks to production, verify these critical elements:

Pre-Deployment Validation

  1. Test health check endpoints in staging environment
  2. Verify alert thresholds don't trigger false positives
  3. Confirm monitoring data flows to your observability platform
  4. Test automated recovery scripts with controlled failures
  5. Validate notification channels receive alerts correctly

Deployment Steps

  1. Deploy monitoring components alongside Ollama service
  2. Configure Prometheus to scrape Ollama metrics
  3. Set up Grafana dashboards with appropriate alerts
  4. Enable automated health check scripts via cron or systemd timers
  5. Test end-to-end monitoring with intentional service disruption

Post-Deployment Verification

Monitor your monitoring system for the first 48 hours to ensure:

  • Health checks run consistently without false alerts
  • Metric collection captures expected data ranges
  • Alert notifications reach the correct teams
  • Automated recovery actions work as designed

Conclusion

Effective Ollama production health checks prevent service disruptions and maintain high availability for your LLM applications. This comprehensive monitoring approach combines basic connectivity tests, quality assessments, resource monitoring, and automated recovery actions.

The monitoring patterns shown here scale from simple curl-based checks to sophisticated observability platforms. Start with basic health checks and gradually add complexity as your production requirements grow.

Your users will notice the difference when Ollama issues get resolved before they cause application failures. Implement these health check strategies to build confidence in your production LLM deployment.

Remember: good monitoring pays for itself the first time it prevents a production outage. Your 3 AM self will thank you for setting up these automated health checks today.