Your Ollama instance just went down at 3 AM, and you found out from angry users instead of your monitoring system. Sound familiar? That awkward moment when your LLM stops responding, and you're frantically checking logs while your production application throws errors.
This guide shows you how to implement Ollama production health checks that catch issues before users notice them. You'll learn to build monitoring systems that track performance, detect failures, and alert you instantly when problems occur.
Why Ollama Production Health Checks Matter
Production Ollama deployments face unique challenges that traditional web application monitoring doesn't cover. LLM inference can fail silently, consume excessive memory, or respond with degraded quality without obvious error signals.
The consequences of poor Ollama monitoring include:
- Silent failures that affect user experience without triggering alerts
- Resource exhaustion leading to system crashes
- Model loading delays causing application timeouts
- Quality degradation that users notice before you do
Essential Ollama Health Check Components
Basic Connection Health Check
Start with a simple connectivity test that verifies Ollama responds to requests:
#!/bin/bash
# basic-ollama-health-check.sh
OLLAMA_URL="http://localhost:11434"
TIMEOUT=30
# Test basic connectivity
response=$(curl -s -w "%{http_code}" -o /dev/null --max-time $TIMEOUT "$OLLAMA_URL/api/tags")
if [ "$response" = "200" ]; then
echo "✓ Ollama is responding"
exit 0
else
echo "✗ Ollama health check failed (HTTP: $response)"
exit 1
fi
Model Loading Verification
Verify that your required models load correctly and respond within acceptable timeframes:
# model_health_check.py
import requests
import time
import json
import sys
def check_model_health(model_name, max_response_time=10):
"""Check if a specific model responds within time limits"""
url = "http://localhost:11434/api/generate"
payload = {
"model": model_name,
"prompt": "Hello",
"stream": False
}
start_time = time.time()
try:
response = requests.post(url, json=payload, timeout=max_response_time)
response_time = time.time() - start_time
if response.status_code == 200:
print(f"✓ Model {model_name}: {response_time:.2f}s response time")
return True
else:
print(f"✗ Model {model_name}: HTTP {response.status_code}")
return False
except requests.exceptions.Timeout:
print(f"✗ Model {model_name}: Timeout after {max_response_time}s")
return False
except Exception as e:
print(f"✗ Model {model_name}: Error - {str(e)}")
return False
# Check multiple models
models_to_check = ["llama2", "codellama", "mistral"]
all_healthy = True
for model in models_to_check:
if not check_model_health(model):
all_healthy = False
sys.exit(0 if all_healthy else 1)
Comprehensive Monitoring Setup
Prometheus Metrics Collection
Create custom metrics that track Ollama performance indicators:
# ollama_metrics_exporter.py
from prometheus_client import start_http_server, Gauge, Counter, Histogram
import requests
import time
import threading
# Define metrics
ollama_up = Gauge('ollama_up', 'Ollama service availability')
ollama_response_time = Histogram('ollama_response_time_seconds', 'Response time for Ollama requests')
ollama_memory_usage = Gauge('ollama_memory_usage_bytes', 'Memory usage by Ollama')
ollama_requests_total = Counter('ollama_requests_total', 'Total requests to Ollama')
ollama_errors_total = Counter('ollama_errors_total', 'Total errors from Ollama')
def collect_ollama_metrics():
"""Continuously collect Ollama metrics"""
while True:
try:
# Test basic connectivity
start_time = time.time()
response = requests.get("http://localhost:11434/api/tags", timeout=10)
response_time = time.time() - start_time
if response.status_code == 200:
ollama_up.set(1)
ollama_response_time.observe(response_time)
ollama_requests_total.inc()
else:
ollama_up.set(0)
ollama_errors_total.inc()
except Exception as e:
ollama_up.set(0)
ollama_errors_total.inc()
print(f"Metrics collection error: {e}")
time.sleep(30) # Collect metrics every 30 seconds
if __name__ == '__main__':
# Start Prometheus metrics server
start_http_server(8000)
print("Metrics server started on port 8000")
# Start metrics collection in background thread
metrics_thread = threading.Thread(target=collect_ollama_metrics)
metrics_thread.daemon = True
metrics_thread.start()
# Keep the main thread alive
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
print("Shutting down metrics exporter")
Docker Health Check Configuration
Add health checks directly to your Ollama Docker deployment:
# Dockerfile with health check
FROM ollama/ollama:latest
# Add health check script
COPY health_check.sh /usr/local/bin/health_check.sh
RUN chmod +x /usr/local/bin/health_check.sh
# Configure Docker health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD /usr/local/bin/health_check.sh || exit 1
EXPOSE 11434
# docker-compose.yml with health checks
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
restart: unless-stopped
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
depends_on:
- ollama
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
depends_on:
- prometheus
volumes:
ollama_data:
Advanced Observability Patterns
Quality-Based Health Checks
Monitor response quality to detect model degradation:
# quality_health_check.py
import requests
import json
import re
from textstat import flesch_reading_ease
def check_response_quality(model_name, test_prompts):
"""Check if model responses meet quality thresholds"""
quality_scores = []
for prompt in test_prompts:
try:
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model_name,
"prompt": prompt,
"stream": False
},
timeout=30
)
if response.status_code == 200:
result = response.json()
text = result.get('response', '')
# Basic quality checks
word_count = len(text.split())
readability = flesch_reading_ease(text)
has_repetition = check_repetition(text)
quality_score = calculate_quality_score(word_count, readability, has_repetition)
quality_scores.append(quality_score)
except Exception as e:
print(f"Quality check error for prompt '{prompt}': {e}")
return False
average_quality = sum(quality_scores) / len(quality_scores) if quality_scores else 0
threshold = 70 # Minimum acceptable quality score
if average_quality >= threshold:
print(f"✓ Model quality check passed: {average_quality:.1f}")
return True
else:
print(f"✗ Model quality degraded: {average_quality:.1f} < {threshold}")
return False
def check_repetition(text):
"""Detect excessive repetition in model output"""
sentences = text.split('.')
unique_sentences = set(sentences)
repetition_ratio = 1 - (len(unique_sentences) / len(sentences)) if sentences else 0
return repetition_ratio > 0.3 # 30% repetition threshold
def calculate_quality_score(word_count, readability, has_repetition):
"""Calculate composite quality score"""
score = 50 # Base score
# Word count contribution (prefer 50-200 words)
if 50 <= word_count <= 200:
score += 20
elif word_count < 10:
score -= 30
# Readability contribution
if readability > 60:
score += 20
elif readability < 30:
score -= 20
# Repetition penalty
if has_repetition:
score -= 25
return max(0, min(100, score))
# Test prompts for quality assessment
test_prompts = [
"Explain the concept of machine learning in simple terms.",
"What are the benefits of using Docker containers?",
"Describe the process of photosynthesis."
]
# Run quality check
model_healthy = check_response_quality("llama2", test_prompts)
Resource Monitoring Integration
Track system resources alongside Ollama health:
# resource_monitor.py
import psutil
import requests
import time
import json
def get_ollama_process_stats():
"""Get CPU and memory usage for Ollama process"""
for proc in psutil.process_iter(['pid', 'name', 'cpu_percent', 'memory_info']):
if 'ollama' in proc.info['name'].lower():
return {
'pid': proc.info['pid'],
'cpu_percent': proc.info['cpu_percent'],
'memory_mb': proc.info['memory_info'].rss / 1024 / 1024,
'memory_percent': proc.memory_percent()
}
return None
def check_resource_health():
"""Monitor resource usage and alert on thresholds"""
# System-wide checks
cpu_usage = psutil.cpu_percent(interval=1)
memory = psutil.virtual_memory()
disk = psutil.disk_usage('/')
# Ollama-specific checks
ollama_stats = get_ollama_process_stats()
# Define thresholds
thresholds = {
'cpu_warning': 80,
'cpu_critical': 95,
'memory_warning': 80,
'memory_critical': 95,
'disk_warning': 85,
'disk_critical': 95
}
alerts = []
# Check system CPU
if cpu_usage > thresholds['cpu_critical']:
alerts.append(f"CRITICAL: CPU usage at {cpu_usage:.1f}%")
elif cpu_usage > thresholds['cpu_warning']:
alerts.append(f"WARNING: CPU usage at {cpu_usage:.1f}%")
# Check system memory
if memory.percent > thresholds['memory_critical']:
alerts.append(f"CRITICAL: Memory usage at {memory.percent:.1f}%")
elif memory.percent > thresholds['memory_warning']:
alerts.append(f"WARNING: Memory usage at {memory.percent:.1f}%")
# Check Ollama process
if ollama_stats:
if ollama_stats['memory_mb'] > 8192: # 8GB threshold
alerts.append(f"WARNING: Ollama using {ollama_stats['memory_mb']:.0f}MB memory")
if ollama_stats['cpu_percent'] > 90:
alerts.append(f"WARNING: Ollama CPU usage at {ollama_stats['cpu_percent']:.1f}%")
return alerts
# Example usage
alerts = check_resource_health()
if alerts:
for alert in alerts:
print(alert)
else:
print("✓ All resource checks passed")
Automated Alert Configuration
Slack Integration for Critical Alerts
Set up automated notifications for production issues:
# slack_alerting.py
import requests
import json
import os
from datetime import datetime
class SlackAlerter:
def __init__(self, webhook_url):
self.webhook_url = webhook_url
def send_alert(self, severity, title, message, details=None):
"""Send formatted alert to Slack"""
# Color coding for different severities
colors = {
'critical': '#FF0000',
'warning': '#FFA500',
'info': '#00FF00'
}
payload = {
"attachments": [
{
"color": colors.get(severity, '#808080'),
"title": f"{severity.upper()}: {title}",
"text": message,
"fields": [
{
"title": "Timestamp",
"value": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
"short": True
},
{
"title": "Service",
"value": "Ollama Production",
"short": True
}
],
"footer": "Ollama Health Monitor"
}
]
}
if details:
payload["attachments"][0]["fields"].append({
"title": "Details",
"value": details,
"short": False
})
try:
response = requests.post(self.webhook_url, json=payload)
response.raise_for_status()
print(f"Alert sent successfully: {title}")
except Exception as e:
print(f"Failed to send alert: {e}")
# Example usage
def check_and_alert():
slack = SlackAlerter(os.getenv('SLACK_WEBHOOK_URL'))
# Run health checks
try:
response = requests.get("http://localhost:11434/api/tags", timeout=10)
if response.status_code != 200:
slack.send_alert(
severity='critical',
title='Ollama Service Down',
message='Ollama is not responding to health checks',
details=f'HTTP Status: {response.status_code}'
)
except requests.exceptions.Timeout:
slack.send_alert(
severity='critical',
title='Ollama Timeout',
message='Ollama health check timed out after 10 seconds'
)
except Exception as e:
slack.send_alert(
severity='critical',
title='Ollama Health Check Failed',
message='Unable to connect to Ollama service',
details=str(e)
)
Automated Recovery Actions
Implement self-healing capabilities for common issues:
#!/bin/bash
# auto_recovery.sh
OLLAMA_SERVICE="ollama"
MAX_RESTART_ATTEMPTS=3
RESTART_COUNT_FILE="/tmp/ollama_restart_count"
check_ollama_health() {
curl -s -f http://localhost:11434/api/tags > /dev/null
return $?
}
restart_ollama() {
echo "$(date): Attempting to restart Ollama service"
# Get current restart count
if [ -f "$RESTART_COUNT_FILE" ]; then
RESTART_COUNT=$(cat "$RESTART_COUNT_FILE")
else
RESTART_COUNT=0
fi
# Check if we've exceeded max attempts
if [ "$RESTART_COUNT" -ge "$MAX_RESTART_ATTEMPTS" ]; then
echo "$(date): Max restart attempts reached. Manual intervention required."
# Send critical alert
curl -X POST "$SLACK_WEBHOOK_URL" \
-H 'Content-type: application/json' \
--data '{"text":"🚨 CRITICAL: Ollama failed to restart after '$MAX_RESTART_ATTEMPTS' attempts"}'
exit 1
fi
# Attempt restart
systemctl restart "$OLLAMA_SERVICE"
sleep 30 # Wait for service to start
# Verify restart was successful
if check_ollama_health; then
echo "$(date): Ollama restarted successfully"
rm -f "$RESTART_COUNT_FILE" # Reset counter on success
# Send recovery notification
curl -X POST "$SLACK_WEBHOOK_URL" \
-H 'Content-type: application/json' \
--data '{"text":"✅ Ollama service recovered after restart"}'
else
# Increment restart counter
RESTART_COUNT=$((RESTART_COUNT + 1))
echo "$RESTART_COUNT" > "$RESTART_COUNT_FILE"
echo "$(date): Restart attempt $RESTART_COUNT failed"
fi
}
# Main health check logic
if ! check_ollama_health; then
echo "$(date): Ollama health check failed"
restart_ollama
else
echo "$(date): Ollama health check passed"
# Reset restart counter on successful health check
rm -f "$RESTART_COUNT_FILE"
fi
Monitoring Dashboard Setup
Grafana Dashboard Configuration
Create visual dashboards for Ollama metrics:
{
"dashboard": {
"title": "Ollama Production Health",
"panels": [
{
"title": "Service Availability",
"type": "stat",
"targets": [
{
"expr": "ollama_up",
"legendFormat": "Ollama Status"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "green", "value": 1}
]
}
}
}
},
{
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, ollama_response_time_seconds_bucket)",
"legendFormat": "95th Percentile"
},
{
"expr": "histogram_quantile(0.50, ollama_response_time_seconds_bucket)",
"legendFormat": "Median"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(ollama_errors_total[5m])",
"legendFormat": "Errors per second"
}
]
}
]
}
}
Production Deployment Checklist
Before deploying Ollama health checks to production, verify these critical elements:
Pre-Deployment Validation
- Test health check endpoints in staging environment
- Verify alert thresholds don't trigger false positives
- Confirm monitoring data flows to your observability platform
- Test automated recovery scripts with controlled failures
- Validate notification channels receive alerts correctly
Deployment Steps
- Deploy monitoring components alongside Ollama service
- Configure Prometheus to scrape Ollama metrics
- Set up Grafana dashboards with appropriate alerts
- Enable automated health check scripts via cron or systemd timers
- Test end-to-end monitoring with intentional service disruption
Post-Deployment Verification
Monitor your monitoring system for the first 48 hours to ensure:
- Health checks run consistently without false alerts
- Metric collection captures expected data ranges
- Alert notifications reach the correct teams
- Automated recovery actions work as designed
Conclusion
Effective Ollama production health checks prevent service disruptions and maintain high availability for your LLM applications. This comprehensive monitoring approach combines basic connectivity tests, quality assessments, resource monitoring, and automated recovery actions.
The monitoring patterns shown here scale from simple curl-based checks to sophisticated observability platforms. Start with basic health checks and gradually add complexity as your production requirements grow.
Your users will notice the difference when Ollama issues get resolved before they cause application failures. Implement these health check strategies to build confidence in your production LLM deployment.
Remember: good monitoring pays for itself the first time it prevents a production outage. Your 3 AM self will thank you for setting up these automated health checks today.