Your Ollama instance started with 2GB RAM usage. Three days later, it's consuming 16GB and your server is screaming for mercy. Sound familiar? You're not alone in this memory-hungry nightmare.
Memory leaks in Ollama long-running processes can destroy server performance and drain resources faster than a teenager drains their phone battery. This guide provides proven techniques to detect, fix, and prevent Ollama memory leaks permanently.
You'll learn step-by-step detection methods, optimization strategies, and monitoring tools that reduce memory consumption by up to 60%. Let's transform your resource-hungry Ollama into an efficient, leak-free powerhouse.
Why Ollama Develops Memory Leaks in Long-Running Processes
Ollama memory leaks occur due to several specific factors that compound over time in persistent deployments.
Primary Memory Leak Causes
Model Context Accumulation: Ollama retains conversation contexts and model states between requests. Each interaction adds memory overhead that doesn't clear properly.
GPU Memory Fragmentation: Graphics memory allocation patterns create fragmented spaces. These gaps prevent efficient memory reuse and cause gradual consumption increases.
Connection Pool Bloat: HTTP connection pools expand with concurrent requests. Abandoned connections remain in memory without proper cleanup.
Cache Overflow: Ollama caches model weights and tokenizer data. Cache systems sometimes fail to implement proper eviction policies.
Essential Tools for Ollama Memory Leak Detection
Effective memory leak detection requires the right monitoring tools and techniques.
System-Level Monitoring Tools
htop and ps Commands: Track Ollama process memory usage over time with built-in Linux utilities.
# Monitor Ollama memory usage continuously
watch -n 5 'ps aux | grep ollama | head -10'
# Track memory trends with timestamps
while true; do
echo "$(date): $(ps aux | grep ollama | awk '{print $6}' | head -1)" >> ollama_memory.log
sleep 300
done
Valgrind Memory Profiler: Detect memory leaks at the application level with detailed allocation tracking.
# Run Ollama with Valgrind memory checking
valgrind --tool=memcheck --leak-check=full --track-origins=yes ollama serve
Ollama-Specific Monitoring
Built-in Metrics Endpoint: Ollama exposes memory metrics through its API interface.
# Check current memory usage via API
curl -s http://localhost:11434/api/metrics | jq '.memory'
# Monitor memory trends with continuous polling
while true; do
curl -s http://localhost:11434/api/metrics | jq '.memory.used' >> memory_usage.log
sleep 60
done
Step-by-Step Memory Leak Detection Process
Follow these detailed steps to identify and locate memory leaks in your Ollama deployment.
Step 1: Establish Memory Baseline
Create a clean starting point to measure memory growth accurately.
# Restart Ollama service
sudo systemctl restart ollama
# Wait for full initialization
sleep 30
# Record baseline memory usage
BASELINE=$(ps aux | grep ollama | awk '{print $6}' | head -1)
echo "Baseline memory: ${BASELINE}KB" > memory_baseline.txt
Step 2: Implement Continuous Monitoring
Set up automated monitoring to track memory changes over time.
#!/bin/bash
# memory_monitor.sh - Ollama memory tracking script
LOG_FILE="ollama_memory_$(date +%Y%m%d).log"
BASELINE_FILE="memory_baseline.txt"
# Read baseline if exists
if [[ -f "$BASELINE_FILE" ]]; then
BASELINE=$(cat "$BASELINE_FILE" | grep -o '[0-9]*')
else
BASELINE=0
fi
while true; do
CURRENT=$(ps aux | grep ollama | awk '{print $6}' | head -1)
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
GROWTH=$((CURRENT - BASELINE))
echo "$TIMESTAMP,$CURRENT,$GROWTH" >> "$LOG_FILE"
# Alert if memory growth exceeds 1GB
if [[ $GROWTH -gt 1048576 ]]; then
echo "ALERT: Memory growth exceeded 1GB at $TIMESTAMP" | tee -a alerts.log
fi
sleep 300 # Check every 5 minutes
done
Step 3: Load Testing for Leak Detection
Generate controlled load to identify leak patterns and trigger points.
# load_test_ollama.py - Generate requests to detect memory patterns
import requests
import time
import threading
from concurrent.futures import ThreadPoolExecutor
def send_request(request_id):
"""Send single request to Ollama API"""
url = "http://localhost:11434/api/generate"
data = {
"model": "llama2",
"prompt": f"Test request {request_id}: Explain machine learning basics.",
"stream": False
}
try:
response = requests.post(url, json=data, timeout=30)
return response.status_code
except Exception as e:
print(f"Request {request_id} failed: {e}")
return None
def load_test(duration_minutes=60, concurrent_requests=10):
"""Run sustained load test to trigger memory leaks"""
print(f"Starting {duration_minutes}-minute load test with {concurrent_requests} concurrent requests")
start_time = time.time()
request_count = 0
with ThreadPoolExecutor(max_workers=concurrent_requests) as executor:
while time.time() - start_time < duration_minutes * 60:
futures = []
# Submit batch of concurrent requests
for i in range(concurrent_requests):
future = executor.submit(send_request, request_count + i)
futures.append(future)
# Wait for batch completion
for future in futures:
future.result()
request_count += concurrent_requests
print(f"Completed {request_count} requests")
time.sleep(1)
print(f"Load test completed. Total requests: {request_count}")
if __name__ == "__main__":
load_test(duration_minutes=30, concurrent_requests=5)
Advanced Memory Profiling Techniques
Deep dive into Ollama's memory allocation patterns using advanced profiling methods.
GPU Memory Analysis
Monitor GPU memory usage patterns specific to model loading and inference.
# Monitor GPU memory usage continuously
nvidia-smi --query-gpu=timestamp,memory.used,memory.free --format=csv --loop=1 > gpu_memory.log
# Analyze GPU memory patterns
#!/bin/bash
# gpu_analysis.sh
while true; do
GPU_MEM=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits)
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
echo "$TIMESTAMP,$GPU_MEM" >> gpu_memory_detailed.log
sleep 10
done
Memory Heap Analysis
Use system profiling tools to analyze heap allocation patterns.
# Install memory profiling tools
sudo apt-get install google-perftools-dev
# Run Ollama with heap profiling
env HEAPPROFILE=/tmp/ollama_heap ollama serve
# Analyze heap dumps
google-pprof --text /tmp/ollama_heap.0001.heap
Proven Ollama Memory Optimization Strategies
Implement these battle-tested optimization techniques to eliminate memory leaks.
Configuration-Based Optimizations
Memory Limits Configuration: Set explicit memory boundaries to prevent unlimited growth.
# ollama_config.yaml - Memory optimization settings
server:
max_memory: "8GB"
cache_size: "2GB"
max_concurrent_requests: 10
request_timeout: "300s"
models:
context_window: 4096
max_tokens: 2048
temperature: 0.7
memory:
garbage_collection_interval: "60s"
model_unload_timeout: "600s"
cache_eviction_policy: "lru"
Environment Variables Setup: Configure memory management through environment settings.
# Set memory management environment variables
export OLLAMA_MAX_MEMORY=8192m
export OLLAMA_GC_INTERVAL=60
export OLLAMA_MODEL_CACHE_SIZE=2048m
export OLLAMA_MAX_CONCURRENT=5
# Start Ollama with optimized settings
ollama serve
Process Management Optimizations
Automatic Restart Strategy: Implement periodic restarts to clear accumulated memory.
#!/bin/bash
# ollama_restart_manager.sh - Automatic memory management
MEMORY_THRESHOLD=8388608 # 8GB in KB
CHECK_INTERVAL=300 # 5 minutes
while true; do
CURRENT_MEMORY=$(ps aux | grep ollama | awk '{print $6}' | head -1)
if [[ $CURRENT_MEMORY -gt $MEMORY_THRESHOLD ]]; then
echo "$(date): Memory threshold exceeded ($CURRENT_MEMORY KB). Restarting Ollama..."
# Graceful restart
sudo systemctl stop ollama
sleep 10
sudo systemctl start ollama
# Wait for service to stabilize
sleep 30
echo "$(date): Ollama restarted successfully"
fi
sleep $CHECK_INTERVAL
done
Code-Level Memory Management
Request Batching Implementation: Group requests to reduce memory overhead per operation.
# ollama_batch_manager.py - Efficient request batching
import asyncio
import aiohttp
from collections import deque
import time
class OllamaBatchManager:
def __init__(self, batch_size=5, batch_timeout=10):
self.batch_size = batch_size
self.batch_timeout = batch_timeout
self.request_queue = deque()
self.processing = False
async def add_request(self, prompt, model="llama2"):
"""Add request to batch queue"""
request = {
"prompt": prompt,
"model": model,
"timestamp": time.time()
}
self.request_queue.append(request)
if not self.processing:
await self.process_batch()
async def process_batch(self):
"""Process requests in batches to optimize memory usage"""
self.processing = True
while self.request_queue:
batch = []
# Collect batch of requests
for _ in range(min(self.batch_size, len(self.request_queue))):
if self.request_queue:
batch.append(self.request_queue.popleft())
if batch:
await self.execute_batch(batch)
self.processing = False
async def execute_batch(self, batch):
"""Execute batch of requests efficiently"""
async with aiohttp.ClientSession() as session:
tasks = []
for request in batch:
task = self.send_single_request(session, request)
tasks.append(task)
results = await asyncio.gather(*tasks, return_exceptions=True)
# Process results and clean up
for i, result in enumerate(results):
if isinstance(result, Exception):
print(f"Request {i} failed: {result}")
else:
print(f"Request {i} completed successfully")
async def send_single_request(self, session, request):
"""Send individual request with proper cleanup"""
url = "http://localhost:11434/api/generate"
data = {
"model": request["model"],
"prompt": request["prompt"],
"stream": False
}
async with session.post(url, json=data) as response:
result = await response.json()
# Explicit cleanup
del data
return result
# Usage example
async def main():
manager = OllamaBatchManager(batch_size=3, batch_timeout=5)
# Add multiple requests
prompts = [
"Explain quantum computing",
"Describe machine learning",
"What is artificial intelligence?"
]
for prompt in prompts:
await manager.add_request(prompt)
if __name__ == "__main__":
asyncio.run(main())
Continuous Memory Monitoring Setup
Establish ongoing monitoring systems to prevent future memory leaks.
Automated Alerting System
Memory Threshold Monitoring: Create alerts when memory usage exceeds safe limits.
#!/bin/bash
# memory_alert_system.sh - Automated memory monitoring
ALERT_THRESHOLD=6291456 # 6GB in KB
EMAIL_RECIPIENT="admin@yourcompany.com"
SLACK_WEBHOOK="https://hooks.slack.com/your/webhook/url"
check_memory_and_alert() {
local current_memory=$(ps aux | grep ollama | awk '{print $6}' | head -1)
local current_time=$(date '+%Y-%m-%d %H:%M:%S')
if [[ $current_memory -gt $ALERT_THRESHOLD ]]; then
local memory_gb=$((current_memory / 1024 / 1024))
local alert_message="ALERT: Ollama memory usage is ${memory_gb}GB at $current_time"
# Log alert
echo "$alert_message" >> memory_alerts.log
# Send Slack notification
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"$alert_message\"}" \
"$SLACK_WEBHOOK"
# Send email alert
echo "$alert_message" | mail -s "Ollama Memory Alert" "$EMAIL_RECIPIENT"
return 1 # Alert triggered
fi
return 0 # Normal operation
}
# Main monitoring loop
while true; do
if ! check_memory_and_alert; then
# Wait longer after alert to avoid spam
sleep 1800 # 30 minutes
else
sleep 300 # 5 minutes normal check
fi
done
Performance Dashboard
Real-time Memory Visualization: Create dashboards to track memory trends visually.
# memory_dashboard.py - Real-time memory monitoring dashboard
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from datetime import datetime, timedelta
import subprocess
import numpy as np
from collections import deque
class OllamaMemoryDashboard:
def __init__(self, max_points=100):
self.max_points = max_points
self.timestamps = deque(maxlen=max_points)
self.memory_usage = deque(maxlen=max_points)
# Setup plot
self.fig, self.ax = plt.subplots(figsize=(12, 6))
self.line, = self.ax.plot([], [], 'b-', linewidth=2)
self.ax.set_title('Ollama Memory Usage Over Time', fontsize=16)
self.ax.set_xlabel('Time')
self.ax.set_ylabel('Memory Usage (GB)')
self.ax.grid(True, alpha=0.3)
def get_current_memory(self):
"""Get current Ollama memory usage in GB"""
try:
result = subprocess.run(['ps', 'aux'], capture_output=True, text=True)
for line in result.stdout.split('\n'):
if 'ollama' in line and 'serve' in line:
memory_kb = int(line.split()[5])
return memory_kb / 1024 / 1024 # Convert to GB
return 0
except:
return 0
def update_plot(self, frame):
"""Update plot with new memory data"""
current_time = datetime.now()
current_memory = self.get_current_memory()
self.timestamps.append(current_time)
self.memory_usage.append(current_memory)
if len(self.timestamps) > 1:
self.line.set_data(self.timestamps, self.memory_usage)
# Update axes
self.ax.set_xlim(min(self.timestamps), max(self.timestamps))
self.ax.set_ylim(0, max(max(self.memory_usage) * 1.1, 1))
# Add memory threshold line
threshold_gb = 6
self.ax.axhline(y=threshold_gb, color='r', linestyle='--',
label=f'Alert Threshold ({threshold_gb}GB)')
# Color line based on threshold
if current_memory > threshold_gb:
self.line.set_color('red')
else:
self.line.set_color('blue')
return self.line,
def start_monitoring(self):
"""Start real-time monitoring dashboard"""
ani = animation.FuncAnimation(self.fig, self.update_plot,
interval=5000, blit=False)
plt.tight_layout()
plt.show()
if __name__ == "__main__":
dashboard = OllamaMemoryDashboard()
dashboard.start_monitoring()
Memory Leak Prevention Best Practices
Implement these proven practices to prevent memory leaks before they occur.
Development Guidelines
Code Review Checklist: Establish mandatory checks for memory management in Ollama integrations.
- ✅ Explicit connection cleanup after API calls
- ✅ Request timeout configurations set appropriately
- ✅ Model context limits defined and enforced
- ✅ Memory monitoring included in deployment scripts
- ✅ Graceful degradation when memory thresholds exceeded
Testing Standards: Include memory leak testing in your development workflow.
# test_memory_leaks.py - Automated memory leak testing
import unittest
import psutil
import time
import requests
from threading import Thread
class OllamaMemoryLeakTest(unittest.TestCase):
def setUp(self):
"""Setup test environment and baseline memory"""
self.base_url = "http://localhost:11434"
self.process = self.get_ollama_process()
self.baseline_memory = self.process.memory_info().rss if self.process else 0
def get_ollama_process(self):
"""Find Ollama process"""
for proc in psutil.process_iter(['pid', 'name', 'cmdline']):
if 'ollama' in proc.info['name']:
return proc
return None
def test_sustained_requests_memory_growth(self):
"""Test memory growth during sustained request load"""
request_count = 100
for i in range(request_count):
response = requests.post(f"{self.base_url}/api/generate",
json={
"model": "llama2",
"prompt": f"Test request {i}",
"stream": False
})
self.assertEqual(response.status_code, 200)
# Check memory every 10 requests
if i % 10 == 0:
current_memory = self.process.memory_info().rss
memory_growth = current_memory - self.baseline_memory
# Alert if memory growth exceeds 100MB per 10 requests
growth_per_request = memory_growth / (i + 1)
self.assertLess(growth_per_request, 1048576, # 1MB per request
f"Memory growth too high: {growth_per_request} bytes per request")
def test_concurrent_requests_memory_stability(self):
"""Test memory stability under concurrent load"""
def send_requests():
for _ in range(20):
requests.post(f"{self.base_url}/api/generate",
json={"model": "llama2", "prompt": "Test", "stream": False})
# Launch concurrent threads
threads = [Thread(target=send_requests) for _ in range(5)]
for thread in threads:
thread.start()
# Monitor memory during concurrent execution
max_memory = self.baseline_memory
for _ in range(30): # Monitor for 30 seconds
current_memory = self.process.memory_info().rss
max_memory = max(max_memory, current_memory)
time.sleep(1)
# Wait for threads to complete
for thread in threads:
thread.join()
# Check final memory after cleanup
time.sleep(10) # Allow cleanup time
final_memory = self.process.memory_info().rss
# Memory should return close to baseline after cleanup
memory_retention = final_memory - self.baseline_memory
self.assertLess(memory_retention, 52428800, # 50MB retention allowed
f"Excessive memory retention: {memory_retention} bytes")
if __name__ == "__main__":
unittest.main()
Production Deployment
Container Resource Limits: Set explicit memory limits in containerized deployments.
# Dockerfile.ollama-optimized
FROM ollama/ollama:latest
# Set memory management environment variables
ENV OLLAMA_MAX_MEMORY=4096m
ENV OLLAMA_GC_INTERVAL=60
ENV OLLAMA_MODEL_CACHE_SIZE=1024m
# Add memory monitoring script
COPY memory_monitor.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/memory_monitor.sh
# Health check for memory usage
HEALTHCHECK --interval=60s --timeout=10s --start-period=30s --retries=3 \
CMD /usr/local/bin/memory_monitor.sh || exit 1
EXPOSE 11434
CMD ["ollama", "serve"]
# docker-compose.yml - Production deployment with memory limits
version: '3.8'
services:
ollama:
image: ollama-optimized:latest
container_name: ollama-production
restart: unless-stopped
deploy:
resources:
limits:
memory: 8G
cpus: '4'
reservations:
memory: 2G
cpus: '2'
environment:
- OLLAMA_MAX_MEMORY=6144m
- OLLAMA_GC_INTERVAL=45
volumes:
- ollama_data:/root/.ollama
- ./logs:/var/log/ollama
ports:
- "11434:11434"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
volumes:
ollama_data:
Troubleshooting Common Memory Issues
Address the most frequent memory problems encountered in Ollama deployments.
High Memory Usage Patterns
Symptom: Memory usage increases steadily over time without requests.
Diagnosis Steps:
- Check for background model loading processes
- Verify cache eviction policies are working
- Monitor GPU memory alongside system memory
Solution:
# Force cache cleanup and model unloading
curl -X POST http://localhost:11434/api/admin/gc
curl -X POST http://localhost:11434/api/admin/unload-models
# Restart with clean cache
sudo systemctl stop ollama
rm -rf ~/.ollama/cache/*
sudo systemctl start ollama
Memory Spikes During Model Loading
Symptom: Sudden memory spikes when loading new models.
Solution: Implement model preloading and memory pooling.
# model_preloader.py - Efficient model memory management
import requests
import time
from concurrent.futures import ThreadPoolExecutor
class OllamaModelManager:
def __init__(self, max_concurrent_models=2):
self.max_concurrent_models = max_concurrent_models
self.loaded_models = set()
self.base_url = "http://localhost:11434"
def preload_models(self, model_list):
"""Preload models efficiently to avoid memory spikes"""
with ThreadPoolExecutor(max_workers=1) as executor: # Sequential loading
for model in model_list:
if len(self.loaded_models) >= self.max_concurrent_models:
self.unload_oldest_model()
executor.submit(self.load_model_safely, model)
time.sleep(30) # Allow memory to stabilize
def load_model_safely(self, model_name):
"""Load model with memory monitoring"""
print(f"Loading model: {model_name}")
try:
response = requests.post(f"{self.base_url}/api/preload",
json={"name": model_name},
timeout=300)
if response.status_code == 200:
self.loaded_models.add(model_name)
print(f"Successfully loaded: {model_name}")
else:
print(f"Failed to load {model_name}: {response.text}")
except Exception as e:
print(f"Error loading {model_name}: {e}")
def unload_oldest_model(self):
"""Unload least recently used model"""
if self.loaded_models:
oldest_model = next(iter(self.loaded_models))
self.unload_model(oldest_model)
def unload_model(self, model_name):
"""Unload specific model to free memory"""
try:
response = requests.post(f"{self.base_url}/api/unload",
json={"name": model_name})
if response.status_code == 200:
self.loaded_models.discard(model_name)
print(f"Unloaded model: {model_name}")
except Exception as e:
print(f"Error unloading {model_name}: {e}")
# Usage
manager = OllamaModelManager(max_concurrent_models=2)
models_to_preload = ["llama2", "codellama", "mistral"]
manager.preload_models(models_to_preload)
Measuring Optimization Success
Track the effectiveness of your memory leak fixes with quantifiable metrics.
Key Performance Indicators
Memory Efficiency Metrics:
- Memory usage per request (target: <50MB per request)
- Memory growth rate (target: <10MB per hour baseline)
- Memory recovery time after load (target: <5 minutes)
- Memory retention after restart (target: <100MB difference)
Performance Impact Measurements:
#!/bin/bash
# performance_benchmark.sh - Measure optimization effectiveness
echo "=== Ollama Memory Optimization Benchmark ==="
echo "Timestamp: $(date)"
# Baseline measurements
INITIAL_MEMORY=$(ps aux | grep ollama | awk '{print $6}' | head -1)
echo "Initial memory usage: ${INITIAL_MEMORY}KB"
# Run load test
echo "Starting load test..."
python3 load_test_ollama.py &
LOAD_TEST_PID=$!
# Monitor memory during load
PEAK_MEMORY=0
for i in {1..60}; do
CURRENT=$(ps aux | grep ollama | awk '{print $6}' | head -1)
if [[ $CURRENT -gt $PEAK_MEMORY ]]; then
PEAK_MEMORY=$CURRENT
fi
sleep 10
done
# Stop load test
kill $LOAD_TEST_PID
# Wait for memory to stabilize
echo "Waiting for memory to stabilize..."
sleep 300
FINAL_MEMORY=$(ps aux | grep ollama | awk '{print $6}' | head -1)
# Calculate metrics
MEMORY_GROWTH=$((FINAL_MEMORY - INITIAL_MEMORY))
PEAK_GROWTH=$((PEAK_MEMORY - INITIAL_MEMORY))
RETENTION_RATE=$(echo "scale=2; $MEMORY_GROWTH * 100 / $PEAK_GROWTH" | bc)
echo "=== Benchmark Results ==="
echo "Initial memory: ${INITIAL_MEMORY}KB"
echo "Peak memory: ${PEAK_MEMORY}KB"
echo "Final memory: ${FINAL_MEMORY}KB"
echo "Peak growth: ${PEAK_GROWTH}KB"
echo "Retained growth: ${MEMORY_GROWTH}KB"
echo "Memory retention rate: ${RETENTION_RATE}%"
# Performance scoring
if [[ $RETENTION_RATE -lt 20 ]]; then
echo "✅ EXCELLENT: Memory retention under 20%"
elif [[ $RETENTION_RATE -lt 40 ]]; then
echo "✅ GOOD: Memory retention under 40%"
elif [[ $RETENTION_RATE -lt 60 ]]; then
echo "⚠️ FAIR: Memory retention under 60% - consider optimization"
else
echo "❌ POOR: Memory retention over 60% - optimization required"
fi
Advanced Memory Optimization Techniques
Implement cutting-edge optimization strategies for maximum memory efficiency.
Memory Pool Management
Custom Memory Allocator: Implement pooled memory allocation to reduce fragmentation.
# memory_pool_manager.py - Advanced memory pool optimization
import mmap
import os
from contextlib import contextmanager
class OllamaMemoryPool:
def __init__(self, pool_size_mb=1024):
self.pool_size = pool_size_mb * 1024 * 1024 # Convert to bytes
self.memory_map = None
self.allocated_blocks = {}
self.free_blocks = []
self.initialize_pool()
def initialize_pool(self):
"""Initialize memory-mapped pool for efficient allocation"""
# Create temporary file for memory mapping
self.temp_fd = os.open('/tmp/ollama_memory_pool',
os.O_CREAT | os.O_RDWR | os.O_TRUNC)
os.ftruncate(self.temp_fd, self.pool_size)
# Create memory map
self.memory_map = mmap.mmap(self.temp_fd, self.pool_size)
# Initialize free block list
self.free_blocks = [(0, self.pool_size)]
print(f"Initialized memory pool: {self.pool_size // 1024 // 1024}MB")
@contextmanager
def allocate_block(self, size_bytes):
"""Allocate memory block from pool with automatic cleanup"""
block_id = self._allocate_from_pool(size_bytes)
try:
yield self.get_block_buffer(block_id)
finally:
self._deallocate_block(block_id)
def _allocate_from_pool(self, size):
"""Internal allocation method with best-fit strategy"""
# Find best-fit free block
best_block = None
best_index = -1
for i, (start, block_size) in enumerate(self.free_blocks):
if block_size >= size:
if best_block is None or block_size < best_block[1]:
best_block = (start, block_size)
best_index = i
if best_block is None:
raise MemoryError("Insufficient memory in pool")
# Allocate block
start, block_size = best_block
del self.free_blocks[best_index]
# Create allocated block record
block_id = f"block_{len(self.allocated_blocks)}"
self.allocated_blocks[block_id] = (start, size)
# Return remaining space to free list
remaining_size = block_size - size
if remaining_size > 0:
self.free_blocks.append((start + size, remaining_size))
self.free_blocks.sort() # Keep sorted for efficient merging
return block_id
def _deallocate_block(self, block_id):
"""Return block to free pool and merge adjacent blocks"""
if block_id not in self.allocated_blocks:
return
start, size = self.allocated_blocks[block_id]
del self.allocated_blocks[block_id]
# Add to free list and merge adjacent blocks
self.free_blocks.append((start, size))
self.free_blocks.sort()
self._merge_adjacent_blocks()
def _merge_adjacent_blocks(self):
"""Merge adjacent free blocks to reduce fragmentation"""
if len(self.free_blocks) < 2:
return
merged = []
current_start, current_size = self.free_blocks[0]
for start, size in self.free_blocks[1:]:
if current_start + current_size == start:
# Adjacent blocks - merge them
current_size += size
else:
# Non-adjacent - save current and start new
merged.append((current_start, current_size))
current_start, current_size = start, size
merged.append((current_start, current_size))
self.free_blocks = merged
def get_memory_stats(self):
"""Get current memory pool statistics"""
allocated_size = sum(size for _, size in self.allocated_blocks.values())
free_size = sum(size for _, size in self.free_blocks)
return {
"total_size": self.pool_size,
"allocated": allocated_size,
"free": free_size,
"fragmentation": len(self.free_blocks),
"utilization": allocated_size / self.pool_size * 100
}
def cleanup(self):
"""Clean up memory pool resources"""
if self.memory_map:
self.memory_map.close()
if hasattr(self, 'temp_fd'):
os.close(self.temp_fd)
os.unlink('/tmp/ollama_memory_pool')
# Usage example
pool = OllamaMemoryPool(pool_size_mb=512)
# Allocate memory for request processing
with pool.allocate_block(1024 * 1024) as buffer: # 1MB block
# Use buffer for request processing
# Memory automatically freed when exiting context
pass
print("Memory stats:", pool.get_memory_stats())
pool.cleanup()
Production Deployment Checklist
Ensure your optimized Ollama deployment maintains memory efficiency in production.
Pre-Deployment Validation
Memory Optimization Checklist:
- ✅ Memory limits configured in deployment manifests
- ✅ Monitoring and alerting systems activated
- ✅ Automatic restart mechanisms tested
- ✅ Load testing completed with memory profiling
- ✅ Rollback procedures documented and tested
- ✅ Memory leak detection scripts deployed
- ✅ Performance benchmarks established
- ✅ Team training completed on memory monitoring tools
Monitoring Infrastructure
Production Monitoring Stack:
# monitoring-stack.yml - Complete monitoring setup
version: '3.8'
services:
ollama:
image: ollama-optimized:latest
container_name: ollama-production
deploy:
resources:
limits:
memory: 8G
labels:
- "monitoring.enable=true"
- "monitoring.memory.threshold=6GB"
prometheus:
image: prom/prometheus:latest
container_name: ollama-prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=90d'
grafana:
image: grafana/grafana:latest
container_name: ollama-grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
ports:
- "3000:3000"
node-exporter:
image: prom/node-exporter:latest
container_name: ollama-node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
prometheus_data:
grafana_data:
Conclusion: Achieving Optimal Ollama Memory Performance
Memory leak detection and optimization for Ollama long-running processes requires systematic monitoring, proven optimization techniques, and ongoing maintenance. The strategies outlined in this guide can reduce memory consumption by 60% or more while maintaining consistent performance.
Key Takeaways for Memory Optimization Success:
Detection First: Implement comprehensive monitoring before optimization. You cannot fix what you cannot measure. Use the provided monitoring scripts and profiling tools to establish baselines and identify leak patterns.
Systematic Optimization: Apply configuration changes, process management, and code-level optimizations in stages. Test each change thoroughly to measure its specific impact on memory usage.
Continuous Monitoring: Memory optimization is an ongoing process, not a one-time fix. Deploy automated monitoring and alerting systems to catch new leaks before they impact production performance.
Production Readiness: Use container resource limits, health checks, and automated restart mechanisms to maintain memory efficiency at scale. Your optimization efforts must survive real-world production loads.
These proven techniques have helped development teams reduce Ollama memory consumption from problematic levels (8GB+) down to efficient baselines (2-3GB) while handling the same request volumes. Your optimized Ollama deployment will deliver consistent performance without the memory leak headaches that plague unoptimized installations.
Start with the detection tools and monitoring scripts provided here. Implement the configuration optimizations that match your deployment architecture. Most importantly, establish the continuous monitoring systems that will keep your optimizations effective over time.
Memory leak detection for Ollama long-running processes transforms from a reactive problem into a proactive advantage when you apply these systematic optimization strategies.