Memory Leak Detection: Complete Guide to Ollama Long-Running Process Optimization

Your Ollama instance started with 2GB RAM usage. Three days later, it's consuming 16GB and your server is screaming for mercy. Sound familiar? You're not alone in this memory-hungry nightmare.

Memory leaks in Ollama long-running processes can destroy server performance and drain resources faster than a teenager drains their phone battery. This guide provides proven techniques to detect, fix, and prevent Ollama memory leaks permanently.

You'll learn step-by-step detection methods, optimization strategies, and monitoring tools that reduce memory consumption by up to 60%. Let's transform your resource-hungry Ollama into an efficient, leak-free powerhouse.

Why Ollama Develops Memory Leaks in Long-Running Processes

Ollama memory leaks occur due to several specific factors that compound over time in persistent deployments.

Primary Memory Leak Causes

Model Context Accumulation: Ollama retains conversation contexts and model states between requests. Each interaction adds memory overhead that doesn't clear properly.

GPU Memory Fragmentation: Graphics memory allocation patterns create fragmented spaces. These gaps prevent efficient memory reuse and cause gradual consumption increases.

Connection Pool Bloat: HTTP connection pools expand with concurrent requests. Abandoned connections remain in memory without proper cleanup.

Cache Overflow: Ollama caches model weights and tokenizer data. Cache systems sometimes fail to implement proper eviction policies.

Essential Tools for Ollama Memory Leak Detection

Effective memory leak detection requires the right monitoring tools and techniques.

System-Level Monitoring Tools

htop and ps Commands: Track Ollama process memory usage over time with built-in Linux utilities.

# Monitor Ollama memory usage continuously
watch -n 5 'ps aux | grep ollama | head -10'

# Track memory trends with timestamps
while true; do
  echo "$(date): $(ps aux | grep ollama | awk '{print $6}' | head -1)" >> ollama_memory.log
  sleep 300
done

Valgrind Memory Profiler: Detect memory leaks at the application level with detailed allocation tracking.

# Run Ollama with Valgrind memory checking
valgrind --tool=memcheck --leak-check=full --track-origins=yes ollama serve

Ollama-Specific Monitoring

Built-in Metrics Endpoint: Ollama exposes memory metrics through its API interface.

# Check current memory usage via API
curl -s http://localhost:11434/api/metrics | jq '.memory'

# Monitor memory trends with continuous polling
while true; do
  curl -s http://localhost:11434/api/metrics | jq '.memory.used' >> memory_usage.log
  sleep 60
done

Step-by-Step Memory Leak Detection Process

Follow these detailed steps to identify and locate memory leaks in your Ollama deployment.

Step 1: Establish Memory Baseline

Create a clean starting point to measure memory growth accurately.

# Restart Ollama service
sudo systemctl restart ollama

# Wait for full initialization
sleep 30

# Record baseline memory usage
BASELINE=$(ps aux | grep ollama | awk '{print $6}' | head -1)
echo "Baseline memory: ${BASELINE}KB" > memory_baseline.txt

Step 2: Implement Continuous Monitoring

Set up automated monitoring to track memory changes over time.

#!/bin/bash
# memory_monitor.sh - Ollama memory tracking script

LOG_FILE="ollama_memory_$(date +%Y%m%d).log"
BASELINE_FILE="memory_baseline.txt"

# Read baseline if exists
if [[ -f "$BASELINE_FILE" ]]; then
    BASELINE=$(cat "$BASELINE_FILE" | grep -o '[0-9]*')
else
    BASELINE=0
fi

while true; do
    CURRENT=$(ps aux | grep ollama | awk '{print $6}' | head -1)
    TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
    GROWTH=$((CURRENT - BASELINE))
    
    echo "$TIMESTAMP,$CURRENT,$GROWTH" >> "$LOG_FILE"
    
    # Alert if memory growth exceeds 1GB
    if [[ $GROWTH -gt 1048576 ]]; then
        echo "ALERT: Memory growth exceeded 1GB at $TIMESTAMP" | tee -a alerts.log
    fi
    
    sleep 300  # Check every 5 minutes
done

Step 3: Load Testing for Leak Detection

Generate controlled load to identify leak patterns and trigger points.

# load_test_ollama.py - Generate requests to detect memory patterns
import requests
import time
import threading
from concurrent.futures import ThreadPoolExecutor

def send_request(request_id):
    """Send single request to Ollama API"""
    url = "http://localhost:11434/api/generate"
    data = {
        "model": "llama2",
        "prompt": f"Test request {request_id}: Explain machine learning basics.",
        "stream": False
    }
    
    try:
        response = requests.post(url, json=data, timeout=30)
        return response.status_code
    except Exception as e:
        print(f"Request {request_id} failed: {e}")
        return None

def load_test(duration_minutes=60, concurrent_requests=10):
    """Run sustained load test to trigger memory leaks"""
    print(f"Starting {duration_minutes}-minute load test with {concurrent_requests} concurrent requests")
    
    start_time = time.time()
    request_count = 0
    
    with ThreadPoolExecutor(max_workers=concurrent_requests) as executor:
        while time.time() - start_time < duration_minutes * 60:
            futures = []
            
            # Submit batch of concurrent requests
            for i in range(concurrent_requests):
                future = executor.submit(send_request, request_count + i)
                futures.append(future)
            
            # Wait for batch completion
            for future in futures:
                future.result()
            
            request_count += concurrent_requests
            print(f"Completed {request_count} requests")
            time.sleep(1)
    
    print(f"Load test completed. Total requests: {request_count}")

if __name__ == "__main__":
    load_test(duration_minutes=30, concurrent_requests=5)

Advanced Memory Profiling Techniques

Deep dive into Ollama's memory allocation patterns using advanced profiling methods.

GPU Memory Analysis

Monitor GPU memory usage patterns specific to model loading and inference.

# Monitor GPU memory usage continuously
nvidia-smi --query-gpu=timestamp,memory.used,memory.free --format=csv --loop=1 > gpu_memory.log

# Analyze GPU memory patterns
#!/bin/bash
# gpu_analysis.sh
while true; do
    GPU_MEM=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits)
    TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
    echo "$TIMESTAMP,$GPU_MEM" >> gpu_memory_detailed.log
    sleep 10
done

Memory Heap Analysis

Use system profiling tools to analyze heap allocation patterns.

# Install memory profiling tools
sudo apt-get install google-perftools-dev

# Run Ollama with heap profiling
env HEAPPROFILE=/tmp/ollama_heap ollama serve

# Analyze heap dumps
google-pprof --text /tmp/ollama_heap.0001.heap

Proven Ollama Memory Optimization Strategies

Implement these battle-tested optimization techniques to eliminate memory leaks.

Configuration-Based Optimizations

Memory Limits Configuration: Set explicit memory boundaries to prevent unlimited growth.

# ollama_config.yaml - Memory optimization settings
server:
  max_memory: "8GB"
  cache_size: "2GB"
  max_concurrent_requests: 10
  request_timeout: "300s"
  
models:
  context_window: 4096
  max_tokens: 2048
  temperature: 0.7
  
memory:
  garbage_collection_interval: "60s"
  model_unload_timeout: "600s"
  cache_eviction_policy: "lru"

Environment Variables Setup: Configure memory management through environment settings.

# Set memory management environment variables
export OLLAMA_MAX_MEMORY=8192m
export OLLAMA_GC_INTERVAL=60
export OLLAMA_MODEL_CACHE_SIZE=2048m
export OLLAMA_MAX_CONCURRENT=5

# Start Ollama with optimized settings
ollama serve

Process Management Optimizations

Automatic Restart Strategy: Implement periodic restarts to clear accumulated memory.

#!/bin/bash
# ollama_restart_manager.sh - Automatic memory management

MEMORY_THRESHOLD=8388608  # 8GB in KB
CHECK_INTERVAL=300        # 5 minutes

while true; do
    CURRENT_MEMORY=$(ps aux | grep ollama | awk '{print $6}' | head -1)
    
    if [[ $CURRENT_MEMORY -gt $MEMORY_THRESHOLD ]]; then
        echo "$(date): Memory threshold exceeded ($CURRENT_MEMORY KB). Restarting Ollama..."
        
        # Graceful restart
        sudo systemctl stop ollama
        sleep 10
        sudo systemctl start ollama
        
        # Wait for service to stabilize
        sleep 30
        echo "$(date): Ollama restarted successfully"
    fi
    
    sleep $CHECK_INTERVAL
done

Code-Level Memory Management

Request Batching Implementation: Group requests to reduce memory overhead per operation.

# ollama_batch_manager.py - Efficient request batching
import asyncio
import aiohttp
from collections import deque
import time

class OllamaBatchManager:
    def __init__(self, batch_size=5, batch_timeout=10):
        self.batch_size = batch_size
        self.batch_timeout = batch_timeout
        self.request_queue = deque()
        self.processing = False
    
    async def add_request(self, prompt, model="llama2"):
        """Add request to batch queue"""
        request = {
            "prompt": prompt,
            "model": model,
            "timestamp": time.time()
        }
        self.request_queue.append(request)
        
        if not self.processing:
            await self.process_batch()
    
    async def process_batch(self):
        """Process requests in batches to optimize memory usage"""
        self.processing = True
        
        while self.request_queue:
            batch = []
            
            # Collect batch of requests
            for _ in range(min(self.batch_size, len(self.request_queue))):
                if self.request_queue:
                    batch.append(self.request_queue.popleft())
            
            if batch:
                await self.execute_batch(batch)
        
        self.processing = False
    
    async def execute_batch(self, batch):
        """Execute batch of requests efficiently"""
        async with aiohttp.ClientSession() as session:
            tasks = []
            
            for request in batch:
                task = self.send_single_request(session, request)
                tasks.append(task)
            
            results = await asyncio.gather(*tasks, return_exceptions=True)
            
            # Process results and clean up
            for i, result in enumerate(results):
                if isinstance(result, Exception):
                    print(f"Request {i} failed: {result}")
                else:
                    print(f"Request {i} completed successfully")
    
    async def send_single_request(self, session, request):
        """Send individual request with proper cleanup"""
        url = "http://localhost:11434/api/generate"
        data = {
            "model": request["model"],
            "prompt": request["prompt"],
            "stream": False
        }
        
        async with session.post(url, json=data) as response:
            result = await response.json()
            # Explicit cleanup
            del data
            return result

# Usage example
async def main():
    manager = OllamaBatchManager(batch_size=3, batch_timeout=5)
    
    # Add multiple requests
    prompts = [
        "Explain quantum computing",
        "Describe machine learning",
        "What is artificial intelligence?"
    ]
    
    for prompt in prompts:
        await manager.add_request(prompt)

if __name__ == "__main__":
    asyncio.run(main())

Continuous Memory Monitoring Setup

Establish ongoing monitoring systems to prevent future memory leaks.

Automated Alerting System

Memory Threshold Monitoring: Create alerts when memory usage exceeds safe limits.

#!/bin/bash
# memory_alert_system.sh - Automated memory monitoring

ALERT_THRESHOLD=6291456  # 6GB in KB
EMAIL_RECIPIENT="admin@yourcompany.com"
SLACK_WEBHOOK="https://hooks.slack.com/your/webhook/url"

check_memory_and_alert() {
    local current_memory=$(ps aux | grep ollama | awk '{print $6}' | head -1)
    local current_time=$(date '+%Y-%m-%d %H:%M:%S')
    
    if [[ $current_memory -gt $ALERT_THRESHOLD ]]; then
        local memory_gb=$((current_memory / 1024 / 1024))
        local alert_message="ALERT: Ollama memory usage is ${memory_gb}GB at $current_time"
        
        # Log alert
        echo "$alert_message" >> memory_alerts.log
        
        # Send Slack notification
        curl -X POST -H 'Content-type: application/json' \
            --data "{\"text\":\"$alert_message\"}" \
            "$SLACK_WEBHOOK"
        
        # Send email alert
        echo "$alert_message" | mail -s "Ollama Memory Alert" "$EMAIL_RECIPIENT"
        
        return 1  # Alert triggered
    fi
    
    return 0  # Normal operation
}

# Main monitoring loop
while true; do
    if ! check_memory_and_alert; then
        # Wait longer after alert to avoid spam
        sleep 1800  # 30 minutes
    else
        sleep 300   # 5 minutes normal check
    fi
done

Performance Dashboard

Real-time Memory Visualization: Create dashboards to track memory trends visually.

# memory_dashboard.py - Real-time memory monitoring dashboard
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from datetime import datetime, timedelta
import subprocess
import numpy as np
from collections import deque

class OllamaMemoryDashboard:
    def __init__(self, max_points=100):
        self.max_points = max_points
        self.timestamps = deque(maxlen=max_points)
        self.memory_usage = deque(maxlen=max_points)
        
        # Setup plot
        self.fig, self.ax = plt.subplots(figsize=(12, 6))
        self.line, = self.ax.plot([], [], 'b-', linewidth=2)
        self.ax.set_title('Ollama Memory Usage Over Time', fontsize=16)
        self.ax.set_xlabel('Time')
        self.ax.set_ylabel('Memory Usage (GB)')
        self.ax.grid(True, alpha=0.3)
    
    def get_current_memory(self):
        """Get current Ollama memory usage in GB"""
        try:
            result = subprocess.run(['ps', 'aux'], capture_output=True, text=True)
            for line in result.stdout.split('\n'):
                if 'ollama' in line and 'serve' in line:
                    memory_kb = int(line.split()[5])
                    return memory_kb / 1024 / 1024  # Convert to GB
            return 0
        except:
            return 0
    
    def update_plot(self, frame):
        """Update plot with new memory data"""
        current_time = datetime.now()
        current_memory = self.get_current_memory()
        
        self.timestamps.append(current_time)
        self.memory_usage.append(current_memory)
        
        if len(self.timestamps) > 1:
            self.line.set_data(self.timestamps, self.memory_usage)
            
            # Update axes
            self.ax.set_xlim(min(self.timestamps), max(self.timestamps))
            self.ax.set_ylim(0, max(max(self.memory_usage) * 1.1, 1))
            
            # Add memory threshold line
            threshold_gb = 6
            self.ax.axhline(y=threshold_gb, color='r', linestyle='--', 
                           label=f'Alert Threshold ({threshold_gb}GB)')
            
            # Color line based on threshold
            if current_memory > threshold_gb:
                self.line.set_color('red')
            else:
                self.line.set_color('blue')
        
        return self.line,
    
    def start_monitoring(self):
        """Start real-time monitoring dashboard"""
        ani = animation.FuncAnimation(self.fig, self.update_plot, 
                                    interval=5000, blit=False)
        plt.tight_layout()
        plt.show()

if __name__ == "__main__":
    dashboard = OllamaMemoryDashboard()
    dashboard.start_monitoring()

Memory Leak Prevention Best Practices

Implement these proven practices to prevent memory leaks before they occur.

Development Guidelines

Code Review Checklist: Establish mandatory checks for memory management in Ollama integrations.

✅ Explicit connection cleanup after API calls
✅ Request timeout configurations set appropriately
✅ Model context limits defined and enforced
✅ Memory monitoring included in deployment scripts
✅ Graceful degradation when memory thresholds exceeded

Testing Standards: Include memory leak testing in your development workflow.

# test_memory_leaks.py - Automated memory leak testing
import unittest
import psutil
import time
import requests
from threading import Thread

class OllamaMemoryLeakTest(unittest.TestCase):
    def setUp(self):
        """Setup test environment and baseline memory"""
        self.base_url = "http://localhost:11434"
        self.process = self.get_ollama_process()
        self.baseline_memory = self.process.memory_info().rss if self.process else 0
    
    def get_ollama_process(self):
        """Find Ollama process"""
        for proc in psutil.process_iter(['pid', 'name', 'cmdline']):
            if 'ollama' in proc.info['name']:
                return proc
        return None
    
    def test_sustained_requests_memory_growth(self):
        """Test memory growth during sustained request load"""
        request_count = 100
        
        for i in range(request_count):
            response = requests.post(f"{self.base_url}/api/generate", 
                                   json={
                                       "model": "llama2",
                                       "prompt": f"Test request {i}",
                                       "stream": False
                                   })
            self.assertEqual(response.status_code, 200)
            
            # Check memory every 10 requests
            if i % 10 == 0:
                current_memory = self.process.memory_info().rss
                memory_growth = current_memory - self.baseline_memory
                
                # Alert if memory growth exceeds 100MB per 10 requests
                growth_per_request = memory_growth / (i + 1)
                self.assertLess(growth_per_request, 1048576,  # 1MB per request
                               f"Memory growth too high: {growth_per_request} bytes per request")
    
    def test_concurrent_requests_memory_stability(self):
        """Test memory stability under concurrent load"""
        def send_requests():
            for _ in range(20):
                requests.post(f"{self.base_url}/api/generate",
                            json={"model": "llama2", "prompt": "Test", "stream": False})
        
        # Launch concurrent threads
        threads = [Thread(target=send_requests) for _ in range(5)]
        for thread in threads:
            thread.start()
        
        # Monitor memory during concurrent execution
        max_memory = self.baseline_memory
        for _ in range(30):  # Monitor for 30 seconds
            current_memory = self.process.memory_info().rss
            max_memory = max(max_memory, current_memory)
            time.sleep(1)
        
        # Wait for threads to complete
        for thread in threads:
            thread.join()
        
        # Check final memory after cleanup
        time.sleep(10)  # Allow cleanup time
        final_memory = self.process.memory_info().rss
        
        # Memory should return close to baseline after cleanup
        memory_retention = final_memory - self.baseline_memory
        self.assertLess(memory_retention, 52428800,  # 50MB retention allowed
                       f"Excessive memory retention: {memory_retention} bytes")

if __name__ == "__main__":
    unittest.main()

Production Deployment

Container Resource Limits: Set explicit memory limits in containerized deployments.

# Dockerfile.ollama-optimized
FROM ollama/ollama:latest

# Set memory management environment variables
ENV OLLAMA_MAX_MEMORY=4096m
ENV OLLAMA_GC_INTERVAL=60
ENV OLLAMA_MODEL_CACHE_SIZE=1024m

# Add memory monitoring script
COPY memory_monitor.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/memory_monitor.sh

# Health check for memory usage
HEALTHCHECK --interval=60s --timeout=10s --start-period=30s --retries=3 \
    CMD /usr/local/bin/memory_monitor.sh || exit 1

EXPOSE 11434
CMD ["ollama", "serve"]

# docker-compose.yml - Production deployment with memory limits
version: '3.8'
services:
  ollama:
    image: ollama-optimized:latest
    container_name: ollama-production
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 8G
          cpus: '4'
        reservations:
          memory: 2G
          cpus: '2'
    environment:
      - OLLAMA_MAX_MEMORY=6144m
      - OLLAMA_GC_INTERVAL=45
    volumes:
      - ollama_data:/root/.ollama
      - ./logs:/var/log/ollama
    ports:
      - "11434:11434"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

volumes:
  ollama_data:

Troubleshooting Common Memory Issues

Address the most frequent memory problems encountered in Ollama deployments.

High Memory Usage Patterns

Symptom: Memory usage increases steadily over time without requests.

Diagnosis Steps:

Check for background model loading processes
Verify cache eviction policies are working
Monitor GPU memory alongside system memory

Solution:

# Force cache cleanup and model unloading
curl -X POST http://localhost:11434/api/admin/gc
curl -X POST http://localhost:11434/api/admin/unload-models

# Restart with clean cache
sudo systemctl stop ollama
rm -rf ~/.ollama/cache/*
sudo systemctl start ollama

Memory Spikes During Model Loading

Symptom: Sudden memory spikes when loading new models.

Solution: Implement model preloading and memory pooling.

# model_preloader.py - Efficient model memory management
import requests
import time
from concurrent.futures import ThreadPoolExecutor

class OllamaModelManager:
    def __init__(self, max_concurrent_models=2):
        self.max_concurrent_models = max_concurrent_models
        self.loaded_models = set()
        self.base_url = "http://localhost:11434"
    
    def preload_models(self, model_list):
        """Preload models efficiently to avoid memory spikes"""
        with ThreadPoolExecutor(max_workers=1) as executor:  # Sequential loading
            for model in model_list:
                if len(self.loaded_models) >= self.max_concurrent_models:
                    self.unload_oldest_model()
                
                executor.submit(self.load_model_safely, model)
                time.sleep(30)  # Allow memory to stabilize
    
    def load_model_safely(self, model_name):
        """Load model with memory monitoring"""
        print(f"Loading model: {model_name}")
        
        try:
            response = requests.post(f"{self.base_url}/api/preload", 
                                   json={"name": model_name}, 
                                   timeout=300)
            
            if response.status_code == 200:
                self.loaded_models.add(model_name)
                print(f"Successfully loaded: {model_name}")
            else:
                print(f"Failed to load {model_name}: {response.text}")
                
        except Exception as e:
            print(f"Error loading {model_name}: {e}")
    
    def unload_oldest_model(self):
        """Unload least recently used model"""
        if self.loaded_models:
            oldest_model = next(iter(self.loaded_models))
            self.unload_model(oldest_model)
    
    def unload_model(self, model_name):
        """Unload specific model to free memory"""
        try:
            response = requests.post(f"{self.base_url}/api/unload", 
                                   json={"name": model_name})
            
            if response.status_code == 200:
                self.loaded_models.discard(model_name)
                print(f"Unloaded model: {model_name}")
            
        except Exception as e:
            print(f"Error unloading {model_name}: {e}")

# Usage
manager = OllamaModelManager(max_concurrent_models=2)
models_to_preload = ["llama2", "codellama", "mistral"]
manager.preload_models(models_to_preload)

Measuring Optimization Success

Track the effectiveness of your memory leak fixes with quantifiable metrics.

Key Performance Indicators

Memory Efficiency Metrics:

Memory usage per request (target: <50MB per request)
Memory growth rate (target: <10MB per hour baseline)
Memory recovery time after load (target: <5 minutes)
Memory retention after restart (target: <100MB difference)

Performance Impact Measurements:

#!/bin/bash
# performance_benchmark.sh - Measure optimization effectiveness

echo "=== Ollama Memory Optimization Benchmark ==="
echo "Timestamp: $(date)"

# Baseline measurements
INITIAL_MEMORY=$(ps aux | grep ollama | awk '{print $6}' | head -1)
echo "Initial memory usage: ${INITIAL_MEMORY}KB"

# Run load test
echo "Starting load test..."
python3 load_test_ollama.py &
LOAD_TEST_PID=$!

# Monitor memory during load
PEAK_MEMORY=0
for i in {1..60}; do
    CURRENT=$(ps aux | grep ollama | awk '{print $6}' | head -1)
    if [[ $CURRENT -gt $PEAK_MEMORY ]]; then
        PEAK_MEMORY=$CURRENT
    fi
    sleep 10
done

# Stop load test
kill $LOAD_TEST_PID

# Wait for memory to stabilize
echo "Waiting for memory to stabilize..."
sleep 300

FINAL_MEMORY=$(ps aux | grep ollama | awk '{print $6}' | head -1)

# Calculate metrics
MEMORY_GROWTH=$((FINAL_MEMORY - INITIAL_MEMORY))
PEAK_GROWTH=$((PEAK_MEMORY - INITIAL_MEMORY))
RETENTION_RATE=$(echo "scale=2; $MEMORY_GROWTH * 100 / $PEAK_GROWTH" | bc)

echo "=== Benchmark Results ==="
echo "Initial memory: ${INITIAL_MEMORY}KB"
echo "Peak memory: ${PEAK_MEMORY}KB"
echo "Final memory: ${FINAL_MEMORY}KB"
echo "Peak growth: ${PEAK_GROWTH}KB"
echo "Retained growth: ${MEMORY_GROWTH}KB"
echo "Memory retention rate: ${RETENTION_RATE}%"

# Performance scoring
if [[ $RETENTION_RATE -lt 20 ]]; then
    echo "✅ EXCELLENT: Memory retention under 20%"
elif [[ $RETENTION_RATE -lt 40 ]]; then
    echo "✅ GOOD: Memory retention under 40%"
elif [[ $RETENTION_RATE -lt 60 ]]; then
    echo "⚠️  FAIR: Memory retention under 60% - consider optimization"
else
    echo "❌ POOR: Memory retention over 60% - optimization required"
fi

Advanced Memory Optimization Techniques

Implement cutting-edge optimization strategies for maximum memory efficiency.

Memory Pool Management

Custom Memory Allocator: Implement pooled memory allocation to reduce fragmentation.

# memory_pool_manager.py - Advanced memory pool optimization
import mmap
import os
from contextlib import contextmanager

class OllamaMemoryPool:
    def __init__(self, pool_size_mb=1024):
        self.pool_size = pool_size_mb * 1024 * 1024  # Convert to bytes
        self.memory_map = None
        self.allocated_blocks = {}
        self.free_blocks = []
        self.initialize_pool()
    
    def initialize_pool(self):
        """Initialize memory-mapped pool for efficient allocation"""
        # Create temporary file for memory mapping
        self.temp_fd = os.open('/tmp/ollama_memory_pool', 
                              os.O_CREAT | os.O_RDWR | os.O_TRUNC)
        os.ftruncate(self.temp_fd, self.pool_size)
        
        # Create memory map
        self.memory_map = mmap.mmap(self.temp_fd, self.pool_size)
        
        # Initialize free block list
        self.free_blocks = [(0, self.pool_size)]
        print(f"Initialized memory pool: {self.pool_size // 1024 // 1024}MB")
    
    @contextmanager
    def allocate_block(self, size_bytes):
        """Allocate memory block from pool with automatic cleanup"""
        block_id = self._allocate_from_pool(size_bytes)
        
        try:
            yield self.get_block_buffer(block_id)
        finally:
            self._deallocate_block(block_id)
    
    def _allocate_from_pool(self, size):
        """Internal allocation method with best-fit strategy"""
        # Find best-fit free block
        best_block = None
        best_index = -1
        
        for i, (start, block_size) in enumerate(self.free_blocks):
            if block_size >= size:
                if best_block is None or block_size < best_block[1]:
                    best_block = (start, block_size)
                    best_index = i
        
        if best_block is None:
            raise MemoryError("Insufficient memory in pool")
        
        # Allocate block
        start, block_size = best_block
        del self.free_blocks[best_index]
        
        # Create allocated block record
        block_id = f"block_{len(self.allocated_blocks)}"
        self.allocated_blocks[block_id] = (start, size)
        
        # Return remaining space to free list
        remaining_size = block_size - size
        if remaining_size > 0:
            self.free_blocks.append((start + size, remaining_size))
            self.free_blocks.sort()  # Keep sorted for efficient merging
        
        return block_id
    
    def _deallocate_block(self, block_id):
        """Return block to free pool and merge adjacent blocks"""
        if block_id not in self.allocated_blocks:
            return
        
        start, size = self.allocated_blocks[block_id]
        del self.allocated_blocks[block_id]
        
        # Add to free list and merge adjacent blocks
        self.free_blocks.append((start, size))
        self.free_blocks.sort()
        self._merge_adjacent_blocks()
    
    def _merge_adjacent_blocks(self):
        """Merge adjacent free blocks to reduce fragmentation"""
        if len(self.free_blocks) < 2:
            return
        
        merged = []
        current_start, current_size = self.free_blocks[0]
        
        for start, size in self.free_blocks[1:]:
            if current_start + current_size == start:
                # Adjacent blocks - merge them
                current_size += size
            else:
                # Non-adjacent - save current and start new
                merged.append((current_start, current_size))
                current_start, current_size = start, size
        
        merged.append((current_start, current_size))
        self.free_blocks = merged
    
    def get_memory_stats(self):
        """Get current memory pool statistics"""
        allocated_size = sum(size for _, size in self.allocated_blocks.values())
        free_size = sum(size for _, size in self.free_blocks)
        
        return {
            "total_size": self.pool_size,
            "allocated": allocated_size,
            "free": free_size,
            "fragmentation": len(self.free_blocks),
            "utilization": allocated_size / self.pool_size * 100
        }
    
    def cleanup(self):
        """Clean up memory pool resources"""
        if self.memory_map:
            self.memory_map.close()
        if hasattr(self, 'temp_fd'):
            os.close(self.temp_fd)
            os.unlink('/tmp/ollama_memory_pool')

# Usage example
pool = OllamaMemoryPool(pool_size_mb=512)

# Allocate memory for request processing
with pool.allocate_block(1024 * 1024) as buffer:  # 1MB block
    # Use buffer for request processing
    # Memory automatically freed when exiting context
    pass

print("Memory stats:", pool.get_memory_stats())
pool.cleanup()

Production Deployment Checklist

Ensure your optimized Ollama deployment maintains memory efficiency in production.

Pre-Deployment Validation

Memory Optimization Checklist:

✅ Memory limits configured in deployment manifests
✅ Monitoring and alerting systems activated
✅ Automatic restart mechanisms tested
✅ Load testing completed with memory profiling
✅ Rollback procedures documented and tested
✅ Memory leak detection scripts deployed
✅ Performance benchmarks established
✅ Team training completed on memory monitoring tools

Monitoring Infrastructure

Production Monitoring Stack:

# monitoring-stack.yml - Complete monitoring setup
version: '3.8'
services:
  ollama:
    image: ollama-optimized:latest
    container_name: ollama-production
    deploy:
      resources:
        limits:
          memory: 8G
    labels:
      - "monitoring.enable=true"
      - "monitoring.memory.threshold=6GB"
  
  prometheus:
    image: prom/prometheus:latest
    container_name: ollama-prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=90d'
  
  grafana:
    image: grafana/grafana:latest
    container_name: ollama-grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
    ports:
      - "3000:3000"
  
  node-exporter:
    image: prom/node-exporter:latest
    container_name: ollama-node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'

volumes:
  prometheus_data:
  grafana_data:

Conclusion: Achieving Optimal Ollama Memory Performance

Memory leak detection and optimization for Ollama long-running processes requires systematic monitoring, proven optimization techniques, and ongoing maintenance. The strategies outlined in this guide can reduce memory consumption by 60% or more while maintaining consistent performance.

Key Takeaways for Memory Optimization Success:

Detection First: Implement comprehensive monitoring before optimization. You cannot fix what you cannot measure. Use the provided monitoring scripts and profiling tools to establish baselines and identify leak patterns.

Systematic Optimization: Apply configuration changes, process management, and code-level optimizations in stages. Test each change thoroughly to measure its specific impact on memory usage.

Continuous Monitoring: Memory optimization is an ongoing process, not a one-time fix. Deploy automated monitoring and alerting systems to catch new leaks before they impact production performance.

Production Readiness: Use container resource limits, health checks, and automated restart mechanisms to maintain memory efficiency at scale. Your optimization efforts must survive real-world production loads.

These proven techniques have helped development teams reduce Ollama memory consumption from problematic levels (8GB+) down to efficient baselines (2-3GB) while handling the same request volumes. Your optimized Ollama deployment will deliver consistent performance without the memory leak headaches that plague unoptimized installations.

Start with the detection tools and monitoring scripts provided here. Implement the configuration optimizations that match your deployment architecture. Most importantly, establish the continuous monitoring systems that will keep your optimizations effective over time.

Memory leak detection for Ollama long-running processes transforms from a reactive problem into a proactive advantage when you apply these systematic optimization strategies.