Your AI chatbot just crashed during peak usage. Users are frustrated, your server is overwhelmed, and you're scrambling to understand why your "perfectly sized" infrastructure failed. Sound familiar? Welcome to the wild world of AI resource planning, where a 7B parameter model can devour your RAM faster than a teenager clears a pizza.
Ollama capacity planning helps you predict exact resource requirements before deployment disasters strike. This guide shows you how to calculate GPU memory, CPU usage, and storage needs for any AI model size.
What is Ollama Capacity Planning?
Capacity planning for Ollama involves predicting computational resources needed to run large language models efficiently. This process calculates GPU memory, CPU cores, RAM, and storage requirements based on model parameters, quantization levels, and expected concurrent users.
Unlike traditional software capacity planning, AI model resource prediction must account for:
- Model parameter count (7B, 13B, 70B parameters)
- Quantization format (Q4_0, Q5_1, Q8_0)
- Context window size (2K, 4K, 32K tokens)
- Concurrent inference requests
- Batch processing requirements
Understanding Ollama Resource Requirements
GPU Memory Calculation Formula
The primary resource constraint for Ollama deployments is GPU memory. Here's how to calculate exact requirements:
# Base GPU memory formula
GPU_Memory = (Parameters × Bits_per_Parameter) / 8 + Context_Memory + Overhead
# Example for Llama 2 7B Q4_0
Parameters = 7,000,000,000
Bits_per_Parameter = 4.5 (Q4_0 quantization)
Context_Memory = 1GB (for 4K context)
Overhead = 2GB (system overhead)
GPU_Memory = (7B × 4.5) / 8 + 1GB + 2GB = 4.9GB + 1GB + 2GB = 7.9GB
Memory Requirements by Model Size
| Model Size | Q4_0 | Q5_1 | Q8_0 | FP16 |
|---|---|---|---|---|
| 7B | 4.9GB | 5.5GB | 8.2GB | 14GB |
| 13B | 8.6GB | 9.8GB | 14.5GB | 26GB |
| 30B | 19.5GB | 22.1GB | 33.2GB | 60GB |
| 70B | 45.5GB | 51.7GB | 77.4GB | 140GB |
Note: Add 1-3GB for context memory and system overhead
CPU and RAM Planning for Ollama
CPU Requirements
CPU usage patterns differ significantly from GPU-accelerated inference:
# CPU-only inference requirements
def calculate_cpu_requirements(model_params, target_tokens_per_second):
"""
Calculate CPU cores needed for target inference speed
"""
# Rough estimation: 1 billion parameters needs 2-4 CPU cores
# for 1 token/second generation
base_cores = (model_params / 1_000_000_000) * 3
# Scale by target speed
required_cores = base_cores * target_tokens_per_second
# Add overhead for system processes
total_cores = required_cores * 1.3
return int(total_cores)
# Example: 7B model targeting 5 tokens/second
cores_needed = calculate_cpu_requirements(7_000_000_000, 5)
print(f"CPU cores needed: {cores_needed}") # Output: 27 cores
RAM Requirements
System RAM needs vary based on deployment configuration:
- Model loading: 1.5x model size in RAM
- Context caching: 2-4GB per concurrent user
- System overhead: 4-8GB base requirement
- Batch processing: Additional 2-4GB per batch
# RAM calculation script
#!/bin/bash
MODEL_SIZE_GB=8
CONCURRENT_USERS=10
CONTEXT_CACHE_PER_USER=3
BASE_RAM=8
MODEL_RAM=$((MODEL_SIZE_GB * 3 / 2)) # 1.5x model size
CONTEXT_RAM=$((CONCURRENT_USERS * CONTEXT_CACHE_PER_USER))
TOTAL_RAM=$((BASE_RAM + MODEL_RAM + CONTEXT_RAM))
echo "Total RAM required: ${TOTAL_RAM}GB"
Storage Planning and Optimization
Model Storage Requirements
Different quantization formats require varying storage space:
# Storage requirements by quantization
model_storage:
llama2_7b:
original: 13.5GB
q8_0: 7.16GB
q5_1: 4.78GB
q4_0: 3.83GB
q2_k: 2.63GB
llama2_13b:
original: 26GB
q8_0: 13.83GB
q5_1: 9.23GB
q4_0: 7.37GB
q2_k: 5.06GB
Storage Performance Considerations
Model loading speed depends on storage type:
- NVMe SSD: 3-7GB/s (fastest loading)
- SATA SSD: 500MB/s (moderate loading)
- HDD: 100-200MB/s (slow loading)
# Test model loading speed
time ollama run llama2:7b-q4_0 <<< "Hello"
# Optimize with model preloading
ollama run llama2:7b-q4_0 &
OLLAMA_PID=$!
# Model stays loaded in memory
Predicting Concurrent User Capacity
Batch Processing Calculations
Calculate maximum concurrent users based on available resources:
def calculate_max_concurrent_users(gpu_memory_gb, model_memory_gb,
context_memory_per_user_gb=2):
"""
Calculate maximum concurrent users for given hardware
"""
available_memory = gpu_memory_gb - model_memory_gb
max_users = available_memory // context_memory_per_user_gb
# Conservative estimate (80% utilization)
safe_max_users = int(max_users * 0.8)
return max(1, safe_max_users)
# Example: RTX 4090 (24GB) with 7B model (8GB)
max_users = calculate_max_concurrent_users(24, 8, 2)
print(f"Maximum concurrent users: {max_users}") # Output: 6 users
Request Queuing Strategy
Implement intelligent request queuing for capacity management:
import asyncio
from collections import deque
class OllamaRequestQueue:
def __init__(self, max_concurrent=4):
self.max_concurrent = max_concurrent
self.active_requests = 0
self.queue = deque()
async def process_request(self, prompt):
"""Process request with capacity management"""
if self.active_requests >= self.max_concurrent:
# Queue the request
future = asyncio.Future()
self.queue.append((prompt, future))
return await future
# Process immediately
self.active_requests += 1
try:
result = await self.call_ollama(prompt)
return result
finally:
self.active_requests -= 1
await self.process_queue()
async def process_queue(self):
"""Process queued requests"""
if self.queue and self.active_requests < self.max_concurrent:
prompt, future = self.queue.popleft()
result = await self.process_request(prompt)
future.set_result(result)
Hardware Recommendations by Use Case
Development Environment
For testing and development:
development_setup:
cpu: 8-16 cores
ram: 32GB
gpu: RTX 4060 Ti 16GB or RTX 4070
storage: 1TB NVMe SSD
supported_models:
- 7B models (all quantizations)
- 13B models (Q4_0, Q5_1)
Production Environment
For production deployments:
production_setup:
cpu: 32-64 cores
ram: 128-256GB
gpu: RTX 4090 (24GB) or A6000 (48GB)
storage: 2TB+ NVMe SSD
supported_models:
- Multiple 7B models
- 13B models (concurrent users)
- 30B models (limited concurrency)
Enterprise Environment
For large-scale deployments:
enterprise_setup:
cpu: 64+ cores
ram: 512GB+
gpu: A100 (80GB) or H100 (80GB)
storage: 4TB+ NVMe SSD array
supported_models:
- 70B models
- Multiple concurrent models
- High-concurrency workloads
Monitoring and Scaling Strategies
Resource Monitoring Script
Track actual resource usage against predictions:
#!/bin/bash
# ollama_monitor.sh
monitor_ollama_resources() {
while true; do
echo "=== Ollama Resource Monitor ==="
echo "GPU Memory Usage:"
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits
echo "CPU Usage:"
top -bn1 | grep "Cpu(s)" | awk '{print $2}' | sed 's/%us,//'
echo "RAM Usage:"
free -h | grep "Mem:" | awk '{print $3 "/" $2}'
echo "Active Ollama Processes:"
ps aux | grep ollama | grep -v grep | wc -l
echo "=========================="
sleep 30
done
}
monitor_ollama_resources
Auto-scaling Configuration
Implement auto-scaling based on resource utilization:
import psutil
import subprocess
import time
class OllamaAutoScaler:
def __init__(self, cpu_threshold=80, memory_threshold=85):
self.cpu_threshold = cpu_threshold
self.memory_threshold = memory_threshold
def check_system_load(self):
"""Check current system resource usage"""
cpu_percent = psutil.cpu_percent(interval=1)
memory_percent = psutil.virtual_memory().percent
return cpu_percent, memory_percent
def scale_decision(self):
"""Decide whether to scale up or down"""
cpu_load, memory_load = self.check_system_load()
if cpu_load > self.cpu_threshold or memory_load > self.memory_threshold:
return "scale_up"
elif cpu_load < 50 and memory_load < 50:
return "scale_down"
else:
return "maintain"
def execute_scaling(self, action):
"""Execute scaling action"""
if action == "scale_up":
# Add more Ollama instances or upgrade resources
print("Scaling up: Adding resources...")
elif action == "scale_down":
# Remove unnecessary instances
print("Scaling down: Removing resources...")
Troubleshooting Common Capacity Issues
Out of Memory Errors
When Ollama runs out of GPU memory:
# Check GPU memory usage
nvidia-smi
# Reduce context window size
ollama run llama2:7b-q4_0 -c 2048 # Reduce from 4096 to 2048 tokens
# Use smaller quantization
ollama pull llama2:7b-q2_k # Smaller but lower quality
Slow Response Times
Optimize for faster inference:
# Ollama performance tuning
optimization_settings = {
"num_ctx": 2048, # Reduce context window
"num_batch": 512, # Increase batch size
"num_gqa": 8, # Group query attention
"num_gpu": 1, # Use single GPU
"num_thread": 8, # Optimize thread count
"repeat_penalty": 1.1, # Reduce repetition
"temperature": 0.7, # Balance creativity/speed
"top_k": 40, # Limit token selection
"top_p": 0.9 # Nucleus sampling
}
Performance Benchmarking
Throughput Testing
Measure actual performance against predictions:
import time
import asyncio
import aiohttp
async def benchmark_throughput(model_name, num_requests=100):
"""Benchmark Ollama throughput"""
start_time = time.time()
async def single_request():
async with aiohttp.ClientSession() as session:
async with session.post(
'http://localhost:11434/api/generate',
json={
'model': model_name,
'prompt': 'Explain quantum computing in one sentence.',
'stream': False
}
) as response:
return await response.json()
# Run concurrent requests
tasks = [single_request() for _ in range(num_requests)]
results = await asyncio.gather(*tasks)
end_time = time.time()
total_time = end_time - start_time
print(f"Processed {num_requests} requests in {total_time:.2f}s")
print(f"Throughput: {num_requests/total_time:.2f} requests/second")
return results
# Run benchmark
asyncio.run(benchmark_throughput('llama2:7b-q4_0'))
Cost Optimization Strategies
Resource Cost Analysis
Calculate infrastructure costs for different configurations:
def calculate_monthly_cost(gpu_type, instance_hours_per_month=730):
"""Calculate monthly cloud costs"""
gpu_costs = {
'RTX_4090': 0.50, # $/hour
'A6000': 1.20, # $/hour
'A100_40GB': 2.50, # $/hour
'A100_80GB': 4.00, # $/hour
'H100': 8.00 # $/hour
}
hourly_rate = gpu_costs.get(gpu_type, 0)
monthly_cost = hourly_rate * instance_hours_per_month
return monthly_cost
# Compare costs
for gpu in ['RTX_4090', 'A6000', 'A100_40GB', 'H100']:
cost = calculate_monthly_cost(gpu)
print(f"{gpu}: ${cost:.2f}/month")
Efficient Model Selection
Choose optimal models for your use case:
model_efficiency = {
'llama2:7b-q2_k': {'quality': 6, 'speed': 9, 'memory': 2.6},
'llama2:7b-q4_0': {'quality': 8, 'speed': 7, 'memory': 3.8},
'llama2:7b-q5_1': {'quality': 9, 'speed': 6, 'memory': 4.8},
'llama2:13b-q4_0': {'quality': 9, 'speed': 5, 'memory': 7.4},
'llama2:70b-q4_0': {'quality': 10, 'speed': 2, 'memory': 45.5}
}
def recommend_model(priority='balanced'):
"""Recommend model based on priorities"""
if priority == 'speed':
return max(model_efficiency.items(), key=lambda x: x[1]['speed'])
elif priority == 'quality':
return max(model_efficiency.items(), key=lambda x: x[1]['quality'])
elif priority == 'memory':
return min(model_efficiency.items(), key=lambda x: x[1]['memory'])
else: # balanced
scores = {}
for model, metrics in model_efficiency.items():
scores[model] = sum(metrics.values()) / len(metrics)
return max(scores.items(), key=lambda x: x[1])
Advanced Capacity Planning Techniques
Predictive Scaling
Use machine learning to predict resource needs:
import numpy as np
from sklearn.linear_model import LinearRegression
from datetime import datetime, timedelta
class ResourcePredictor:
def __init__(self):
self.model = LinearRegression()
self.trained = False
def prepare_features(self, timestamps, user_counts):
"""Prepare time-based features"""
features = []
for ts in timestamps:
dt = datetime.fromtimestamp(ts)
features.append([
dt.hour, # Hour of day
dt.weekday(), # Day of week
dt.day, # Day of month
len(user_counts) # Historical data points
])
return np.array(features)
def train(self, historical_data):
"""Train prediction model"""
timestamps = [d['timestamp'] for d in historical_data]
user_counts = [d['concurrent_users'] for d in historical_data]
X = self.prepare_features(timestamps, user_counts)
y = np.array(user_counts)
self.model.fit(X, y)
self.trained = True
def predict_load(self, future_timestamp):
"""Predict future resource needs"""
if not self.trained:
raise ValueError("Model not trained")
X = self.prepare_features([future_timestamp], [0])
predicted_users = self.model.predict(X)[0]
# Convert to resource requirements
resources = {
'gpu_memory': max(8, predicted_users * 2), # 2GB per user
'cpu_cores': max(4, predicted_users * 0.5),
'ram_gb': max(16, predicted_users * 4)
}
return resources
Multi-Model Deployment
Plan capacity for multiple concurrent models:
def plan_multi_model_deployment(models, target_concurrency):
"""Plan resources for multiple models"""
total_gpu_memory = 0
total_cpu_cores = 0
total_ram = 0
for model_name, model_config in models.items():
# Calculate per-model resources
model_memory = model_config['memory_gb']
model_concurrency = target_concurrency.get(model_name, 1)
# GPU memory (models share GPU)
total_gpu_memory = max(total_gpu_memory, model_memory)
# CPU cores (additive for concurrent processing)
total_cpu_cores += model_config['cpu_cores'] * model_concurrency
# RAM (additive for model loading)
total_ram += model_memory * 1.5 # 1.5x for loading overhead
# Add context memory for concurrent users
context_memory = sum(target_concurrency.values()) * 2 # 2GB per user
total_gpu_memory += context_memory
total_ram += context_memory
return {
'gpu_memory_gb': total_gpu_memory,
'cpu_cores': total_cpu_cores,
'ram_gb': total_ram,
'recommended_gpu': recommend_gpu(total_gpu_memory),
'estimated_cost': calculate_monthly_cost(recommend_gpu(total_gpu_memory))
}
def recommend_gpu(memory_needed):
"""Recommend GPU based on memory requirements"""
gpus = {
'RTX_4060_Ti': 16,
'RTX_4070': 12,
'RTX_4090': 24,
'A6000': 48,
'A100_40GB': 40,
'A100_80GB': 80,
'H100': 80
}
for gpu, memory in gpus.items():
if memory >= memory_needed * 1.2: # 20% headroom
return gpu
return 'H100' # Fallback to highest-end GPU
Conclusion
Ollama capacity planning prevents costly deployment failures and ensures optimal AI model performance. By calculating GPU memory requirements, predicting concurrent user capacity, and implementing monitoring strategies, you can right-size your infrastructure from the start.
Key takeaways for successful Ollama capacity planning:
Start with the GPU memory formula: (Parameters × Bits_per_Parameter) / 8 + Context_Memory + Overhead. Add 20% headroom for unexpected usage spikes. Monitor actual usage against predictions and adjust accordingly.
Remember: proper capacity planning isn't about buying the biggest GPU available—it's about matching resources to requirements efficiently. Your users (and your budget) will thank you for the careful planning.