Problem: You Need AI Without Cloud Dependencies
You want GPT-quality AI for coding, writing, or analysis but don't want to pay per token, share sensitive data with APIs, or lose access when offline.
You'll learn:
- Install Ollama and download production-ready models
- Run local LLMs with OpenAI-compatible API
- Integrate with your existing tools (VS Code, scripts, apps)
Time: 15 min | Level: Beginner
Why This Matters
Cloud AI APIs cost $10-100/month and send your data to third parties. Local models like Llama 3.3 (70B) now match GPT-4 quality while running entirely on your hardware.
Common use cases:
- Code completion without sharing proprietary code
- Document analysis with sensitive data
- Offline development environments
- Prototyping without API costs
Solution
Step 1: Install Ollama
macOS:
# Download and install from official site
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download installer from ollama.com/download
Expected: ollama version 0.5.0 or newer
If it fails:
- "Command not found": Add
/usr/local/binto PATH:export PATH=$PATH:/usr/local/bin - Permission denied: Use
sudoon Linux
Step 2: Download Your First Model
# Recommended: Llama 3.3 70B (best quality, needs 48GB RAM)
ollama pull llama3.3:70b
# OR lightweight option: Llama 3.2 3B (8GB RAM minimum)
ollama pull llama3.2:3b
# OR coding-focused: DeepSeek Coder v2 16B
ollama pull deepseek-coder-v2:16b
Why these models:
llama3.3:70b— Matches GPT-4 on most tasks, best for productionllama3.2:3b— Fast responses, good for laptops with limited RAMdeepseek-coder-v2— Specialized for code generation/debugging
Download time: 5-30 minutes depending on model size and connection.
Step 3: Test the Model
# Interactive chat mode
ollama run llama3.3:70b
# Try a test prompt
>>> Write a Python function to validate email addresses
# Exit chat mode
>>> /bye
You should see: Multi-line response with working code in ~5-10 seconds.
Performance check:
# See loaded models and memory usage
ollama list
# Monitor real-time stats
ollama ps
Step 4: Use as an API
Ollama runs a local server at http://localhost:11434 with OpenAI-compatible endpoints.
Start the server (runs automatically on install):
# Check if running
curl http://localhost:11434/api/tags
# Manually start if needed
ollama serve
Test the API:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.3:70b",
"prompt": "Explain Docker in one sentence",
"stream": false
}'
Expected: JSON response with generated text in response field.
Step 5: Integrate with Your Tools
VS Code (Continue.dev)
# Install Continue extension
code --install-extension continue.continue
# Add to Continue config (~/.continue/config.json)
{
"models": [{
"title": "Llama 3.3 Local",
"provider": "ollama",
"model": "llama3.3:70b",
"apiBase": "http://localhost:11434"
}]
}
Python Script
import requests
def ask_ollama(prompt: str) -> str:
response = requests.post('http://localhost:11434/api/generate', json={
'model': 'llama3.3:70b',
'prompt': prompt,
'stream': False
})
return response.json()['response']
# Use it
result = ask_ollama("Write a SQL query to find duplicate emails")
print(result)
OpenAI SDK (drop-in replacement)
from openai import OpenAI
# Point to Ollama instead of OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # Required but unused
)
response = client.chat.completions.create(
model='llama3.3:70b',
messages=[{'role': 'user', 'content': 'Debug this error: TypeError...'}]
)
print(response.choices[0].message.content)
Why this works: Ollama implements OpenAI's API spec, so existing code works without changes.
Verification
Test end-to-end:
# 1. Model is downloaded
ollama list | grep llama3.3
# 2. Server responds
curl http://localhost:11434/api/tags | grep llama3.3
# 3. Generation works
ollama run llama3.3:70b "Say 'working' if you can read this"
You should see:
- Model listed with size and modified date
- JSON response containing model info
- Text response "working" in Terminal
Benchmark (optional):
# Test speed on your hardware
time ollama run llama3.3:70b "Count to 10" --verbose
Good performance: <2s first token, 20-40 tokens/sec on M1/M2 Mac or RTX 3080+.
What You Learned
- Ollama runs production LLMs locally with zero configuration
- Models use OpenAI-compatible API for easy integration
- 70B models match GPT-4 quality without cloud costs or data sharing
Limitations:
- Large models need 48GB+ RAM (use smaller models or quantized versions)
- First response slower than cloud (loads model into VRAM)
- No built-in conversation memory (implement in your app)
When NOT to use this:
- Need absolute latest training data (cloud models update frequently)
- Running on severely limited hardware (<8GB RAM)
- Building for non-technical users who can't install software
Alternative models to try:
ollama pull mistral:7b # Balanced performance/speed
ollama pull codellama:34b # Code-specific tasks
ollama pull llama3.2-vision # Image understanding
Common workflows:
- Code review:
ollama run deepseek-coder-v2 "Review this PR: $(cat changes.diff)" - Document summary:
ollama run llama3.3:70b "Summarize: $(cat report.md)" - Data Analysis: Pipe CSV to model for insights
Troubleshooting
"Model too large for available memory"
# Check available RAM
free -h # Linux
vm_stat # macOS
# Use smaller quantized version
ollama pull llama3.3:70b-q4_0 # 4-bit quantization, half the size
"Slow generation speed"
- Ensure model fully loads:
ollama psshould show 0% offload if using CPU - Close other apps to free RAM
- GPU users: verify CUDA/Metal acceleration with
ollama run llama3.3:70b --verbose
"Cannot connect to Ollama server"
# Restart service
pkill ollama
ollama serve
# Check logs
journalctl -u ollama # Linux with systemd
~/Library/Logs/Ollama/server.log # macOS
"Model keeps unloading"
# Keep model in memory (add to ~/.ollama/config.json)
{
"keep_alive": "24h" # or "-1" for permanent
}
Resource Requirements
Minimum specs:
- 8GB RAM (for 3B models)
- 10GB free disk space
- macOS 11+, Ubuntu 20.04+, Windows 10+
Recommended specs:
- 64GB RAM (for 70B models)
- NVIDIA RTX 3080+ or M2 Max/Ultra
- NVMe SSD for model storage
Model sizes:
- 3B models: ~2GB download, 4GB RAM
- 13B models: ~7GB download, 16GB RAM
- 70B models: ~40GB download, 48GB RAM
Tested on Ollama 0.5.0, macOS 14.6 (M2 Max), Ubuntu 24.04 (RTX 4090), Windows 11 (RTX 3090)