Problem: You Need AI Without Cloud Dependencies

You want GPT-quality AI for coding, writing, or analysis but don't want to pay per token, share sensitive data with APIs, or lose access when offline.

You'll learn:

Install Ollama and download production-ready models
Run local LLMs with OpenAI-compatible API
Integrate with your existing tools (VS Code, scripts, apps)

Time: 15 min | Level: Beginner

Why This Matters

Cloud AI APIs cost $10-100/month and send your data to third parties. Local models like Llama 3.3 (70B) now match GPT-4 quality while running entirely on your hardware.

Common use cases:

Code completion without sharing proprietary code
Document analysis with sensitive data
Offline development environments
Prototyping without API costs

Solution

Step 1: Install Ollama

macOS:

# Download and install from official site
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download installer from ollama.com/download

Expected: ollama version 0.5.0 or newer

If it fails:

"Command not found": Add /usr/local/bin to PATH: export PATH=$PATH:/usr/local/bin
Permission denied: Use sudo on Linux

Step 2: Download Your First Model

# Recommended: Llama 3.3 70B (best quality, needs 48GB RAM)
ollama pull llama3.3:70b

# OR lightweight option: Llama 3.2 3B (8GB RAM minimum)
ollama pull llama3.2:3b

# OR coding-focused: DeepSeek Coder v2 16B
ollama pull deepseek-coder-v2:16b

Why these models:

llama3.3:70b — Matches GPT-4 on most tasks, best for production
llama3.2:3b — Fast responses, good for laptops with limited RAM
deepseek-coder-v2 — Specialized for code generation/debugging

Download time: 5-30 minutes depending on model size and connection.

Step 3: Test the Model

# Interactive chat mode
ollama run llama3.3:70b

# Try a test prompt
>>> Write a Python function to validate email addresses

# Exit chat mode
>>> /bye

You should see: Multi-line response with working code in ~5-10 seconds.

Performance check:

# See loaded models and memory usage
ollama list

# Monitor real-time stats
ollama ps

Step 4: Use as an API

Ollama runs a local server at http://localhost:11434 with OpenAI-compatible endpoints.

Start the server (runs automatically on install):

# Check if running
curl http://localhost:11434/api/tags

# Manually start if needed
ollama serve

Test the API:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.3:70b",
  "prompt": "Explain Docker in one sentence",
  "stream": false
}'

Expected: JSON response with generated text in response field.

Step 5: Integrate with Your Tools

VS Code (Continue.dev)

# Install Continue extension
code --install-extension continue.continue

# Add to Continue config (~/.continue/config.json)
{
  "models": [{
    "title": "Llama 3.3 Local",
    "provider": "ollama",
    "model": "llama3.3:70b",
    "apiBase": "http://localhost:11434"
  }]
}

Python Script

import requests

def ask_ollama(prompt: str) -> str:
    response = requests.post('http://localhost:11434/api/generate', json={
        'model': 'llama3.3:70b',
        'prompt': prompt,
        'stream': False
    })
    return response.json()['response']

# Use it
result = ask_ollama("Write a SQL query to find duplicate emails")
print(result)

OpenAI SDK (drop-in replacement)

from openai import OpenAI

# Point to Ollama instead of OpenAI
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # Required but unused
)

response = client.chat.completions.create(
    model='llama3.3:70b',
    messages=[{'role': 'user', 'content': 'Debug this error: TypeError...'}]
)

print(response.choices[0].message.content)

Why this works: Ollama implements OpenAI's API spec, so existing code works without changes.

Verification

Test end-to-end:

# 1. Model is downloaded
ollama list | grep llama3.3

# 2. Server responds
curl http://localhost:11434/api/tags | grep llama3.3

# 3. Generation works
ollama run llama3.3:70b "Say 'working' if you can read this"

You should see:

Model listed with size and modified date
JSON response containing model info
Text response "working" in Terminal

Benchmark (optional):

# Test speed on your hardware
time ollama run llama3.3:70b "Count to 10" --verbose

Good performance: <2s first token, 20-40 tokens/sec on M1/M2 Mac or RTX 3080+.

What You Learned

Ollama runs production LLMs locally with zero configuration
Models use OpenAI-compatible API for easy integration
70B models match GPT-4 quality without cloud costs or data sharing

Limitations:

Large models need 48GB+ RAM (use smaller models or quantized versions)
First response slower than cloud (loads model into VRAM)
No built-in conversation memory (implement in your app)

When NOT to use this:

Need absolute latest training data (cloud models update frequently)
Running on severely limited hardware (<8GB RAM)
Building for non-technical users who can't install software

Alternative models to try:

ollama pull mistral:7b        # Balanced performance/speed
ollama pull codellama:34b     # Code-specific tasks
ollama pull llama3.2-vision   # Image understanding

Common workflows:

Code review: ollama run deepseek-coder-v2 "Review this PR: $(cat changes.diff)"
Document summary: ollama run llama3.3:70b "Summarize: $(cat report.md)"
Data Analysis: Pipe CSV to model for insights

Troubleshooting

"Model too large for available memory"

# Check available RAM
free -h  # Linux
vm_stat  # macOS

# Use smaller quantized version
ollama pull llama3.3:70b-q4_0  # 4-bit quantization, half the size

"Slow generation speed"

Ensure model fully loads: ollama ps should show 0% offload if using CPU
Close other apps to free RAM
GPU users: verify CUDA/Metal acceleration with ollama run llama3.3:70b --verbose

"Cannot connect to Ollama server"

# Restart service
pkill ollama
ollama serve

# Check logs
journalctl -u ollama  # Linux with systemd
~/Library/Logs/Ollama/server.log  # macOS

"Model keeps unloading"

# Keep model in memory (add to ~/.ollama/config.json)
{
  "keep_alive": "24h"  # or "-1" for permanent
}

Resource Requirements

Minimum specs:

8GB RAM (for 3B models)
10GB free disk space
macOS 11+, Ubuntu 20.04+, Windows 10+

Recommended specs:

64GB RAM (for 70B models)
NVIDIA RTX 3080+ or M2 Max/Ultra
NVMe SSD for model storage

Model sizes:

3B models: ~2GB download, 4GB RAM
13B models: ~7GB download, 16GB RAM
70B models: ~40GB download, 48GB RAM

Tested on Ollama 0.5.0, macOS 14.6 (M2 Max), Ubuntu 24.04 (RTX 4090), Windows 11 (RTX 3090)