Problem: Running a 70B Model Locally Without Melting Your Laptop

You want Llama 5 70B running on your MacBook Pro M5 — no API costs, no data leaving your machine, no rate limits. But the model is massive, and most guides assume you have a Linux server with NVIDIA GPUs.

You'll learn:

How to install and configure Ollama on macOS
Why the M5's unified memory makes 70B models viable on a laptop
How to tune performance for your specific M5 config

Time: 15 min | Level: Beginner

Why This Happens

Most 70B parameter models require 40–80GB of VRAM to run at full precision. That ruled out consumer hardware — until Apple Silicon changed the equation.

The M5 uses unified memory, meaning the CPU and GPU share the same memory pool. A MacBook Pro M5 Max with 128GB unified memory can load and run a quantized Llama 5 70B model entirely in memory, using the GPU cores for inference. No discrete GPU required.

What makes M5 different:

Up to 128GB unified memory (vs. 24GB max on most consumer GPUs)
Metal GPU acceleration works with Ollama out of the box
Memory bandwidth on M5 Max: ~400 GB/s — fast enough for real-time inference

Minimum specs to run Llama 5 70B:

MacBook Pro M5 Pro with 48GB unified memory (slow but usable)
MacBook Pro M5 Max with 64GB+ (recommended)
macOS Sequoia 15.x or later

Activity Monitor showing unified memory usage during inference M5 Max with 64GB: Llama 5 70B Q4 quantization uses ~38GB, leaving headroom

Solution

Step 1: Install Ollama

Ollama is the fastest way to run local LLMs on macOS. It handles model downloads, quantization selection, and Metal GPU acceleration automatically.

# Install via Homebrew
brew install ollama

# Verify installation
ollama --version

Expected: ollama version 0.5.x or later.

If Homebrew isn't installed:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

If it fails:

"command not found: brew": Install Homebrew first using the command above
Permission error: Run sudo chown -R $(whoami) /usr/local then retry

Step 2: Start the Ollama Service

# Start Ollama as a background service
ollama serve &

Expected: You should see Listening on 127.0.0.1:11434

Alternatively, launch the Ollama app from /Applications if you prefer a menu bar icon. Both work identically.

Ollama menu bar icon on macOS Ollama running in the menu bar — green dot means the service is active

Step 3: Pull and Run Llama 5 70B

# Pull the Q4_K_M quantized version (best quality/size balance for M5)
ollama pull llama5:70b-q4_K_M

This downloads ~38GB. On a fast connection, expect 10–20 minutes. The model is cached locally — you only download once.

# Start an interactive chat session
ollama run llama5:70b-q4_K_M

Expected: A >>> prompt appears. Type your first message and hit Enter.

Terminal showing Ollama model download progress Download progress bar — the Q4_K_M quantization is the sweet spot for M5

If it fails:

"model not found": Check the exact model tag with ollama list after pulling — tag names update with new releases
Out of memory error: Switch to a more aggressive quantization: ollama pull llama5:70b-q2_K (~20GB, slightly lower quality)

Step 4: Tune Performance for Your M5

Ollama auto-detects Apple Silicon, but you can push performance further with environment variables:

# Tells Ollama to offload all layers to the Metal GPU
export OLLAMA_NUM_GPU=999

# Increase context window (uses more memory)
export OLLAMA_MAX_CONTEXT=8192

# Run with tuning applied
OLLAMA_NUM_GPU=999 ollama run llama5:70b-q4_K_M

Why 999? Ollama treats this as "offload everything possible." It doesn't mean 999 GPU layers — it just signals maximum GPU utilization.

For M5 Pro with 48GB, reduce context to stay within memory limits:

# Conservative settings for M5 Pro 48GB
export OLLAMA_MAX_CONTEXT=4096
export OLLAMA_NUM_GPU=999
ollama run llama5:70b-q4_K_M

GPU usage in Activity Monitor during inference Metal GPU hitting ~80% utilization during token generation — this is what you want

Step 5: Use the REST API (Optional)

Ollama exposes a local API compatible with OpenAI's format. Swap it into any tool that supports OpenAI — VS Code extensions, Continue.dev, custom scripts.

# Test the API directly
curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama5:70b-q4_K_M",
    "messages": [
      { "role": "user", "content": "Explain memory bandwidth in one paragraph." }
    ],
    "stream": false
  }'

Expected: JSON response with the model's reply under message.content.

Drop-in OpenAI SDK replacement:

# Works with any OpenAI SDK — just change the base_url
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required by SDK, value is ignored locally
)

response = client.chat.completions.create(
    model="llama5:70b-q4_K_M",
    messages=[{"role": "user", "content": "Hello, Llama."}]
)

print(response.choices[0].message.content)

If it fails:

"Connection refused": Make sure ollama serve is running first
Slow first response: The model loads into memory on the first request — subsequent responses are fast

Verification

# Confirm the model is loaded and responding
ollama run llama5:70b-q4_K_M "What is 17 * 23? Think step by step."

You should see: A step-by-step response streaming in chunks, ending with 391.

Check token generation speed:

# The --verbose flag prints a stats line with tokens/sec
ollama run llama5:70b-q4_K_M "Write a haiku about inference speed." --verbose

Expected on M5 Max 64GB: 15–25 tokens/second with Q4_K_M. Fast enough for real-time chat.

Terminal showing token generation speed stats Stats line shows tokens/sec — M5 Max should hit 15–25 t/s for Llama 5 70B Q4

What You Learned

Ollama handles the hard parts: quantization selection, Metal GPU offloading, and model caching
Q4_K_M is the best quantization for 70B on M5 — good quality, fits in 64GB with room to spare
Unified memory is why M5 can run models that normally need 40GB+ of VRAM
The OpenAI-compatible local API drops into almost any existing toolchain without code changes

Limitations to know:

M5 Pro with 48GB works but runs slower and needs a reduced context window
Q2 quantization noticeably reduces quality — avoid unless memory is tight
First load takes 10–20 seconds; after that, responses are fast

When NOT to use this:

Serving multiple concurrent users — Ollama is a single-user local tool
Workflows requiring >32K context windows — cloud APIs still win here
Running multiple large models simultaneously — use a dedicated inference server instead

Tested on MacBook Pro M5 Max 64GB, macOS Sequoia 15.3, Ollama 0.5.x, Llama 5 70B Q4_K_M quantization