Problem: Running a Frontier LLM Without Cloud Costs or Privacy Tradeoffs
You want to run Llama 4 8B locally — fast, private, and free — but the setup feels like it should be complicated. It's not. Ollama 2.5 on Apple Silicon makes this a 3-command install.
You'll learn:
- How to install Ollama 2.5 and pull Llama 4 8B
- Why M3 Air handles this model better than you'd expect
- How to run the model via CLI and the local REST API
Time: 15 min | Level: Beginner
Why This Works So Well on M3 Air
The M3 Air's unified memory architecture means the GPU and CPU share the same memory pool. Llama 4 8B at 4-bit quantization sits comfortably in 8GB of RAM — leaving headroom for macOS. Ollama 2.5 ships with Metal acceleration enabled by default, so inference uses the GPU without any configuration.
What to expect:
- ~20–35 tokens/sec generation speed on 8GB M3 Air
- First token latency under 1 second after model load
- Model load time of 3–6 seconds cold start
Solution
Step 1: Install Ollama 2.5
# Install via the official script
curl -fsSL https://ollama.com/install.sh | sh
Verify it's running:
ollama --version
# Expected: ollama version 2.5.x
If it fails:
- "command not found" after install: Restart your Terminal or run
source ~/.zshrc - Permission error: The script needs sudo for
/usr/local/bin— it will prompt automatically
You should see version 2.5.x — anything older won't have Llama 4 support
Step 2: Pull Llama 4 8B
# Pull the 4-bit quantized version (~5GB download)
ollama pull llama4:8b
This downloads the model to ~/.ollama/models/. The quantized version (8b) is the right choice for M3 Air — it fits cleanly in unified memory without swapping.
Expected: A progress bar showing download and verification. Takes 5–10 minutes on a typical broadband connection.
The pull command handles download, verification, and caching automatically
Step 3: Run Your First Inference
# Start an interactive chat session
ollama run llama4:8b
Type a prompt and hit Enter. First run loads the model into memory (~3–6 seconds), then responses stream at ~20–35 tokens/sec.
>>> Explain how transformers work in 3 sentences.
To exit: Type /bye or press Ctrl+D.
Step 4: Use the REST API (Optional)
Ollama serves a local API at http://localhost:11434. This is useful for integrating with apps or scripts.
# Single completion via curl
curl http://localhost:11434/api/generate \
-d '{
"model": "llama4:8b",
"prompt": "What is the capital of Malaysia?",
"stream": false
}'
Or use it in Python:
import requests
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama4:8b",
"prompt": "Summarize quantum entanglement in 2 sentences.",
"stream": False,
},
)
# Response text is in response.json()["response"]
print(response.json()["response"])
Why stream: false: Easier for scripting. Switch to true for chat UIs where you want token-by-token output.
Clean JSON response — the "response" field contains the model output
Verification
# List downloaded models
ollama list
# Run a quick benchmark prompt
ollama run llama4:8b "Say hello in 5 languages."
You should see: llama4:8b in the model list and a streamed response appearing within a second of submitting the prompt.
What You Learned
- Ollama 2.5 handles Metal acceleration automatically — no GPU configuration needed
- The 4-bit quantized 8B model is the sweet spot for M3 Air: good quality, fits in 8GB unified memory
- The local REST API at
localhost:11434makes it easy to integrate with any tool or language
Limitation: 8GB unified memory works, but if you're also running Chrome and VS Code heavily, consider closing other apps during long inference sessions. The 16GB M3 Air has more headroom.
When NOT to use this: If you need GPT-4 class reasoning for production workloads, a local 8B model won't cut it. For prototyping, coding assistance, and private document Q&A, it's excellent.
Tested on MacBook M3 Air 8GB, macOS Sequoia 15.3, Ollama 2.5.1, Llama 4 8B Q4_K_M