Problem: Running Qwen 2.5 72B Locally Without a $10,000 GPU
Qwen 2.5 72B local setup is achievable on consumer hardware — but only if you pick the right quantization level and offloading strategy. Most guides skip the VRAM math. This one doesn't.
You'll learn:
- Which Qwen 2.5 72B GGUF quant to use for your hardware tier
- How to run it with Ollama (CLI-first, OpenAI-compatible API)
- How to run it with LM Studio (GUI, easy model switching)
- How to tune
num_gpulayers for partial GPU offloading on 16–24 GB VRAM
Time: 25 min | Difficulty: Intermediate
Why This Happens
Qwen 2.5 72B at full BF16 precision requires ~144 GB of VRAM — well beyond a single consumer GPU. GGUF quantization compresses weights to 4–6 bits, bringing the memory footprint down to 40–55 GB. With partial CPU offloading, you can run acceptable inference on a single RTX 4090 (24 GB) or an M2 Max (96 GB unified memory).
Minimum hardware tiers:
| Tier | VRAM / RAM | Recommended Quant | Tokens/sec (est.) |
|---|---|---|---|
| RTX 3090 / 4080 (16 GB) | 16 GB VRAM + 64 GB RAM | Q4_K_M | 3–6 t/s (heavy offload) |
| RTX 4090 (24 GB) | 24 GB VRAM + 64 GB RAM | Q4_K_M | 8–12 t/s |
| M2 Max / M3 Max (96 GB) | 96 GB unified | Q6_K | 14–20 t/s |
| Dual RTX 3090 (48 GB) | 48 GB VRAM | Q5_K_M | 18–25 t/s |
Symptoms of wrong setup:
CUDA out of memoryon model load — quant is too large for VRAM- Inference running at 0.5 t/s — all layers on CPU, no GPU offload configured
model not foundin Ollama — wrong model tag used
Option 1: Run Qwen 2.5 72B with Ollama
Ollama is the fastest path to a local OpenAI-compatible API. It handles model download, GGUF conversion, and GPU offload automatically via num_gpu.
Step 1: Install Ollama
# Linux / macOS — official install script
curl -fsSL https://ollama.com/install.sh | sh
# Verify install
ollama --version
# Expected: ollama version 0.6.x or later
On Windows: Download the installer from ollama.com — Ollama 0.5+ supports Windows natively.
Step 2: Pull the Qwen 2.5 72B Model
# Default pull — Ollama selects Q4_K_M automatically for 72B
ollama pull qwen2.5:72b
# Explicit quant tags (use if you want to override)
ollama pull qwen2.5:72b-instruct-q4_K_M # ~43 GB — recommended for 24 GB VRAM
ollama pull qwen2.5:72b-instruct-q5_K_M # ~52 GB — better quality, needs 48 GB VRAM
ollama pull qwen2.5:72b-instruct-q6_K # ~60 GB — for 64 GB+ unified memory (Apple Silicon)
Expected output:
pulling manifest
pulling 8edb4d1b7dac... 43.0 GB ████████████ 100%
verifying sha256 digest
writing manifest
success
This download takes 20–60 min depending on your connection. The model lands in ~/.ollama/models/.
Step 3: Configure GPU Offloading
By default, Ollama auto-detects VRAM and sets num_gpu accordingly. On 16–24 GB VRAM, it may not load enough layers onto the GPU.
# Run a quick inference and watch the layer log
OLLAMA_DEBUG=1 ollama run qwen2.5:72b "Hello" 2>&1 | grep -i "gpu\|layer"
If fewer than 40 layers are on GPU, override with a Modelfile:
cat > Modelfile << 'EOF'
FROM qwen2.5:72b
# Force 40 layers to GPU — tune down if you hit OOM
PARAMETER num_gpu 40
# Larger context = more VRAM; drop to 4096 if OOM
PARAMETER num_ctx 8192
EOF
ollama create qwen2.5-72b-tuned -f Modelfile
ollama run qwen2.5-72b-tuned
Tuning num_gpu for your tier:
| VRAM | Q4_K_M layers on GPU | Context |
|---|---|---|
| 16 GB | 20–28 | 4096 |
| 24 GB | 38–45 | 8192 |
| 48 GB | 80 (all) | 32768 |
Step 4: Call the OpenAI-Compatible API
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5:72b",
"messages": [{"role": "user", "content": "Explain GGUF quantization in 3 sentences."}],
"stream": false
}'
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Ollama ignores the key value; any string works
)
response = client.chat.completions.create(
model="qwen2.5:72b",
messages=[{"role": "user", "content": "Write a Python function to chunk text for RAG."}],
temperature=0.7,
max_tokens=1024,
)
print(response.choices[0].message.content)
Step 5: Verify the Ollama Setup
ollama ps
You should see:
NAME ID SIZE PROCESSOR UNTIL
qwen2.5:72b a08d2dcea6d4 26 GB 45%/55% GPU/CPU 4 minutes from now
The GPU/CPU split confirms partial offloading is working. If it shows 100% CPU, revisit the num_gpu step above.
Option 2: Run Qwen 2.5 72B with LM Studio
LM Studio is the best GUI option for local LLMs — especially for model comparisons and non-developers who don't want a terminal.
Step 1: Install LM Studio
Download from lmstudio.ai — available for macOS (Apple Silicon + Intel), Windows, and Linux (AppImage). Requires LM Studio 0.3.5 or later.
# Linux AppImage install
chmod +x LM_Studio-0.3.x.AppImage
./LM_Studio-0.3.x.AppImage
Step 2: Download Qwen 2.5 72B GGUF
- Open LM Studio → click Search (top bar)
- Search:
qwen2.5 72b GGUF - Select bartowski/Qwen2.5-72B-Instruct-GGUF — highest download count, well-maintained quants
- Choose your quant:
Q4_K_M— best balance of speed and quality for 24 GB VRAMQ6_K— for Apple Silicon 64 GB+ unified memoryIQ4_XS— smallest file (~40 GB), slight quality drop, good for 16 GB VRAM + offload
Click Download. Models land in ~/LM Studio/models/.
Step 3: Load the Model and Configure GPU Layers
- My Models tab → click the downloaded file → Load Model
- GPU Offload slider: drag to Max for 48 GB+ VRAM; set 40–45 layers for 24 GB VRAM
- Context Length:
8192(reduce to4096if load fails) - Click Load — takes 30–90 seconds on NVMe
Step 4: Enable the Local API Server
- Local Server tab (left sidebar
</>icon) → Start Server - Defaults to
http://localhost:1234/v1
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="lm-studio", # Value is ignored; LM Studio requires any non-empty string
)
response = client.chat.completions.create(
model="bartowski/Qwen2.5-72B-Instruct-GGUF",
messages=[{"role": "user", "content": "Summarize the Qwen 2.5 technical report."}],
temperature=0.3,
max_tokens=512,
)
print(response.choices[0].message.content)
Step 5: Verify LM Studio is Running
curl http://localhost:1234/v1/models
You should see:
{
"data": [
{
"id": "bartowski/Qwen2.5-72B-Instruct-GGUF/Qwen2.5-72B-Instruct-Q4_K_M.gguf",
"object": "model"
}
]
}
Ollama vs LM Studio: Which Should You Use?
| Ollama | LM Studio | |
|---|---|---|
| Interface | CLI + REST API | GUI + REST API |
| Auto GPU config | ✅ Auto-detect | Manual slider |
| OpenAI-compatible API | ✅ | ✅ |
| Model switching | CLI only | GUI drag-and-drop |
| Docker support | ✅ ollama/ollama image | ❌ |
| Multi-model server | ✅ | ❌ (one model at a time) |
| Best for | Developers, CI, Docker | Evaluation, non-devs |
Choose Ollama if: You're building an app, running in Docker, or need CLI scripting.
Choose LM Studio if: You want a GUI, are comparing models, or don't want to touch a terminal.
What You Learned
- Q4_K_M is the right quant for 24 GB VRAM — Q5/Q6 require 48 GB+ for full GPU offload
num_gpuin Ollama controls per-layer GPU placement — always verify withollama ps- Both Ollama and LM Studio expose OpenAI-compatible endpoints — no SDK changes needed
- Unified memory on Apple Silicon can run larger quants than discrete VRAM at equivalent size
Tested on Ollama 0.6.2, LM Studio 0.3.6, Qwen 2.5 72B Instruct GGUF, Ubuntu 24.04, RTX 4090, and M2 Max 96 GB
FAQ
Q: How much VRAM does Qwen 2.5 72B actually need?
A: Q4_K_M weighs ~43 GB total. With 24 GB VRAM, ~40 layers run on GPU and the rest offload to CPU RAM — you need at least 64 GB system RAM for smooth inference.
Q: What is the difference between Q4_K_M and IQ4_XS?
A: Q4_K_M uses mixed 4-bit quantization with better outlier handling; IQ4_XS is 10–15% smaller but loses some reasoning quality on complex tasks. Use Q4_K_M unless disk space is the constraint.
Q: Can I run Qwen 2.5 72B in Docker with Ollama?
A: Yes. Use the official ollama/ollama image with --gpus all. Mount /root/.ollama as a volume to persist downloaded models between container restarts.
Q: Does Qwen 2.5 72B support function calling locally?
A: Yes — Qwen 2.5 Instruct supports tool use natively. Ollama exposes this via the /v1/chat/completions tools parameter starting with Ollama 0.5.4.
Q: What is the minimum RAM for CPU-only inference?
A: At least 64 GB system RAM for Q4_K_M. Inference will run at 1–3 t/s on a modern CPU — usable for batch jobs, not interactive chat.