Qwen2.5 Quantized GGUF on 8GB VRAM is possible — but only with the right quant level and GPU offload settings. Get this wrong and you'll either run out of VRAM mid-generation or leave 40% of your GPU idle.
This guide walks you through loading Qwen2.5 7B and 14B GGUF models on a single 8GB GPU (RTX 3070, RTX 4060, or equivalent) using llama.cpp directly and via Ollama. You'll also get a clear comparison of Q4_K_M vs Q5_K_M vs Q8_0 so you know which quant to pick for your use case.
You'll learn:
- Which Qwen2.5 GGUF quant fits in 8GB VRAM without spilling to RAM
- How to set
--n-gpu-layersto max out your GPU offload - How to run the same model through Ollama for a simpler API interface
- What generation speed (tokens/sec) to expect on consumer GPUs
Time: 20 min | Difficulty: Intermediate
Why 8GB VRAM Is Still the Baseline in 2026
Most consumer GPUs shipped between 2020–2024 top out at 8GB VRAM. RTX 3070, RTX 4060, and RX 6700 XT are all in this tier. Cloud spot instances — like g4dn.xlarge on AWS us-east-1 — also sit at 16GB but cost $0.526/hr, making local inference attractive for anyone running sustained workloads.
Qwen2.5 from Alibaba's Qwen team hits a sweet spot: strong benchmark scores, MIT/Apache licensing, and community-maintained GGUF files on Hugging Face that fit consumer hardware.
The catch: model size and quant level directly control VRAM consumption. A misconfigured offload wastes VRAM on system memory reads and kills throughput.
GGUF Quant Levels: What Fits in 8GB
GGUF quantization compresses model weights from 16-bit (FP16) down to 4–8 bits. Lower bits = smaller file = fits in less VRAM. The tradeoff is quality loss at the extremes.
| Quant | Qwen2.5 7B size | Qwen2.5 14B size | 8GB VRAM fit | Quality vs FP16 |
|---|---|---|---|---|
| Q2_K | ~2.7 GB | ~5.2 GB | ✅ Plenty of headroom | Noticeable degradation |
| Q4_K_M | ~4.5 GB | ~8.7 GB | ✅ 7B fits fully; 14B partial | Good — recommended default |
| Q5_K_M | ~5.4 GB | ~10.3 GB | ✅ 7B fits; 14B won't fully | Very good |
| Q8_0 | ~7.7 GB | ~14.7 GB | ⚠️ 7B barely fits; 14B no | Near-lossless |
For 8GB VRAM the practical choices are:
- Qwen2.5 7B Q5_K_M — best quality that still fits fully. Recommended for chat and coding tasks.
- Qwen2.5 7B Q8_0 — use only if you have no other apps consuming VRAM (browser tabs with WebGL, etc.).
- Qwen2.5 14B Q4_K_M — fits ~75% of layers on GPU; the rest runs on CPU RAM. Slower but accessible.
Rule of thumb: leave ~1GB headroom for the KV cache and context window. A model file that's exactly 8.0 GB will OOM at runtime.
Option 1: llama.cpp (Direct Control)
llama.cpp gives you raw control over GPU layers, context size, and threading. Use this when you need performance tuning or are building a custom inference server.
Step 1: Install llama.cpp
Linux / macOS (CUDA build):
# Clone and build with CUDA support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
Windows (pre-built binary — quickest path):
Download the latest llama-*-bin-win-cuda-cu12.2.0-x64.zip from the llama.cpp releases page. Extract and add to PATH.
Expected output after build:
[100%] Built target llama-cli
If you see GGML_CUDA=OFF in the cmake summary, CUDA was not detected. Verify your CUDA toolkit version with nvcc --version — llama.cpp requires CUDA 11.8 or higher.
Step 2: Download the GGUF Model File
# Install huggingface-cli if needed
pip install huggingface_hub --break-system-packages
# Download Qwen2.5 7B Q5_K_M (recommended for 8GB)
huggingface-cli download Qwen/Qwen2.5-7B-Instruct-GGUF \
qwen2.5-7b-instruct-q5_k_m.gguf \
--local-dir ./models/qwen25-7b
For the 14B at Q4_K_M:
huggingface-cli download Qwen/Qwen2.5-14B-Instruct-GGUF \
qwen2.5-14b-instruct-q4_k_m.gguf \
--local-dir ./models/qwen25-14b
Expected output: Progress bar showing download. File sizes: ~5.4 GB (7B Q5_K_M) and ~8.7 GB (14B Q4_K_M).
Step 3: Run with GPU Layer Offload
The --n-gpu-layers flag controls how many transformer layers are loaded onto the GPU. Qwen2.5 7B has 28 layers; Qwen2.5 14B has 40 layers.
Qwen2.5 7B Q5_K_M — full GPU offload (recommended):
./build/bin/llama-cli \
-m ./models/qwen25-7b/qwen2.5-7b-instruct-q5_k_m.gguf \
--n-gpu-layers 28 \ # All 28 layers on GPU — no CPU fallback
--ctx-size 4096 \ # 4K context; increase to 8192 only if VRAM allows
--threads 4 \ # CPU threads for non-GPU ops (tokenizer, sampling)
--temp 0.7 \
-p "You are a helpful assistant." \
--interactive
Qwen2.5 14B Q4_K_M — partial offload for 8GB:
./build/bin/llama-cli \
-m ./models/qwen25-14b/qwen2.5-14b-instruct-q4_k_m.gguf \
--n-gpu-layers 28 \ # Offload 28 of 40 layers; remaining 12 run on CPU RAM
--ctx-size 2048 \ # Keep context smaller to preserve VRAM for KV cache
--threads 8 \ # More CPU threads since 12 layers run there
--temp 0.7 \
-p "You are a helpful assistant." \
--interactive
Expected output (7B full offload):
llm_load_tensors: offloaded 28/28 layers to GPU
llama_new_context_with_model: KV self size = 256.00 MiB
...
>
If you see: CUDA error: out of memory
- Drop
--n-gpu-layersby 4 and retry - Reduce
--ctx-sizeto 2048 - Close GPU-heavy background apps (games, browsers with hardware acceleration)
If you see: offloaded 0/28 layers to GPU
- Your build does not have CUDA enabled — rebuild with
-DGGML_CUDA=ON
Step 4: Run as an OpenAI-Compatible Server
For integration with tools like Open WebUI or your own Python app:
./build/bin/llama-server \
-m ./models/qwen25-7b/qwen2.5-7b-instruct-q5_k_m.gguf \
--n-gpu-layers 28 \
--ctx-size 4096 \
--host 0.0.0.0 \
--port 8080
Test the endpoint:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-7b",
"messages": [{"role": "user", "content": "Explain KV cache in one sentence."}]
}'
Expected output: JSON response with choices[0].message.content populated.
Option 2: Ollama (Simpler API Interface)
Ollama wraps llama.cpp with model management and a clean REST API. It's the better choice if you want fast setup, model versioning, or are running multiple models on the same machine.
Step 1: Install Ollama
# Linux one-liner
curl -fsSL https://ollama.com/install.sh | sh
# macOS
brew install ollama
# Windows: download installer from https://ollama.com
Verify: ollama --version should return 0.5.x or higher.
Step 2: Pull Qwen2.5 GGUF via Ollama
Ollama maintains official Qwen2.5 model tags with Q4_K_M quantization by default:
# 7B — default Q4_K_M (fastest, ~4.5GB download)
ollama pull qwen2.5:7b
# 7B — Q5_K_M for better quality
ollama pull qwen2.5:7b-instruct-q5_k_m
# 14B — Q4_K_M (partial GPU offload on 8GB)
ollama pull qwen2.5:14b
Ollama auto-detects your GPU and sets --n-gpu-layers to the maximum that fits. On 8GB VRAM it will fully offload Qwen2.5 7B and partially offload 14B without any manual config.
Step 3: Run and Test
# Interactive chat
ollama run qwen2.5:7b-instruct-q5_k_m
# Single prompt
ollama run qwen2.5:7b "Write a Python function to flatten a nested list."
REST API (Ollama runs on port 11434 by default):
curl http://localhost:11434/api/chat \
-d '{
"model": "qwen2.5:7b-instruct-q5_k_m",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"stream": false
}'
Verification: Check GPU Utilization
Run this while inference is active to confirm the GPU is doing the work:
# NVIDIA
watch -n 1 nvidia-smi
# AMD (ROCm)
watch -n 1 rocm-smi
What you should see:
- GPU utilization: 80–99% during generation
- VRAM used: 5–7.5GB for Qwen2.5 7B Q5_K_M
- GPU utilization dropping to ~5% between prompts (normal)
If GPU utilization stays at 0–5% during generation, layers are running on CPU. Increase --n-gpu-layers or verify your CUDA/ROCm build.
Expected Performance on 8GB GPUs
| GPU | Model | Quant | Layers offloaded | Tokens/sec |
|---|---|---|---|---|
| RTX 3070 8GB | Qwen2.5 7B | Q5_K_M | 28/28 | ~35–45 t/s |
| RTX 4060 8GB | Qwen2.5 7B | Q5_K_M | 28/28 | ~40–50 t/s |
| RTX 3070 8GB | Qwen2.5 14B | Q4_K_M | 28/40 | ~12–18 t/s |
| RX 6700 XT 12GB | Qwen2.5 7B | Q5_K_M | 28/28 | ~30–38 t/s |
Prompt processing (prefill) is faster than generation. A 512-token prompt processes in under 2 seconds on RTX 3070 for the 7B model.
What You Learned
- Q4_K_M is the 8GB sweet spot for 14B models; Q5_K_M is the sweet spot for 7B
--n-gpu-layersmust match the model's actual layer count for full offload — not just a high number- Ollama handles layer selection automatically; llama.cpp gives you manual control for edge cases
- Leaving 1GB VRAM headroom prevents mid-generation OOM errors from KV cache growth
- Partial GPU offload (14B on 8GB) is viable but expect 2–3× slower generation than full offload
Tested on llama.cpp b3600, Ollama 0.5.4, CUDA 12.4, Ubuntu 24.04 and Windows 11 with RTX 3070.
FAQ
Q: Can I run Qwen2.5 14B fully on 8GB VRAM? A: No. The 14B Q4_K_M file is ~8.7 GB — larger than 8GB VRAM before accounting for KV cache. You can offload ~28 of 40 layers to GPU and run the rest on CPU RAM, which works but reduces speed significantly.
Q: What is the difference between Q4_K_M and Q4_0? A: Q4_K_M uses mixed-precision quantization per block, preserving accuracy better than Q4_0 for the same file size. Always prefer Q4_K_M over Q4_0 if both are available.
Q: Does Qwen2.5 GGUF work on AMD GPUs?
A: Yes, with ROCm 6.x on Linux. Build llama.cpp with -DGGML_HIPBLAS=ON instead of -DGGML_CUDA=ON. Ollama on Linux with ROCm also works — set HSA_OVERRIDE_GFX_VERSION if your card isn't auto-detected.
Q: How much RAM (not VRAM) do I need for the 14B partial offload? A: At least 16GB system RAM. The ~12 CPU-side layers for 14B Q4_K_M consume roughly 3–4GB of RAM plus OS overhead.
Q: Can this run on a MacBook with 8GB unified memory? A: Yes. Apple Silicon uses unified memory, so 8GB M-series chips treat it as both RAM and VRAM. Ollama on macOS uses Metal automatically. Expect ~20–28 t/s for Qwen2.5 7B Q5_K_M on M2 8GB.