Problem: Your Local LLM is Too Slow or Won't Fit in VRAM
You downloaded a 70B model and it either won't load on your GPU or runs at 2 tokens/second. You see GGUF and EXL2 formats but don't know which one actually runs faster on your hardware.
You'll learn:
- When GGUF beats EXL2 (and vice versa)
- How to calculate if a model fits your setup
- Which format to download based on your GPU
Time: 12 min | Level: Intermediate
Why This Matters
Quantization reduces model size from 140GB (70B model in FP16) to 4-40GB by using fewer bits per parameter. GGUF and EXL2 use different compression strategies, making one faster than the other depending on your hardware.
Common symptoms:
- Model loads but inference is painfully slow
- Out-of-memory errors during loading
- GPU shows 100% utilization but low token/s
The Difference
GGUF uses CPU-friendly quantization designed for llama.cpp. It splits work between CPU and GPU efficiently. EXL2 is GPU-only quantization optimized for ExLlamaV2, achieving higher throughput when fully GPU-accelerated.
Hardware requirement split:
| Format | Best For | Min VRAM | CPU RAM |
|---|---|---|---|
| GGUF Q4_K_M | Mixed CPU/GPU, Apple Silicon | 6GB+ | 16GB+ |
| GGUF Q8_0 | High accuracy, offloading | 12GB+ | 32GB+ |
| EXL2 4.0bpw | Pure GPU inference | 12GB+ | 8GB |
| EXL2 6.0bpw | Best quality, high VRAM | 24GB+ | 8GB |
Solution
Step 1: Check Your Hardware
# For NVIDIA GPUs
nvidia-smi --query-gpu=name,memory.total --format=csv
# For AMD
rocm-smi --showmeminfo vram
# For Apple Silicon
system_profiler SPHardwareDataType | grep Memory
Expected: You need to know total VRAM and system RAM.
Rule of thumb:
- GGUF needs model size + 2GB VRAM + extra CPU RAM
- EXL2 needs model size + 1GB VRAM, minimal CPU RAM
Step 2: Calculate Model Size
# Quick calculator for model fit
def can_it_fit(params_billions, bits_per_weight, vram_gb):
# Model size in GB (rough estimate)
model_size = (params_billions * bits_per_weight) / 8
# Overhead for context and activations
total_needed = model_size + 2
return total_needed <= vram_gb
# Example: 13B model at 4-bit quantization
print(can_it_fit(13, 4, 12)) # True for 12GB GPU
print(can_it_fit(70, 4, 24)) # False - needs ~37GB
Why this works: Each parameter needs bits_per_weight storage. Context and activations add overhead.
Common model sizes:
7B @ 4-bit = ~4GB VRAM
13B @ 4-bit = ~8GB VRAM
34B @ 4-bit = ~20GB VRAM
70B @ 4-bit = ~40GB VRAM
Step 3: Choose Format Based on Setup
Scenario A: You have 24GB+ VRAM (RTX 4090, A5000)
Use EXL2 for pure speed.
# Download EXL2 model
huggingface-cli download turboderp/Llama-3.1-70B-Instruct-exl2 \
--include "6.0bpw/*" \
--local-dir ./models/llama-70b-exl2
# Run with ExLlamaV2
python -m exllamav2.server \
--model ./models/llama-70b-exl2/6.0bpw \
--max-seq-len 4096 \
--port 5000
Expected: 40-60 tokens/s on RTX 4090 for 70B model.
If it fails:
- Error: "CUDA out of memory": Drop to 4.0bpw version
- Slow performance: Check GPU utilization with
nvidia-smi dmon
Scenario B: You have 8-16GB VRAM + 32GB+ RAM (RTX 3060, 4060 Ti)
Use GGUF with layer offloading.
# Download GGUF model
huggingface-cli download TheBloke/Llama-2-13B-chat-GGUF \
--include "*Q4_K_M.gguf" \
--local-dir ./models/
# Run with llama.cpp (offload 35 layers to GPU)
./llama-cli \
-m ./models/llama-2-13b-chat.Q4_K_M.gguf \
-n 512 \
-ngl 35 \
--temp 0.7
Why this works: -ngl 35 offloads layers to GPU. Remaining layers run on CPU. Balances speed vs. VRAM.
Expected: 15-25 tokens/s depending on layers offloaded.
Finding optimal -ngl value:
# Start low, increase until VRAM fills
for layers in 20 25 30 35 40; do
echo "Testing $layers layers..."
./llama-cli -m model.gguf -n 10 -ngl $layers 2>&1 | grep "tok/s"
done
Scenario C: Apple Silicon (M1/M2/M3 with unified memory)
Use GGUF Metal acceleration.
# llama.cpp automatically uses Metal on macOS
./llama-cli \
-m ./models/llama-2-13b-chat.Q4_K_M.gguf \
-n 512 \
--n-gpu-layers 999 \
--temp 0.7
Expected: 20-40 tokens/s on M2 Max/Ultra with 64GB+ RAM.
Note: EXL2 doesn't support Metal. Stick with GGUF on Apple Silicon.
Step 4: Benchmark Your Choice
import time
import requests
def benchmark_inference(url, prompt, n_tokens=100):
start = time.time()
response = requests.post(f"{url}/v1/completions", json={
"prompt": prompt,
"max_tokens": n_tokens,
"stream": False
})
elapsed = time.time() - start
tokens_per_sec = n_tokens / elapsed
print(f"Speed: {tokens_per_sec:.2f} tokens/s")
print(f"Time: {elapsed:.2f}s for {n_tokens} tokens")
return tokens_per_sec
# Test your server
benchmark_inference(
"http://localhost:5000",
"Explain quantum computing in simple terms:",
n_tokens=200
)
Expected output:
Speed: 45.23 tokens/s
Time: 4.42s for 200 tokens
Verification
Run both formats if possible and compare:
# GGUF test
./llama-cli -m model.Q4_K_M.gguf -n 100 -p "Test prompt" --log-disable
# EXL2 test
python -m exllamav2.inference -m model_exl2/ -p "Test prompt" -l 100
You should see: Token/s metrics printed. EXL2 typically 1.5-2x faster when fully GPU-resident.
Quality check: Both formats at same bit-width produce nearly identical outputs. Quality loss comes from bits-per-weight, not format.
What You Learned
GGUF excels when you need CPU offloading or run on Apple Silicon. EXL2 wins for pure GPU inference with 20GB+ VRAM. Quality depends on bits-per-weight, not format choice.
Key insight: The format doesn't change model intelligence. A 4-bit GGUF and 4.0bpw EXL2 have identical capability but different speed profiles.
When NOT to use this:
- Don't mix formats (GGUF models won't run on ExLlamaV2)
- Don't assume higher bits = better (8-bit vs 4-bit shows minimal quality gain for most tasks)
- Don't ignore your CPU - even with EXL2, CPU bottlenecks on prompt processing
Limitations:
- Benchmarks vary by model architecture (Llama vs Mistral vs Qwen)
- Context length impacts VRAM needs (4K vs 32K context)
- Batch processing changes optimal quantization
Quick Reference
# GGUF formats explained
Q2_K # 2.5-3.0 bpw - smallest, quality loss visible
Q4_K_M # 4.5 bpw - best size/quality balance
Q5_K_M # 5.5 bpw - high quality
Q8_0 # 8.0 bpw - near-lossless, large
# EXL2 formats
2.0bpw # Extreme compression, quality loss
4.0bpw # Balanced
6.0bpw # High quality
8.0bpw # Maximum quality
Real-world recommendations:
| GPU | Model Size | Format | Speed |
|---|---|---|---|
| RTX 4090 (24GB) | 70B | EXL2 4.0bpw | 40-50 tok/s |
| RTX 4060 Ti (16GB) | 34B | GGUF Q4_K_M | 15-20 tok/s |
| RTX 3060 (12GB) | 13B | EXL2 6.0bpw | 35-45 tok/s |
| M2 Max (64GB) | 70B | GGUF Q4_K_M | 25-35 tok/s |
Tested on RTX 4090, RTX 4060 Ti, M2 Max with llama.cpp b2267, ExLlamaV2 0.0.18, Feb 2026