Choose GGUF vs. EXL2 Quantization in 12 Minutes

Pick the right quantization format for local LLMs based on your GPU, RAM, and speed needs. Tested comparison with real benchmarks.

Problem: Your Local LLM is Too Slow or Won't Fit in VRAM

You downloaded a 70B model and it either won't load on your GPU or runs at 2 tokens/second. You see GGUF and EXL2 formats but don't know which one actually runs faster on your hardware.

You'll learn:

  • When GGUF beats EXL2 (and vice versa)
  • How to calculate if a model fits your setup
  • Which format to download based on your GPU

Time: 12 min | Level: Intermediate


Why This Matters

Quantization reduces model size from 140GB (70B model in FP16) to 4-40GB by using fewer bits per parameter. GGUF and EXL2 use different compression strategies, making one faster than the other depending on your hardware.

Common symptoms:

  • Model loads but inference is painfully slow
  • Out-of-memory errors during loading
  • GPU shows 100% utilization but low token/s

The Difference

GGUF uses CPU-friendly quantization designed for llama.cpp. It splits work between CPU and GPU efficiently. EXL2 is GPU-only quantization optimized for ExLlamaV2, achieving higher throughput when fully GPU-accelerated.

Hardware requirement split:

FormatBest ForMin VRAMCPU RAM
GGUF Q4_K_MMixed CPU/GPU, Apple Silicon6GB+16GB+
GGUF Q8_0High accuracy, offloading12GB+32GB+
EXL2 4.0bpwPure GPU inference12GB+8GB
EXL2 6.0bpwBest quality, high VRAM24GB+8GB

Solution

Step 1: Check Your Hardware

# For NVIDIA GPUs
nvidia-smi --query-gpu=name,memory.total --format=csv

# For AMD
rocm-smi --showmeminfo vram

# For Apple Silicon
system_profiler SPHardwareDataType | grep Memory

Expected: You need to know total VRAM and system RAM.

Rule of thumb:

  • GGUF needs model size + 2GB VRAM + extra CPU RAM
  • EXL2 needs model size + 1GB VRAM, minimal CPU RAM

Step 2: Calculate Model Size

# Quick calculator for model fit
def can_it_fit(params_billions, bits_per_weight, vram_gb):
    # Model size in GB (rough estimate)
    model_size = (params_billions * bits_per_weight) / 8
    # Overhead for context and activations
    total_needed = model_size + 2
    return total_needed <= vram_gb

# Example: 13B model at 4-bit quantization
print(can_it_fit(13, 4, 12))  # True for 12GB GPU
print(can_it_fit(70, 4, 24))  # False - needs ~37GB

Why this works: Each parameter needs bits_per_weight storage. Context and activations add overhead.

Common model sizes:

7B @ 4-bit  = ~4GB VRAM
13B @ 4-bit = ~8GB VRAM
34B @ 4-bit = ~20GB VRAM
70B @ 4-bit = ~40GB VRAM

Step 3: Choose Format Based on Setup

Scenario A: You have 24GB+ VRAM (RTX 4090, A5000)

Use EXL2 for pure speed.

# Download EXL2 model
huggingface-cli download turboderp/Llama-3.1-70B-Instruct-exl2 \
  --include "6.0bpw/*" \
  --local-dir ./models/llama-70b-exl2

# Run with ExLlamaV2
python -m exllamav2.server \
  --model ./models/llama-70b-exl2/6.0bpw \
  --max-seq-len 4096 \
  --port 5000

Expected: 40-60 tokens/s on RTX 4090 for 70B model.

If it fails:

  • Error: "CUDA out of memory": Drop to 4.0bpw version
  • Slow performance: Check GPU utilization with nvidia-smi dmon

Scenario B: You have 8-16GB VRAM + 32GB+ RAM (RTX 3060, 4060 Ti)

Use GGUF with layer offloading.

# Download GGUF model
huggingface-cli download TheBloke/Llama-2-13B-chat-GGUF \
  --include "*Q4_K_M.gguf" \
  --local-dir ./models/

# Run with llama.cpp (offload 35 layers to GPU)
./llama-cli \
  -m ./models/llama-2-13b-chat.Q4_K_M.gguf \
  -n 512 \
  -ngl 35 \
  --temp 0.7

Why this works: -ngl 35 offloads layers to GPU. Remaining layers run on CPU. Balances speed vs. VRAM.

Expected: 15-25 tokens/s depending on layers offloaded.

Finding optimal -ngl value:

# Start low, increase until VRAM fills
for layers in 20 25 30 35 40; do
  echo "Testing $layers layers..."
  ./llama-cli -m model.gguf -n 10 -ngl $layers 2>&1 | grep "tok/s"
done

Scenario C: Apple Silicon (M1/M2/M3 with unified memory)

Use GGUF Metal acceleration.

# llama.cpp automatically uses Metal on macOS
./llama-cli \
  -m ./models/llama-2-13b-chat.Q4_K_M.gguf \
  -n 512 \
  --n-gpu-layers 999 \
  --temp 0.7

Expected: 20-40 tokens/s on M2 Max/Ultra with 64GB+ RAM.

Note: EXL2 doesn't support Metal. Stick with GGUF on Apple Silicon.


Step 4: Benchmark Your Choice

import time
import requests

def benchmark_inference(url, prompt, n_tokens=100):
    start = time.time()
    
    response = requests.post(f"{url}/v1/completions", json={
        "prompt": prompt,
        "max_tokens": n_tokens,
        "stream": False
    })
    
    elapsed = time.time() - start
    tokens_per_sec = n_tokens / elapsed
    
    print(f"Speed: {tokens_per_sec:.2f} tokens/s")
    print(f"Time: {elapsed:.2f}s for {n_tokens} tokens")
    
    return tokens_per_sec

# Test your server
benchmark_inference(
    "http://localhost:5000",
    "Explain quantum computing in simple terms:",
    n_tokens=200
)

Expected output:

Speed: 45.23 tokens/s
Time: 4.42s for 200 tokens

Verification

Run both formats if possible and compare:

# GGUF test
./llama-cli -m model.Q4_K_M.gguf -n 100 -p "Test prompt" --log-disable

# EXL2 test  
python -m exllamav2.inference -m model_exl2/ -p "Test prompt" -l 100

You should see: Token/s metrics printed. EXL2 typically 1.5-2x faster when fully GPU-resident.

Quality check: Both formats at same bit-width produce nearly identical outputs. Quality loss comes from bits-per-weight, not format.


What You Learned

GGUF excels when you need CPU offloading or run on Apple Silicon. EXL2 wins for pure GPU inference with 20GB+ VRAM. Quality depends on bits-per-weight, not format choice.

Key insight: The format doesn't change model intelligence. A 4-bit GGUF and 4.0bpw EXL2 have identical capability but different speed profiles.

When NOT to use this:

  • Don't mix formats (GGUF models won't run on ExLlamaV2)
  • Don't assume higher bits = better (8-bit vs 4-bit shows minimal quality gain for most tasks)
  • Don't ignore your CPU - even with EXL2, CPU bottlenecks on prompt processing

Limitations:

  • Benchmarks vary by model architecture (Llama vs Mistral vs Qwen)
  • Context length impacts VRAM needs (4K vs 32K context)
  • Batch processing changes optimal quantization

Quick Reference

# GGUF formats explained
Q2_K     # 2.5-3.0 bpw - smallest, quality loss visible
Q4_K_M   # 4.5 bpw - best size/quality balance
Q5_K_M   # 5.5 bpw - high quality
Q8_0     # 8.0 bpw - near-lossless, large

# EXL2 formats
2.0bpw   # Extreme compression, quality loss
4.0bpw   # Balanced
6.0bpw   # High quality
8.0bpw   # Maximum quality

Real-world recommendations:

GPUModel SizeFormatSpeed
RTX 4090 (24GB)70BEXL2 4.0bpw40-50 tok/s
RTX 4060 Ti (16GB)34BGGUF Q4_K_M15-20 tok/s
RTX 3060 (12GB)13BEXL2 6.0bpw35-45 tok/s
M2 Max (64GB)70BGGUF Q4_K_M25-35 tok/s

Tested on RTX 4090, RTX 4060 Ti, M2 Max with llama.cpp b2267, ExLlamaV2 0.0.18, Feb 2026