Quantize LLMs to GGUF and AWQ Formats in 20 Minutes

Problem: Your LLM Is Too Big to Run Locally

You want to run a 7B or 13B model on your own hardware, but full-precision weights demand 14–26GB of VRAM — more than most consumer GPUs have. Quantization shrinks models by 2–4x with minimal quality loss.

You'll learn:

How GGUF and AWQ differ and when to use each
How to quantize any HuggingFace model to GGUF with llama.cpp
How to quantize to AWQ for GPU-accelerated inference with vLLM or AutoAWQ

Time: 20 min | Level: Intermediate

Why This Happens

Full-precision (FP32) or half-precision (FP16) weights store each parameter as 4 or 2 bytes. A 7B model at FP16 = ~14GB. Quantization maps those weights to lower bit-widths (4-bit, 8-bit), cutting memory by 50–75%.

Two formats dominate in 2026:

GGUF — CPU-friendly, runs via llama.cpp or Ollama, works on any hardware with enough RAM
AWQ — GPU-optimized, 4-bit with activation-aware calibration, faster than naive INT4

Common symptoms that send you here:

RuntimeError: CUDA out of memory when loading a model
Inference taking 30+ seconds per token on CPU with full weights
Can't fit a useful model in your GPU's VRAM

Solution

Step 1: Set Up Your Environment

You need Python 3.11+ and either a GPU (for AWQ) or CPU (for GGUF). Start with a clean venv.

python -m venv quant-env
source quant-env/bin/activate  # Windows: quant-env\Scripts\activate

# Core deps
pip install huggingface_hub transformers torch --break-system-packages

Download the base model you want to quantize. This example uses Mistral-7B-v0.3, but any HuggingFace causal LM works.

huggingface-cli download mistralai/Mistral-7B-v0.3 \
  --local-dir ./models/mistral-7b-fp16 \
  --local-dir-use-symlinks False

Expected: Model files (~14GB) download to ./models/mistral-7b-fp16/.

If it fails:

401 Unauthorized: Run huggingface-cli login first — some models require accepting a license
Slow download: Add --include "*.safetensors" to skip tokenizer-only checkpoints

Step 2: Quantize to GGUF (CPU-Friendly)

GGUF runs on any machine via llama.cpp. Clone and build it first.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# GPU support (optional but faster for conversion)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# CPU-only build
cmake -B build
cmake --build build --config Release -j$(nproc)

Now convert the FP16 model to GGUF, then quantize it.

# Step A: Convert HuggingFace weights to GGUF (FP16 baseline)
python convert_hf_to_gguf.py ../models/mistral-7b-fp16 \
  --outfile ../models/mistral-7b-f16.gguf \
  --outtype f16

# Step B: Quantize to Q4_K_M (best quality/size tradeoff in 2026)
./build/bin/llama-quantize \
  ../models/mistral-7b-f16.gguf \
  ../models/mistral-7b-Q4_K_M.gguf \
  Q4_K_M

GGUF quantization types — pick based on your use case:

Type	Size (7B)	Quality	Use when
`Q4_K_M`	~4.1GB	★★★★☆	Best default choice
`Q5_K_M`	~4.8GB	★★★★★	More VRAM, better coherence
`Q3_K_M`	~3.3GB	★★★☆☆	RAM-constrained (8GB system)
`Q8_0`	~7.7GB	★★★★★	Near-lossless, GPU only

Expected: Quantization takes 2–5 minutes and produces a single .gguf file.

If it fails:

unknown model architecture: Your model isn't supported yet — check llama.cpp issues for a conversion PR
Segmentation fault on build: Use the CPU-only cmake build without -DGGML_CUDA=ON

Step 3: Test Your GGUF Model

Verify quality before moving on.

./build/bin/llama-cli \
  -m ../models/mistral-7b-Q4_K_M.gguf \
  -p "Explain quantization in one sentence:" \
  -n 80 \
  --temp 0.7

You should see: A coherent one-sentence response generated in under 5 seconds on a modern CPU.

Terminal output from llama-cli showing generated text llama-cli output — response quality should be indistinguishable from FP16 for most prompts

Step 4: Quantize to AWQ (GPU-Accelerated)

AWQ (Activation-aware Weight Quantization) calibrates using real data, making 4-bit GPU inference significantly better than naive INT4. Requires an NVIDIA GPU with 16GB+ VRAM for the calibration step.

cd ..  # Back to project root
pip install autoawq autoawq-kernels --break-system-packages

# quantize_awq.py
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "./models/mistral-7b-fp16"
output_path = "./models/mistral-7b-awq"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Load model for quantization (stays on GPU)
model = AutoAWQForCausalLM.from_pretrained(
    model_path,
    low_cpu_mem_usage=True,
    use_cache=False
)

# Quantize config — w_bit=4 is standard; zero_point=True improves accuracy
quant_config = {
    "zero_point": True,
    "q_group_size": 128,  # Larger = better quality, smaller = more compression
    "w_bit": 4,
    "version": "GEMM"     # GEMM for batch inference, GEMV for single-token
}

# Calibration uses 128 samples from WikiText — takes ~10 minutes
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized(output_path)
tokenizer.save_pretrained(output_path)
print(f"AWQ model saved to {output_path}")

python quantize_awq.py

Expected: Calibration prints loss values per layer, then saves ~4GB of quantized weights.

If it fails:

CUDA out of memory during calibration: Reduce n_samples by adding calib_data param: model.quantize(tokenizer, quant_config=quant_config, calib_data=your_dataset) with 64 samples
version: GEMM not supported: Use "version": "GEMV" — some older GPUs don't support GEMM kernels

Step 5: Run AWQ Inference

# infer_awq.py
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, pipeline

model_path = "./models/mistral-7b-awq"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_quantized(
    model_path,
    fuse_layers=True,    # Fuses attention layers for 20-30% speedup
    trust_remote_code=True,
    safetensors=True
)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("Explain quantization in one sentence:", max_new_tokens=80)
print(result[0]["generated_text"])

python infer_awq.py

You should see: Output in 1–2 seconds on an RTX 4090, or 3–5 seconds on an RTX 3080.

GPU utilization during AWQ inference shown in nvtop AWQ inference keeps GPU utilization high — you want to see 80%+ sustained utilization

Verification

Run a quick benchmark to confirm your quantized models are performing correctly.

# GGUF benchmark (llama.cpp)
./llama.cpp/build/bin/llama-bench \
  -m ./models/mistral-7b-Q4_K_M.gguf \
  -n 128 -ngl 0  # -ngl 0 = CPU only; set to 35 for GPU offload

# AWQ: check memory usage
python -c "
from awq import AutoAWQForCausalLM
import torch
model = AutoAWQForCausalLM.from_quantized('./models/mistral-7b-awq', fuse_layers=True)
print(f'GPU memory: {torch.cuda.memory_allocated()/1e9:.1f}GB')
"

You should see:

GGUF: 15–25 tokens/sec on a modern CPU (M3 Pro, Ryzen 9 9900X)
AWQ GPU memory: ~4.2GB for a 7B model (vs 14GB at FP16)

What You Learned

GGUF is for portability — runs everywhere, great for local dev and offline use
AWQ is for GPU throughput — calibration takes time but inference is significantly faster than GGUF on GPU
Q4_K_M is the default for GGUF — don't overthink the quantization type unless you're optimizing for a specific constraint
Calibration data matters for AWQ — the default WikiText calibration works well for general models; use domain-specific data for coding or medical models
Limitation: AWQ doesn't yet support all architectures (check autoawq GitHub for compatibility); Mamba and SSM-based models need different tooling

When NOT to use quantization: Fine-tuning still requires full or BF16 precision. Quantize after training, not before.

Tested on Python 3.12, llama.cpp (Feb 2026 build), AutoAWQ 0.2.x, CUDA 12.4, Ubuntu 24.04 and macOS 15 (Sequoia)