Convert Fine-Tuned Models to GGUF: llama.cpp Workflow 2026

GGUF quantization after fine-tuning with llama.cpp: convert, quantize to Q4_K_M or Q8_0, and run locally. Tested on Python 3.12, CUDA 12, Ubuntu 24.

GGUF quantization after fine-tuning is the step most tutorials skip — they show you how to train, then assume you know how to ship. This guide covers the exact llama.cpp workflow to convert a Hugging Face fine-tuned model to GGUF and quantize it to a size that runs on consumer hardware.

You'll learn:

  • How to merge LoRA adapters into a base model before conversion
  • How to convert the merged model to GGUF using convert_hf_to_gguf.py
  • How to quantize to Q4_K_M, Q5_K_M, or Q8_0 using llama-quantize
  • Which quantization level to pick for your RAM/quality target

Time: 25 min | Difficulty: Intermediate


Why GGUF Conversion Breaks After Fine-Tuning

Most fine-tuning runs produce a LoRA adapter, not a standalone model. You end up with two directories: the original base weights and a small adapter_model.safetensors. Feed that adapter directly into llama.cpp's converter and it will error immediately — the converter expects full model weights, not a delta.

Even when you do have full weights (a merged or full-param fine-tune), subtle config differences — a renamed tokenizer_config.json key, a missing rope_scaling field, a non-standard chat template — silently produce malformed GGUF files that crash at inference with no obvious error.

Symptoms:

  • KeyError: 'rope_scaling' during conversion
  • llama_model_load: error loading model at runtime
  • Garbled output or repetition loop — quantization ran but config wasn't embedded correctly
  • ValueError: Unrecognized model when convert_hf_to_gguf.py hits a custom architecture

Solution

Step 1: Merge LoRA Adapters into the Base Model

Skip this step only if you ran a full parameter fine-tune (no PEFT/LoRA). Otherwise, merge first.

# merge_lora.py
# Merges adapter weights into base — required before GGUF conversion
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

BASE_MODEL = "meta-llama/Llama-3.1-8B-Instruct"   # or your local path
ADAPTER_PATH = "./checkpoints/lora-adapter"
OUTPUT_PATH = "./merged-model"

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
base = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,   # float16 keeps disk size manageable before quantization
    device_map="cpu",            # CPU merge avoids VRAM limits on large models
)

model = PeftModel.from_pretrained(base, ADAPTER_PATH)
model = model.merge_and_unload()   # collapses LoRA A/B matrices into base weights

model.save_pretrained(OUTPUT_PATH)
tokenizer.save_pretrained(OUTPUT_PATH)
print("Merged model saved to", OUTPUT_PATH)
python merge_lora.py

Expected output: Merged model saved to ./merged-model

If it fails:

  • ValueError: adapter_config.json not found → check ADAPTER_PATH points to the folder, not a file
  • OOM on GPU → set device_map="cpu" (already set above — confirm it wasn't overridden)
  • AttributeError: 'LlamaForCausalLM' has no attribute 'merge_and_unload' → run pip install -U peft --break-system-packages

Step 2: Clone llama.cpp and Build the Tools

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA support (remove -DGGML_CUDA=ON for CPU-only)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

Expected output: [100%] Built target llama-quantize and llama-cli in ./build/bin/

If it fails:

  • nvcc not found → install cuda-toolkit-12 or drop -DGGML_CUDA=ON for CPU build
  • cmake: command not foundsudo apt install cmake --break-system-packages (Ubuntu 24)

Install Python deps for the converter:

pip install -r requirements.txt --break-system-packages
# Key packages: transformers, sentencepiece, numpy, gguf

Step 3: Convert the Merged Model to GGUF (F16)

Always convert to F16 first. Quantize in the next step — going directly to Q4 in one pass skips the high-precision intermediate that llama-quantize needs.

python convert_hf_to_gguf.py \
  ../merged-model \
  --outfile ../merged-model/model-f16.gguf \
  --outtype f16
# --outtype f16 preserves full weight precision before lossy quantization

Expected output:

Model successfully exported to ../merged-model/model-f16.gguf

If it fails:

  • KeyError: 'rope_scaling' → open config.json and add "rope_scaling": null — some fine-tuning configs drop this field
  • Unrecognized model type → your base model architecture may need a custom conversion script; check llama.cpp/examples/convert_legacy_llama.py
  • Converter hangs on large model → add --vocab-only first to validate tokenizer, then rerun without it

Step 4: Quantize to Your Target Format

GGUF quantization workflow: merge → F16 GGUF → quantized GGUF → llama.cpp inference End-to-end pipeline: LoRA adapter merge → F16 GGUF conversion → lossy quantization → local inference

# Q4_K_M — best default: 4-bit with mixed precision on attention layers
./build/bin/llama-quantize \
  ../merged-model/model-f16.gguf \
  ../merged-model/model-q4_k_m.gguf \
  Q4_K_M

# Q5_K_M — +15% size, noticeably better on reasoning and code tasks
./build/bin/llama-quantize \
  ../merged-model/model-f16.gguf \
  ../merged-model/model-q5_k_m.gguf \
  Q5_K_M

# Q8_0 — near-lossless, ~2× size of Q4_K_M, use when VRAM allows
./build/bin/llama-quantize \
  ../merged-model/model-f16.gguf \
  ../merged-model/model-q8_0.gguf \
  Q8_0

Expected output per run:

[  0/ 32] blk.0.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
...
llama_model_quantize_internal: model size  = 15246.07 MB
llama_model_quantize_internal: quant size  =  4685.32 MB

If it fails:

  • invalid magic → the F16 GGUF is corrupt; rerun Step 3
  • quantize: failed to load model → llama.cpp version mismatch between convert and quantize steps; rebuild after pulling latest

Step 5: Test Inference with the Quantized Model

./build/bin/llama-cli \
  -m ../merged-model/model-q4_k_m.gguf \
  -p "Explain gradient checkpointing in one paragraph." \
  -n 200 \
  --temp 0.7 \
  -ngl 32   # 32 layers offloaded to GPU — adjust to your VRAM

Expected output: Coherent generation in under 5 seconds on an RTX 3080.


Quantization Format Comparison

FormatSize (8B model)Quality lossRAM neededBest for
F16~16 GBNone18 GB+Intermediate only
Q8_0~8.5 GBMinimal10 GBHigh accuracy, 12 GB VRAM
Q5_K_M~5.7 GBLow8 GBBalanced quality/size
Q4_K_M~4.7 GBModerate6 GBBest default
Q3_K_M~3.7 GBHigh5 GBEdge / low RAM only
Q2_K~2.9 GBVery high4 GBLast resort

Q4_K_M uses the _K_M "K-quant" method, which quantizes most layers to 4-bit but keeps attention and output layers at higher precision. For most fine-tuned chat and instruction models, quality versus Q8_0 is indistinguishable in casual use and measurably close on standard benchmarks.


Verification

Run a quick perplexity check against a reference text to catch catastrophic quantization failure:

./build/bin/llama-perplexity \
  -m ../merged-model/model-q4_k_m.gguf \
  -f /path/to/wikitext-2-raw/wiki.test.raw \
  --chunks 10

You should see: Perplexity within ~0.5–1.5 points of the F16 baseline. A jump of 3+ points means the conversion or quantization step has an error worth investigating.


What You Learned

  • LoRA adapters must be merged into base weights before GGUF conversion — the converter requires full model weights
  • Always produce F16 GGUF first, then quantize in a second step using llama-quantize
  • Q4_K_M is the right default for 8B models on consumer hardware — 6 GB RAM, minimal quality loss
  • Config fields dropped by fine-tuning frameworks (like rope_scaling) silently break conversion; patch config.json before converting

Tested on llama.cpp commit b4896, Python 3.12, CUDA 12.4, Ubuntu 24.04 and macOS Sequoia (M3 Max)


FAQ

Q: Can I skip the merge step if I just want to test the adapter quickly? A: No — convert_hf_to_gguf.py does not support PEFT adapters. You must merge first. The merge takes 2–5 minutes on CPU for an 8B model and uses ~16 GB RAM.

Q: What is the difference between Q4_K_M and Q4_K_S? A: The _M (medium) variant uses higher precision on more sensitive layers than _S (small). Q4_K_M is ~3% larger than Q4_K_S but scores better on reasoning benchmarks. Use _M unless you are tightly constrained on disk space.

Q: Does GGUF quantization work on Mistral, Qwen, and Gemma fine-tunes? A: Yes. llama.cpp supports all major architectures. If conversion fails with Unrecognized model, check that your llama.cpp checkout is recent — architecture support is updated frequently.

Q: Minimum VRAM to run a quantized 8B model at Q4_K_M? A: 6 GB VRAM with -ngl 32 (full GPU offload). On 4 GB VRAM, reduce to -ngl 20 and accept partial CPU offload — generation will be ~2–3× slower.

Q: Can I upload the GGUF to Hugging Face directly? A: Yes. Create a repo, then huggingface-cli upload your-org/model-name model-q4_k_m.gguf. Users with Ollama can pull it via a custom Modelfile pointing to the uploaded GGUF.