Problem: You Fine-Tuned a Model — Now How Do You Know It's Better?
Fine-tuned LLM evaluation is the step most developers skip — and it's exactly why fine-tuned models quietly regress on tasks you didn't test. Running MMLU tells you if general knowledge held. MT-Bench tells you if instruction-following improved. Custom evals tell you if your specific task actually got better.
You'll learn:
- Run MMLU and MT-Bench against your fine-tuned checkpoint
- Build a custom eval harness for domain-specific tasks
- Interpret scores to catch regression and measure real gains
Time: 25 min | Difficulty: Intermediate
Why Eval Is Non-Trivial After Fine-Tuning
Fine-tuning optimizes for one distribution. The risk is catastrophic forgetting — your model gets better at your task and quietly breaks everything else. A model fine-tuned on customer support data might score 10 points lower on MMLU's STEM subset after just two epochs.
Three evaluation layers cover the failure modes:
- MMLU — 57-subject multiple-choice. Detects knowledge regression fast.
- MT-Bench — 80 multi-turn questions judged by GPT-4. Detects instruction-following drift.
- Custom evals — Your actual task, your actual data, your actual success metric.
Running all three takes under 30 minutes. Skipping them costs days of debugging in production.
Common symptoms of a broken fine-tune:
- MMLU drops more than 3 points vs base model
- MT-Bench score decreases despite lower training loss
- Custom eval passes but production users report worse answers
Three-layer eval pipeline: MMLU catches knowledge regression, MT-Bench catches instruction drift, custom evals validate your actual task.
Setup
Step 1: Install lm-evaluation-harness
lm-evaluation-harness is the standard tool for running MMLU and hundreds of other benchmarks against any Hugging Face model.
# Python 3.12 + CUDA 12 assumed
pip install lm-eval --break-system-packages
# Verify install
lm_eval --version
Expected output: lm-eval, version 0.4.x
Also install FastChat for MT-Bench:
pip install fschat --break-system-packages
pip install openai anthropic --break-system-packages # MT-Bench judge calls GPT-4
If install fails:
ERROR: pip's dependency resolver→ Run inside a venv:python -m venv .venv && source .venv/bin/activate- CUDA mismatch → Pin torch:
pip install torch==2.3.0+cu121 --index-url https://download.pytorch.org/whl/cu121
Step 2: Run MMLU Against Your Fine-Tuned Checkpoint
Point lm_eval at your local checkpoint. The --num_fewshot 5 flag matches the original MMLU paper setup — always use 5-shot for comparable scores.
lm_eval \
--model hf \
--model_args pretrained=/path/to/your/checkpoint,dtype=bfloat16 \
--tasks mmlu \
--num_fewshot 5 \
--batch_size auto \
--output_path ./results/mmlu_finetuned.json
Run the same command against the base model so you have a delta:
lm_eval \
--model hf \
--model_args pretrained=meta-llama/Llama-3.2-8B,dtype=bfloat16 \
--tasks mmlu \
--num_fewshot 5 \
--batch_size auto \
--output_path ./results/mmlu_base.json
Expected output:
mmlu (acc) ↑ 0.6312 (fine-tuned)
mmlu (acc) ↑ 0.6387 (base)
A drop under 2 points is acceptable. Over 3 points means your fine-tuning data or learning rate is causing knowledge compression — reduce epochs or increase the data diversity.
Inspect per-subject breakdown to find where regression happened:
import json
with open("results/mmlu_finetuned.json") as f:
ft = json.load(f)
with open("results/mmlu_base.json") as f:
base = json.load(f)
for task, result in ft["results"].items():
base_acc = base["results"].get(task, {}).get("acc,none", 0)
ft_acc = result.get("acc,none", 0)
delta = ft_acc - base_acc
if abs(delta) > 0.03: # Flag subjects that moved more than 3 points
print(f"{task}: {delta:+.3f} (base: {base_acc:.3f} → ft: {ft_acc:.3f})")
Expected output: Only subjects related to your fine-tuning domain should move upward. Everything else should be flat.
Step 3: Run MT-Bench
MT-Bench uses GPT-4 as a judge to score model responses on a scale of 1–10 across 8 categories: writing, roleplay, extraction, math, coding, reasoning, STEM, and humanities.
Clone the FastChat repo and navigate to the MT-Bench directory:
git clone https://github.com/lm-sys/FastChat.git
cd FastChat/fastchat/llm_judge
Generate model answers for your fine-tuned checkpoint:
python gen_model_answer.py \
--model-path /path/to/your/checkpoint \
--model-id my-finetuned-model \
--bench-name mt_bench
Generate answers for the base model too:
python gen_model_answer.py \
--model-path meta-llama/Llama-3.2-8B \
--model-id llama-3.2-8b-base \
--bench-name mt_bench
Run GPT-4 judgment (requires OPENAI_API_KEY):
export OPENAI_API_KEY=sk-...
python gen_judgment.py \
--model-list my-finetuned-model llama-3.2-8b-base \
--judge-model gpt-4o \
--bench-name mt_bench
Show the final scores:
python show_result.py --bench-name mt_bench
Expected output:
Model Score
my-finetuned-model 7.42
llama-3.2-8b-base 6.95
MT-Bench cost with GPT-4o runs about $2–4 USD per model evaluated. Use --judge-model gpt-4o-mini to cut cost to under $0.50 at the expense of slightly noisier scores.
If scores drop on your fine-tuned model:
- Check your training data for instruction format mismatches — the model may have learned to ignore system prompts
- Review the per-category breakdown:
python show_result.py --bench-name mt_bench --mode single
Step 4: Build a Custom Eval
Standard benchmarks don't measure your task. A custom eval runs your model against held-out examples from your domain and scores them with a deterministic metric or an LLM judge.
Create a directory structure:
mkdir -p custom_eval/{data,results}
Prepare your eval dataset as JSONL. Each line is one example:
# custom_eval/data/my_task.jsonl
{"input": "Summarize this support ticket: ...", "expected": "Billing issue, priority high"}
{"input": "Classify this review as positive/negative: ...", "expected": "positive"}
Write the eval runner:
# custom_eval/run_eval.py
import json
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
MODEL_PATH = "/path/to/your/checkpoint"
DATA_PATH = "custom_eval/data/my_task.jsonl"
OUTPUT_PATH = "custom_eval/results/output.jsonl"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
device_map="auto"
)
results = []
with open(DATA_PATH) as f:
examples = [json.loads(line) for line in f]
for ex in examples:
messages = [{"role": "user", "content": ex["input"]}]
input_ids = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True, # Required for instruction-tuned models
return_tensors="pt"
).to(model.device)
with torch.inference_mode():
output_ids = model.generate(
input_ids,
max_new_tokens=256,
temperature=0.0, # Greedy for deterministic eval
do_sample=False
)
response = tokenizer.decode(
output_ids[0][input_ids.shape[1]:],
skip_special_tokens=True
).strip()
results.append({
"input": ex["input"],
"expected": ex["expected"],
"predicted": response
})
with open(OUTPUT_PATH, "w") as f:
for r in results:
f.write(json.dumps(r) + "\n")
print(f"Saved {len(results)} results to {OUTPUT_PATH}")
Run it:
python custom_eval/run_eval.py
Score the results — use exact match for classification, ROUGE for summarization, or an LLM judge for open-ended:
# custom_eval/score.py
import json
with open("custom_eval/results/output.jsonl") as f:
results = [json.loads(line) for line in f]
# Exact match — works for classification, extraction
correct = sum(
1 for r in results
if r["expected"].lower().strip() in r["predicted"].lower() # Partial match for safety
)
print(f"Exact match accuracy: {correct}/{len(results)} = {correct/len(results):.1%}")
Expected output: Exact match accuracy: 47/50 = 94.0%
Run the same eval against the base model and record both scores. Your custom eval delta is the number that justifies shipping the fine-tune.
Verification
Run a quick sanity check that all three eval outputs exist and are non-empty:
ls -lh results/mmlu_finetuned.json \
FastChat/fastchat/llm_judge/data/mt_bench/model_judgment/gpt-4o_single.jsonl \
custom_eval/results/output.jsonl
Parse the MMLU result to confirm it loaded correctly:
python -c "
import json
r = json.load(open('results/mmlu_finetuned.json'))
print('MMLU acc:', round(r['results']['mmlu']['acc,none'], 4))
"
You should see: MMLU acc: 0.6312 (or your model's actual score)
Reading the Results Together
| Metric | Acceptable | Investigate | Block ship |
|---|---|---|---|
| MMLU delta vs base | ≥ −2 pts | −2 to −4 pts | < −4 pts |
| MT-Bench delta vs base | ≥ 0 pts | −0.3 to −0.5 | < −0.5 |
| Custom eval accuracy | Task-dependent target | — | Below base model |
A fine-tune that passes all three is ready to ship. A fine-tune that passes only custom eval but fails MMLU and MT-Bench has overfit — it will degrade in production on queries slightly outside your training distribution.
What You Learned
- MMLU 5-shot is the fastest regression check — run it first, takes ~10 minutes on an 8B model with a single A100
- MT-Bench catches instruction-following collapse that MMLU misses, but costs ~$3 USD and 20 minutes
- Custom evals are the only metric that directly validates your business goal — build them before you fine-tune, not after
- Always compare fine-tuned vs base on the same hardware and same batch size — different batch sizes change generation order and can shift scores by ±0.5 points
Tested on Python 3.12, lm-eval 0.4.3, FastChat 0.2.36, CUDA 12.1, Ubuntu 22.04 + RTX 4090
FAQ
Q: Do I need a GPU to run MMLU evaluation?
A: You can run on CPU with --device cpu, but an 8B model will take 6–8 hours. On a single A100 (available on Lambda Labs at $1.10/hr USD), MMLU finishes in under 15 minutes.
Q: What's the difference between 0-shot and 5-shot MMLU?
A: 5-shot provides five examples in the prompt before the question. Always use --num_fewshot 5 to match published benchmarks. 0-shot scores are typically 3–8 points lower and not comparable to leaderboard numbers.
Q: Can I use a cheaper judge than GPT-4o for MT-Bench?
A: Yes. --judge-model gpt-4o-mini costs under $0.50 USD per eval run. The scores correlate well with GPT-4o for most categories, but math and coding judgments are noisier. For a final production gate, use full GPT-4o.
Q: My custom eval accuracy is high but MT-Bench dropped — should I ship? A: Investigate first. MT-Bench drop usually means the model learned to ignore multi-turn context or system prompts. Check that your training data uses the same chat template as the base model's original instruction format.
Q: How many custom eval examples do I need for a reliable score? A: 200–500 held-out examples give a standard error under ±2 percentage points. Under 100 examples, a single batch of bad generations can swing the score by ±5 points and mislead your decision.