Evaluate Fine-Tuned LLMs: MMLU, MT-Bench, and Custom Evals 2026

Run MMLU, MT-Bench, and custom eval suites on fine-tuned models using lm-evaluation-harness and FastChat. Tested on Python 3.12 + CUDA 12 + Hugging Face.

Problem: You Fine-Tuned a Model — Now How Do You Know It's Better?

Fine-tuned LLM evaluation is the step most developers skip — and it's exactly why fine-tuned models quietly regress on tasks you didn't test. Running MMLU tells you if general knowledge held. MT-Bench tells you if instruction-following improved. Custom evals tell you if your specific task actually got better.

You'll learn:

  • Run MMLU and MT-Bench against your fine-tuned checkpoint
  • Build a custom eval harness for domain-specific tasks
  • Interpret scores to catch regression and measure real gains

Time: 25 min | Difficulty: Intermediate


Why Eval Is Non-Trivial After Fine-Tuning

Fine-tuning optimizes for one distribution. The risk is catastrophic forgetting — your model gets better at your task and quietly breaks everything else. A model fine-tuned on customer support data might score 10 points lower on MMLU's STEM subset after just two epochs.

Three evaluation layers cover the failure modes:

  • MMLU — 57-subject multiple-choice. Detects knowledge regression fast.
  • MT-Bench — 80 multi-turn questions judged by GPT-4. Detects instruction-following drift.
  • Custom evals — Your actual task, your actual data, your actual success metric.

Running all three takes under 30 minutes. Skipping them costs days of debugging in production.

Common symptoms of a broken fine-tune:

  • MMLU drops more than 3 points vs base model
  • MT-Bench score decreases despite lower training loss
  • Custom eval passes but production users report worse answers

Fine-tuned LLM evaluation pipeline: MMLU, MT-Bench, and custom evals workflow Three-layer eval pipeline: MMLU catches knowledge regression, MT-Bench catches instruction drift, custom evals validate your actual task.


Setup

Step 1: Install lm-evaluation-harness

lm-evaluation-harness is the standard tool for running MMLU and hundreds of other benchmarks against any Hugging Face model.

# Python 3.12 + CUDA 12 assumed
pip install lm-eval --break-system-packages

# Verify install
lm_eval --version

Expected output: lm-eval, version 0.4.x

Also install FastChat for MT-Bench:

pip install fschat --break-system-packages
pip install openai anthropic --break-system-packages  # MT-Bench judge calls GPT-4

If install fails:

  • ERROR: pip's dependency resolver → Run inside a venv: python -m venv .venv && source .venv/bin/activate
  • CUDA mismatch → Pin torch: pip install torch==2.3.0+cu121 --index-url https://download.pytorch.org/whl/cu121

Step 2: Run MMLU Against Your Fine-Tuned Checkpoint

Point lm_eval at your local checkpoint. The --num_fewshot 5 flag matches the original MMLU paper setup — always use 5-shot for comparable scores.

lm_eval \
  --model hf \
  --model_args pretrained=/path/to/your/checkpoint,dtype=bfloat16 \
  --tasks mmlu \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path ./results/mmlu_finetuned.json

Run the same command against the base model so you have a delta:

lm_eval \
  --model hf \
  --model_args pretrained=meta-llama/Llama-3.2-8B,dtype=bfloat16 \
  --tasks mmlu \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path ./results/mmlu_base.json

Expected output:

mmlu (acc)    ↑    0.6312   (fine-tuned)
mmlu (acc)    ↑    0.6387   (base)

A drop under 2 points is acceptable. Over 3 points means your fine-tuning data or learning rate is causing knowledge compression — reduce epochs or increase the data diversity.

Inspect per-subject breakdown to find where regression happened:

import json

with open("results/mmlu_finetuned.json") as f:
    ft = json.load(f)
with open("results/mmlu_base.json") as f:
    base = json.load(f)

for task, result in ft["results"].items():
    base_acc = base["results"].get(task, {}).get("acc,none", 0)
    ft_acc = result.get("acc,none", 0)
    delta = ft_acc - base_acc
    if abs(delta) > 0.03:  # Flag subjects that moved more than 3 points
        print(f"{task}: {delta:+.3f}  (base: {base_acc:.3f} → ft: {ft_acc:.3f})")

Expected output: Only subjects related to your fine-tuning domain should move upward. Everything else should be flat.


Step 3: Run MT-Bench

MT-Bench uses GPT-4 as a judge to score model responses on a scale of 1–10 across 8 categories: writing, roleplay, extraction, math, coding, reasoning, STEM, and humanities.

Clone the FastChat repo and navigate to the MT-Bench directory:

git clone https://github.com/lm-sys/FastChat.git
cd FastChat/fastchat/llm_judge

Generate model answers for your fine-tuned checkpoint:

python gen_model_answer.py \
  --model-path /path/to/your/checkpoint \
  --model-id my-finetuned-model \
  --bench-name mt_bench

Generate answers for the base model too:

python gen_model_answer.py \
  --model-path meta-llama/Llama-3.2-8B \
  --model-id llama-3.2-8b-base \
  --bench-name mt_bench

Run GPT-4 judgment (requires OPENAI_API_KEY):

export OPENAI_API_KEY=sk-...

python gen_judgment.py \
  --model-list my-finetuned-model llama-3.2-8b-base \
  --judge-model gpt-4o \
  --bench-name mt_bench

Show the final scores:

python show_result.py --bench-name mt_bench

Expected output:

Model                   Score
my-finetuned-model      7.42
llama-3.2-8b-base       6.95

MT-Bench cost with GPT-4o runs about $2–4 USD per model evaluated. Use --judge-model gpt-4o-mini to cut cost to under $0.50 at the expense of slightly noisier scores.

If scores drop on your fine-tuned model:

  • Check your training data for instruction format mismatches — the model may have learned to ignore system prompts
  • Review the per-category breakdown: python show_result.py --bench-name mt_bench --mode single

Step 4: Build a Custom Eval

Standard benchmarks don't measure your task. A custom eval runs your model against held-out examples from your domain and scores them with a deterministic metric or an LLM judge.

Create a directory structure:

mkdir -p custom_eval/{data,results}

Prepare your eval dataset as JSONL. Each line is one example:

# custom_eval/data/my_task.jsonl
{"input": "Summarize this support ticket: ...", "expected": "Billing issue, priority high"}
{"input": "Classify this review as positive/negative: ...", "expected": "positive"}

Write the eval runner:

# custom_eval/run_eval.py
import json
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL_PATH = "/path/to/your/checkpoint"
DATA_PATH = "custom_eval/data/my_task.jsonl"
OUTPUT_PATH = "custom_eval/results/output.jsonl"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

results = []
with open(DATA_PATH) as f:
    examples = [json.loads(line) for line in f]

for ex in examples:
    messages = [{"role": "user", "content": ex["input"]}]
    input_ids = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,  # Required for instruction-tuned models
        return_tensors="pt"
    ).to(model.device)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            max_new_tokens=256,
            temperature=0.0,   # Greedy for deterministic eval
            do_sample=False
        )

    response = tokenizer.decode(
        output_ids[0][input_ids.shape[1]:],
        skip_special_tokens=True
    ).strip()

    results.append({
        "input": ex["input"],
        "expected": ex["expected"],
        "predicted": response
    })

with open(OUTPUT_PATH, "w") as f:
    for r in results:
        f.write(json.dumps(r) + "\n")

print(f"Saved {len(results)} results to {OUTPUT_PATH}")

Run it:

python custom_eval/run_eval.py

Score the results — use exact match for classification, ROUGE for summarization, or an LLM judge for open-ended:

# custom_eval/score.py
import json

with open("custom_eval/results/output.jsonl") as f:
    results = [json.loads(line) for line in f]

# Exact match — works for classification, extraction
correct = sum(
    1 for r in results
    if r["expected"].lower().strip() in r["predicted"].lower()  # Partial match for safety
)
print(f"Exact match accuracy: {correct}/{len(results)} = {correct/len(results):.1%}")

Expected output: Exact match accuracy: 47/50 = 94.0%

Run the same eval against the base model and record both scores. Your custom eval delta is the number that justifies shipping the fine-tune.


Verification

Run a quick sanity check that all three eval outputs exist and are non-empty:

ls -lh results/mmlu_finetuned.json \
        FastChat/fastchat/llm_judge/data/mt_bench/model_judgment/gpt-4o_single.jsonl \
        custom_eval/results/output.jsonl

Parse the MMLU result to confirm it loaded correctly:

python -c "
import json
r = json.load(open('results/mmlu_finetuned.json'))
print('MMLU acc:', round(r['results']['mmlu']['acc,none'], 4))
"

You should see: MMLU acc: 0.6312 (or your model's actual score)


Reading the Results Together

MetricAcceptableInvestigateBlock ship
MMLU delta vs base≥ −2 pts−2 to −4 pts< −4 pts
MT-Bench delta vs base≥ 0 pts−0.3 to −0.5< −0.5
Custom eval accuracyTask-dependent targetBelow base model

A fine-tune that passes all three is ready to ship. A fine-tune that passes only custom eval but fails MMLU and MT-Bench has overfit — it will degrade in production on queries slightly outside your training distribution.


What You Learned

  • MMLU 5-shot is the fastest regression check — run it first, takes ~10 minutes on an 8B model with a single A100
  • MT-Bench catches instruction-following collapse that MMLU misses, but costs ~$3 USD and 20 minutes
  • Custom evals are the only metric that directly validates your business goal — build them before you fine-tune, not after
  • Always compare fine-tuned vs base on the same hardware and same batch size — different batch sizes change generation order and can shift scores by ±0.5 points

Tested on Python 3.12, lm-eval 0.4.3, FastChat 0.2.36, CUDA 12.1, Ubuntu 22.04 + RTX 4090


FAQ

Q: Do I need a GPU to run MMLU evaluation? A: You can run on CPU with --device cpu, but an 8B model will take 6–8 hours. On a single A100 (available on Lambda Labs at $1.10/hr USD), MMLU finishes in under 15 minutes.

Q: What's the difference between 0-shot and 5-shot MMLU? A: 5-shot provides five examples in the prompt before the question. Always use --num_fewshot 5 to match published benchmarks. 0-shot scores are typically 3–8 points lower and not comparable to leaderboard numbers.

Q: Can I use a cheaper judge than GPT-4o for MT-Bench? A: Yes. --judge-model gpt-4o-mini costs under $0.50 USD per eval run. The scores correlate well with GPT-4o for most categories, but math and coding judgments are noisier. For a final production gate, use full GPT-4o.

Q: My custom eval accuracy is high but MT-Bench dropped — should I ship? A: Investigate first. MT-Bench drop usually means the model learned to ignore multi-turn context or system prompts. Check that your training data uses the same chat template as the base model's original instruction format.

Q: How many custom eval examples do I need for a reliable score? A: 200–500 held-out examples give a standard error under ±2 percentage points. Under 100 examples, a single batch of bad generations can swing the score by ±5 points and mislead your decision.