Fine-Tune Mistral for Legal Tasks in Under 60 Minutes

Train a specialized Mistral 7B model for contract analysis, legal Q&A, and document classification using LoRA and your own data.

Mistral 7B is a strong base model, but it hallucinates case citations, misreads contract clauses, and doesn't know your firm's document taxonomy. Fine-tuning fixes this — without the $100k+ cost of GPT-4 fine-tuning or managed APIs.

You'll learn:

  • How to prepare a legal dataset for instruction fine-tuning
  • Fine-tune Mistral 7B with LoRA (runs on a single A100 or rented GPU)
  • Evaluate and serve the model locally or via API

Time: 60 min | Level: Advanced


Why This Happens

Out-of-the-box instruction-tuned models are trained on general web data. Legal language is dense, domain-specific, and unforgiving — a model that paraphrases a liability clause incorrectly creates real risk.

Common symptoms:

  • Model invents case law citations (hallucination)
  • Misclassifies contract clauses (e.g., confuses indemnity with limitation of liability)
  • Ignores jurisdiction-specific nuance in Q&A
  • Poor formatting of legal outputs (missing numbered clauses, wrong citation style)

Fine-tuning on even 500–1,000 high-quality legal examples dramatically reduces these failure modes.


Solution

Step 1: Set Up Your Environment

You'll need a GPU with at least 24GB VRAM (A100 40GB recommended) or a cloud instance. This setup works on RunPod, Lambda Labs, or Google Colab A100.

# Create a clean environment
python -m venv legal-finetune
source legal-finetune/bin/activate

# Install dependencies
pip install transformers==4.40.0 \
            peft==0.10.0 \
            trl==0.8.6 \
            datasets==2.19.0 \
            bitsandbytes==0.43.1 \
            accelerate==0.29.3 \
            wandb  # optional, for tracking

Expected: All packages install without CUDA conflicts. If you see CUDA version errors, pin torch==2.2.2+cu121.

If it fails:

  • bitsandbytes CUDA error: Run pip install bitsandbytes --upgrade and ensure CUDA 12.1+ is installed
  • peft version conflict: pip install peft==0.10.0 --force-reinstall

The dataset format for instruction fine-tuning is a JSONL file with instruction, input, and output fields. Quality beats quantity here — 500 clean examples outperform 5,000 noisy ones.

# dataset_prep.py
import json

# Each example follows this structure
example = {
    "instruction": "Identify and explain the indemnification clause in the following contract excerpt.",
    "input": "Section 12. Indemnification. The Vendor shall defend, indemnify, and hold harmless the Client from any claims arising out of Vendor's breach of this Agreement...",
    "output": "This is a one-sided indemnification clause. The Vendor bears full indemnification obligations toward the Client. Key elements: (1) Vendor must actively defend claims, not just reimburse; (2) Coverage includes breach of agreement claims only, not tort claims; (3) No cap on indemnification amount is stated — this is a significant risk for the Vendor. Recommendation: negotiate a mutual indemnification clause or add a liability cap."
}

# Format for Mistral's chat template
def format_instruction(sample):
    # Mistral uses [INST] tags for chat-style fine-tuning
    return f"[INST] {sample['instruction']}\n\n{sample['input']} [/INST] {sample['output']}"

# Load your JSONL dataset
with open("legal_data.jsonl", "r") as f:
    data = [json.loads(line) for line in f]

formatted = [format_instruction(d) for d in data]
print(f"Dataset ready: {len(formatted)} examples")
print(f"Sample:\n{formatted[0][:300]}...")

Dataset sources to consider:

  • Annotated clauses from your firm's historical contracts
  • Public legal datasets: CUAD (510 contracts, 13,000+ annotations)
  • Synthetic data generated from GPT-4 with lawyer review (common in practice)

Sample JSONL dataset structure in VS Code Each row is one instruction-response pair — aim for 500+ examples minimum


Step 3: Load Mistral with 4-bit Quantization

4-bit quantization via bitsandbytes lets you fine-tune a 7B model on a single 24GB GPU. You lose less than 1% performance on most benchmarks.

# train.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType

MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",       # NormalFloat4 — best for LLMs
    bnb_4bit_compute_dtype=torch.bfloat16,  # Faster than float16 on Ampere GPUs
    bnb_4bit_use_double_quant=True,  # Reduces memory another ~0.4 bits/param
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token  # Mistral has no dedicated pad token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
)

print(f"Model loaded. Memory: {model.get_memory_footprint() / 1e9:.1f} GB")

Expected output:

Model loaded. Memory: 4.1 GB

If it fails:

  • OOM error: Reduce max_seq_length in Step 4 from 2048 to 1024
  • "Token indices sequence length > model max": Your dataset has examples too long — filter them out before training

Step 4: Configure LoRA and Train

LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices. For legal tasks, target the attention layers — they're where domain-specific reasoning happens.

# train.py (continued)
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import Dataset

# LoRA config — rank 16 is a good default for domain adaptation
lora_config = LoraConfig(
    r=16,                    # Rank: higher = more capacity, more VRAM
    lora_alpha=32,           # Scaling factor (usually 2x rank)
    target_modules=[         # Which layers to adapt
        "q_proj", "k_proj",
        "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Expected: ~0.6% of params are trainable — keeps memory low

# Load formatted dataset
with open("legal_data.jsonl") as f:
    raw = [json.loads(l) for l in f]

dataset = Dataset.from_list([{"text": format_instruction(d)} for d in raw])
dataset = dataset.train_test_split(test_size=0.1)

# Training arguments
args = TrainingArguments(
    output_dir="./mistral-legal-lora",
    num_train_epochs=3,              # 3 epochs is usually enough for 500-1k examples
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,   # Effective batch size = 16
    learning_rate=2e-4,              # Standard for LoRA fine-tuning
    fp16=False,
    bf16=True,                       # Use bfloat16 on Ampere (A100, RTX 3090+)
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none",                # Set to "wandb" if you want tracking
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    dataset_text_field="text",
    max_seq_length=2048,
    args=args,
)

trainer.train()
trainer.save_model("./mistral-legal-final")
print("Training complete.")

Expected training time:

  • 500 examples, 3 epochs → ~25 minutes on A100 40GB
  • 1,000 examples, 3 epochs → ~45 minutes on A100 40GB

Training loss curve in terminal Loss should decrease steadily — if it spikes after epoch 1, reduce learning rate to 1e-4

If it fails:

  • CUDA OOM mid-training: Reduce per_device_train_batch_size to 2, increase gradient_accumulation_steps to 8
  • Loss not decreasing: Check your dataset format — Mistral is sensitive to the [INST] tag placement

Step 5: Merge and Export the Model

LoRA adapters are separate from the base model. Merge them for a self-contained model you can deploy anywhere.

# merge_and_export.py
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.3"
LORA_WEIGHTS = "./mistral-legal-final"

# Load base in float16 for merging (no quantization)
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="cpu",  # Merge on CPU to avoid OOM
)

model = PeftModel.from_pretrained(base_model, LORA_WEIGHTS)
merged = model.merge_and_unload()  # Bakes LoRA weights into base model

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
merged.save_pretrained("./mistral-legal-merged")
tokenizer.save_pretrained("./mistral-legal-merged")

print("Merged model saved to ./mistral-legal-merged")
print(f"Size: ~14GB in float16")

If it fails:

  • OOM during merge: Use device_map="cpu" as shown above — merging on GPU isn't necessary

Verification

Run a quick inference test with a real legal prompt before deploying:

# test_inference.py
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="./mistral-legal-merged",
    torch_dtype="auto",
    device_map="auto",
)

prompt = "[INST] Does the following clause create a unilateral or mutual obligation? Explain.\n\nThe Client shall reimburse Vendor for all reasonable out-of-pocket expenses incurred in connection with the Services. [/INST]"

result = pipe(prompt, max_new_tokens=300, do_sample=False)
print(result[0]["generated_text"])

You should see: A structured legal analysis identifying the clause as a unilateral obligation on the Client, with no reciprocal obligation on the Vendor. If the output is vague or generic, your dataset may need more clause-type diversity.

Evaluation checklist:

  • Model correctly identifies clause types (indemnity, limitation of liability, termination, etc.)
  • No hallucinated case citations
  • Output follows your expected format (numbered points, formal register)
  • Jurisdiction references match your training data

Model output in Terminal showing clause analysis Clean, structured output — compare against a baseline Mistral-Instruct response to measure improvement


What You Learned

  • LoRA fine-tuning lets you adapt a 7B model for legal tasks without full retraining — only ~0.6% of parameters are updated
  • Dataset quality matters more than size; 500 clean, annotated examples beats 5,000 scraped ones
  • The [INST] tag format is mandatory for Mistral instruction models — wrong formatting causes silent training failure
  • Merging LoRA weights gives you a portable model that runs anywhere Mistral runs

Limitations:

  • This approach doesn't give the model new factual knowledge — it learns style, format, and task execution, not new case law
  • Fine-tuned models still hallucinate; always implement output validation in production legal applications
  • Not a substitute for lawyer review — use this to triage and assist, not decide

When NOT to use this:

  • If your task is purely retrieval (use RAG instead — cheaper, more up-to-date)
  • If you have fewer than 200 high-quality examples (few-shot prompting will outperform)
  • If your jurisdiction changes frequently (fine-tuning bakes in knowledge; RAG stays current)

Tested on Mistral-7B-Instruct-v0.3, Python 3.11, CUDA 12.2, A100 40GB. Dataset: CUAD + synthetic annotations.