Problem: Generic LLMs Fail at Specialized Legal Work
Mistral 7B is a strong base model, but it hallucinates case citations, misreads contract clauses, and doesn't know your firm's document taxonomy. Fine-tuning fixes this — without the $100k+ cost of GPT-4 fine-tuning or managed APIs.
You'll learn:
- How to prepare a legal dataset for instruction fine-tuning
- Fine-tune Mistral 7B with LoRA (runs on a single A100 or rented GPU)
- Evaluate and serve the model locally or via API
Time: 60 min | Level: Advanced
Why This Happens
Out-of-the-box instruction-tuned models are trained on general web data. Legal language is dense, domain-specific, and unforgiving — a model that paraphrases a liability clause incorrectly creates real risk.
Common symptoms:
- Model invents case law citations (hallucination)
- Misclassifies contract clauses (e.g., confuses indemnity with limitation of liability)
- Ignores jurisdiction-specific nuance in Q&A
- Poor formatting of legal outputs (missing numbered clauses, wrong citation style)
Fine-tuning on even 500–1,000 high-quality legal examples dramatically reduces these failure modes.
Solution
Step 1: Set Up Your Environment
You'll need a GPU with at least 24GB VRAM (A100 40GB recommended) or a cloud instance. This setup works on RunPod, Lambda Labs, or Google Colab A100.
# Create a clean environment
python -m venv legal-finetune
source legal-finetune/bin/activate
# Install dependencies
pip install transformers==4.40.0 \
peft==0.10.0 \
trl==0.8.6 \
datasets==2.19.0 \
bitsandbytes==0.43.1 \
accelerate==0.29.3 \
wandb # optional, for tracking
Expected: All packages install without CUDA conflicts. If you see CUDA version errors, pin torch==2.2.2+cu121.
If it fails:
- bitsandbytes CUDA error: Run
pip install bitsandbytes --upgradeand ensure CUDA 12.1+ is installed - peft version conflict:
pip install peft==0.10.0 --force-reinstall
Step 2: Prepare Your Legal Dataset
The dataset format for instruction fine-tuning is a JSONL file with instruction, input, and output fields. Quality beats quantity here — 500 clean examples outperform 5,000 noisy ones.
# dataset_prep.py
import json
# Each example follows this structure
example = {
"instruction": "Identify and explain the indemnification clause in the following contract excerpt.",
"input": "Section 12. Indemnification. The Vendor shall defend, indemnify, and hold harmless the Client from any claims arising out of Vendor's breach of this Agreement...",
"output": "This is a one-sided indemnification clause. The Vendor bears full indemnification obligations toward the Client. Key elements: (1) Vendor must actively defend claims, not just reimburse; (2) Coverage includes breach of agreement claims only, not tort claims; (3) No cap on indemnification amount is stated — this is a significant risk for the Vendor. Recommendation: negotiate a mutual indemnification clause or add a liability cap."
}
# Format for Mistral's chat template
def format_instruction(sample):
# Mistral uses [INST] tags for chat-style fine-tuning
return f"[INST] {sample['instruction']}\n\n{sample['input']} [/INST] {sample['output']}"
# Load your JSONL dataset
with open("legal_data.jsonl", "r") as f:
data = [json.loads(line) for line in f]
formatted = [format_instruction(d) for d in data]
print(f"Dataset ready: {len(formatted)} examples")
print(f"Sample:\n{formatted[0][:300]}...")
Dataset sources to consider:
- Annotated clauses from your firm's historical contracts
- Public legal datasets: CUAD (510 contracts, 13,000+ annotations)
- Synthetic data generated from GPT-4 with lawyer review (common in practice)
Each row is one instruction-response pair — aim for 500+ examples minimum
Step 3: Load Mistral with 4-bit Quantization
4-bit quantization via bitsandbytes lets you fine-tune a 7B model on a single 24GB GPU. You lose less than 1% performance on most benchmarks.
# train.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 — best for LLMs
bnb_4bit_compute_dtype=torch.bfloat16, # Faster than float16 on Ampere GPUs
bnb_4bit_use_double_quant=True, # Reduces memory another ~0.4 bits/param
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token # Mistral has no dedicated pad token
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
device_map="auto",
)
print(f"Model loaded. Memory: {model.get_memory_footprint() / 1e9:.1f} GB")
Expected output:
Model loaded. Memory: 4.1 GB
If it fails:
- OOM error: Reduce
max_seq_lengthin Step 4 from 2048 to 1024 - "Token indices sequence length > model max": Your dataset has examples too long — filter them out before training
Step 4: Configure LoRA and Train
LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices. For legal tasks, target the attention layers — they're where domain-specific reasoning happens.
# train.py (continued)
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import Dataset
# LoRA config — rank 16 is a good default for domain adaptation
lora_config = LoraConfig(
r=16, # Rank: higher = more capacity, more VRAM
lora_alpha=32, # Scaling factor (usually 2x rank)
target_modules=[ # Which layers to adapt
"q_proj", "k_proj",
"v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Expected: ~0.6% of params are trainable — keeps memory low
# Load formatted dataset
with open("legal_data.jsonl") as f:
raw = [json.loads(l) for l in f]
dataset = Dataset.from_list([{"text": format_instruction(d)} for d in raw])
dataset = dataset.train_test_split(test_size=0.1)
# Training arguments
args = TrainingArguments(
output_dir="./mistral-legal-lora",
num_train_epochs=3, # 3 epochs is usually enough for 500-1k examples
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 16
learning_rate=2e-4, # Standard for LoRA fine-tuning
fp16=False,
bf16=True, # Use bfloat16 on Ampere (A100, RTX 3090+)
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="epoch",
load_best_model_at_end=True,
report_to="none", # Set to "wandb" if you want tracking
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
dataset_text_field="text",
max_seq_length=2048,
args=args,
)
trainer.train()
trainer.save_model("./mistral-legal-final")
print("Training complete.")
Expected training time:
- 500 examples, 3 epochs → ~25 minutes on A100 40GB
- 1,000 examples, 3 epochs → ~45 minutes on A100 40GB
Loss should decrease steadily — if it spikes after epoch 1, reduce learning rate to 1e-4
If it fails:
- CUDA OOM mid-training: Reduce
per_device_train_batch_sizeto 2, increasegradient_accumulation_stepsto 8 - Loss not decreasing: Check your dataset format — Mistral is sensitive to the
[INST]tag placement
Step 5: Merge and Export the Model
LoRA adapters are separate from the base model. Merge them for a self-contained model you can deploy anywhere.
# merge_and_export.py
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.3"
LORA_WEIGHTS = "./mistral-legal-final"
# Load base in float16 for merging (no quantization)
base_model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
torch_dtype=torch.float16,
device_map="cpu", # Merge on CPU to avoid OOM
)
model = PeftModel.from_pretrained(base_model, LORA_WEIGHTS)
merged = model.merge_and_unload() # Bakes LoRA weights into base model
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
merged.save_pretrained("./mistral-legal-merged")
tokenizer.save_pretrained("./mistral-legal-merged")
print("Merged model saved to ./mistral-legal-merged")
print(f"Size: ~14GB in float16")
If it fails:
- OOM during merge: Use
device_map="cpu"as shown above — merging on GPU isn't necessary
Verification
Run a quick inference test with a real legal prompt before deploying:
# test_inference.py
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="./mistral-legal-merged",
torch_dtype="auto",
device_map="auto",
)
prompt = "[INST] Does the following clause create a unilateral or mutual obligation? Explain.\n\nThe Client shall reimburse Vendor for all reasonable out-of-pocket expenses incurred in connection with the Services. [/INST]"
result = pipe(prompt, max_new_tokens=300, do_sample=False)
print(result[0]["generated_text"])
You should see: A structured legal analysis identifying the clause as a unilateral obligation on the Client, with no reciprocal obligation on the Vendor. If the output is vague or generic, your dataset may need more clause-type diversity.
Evaluation checklist:
- Model correctly identifies clause types (indemnity, limitation of liability, termination, etc.)
- No hallucinated case citations
- Output follows your expected format (numbered points, formal register)
- Jurisdiction references match your training data
Clean, structured output — compare against a baseline Mistral-Instruct response to measure improvement
What You Learned
- LoRA fine-tuning lets you adapt a 7B model for legal tasks without full retraining — only ~0.6% of parameters are updated
- Dataset quality matters more than size; 500 clean, annotated examples beats 5,000 scraped ones
- The
[INST]tag format is mandatory for Mistral instruction models — wrong formatting causes silent training failure - Merging LoRA weights gives you a portable model that runs anywhere Mistral runs
Limitations:
- This approach doesn't give the model new factual knowledge — it learns style, format, and task execution, not new case law
- Fine-tuned models still hallucinate; always implement output validation in production legal applications
- Not a substitute for lawyer review — use this to triage and assist, not decide
When NOT to use this:
- If your task is purely retrieval (use RAG instead — cheaper, more up-to-date)
- If you have fewer than 200 high-quality examples (few-shot prompting will outperform)
- If your jurisdiction changes frequently (fine-tuning bakes in knowledge; RAG stays current)
Tested on Mistral-7B-Instruct-v0.3, Python 3.11, CUDA 12.2, A100 40GB. Dataset: CUAD + synthetic annotations.