Problem: Generic LLMs Fail at Specialized Legal Work

Mistral 7B is a strong base model, but it hallucinates case citations, misreads contract clauses, and doesn't know your firm's document taxonomy. Fine-tuning fixes this — without the $100k+ cost of GPT-4 fine-tuning or managed APIs.

You'll learn:

How to prepare a legal dataset for instruction fine-tuning
Fine-tune Mistral 7B with LoRA (runs on a single A100 or rented GPU)
Evaluate and serve the model locally or via API

Time: 60 min | Level: Advanced

Why This Happens

Out-of-the-box instruction-tuned models are trained on general web data. Legal language is dense, domain-specific, and unforgiving — a model that paraphrases a liability clause incorrectly creates real risk.

Common symptoms:

Model invents case law citations (hallucination)
Misclassifies contract clauses (e.g., confuses indemnity with limitation of liability)
Ignores jurisdiction-specific nuance in Q&A
Poor formatting of legal outputs (missing numbered clauses, wrong citation style)

Fine-tuning on even 500–1,000 high-quality legal examples dramatically reduces these failure modes.

Solution

Step 1: Set Up Your Environment

You'll need a GPU with at least 24GB VRAM (A100 40GB recommended) or a cloud instance. This setup works on RunPod, Lambda Labs, or Google Colab A100.

# Create a clean environment
python -m venv legal-finetune
source legal-finetune/bin/activate

# Install dependencies
pip install transformers==4.40.0 \
            peft==0.10.0 \
            trl==0.8.6 \
            datasets==2.19.0 \
            bitsandbytes==0.43.1 \
            accelerate==0.29.3 \
            wandb  # optional, for tracking

Expected: All packages install without CUDA conflicts. If you see CUDA version errors, pin torch==2.2.2+cu121.

If it fails:

bitsandbytes CUDA error: Run pip install bitsandbytes --upgrade and ensure CUDA 12.1+ is installed
peft version conflict: pip install peft==0.10.0 --force-reinstall

Step 2: Prepare Your Legal Dataset

The dataset format for instruction fine-tuning is a JSONL file with instruction, input, and output fields. Quality beats quantity here — 500 clean examples outperform 5,000 noisy ones.

# dataset_prep.py
import json

# Each example follows this structure
example = {
    "instruction": "Identify and explain the indemnification clause in the following contract excerpt.",
    "input": "Section 12. Indemnification. The Vendor shall defend, indemnify, and hold harmless the Client from any claims arising out of Vendor's breach of this Agreement...",
    "output": "This is a one-sided indemnification clause. The Vendor bears full indemnification obligations toward the Client. Key elements: (1) Vendor must actively defend claims, not just reimburse; (2) Coverage includes breach of agreement claims only, not tort claims; (3) No cap on indemnification amount is stated — this is a significant risk for the Vendor. Recommendation: negotiate a mutual indemnification clause or add a liability cap."
}

# Format for Mistral's chat template
def format_instruction(sample):
    # Mistral uses [INST] tags for chat-style fine-tuning
    return f"[INST] {sample['instruction']}\n\n{sample['input']} [/INST] {sample['output']}"

# Load your JSONL dataset
with open("legal_data.jsonl", "r") as f:
    data = [json.loads(line) for line in f]

formatted = [format_instruction(d) for d in data]
print(f"Dataset ready: {len(formatted)} examples")
print(f"Sample:\n{formatted[0][:300]}...")

Dataset sources to consider:

Annotated clauses from your firm's historical contracts
Public legal datasets: CUAD (510 contracts, 13,000+ annotations)
Synthetic data generated from GPT-4 with lawyer review (common in practice)

Sample JSONL dataset structure in VS Code Each row is one instruction-response pair — aim for 500+ examples minimum

Step 3: Load Mistral with 4-bit Quantization

4-bit quantization via bitsandbytes lets you fine-tune a 7B model on a single 24GB GPU. You lose less than 1% performance on most benchmarks.

# train.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType

MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",       # NormalFloat4 — best for LLMs
    bnb_4bit_compute_dtype=torch.bfloat16,  # Faster than float16 on Ampere GPUs
    bnb_4bit_use_double_quant=True,  # Reduces memory another ~0.4 bits/param
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token  # Mistral has no dedicated pad token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
)

print(f"Model loaded. Memory: {model.get_memory_footprint() / 1e9:.1f} GB")

Expected output:

Model loaded. Memory: 4.1 GB

If it fails:

OOM error: Reduce max_seq_length in Step 4 from 2048 to 1024
"Token indices sequence length > model max": Your dataset has examples too long — filter them out before training

Step 4: Configure LoRA and Train

LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices. For legal tasks, target the attention layers — they're where domain-specific reasoning happens.

# train.py (continued)
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import Dataset

# LoRA config — rank 16 is a good default for domain adaptation
lora_config = LoraConfig(
    r=16,                    # Rank: higher = more capacity, more VRAM
    lora_alpha=32,           # Scaling factor (usually 2x rank)
    target_modules=[         # Which layers to adapt
        "q_proj", "k_proj",
        "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Expected: ~0.6% of params are trainable — keeps memory low

# Load formatted dataset
with open("legal_data.jsonl") as f:
    raw = [json.loads(l) for l in f]

dataset = Dataset.from_list([{"text": format_instruction(d)} for d in raw])
dataset = dataset.train_test_split(test_size=0.1)

# Training arguments
args = TrainingArguments(
    output_dir="./mistral-legal-lora",
    num_train_epochs=3,              # 3 epochs is usually enough for 500-1k examples
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,   # Effective batch size = 16
    learning_rate=2e-4,              # Standard for LoRA fine-tuning
    fp16=False,
    bf16=True,                       # Use bfloat16 on Ampere (A100, RTX 3090+)
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none",                # Set to "wandb" if you want tracking
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    dataset_text_field="text",
    max_seq_length=2048,
    args=args,
)

trainer.train()
trainer.save_model("./mistral-legal-final")
print("Training complete.")

Expected training time:

500 examples, 3 epochs → ~25 minutes on A100 40GB
1,000 examples, 3 epochs → ~45 minutes on A100 40GB

Training loss curve in terminal Loss should decrease steadily — if it spikes after epoch 1, reduce learning rate to 1e-4

If it fails:

CUDA OOM mid-training: Reduce per_device_train_batch_size to 2, increase gradient_accumulation_steps to 8
Loss not decreasing: Check your dataset format — Mistral is sensitive to the [INST] tag placement

Step 5: Merge and Export the Model

LoRA adapters are separate from the base model. Merge them for a self-contained model you can deploy anywhere.

# merge_and_export.py
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.3"
LORA_WEIGHTS = "./mistral-legal-final"

# Load base in float16 for merging (no quantization)
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="cpu",  # Merge on CPU to avoid OOM
)

model = PeftModel.from_pretrained(base_model, LORA_WEIGHTS)
merged = model.merge_and_unload()  # Bakes LoRA weights into base model

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
merged.save_pretrained("./mistral-legal-merged")
tokenizer.save_pretrained("./mistral-legal-merged")

print("Merged model saved to ./mistral-legal-merged")
print(f"Size: ~14GB in float16")

If it fails:

OOM during merge: Use device_map="cpu" as shown above — merging on GPU isn't necessary

Verification

Run a quick inference test with a real legal prompt before deploying:

# test_inference.py
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="./mistral-legal-merged",
    torch_dtype="auto",
    device_map="auto",
)

prompt = "[INST] Does the following clause create a unilateral or mutual obligation? Explain.\n\nThe Client shall reimburse Vendor for all reasonable out-of-pocket expenses incurred in connection with the Services. [/INST]"

result = pipe(prompt, max_new_tokens=300, do_sample=False)
print(result[0]["generated_text"])

You should see: A structured legal analysis identifying the clause as a unilateral obligation on the Client, with no reciprocal obligation on the Vendor. If the output is vague or generic, your dataset may need more clause-type diversity.

Evaluation checklist:

Model correctly identifies clause types (indemnity, limitation of liability, termination, etc.)
No hallucinated case citations
Output follows your expected format (numbered points, formal register)
Jurisdiction references match your training data

Model output in Terminal showing clause analysis Clean, structured output — compare against a baseline Mistral-Instruct response to measure improvement

What You Learned

LoRA fine-tuning lets you adapt a 7B model for legal tasks without full retraining — only ~0.6% of parameters are updated
Dataset quality matters more than size; 500 clean, annotated examples beats 5,000 scraped ones
The [INST] tag format is mandatory for Mistral instruction models — wrong formatting causes silent training failure
Merging LoRA weights gives you a portable model that runs anywhere Mistral runs

Limitations:

This approach doesn't give the model new factual knowledge — it learns style, format, and task execution, not new case law
Fine-tuned models still hallucinate; always implement output validation in production legal applications
Not a substitute for lawyer review — use this to triage and assist, not decide

When NOT to use this:

If your task is purely retrieval (use RAG instead — cheaper, more up-to-date)
If you have fewer than 200 high-quality examples (few-shot prompting will outperform)
If your jurisdiction changes frequently (fine-tuning bakes in knowledge; RAG stays current)

Tested on Mistral-7B-Instruct-v0.3, Python 3.11, CUDA 12.2, A100 40GB. Dataset: CUAD + synthetic annotations.