The base Llama 3 model knows everything about Python in general. It knows nothing about your internal API conventions, your error patterns, or the way your team names variables. Fine-tuning fixes that — on your hardware, for free. Forget the $0.06 per thousand tokens for GPT-4o and the privacy hand-wringing. With Ollama hitting 5M downloads, the tooling is here to make a model truly yours. This guide is for when you’ve outgrown ollama run and need to inject domain knowledge directly into the model’s weights.
When a Bigger Context Window Isn't the Answer
You’ve tried stuffing your 4,000-line internal SDK documentation into the system prompt. You’ve crafted meticulous few-shot examples. Yet, the model still hallucinates your proprietary createWidget() method’s signature. Prompt engineering hits a wall when the knowledge is too deep, too nuanced, or too structural.
Fine-tuning beats prompt engineering when:
- Pattern Internalization is Required: The model needs to learn a new style (e.g., your code review comments, support ticket responses) or a rigid output format (JSON with specific keys).
- Domain Jargon is Dense: Your field has terminology not found on the public web. A base model will approximate; a fine-tuned model will know.
- Latency and Cost of Context are High: Why burn 16K tokens of context reminding the model of your rules every time? Bake them in once.
- Privacy is Non-Negotiable: That 70% of self-hosted users citing data privacy? They’re your audience. Your data never leaves your machine.
The eval is simple. Create a benchmark of 50-100 tasks specific to your domain. If prompt engineering with a base model scores below 70% accuracy, it’s time for fine-tuning.
Curating Your Dataset: More Than Just a JSONL Dump
Garbage in, garbage out. This is the most critical step. You need hundreds, not millions, of high-quality examples. Format is key: Ollama and the broader ecosystem expect the Alpaca format.
Structure:
{
"instruction": "Write a Python function using our internal SDK's data_fetcher module to get user by ID, with retry logic.",
"input": "user_id = 42, max_retries = 3",
"output": "import internal_sdk\ndef get_user_with_retry(user_id, max_retries):\n for i in range(max_retries):\n try:\n return internal_sdk.data_fetcher.get('user', user_id)\n except internal_sdk.APITimeoutError:\n if i == max_retries - 1:\n raise\n time.sleep(2 ** i)\n raise ValueError('Max retries exceeded')"
}
Sources & Sizing:
- Code: Pull Git commits for specific patterns (e.g., "fix: null pointer in auth module"). Use
aiderorShell-GPTto help format. - Documentation: Convert
.mdxfiles to Q&A pairs. - Chat Logs: Anonymize and structure Slack/Teams discussions about solving domain problems.
- Size Guideline: Start with 500-1000 examples. For a 7B model, this is sufficient for strong domain adaptation. Quality trumps quantity every time.
Common Dataset Error:
- Symptom: Training runs but model quality degrades (catastrophic forgetting).
- Fix: Ensure 20-30% of your dataset contains general knowledge examples (e.g., "Write a hello world function in Python") to preserve the model's base capabilities.
LoRA vs. QLoRA: Picking Your Weapon Based on VRAM
This is the hardware decision. LoRA (Low-Rank Adaptation) is efficient. QLoRA (Quantized LoRA) is brutally efficient.
LoRA freezes the base model and injects trainable rank decomposition matrices into the attention layers. It’s fast and memory-efficient compared to full fine-tuning.
- Use it if: You have ample VRAM. For a 7B model at FP16 (14GB), LoRA adds ~100-200MB.
- Requires: VRAM for base model + LoRA params + optimizer states + gradients.
QLoRA goes further by quantizing the base model to 4-bit (NF4), then applying LoRA. It’s a game-changer for consumer hardware.
- Use it if: You're VRAM-constrained. This is how you fine-tune a 7B model on a GPU with 8GB.
- The Trade-off: There's a slight theoretical performance drop from quantization, but in practice, for domain adaptation, the signal from your data dominates the noise from quantization.
Hardware Reality Check (Using Mistral 7B):
| Quantization | Model VRAM | Training VRAM (QLoRA) | Best For |
|---|---|---|---|
| FP16 | ~14 GB | ~16-18 GB | RTX 4090 (24GB), multi-GPU |
| 8-bit | ~7 GB | ~10 GB | RTX 4070 Ti Super (16GB) |
| 4-bit (QLoRA) | ~5 GB | ~6-8 GB | RTX 4060 Ti (8GB), M3 Max (48GB Unified) |
For most developers, QLoRA is the default answer. It lets you fine-tune a usefully large model on a single consumer GPU.
The Training Run: Unsloth on Your Local Machine
We use unsloth, a library that dramatically speeds up LoRA/QLoRA training and reduces memory usage. It’s a drop-in replacement for the Hugging Face transformers training script.
Step 1: Environment Setup
python -m venv .ft
source .ft/bin/activate # or .ft\Scripts\activate on Windows
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # Adjust CUDA version
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install trl accelerate datasets huggingface-hub
Step 2: The Training Script (train.py)
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
# 1. Load and prep model with QLoRA
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Mistral-7B-v0.3-bnb-4bit", # Hugging Face model
max_seq_length = 2048,
dtype = torch.float16,
load_in_4bit = True, # Enable QLoRA
)
# 2. Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r = 16, # LoRA rank
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 42,
)
# 3. Load your dataset
dataset = load_dataset('json', data_files='my_fine_tune_data.jsonl', split='train')
# 4. Configure the Trainer
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text", # Your formatted text column
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 150, # Start small! 150-300 steps often enough.
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
output_dir = "outputs",
optim = "adamw_8bit",
seed = 42,
),
)
# 5. Train
trainer.train()
# 6. Save the LoRA adapters
model.save_pretrained("my_fine_tuned_lora") # Saves only ~50MB
tokenizer.save_pretrained("my_fine_tuned_lora")
Run it: python train.py. On an RTX 4090, 150 steps might take 20 minutes. Watch for the dreaded VRAM OOM.
Training Error & Fix:
- Error:
OutOfMemoryError: CUDA out of memory. - Fix: Drastically reduce
per_device_train_batch_size(try 1). Increasegradient_accumulation_stepsto compensate (e.g., batch=1, accumulation=8). This maintains effective batch size while lowering peak VRAM.
Benchmark: Before and After on Your Turf
Don't trust abstract benchmarks. Test on your data. Here’s a hypothetical for a Python internal SDK:
| Task | Base Mistral 7B | Fine-Tuned Mistral 7B (Our QLoRA) |
|---|---|---|
| "Write a call to internal_sdk.telemetry.log_event" | Uses generic print() or hallucinates non-existent params. | Correctly uses log_event(event_name, payload, severity="INFO") with proper import. |
| "Fix this error: 'ValidationError: field 'userId' must be a string'" | Suggests generic type check. | Suggests data['userId'] = str(data['userId']) per our API convention. |
| "Generate a CLI command for our tool to deploy to staging" | Hallucinates flags. | Correctly outputs ./deploy --env staging --region us-east-2 --rollback-on-fail. |
| HumanEval Score | 30.1% | Not measured (different domain) |
| Domain-Specific Accuracy | ~45% | ~92% |
The fine-tuned model isn’t smarter; it’s specialized. It trades general knowledge for specific competence—exactly what you want.
From PyTorch Adapters to an Ollama Modelfile
You have a my_fine_tuned_lora directory. Ollama can't use it directly. You need to merge the LoRA adapters with the base model and convert to GGUF, Ollama's native format.
Step 1: Merge and Convert (Using unsloth and llama.cpp)
# merge_and_convert.py
from unsloth import FastLanguageModel
import torch
# Load base model and your adapters
model, tokenizer = FastLanguageModel.from_pretrained(
base_model_name = "unsloth/Mistral-7B-v0.3-bnb-4bit",
model_name = "./my_fine_tuned_lora", # Your adapters
)
# Merge model and save to Hugging Face format
merged_model_path = "./merged_model"
model.save_pretrained_merged(merged_model_path, tokenizer, save_method = "merged_16bit",)
# Now convert to GGUF using llama.cpp
Then, use llama.cpp's convert.py (you'll need to clone the repo) to convert the merged model to GGUF. This is the fiddliest part.
# Assuming llama.cpp is cloned and built
python llama.cpp/convert.py ./merged_model --outtype f16 --outfile ./my-model.f16.gguf
# Quantize it for efficiency (recommended)
./llama.cpp/quantize ./my-model.f16.gguf ./my-model.q4_K_M.gguf q4_K_M
Step 2: Create an Ollama Modelfile
Create a file named Modelfile:
FROM ./my-model.q4_K_M.gguf
TEMPLATE """[INST] {{ .System }} {{ .Prompt }} [/INST] """
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"
SYSTEM """You are an expert in our internal software development kit and APIs. You provide accurate, concise code and answers that follow our exact conventions."""
Step 3: Create and Run Your Model
ollama create my-internal-model -f ./Modelfile
ollama run my-internal-model
Import Error & Fix:
- Error:
Error: model 'my-internal-model' not foundafter create. - Fix: The
FROMpath in the Modelfile is relative. Use an absolute path or ensure you're runningollama createfrom the directory containing the GGUF file. Double-check the GGUF file was generated successfully.
Evaluating What Actually Matters: ROUGE, Humans, and Tasks
Forget MMLU. You need domain-specific evaluation.
Automated Metrics (The First Pass):
- ROUGE-L / BLEU: Good for checking if outputs are structurally similar to your training data (e.g., code syntax). Use libraries like
evaluate. - Task-Specific Pass Rate: Run your 50-100 benchmark tasks. An output "passes" if it meets all criteria (correct function, correct API call, no hallucinations). Automate this with simple rule checks or even using a judge LLM (like a larger base model).
- ROUGE-L / BLEU: Good for checking if outputs are structurally similar to your training data (e.g., code syntax). Use libraries like
Human Evaluation (The Final Word):
- Blindly present outputs from the base and fine-tuned model to 2-3 senior team members.
- Ask: "Which is correct? Which follows our standards? Which would you approve in a code review?"
- If the fine-tuned model wins >80% of head-to-head comparisons, you've succeeded.
Next Steps: Deployment and Iteration
Your model is now running locally via ollama run my-internal-model. Integrate it:
- Into your IDE: Use
Continue.devor a custom extension to call the Ollama API (http://localhost:11434/api/generate) for in-editor completions. - Into your apps: Use the Ollama API directly or via
LangChain(Ollama(model="my-internal-model")) for internal tools. - Into your workflow: Script common tasks with
Shell-GPTpointed at your local model.
Set up a pipeline. As your internal APIs evolve, add new examples to your dataset. Retraining is cheap. Every quarter, regenerate your benchmark, fine-tune a new adapter, and swap the GGUF file. You’ve built a living system that improves alongside your codebase.
The goal wasn't to beat GPT-4's 67% on HumanEval. It was to get a 92% on your eval. Your RTX 4090 is no longer crying silicon tears; it's running the most knowledgeable intern about your codebase that you've ever had, one that works offline, for free, and never leaks a secret. That’s the point of local LLMs. Not just to run them, but to remake them in your own image.