Fine-Tuning Local LLMs with Ollama: Domain Adaptation Without Cloud Costs

The base Llama 3 model knows everything about Python in general. It knows nothing about your internal API conventions, your error patterns, or the way your team names variables. Fine-tuning fixes that — on your hardware, for free. Forget the $0.06 per thousand tokens for GPT-4o and the privacy hand-wringing. With Ollama hitting 5M downloads, the tooling is here to make a model truly yours. This guide is for when you’ve outgrown ollama run and need to inject domain knowledge directly into the model’s weights.

When a Bigger Context Window Isn't the Answer

You’ve tried stuffing your 4,000-line internal SDK documentation into the system prompt. You’ve crafted meticulous few-shot examples. Yet, the model still hallucinates your proprietary createWidget() method’s signature. Prompt engineering hits a wall when the knowledge is too deep, too nuanced, or too structural.

Fine-tuning beats prompt engineering when:

Pattern Internalization is Required: The model needs to learn a new style (e.g., your code review comments, support ticket responses) or a rigid output format (JSON with specific keys).
Domain Jargon is Dense: Your field has terminology not found on the public web. A base model will approximate; a fine-tuned model will know.
Latency and Cost of Context are High: Why burn 16K tokens of context reminding the model of your rules every time? Bake them in once.
Privacy is Non-Negotiable: That 70% of self-hosted users citing data privacy? They’re your audience. Your data never leaves your machine.

The eval is simple. Create a benchmark of 50-100 tasks specific to your domain. If prompt engineering with a base model scores below 70% accuracy, it’s time for fine-tuning.

Curating Your Dataset: More Than Just a JSONL Dump

Garbage in, garbage out. This is the most critical step. You need hundreds, not millions, of high-quality examples. Format is key: Ollama and the broader ecosystem expect the Alpaca format.

Structure:

{
  "instruction": "Write a Python function using our internal SDK's data_fetcher module to get user by ID, with retry logic.",
  "input": "user_id = 42, max_retries = 3",
  "output": "import internal_sdk\ndef get_user_with_retry(user_id, max_retries):\n    for i in range(max_retries):\n        try:\n            return internal_sdk.data_fetcher.get('user', user_id)\n        except internal_sdk.APITimeoutError:\n            if i == max_retries - 1:\n                raise\n            time.sleep(2 ** i)\n    raise ValueError('Max retries exceeded')"
}

Sources & Sizing:

Code: Pull Git commits for specific patterns (e.g., "fix: null pointer in auth module"). Use aider or Shell-GPT to help format.
Documentation: Convert .mdx files to Q&A pairs.
Chat Logs: Anonymize and structure Slack/Teams discussions about solving domain problems.
Size Guideline: Start with 500-1000 examples. For a 7B model, this is sufficient for strong domain adaptation. Quality trumps quantity every time.

Common Dataset Error:

Symptom: Training runs but model quality degrades (catastrophic forgetting).
Fix: Ensure 20-30% of your dataset contains general knowledge examples (e.g., "Write a hello world function in Python") to preserve the model's base capabilities.

LoRA vs. QLoRA: Picking Your Weapon Based on VRAM

This is the hardware decision. LoRA (Low-Rank Adaptation) is efficient. QLoRA (Quantized LoRA) is brutally efficient.

LoRA freezes the base model and injects trainable rank decomposition matrices into the attention layers. It’s fast and memory-efficient compared to full fine-tuning.

Use it if: You have ample VRAM. For a 7B model at FP16 (14GB), LoRA adds ~100-200MB.
Requires: VRAM for base model + LoRA params + optimizer states + gradients.

QLoRA goes further by quantizing the base model to 4-bit (NF4), then applying LoRA. It’s a game-changer for consumer hardware.

Use it if: You're VRAM-constrained. This is how you fine-tune a 7B model on a GPU with 8GB.
The Trade-off: There's a slight theoretical performance drop from quantization, but in practice, for domain adaptation, the signal from your data dominates the noise from quantization.

Hardware Reality Check (Using Mistral 7B):

Quantization	Model VRAM	Training VRAM (QLoRA)	Best For
FP16	~14 GB	~16-18 GB	RTX 4090 (24GB), multi-GPU
8-bit	~7 GB	~10 GB	RTX 4070 Ti Super (16GB)
4-bit (QLoRA)	~5 GB	~6-8 GB	RTX 4060 Ti (8GB), M3 Max (48GB Unified)

For most developers, QLoRA is the default answer. It lets you fine-tune a usefully large model on a single consumer GPU.

The Training Run: Unsloth on Your Local Machine

We use unsloth, a library that dramatically speeds up LoRA/QLoRA training and reduces memory usage. It’s a drop-in replacement for the Hugging Face transformers training script.

Step 1: Environment Setup


python -m venv .ft
source .ft/bin/activate  # or .ft\Scripts\activate on Windows
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121  # Adjust CUDA version
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install trl accelerate datasets huggingface-hub

Step 2: The Training Script (train.py)

from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

# 1. Load and prep model with QLoRA
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Mistral-7B-v0.3-bnb-4bit", # Hugging Face model
    max_seq_length = 2048,
    dtype = torch.float16,
    load_in_4bit = True, # Enable QLoRA
)

# 2. Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,           # LoRA rank
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)

# 3. Load your dataset
dataset = load_dataset('json', data_files='my_fine_tune_data.jsonl', split='train')

# 4. Configure the Trainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text", # Your formatted text column
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 150, # Start small! 150-300 steps often enough.
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 42,
    ),
)

# 5. Train
trainer.train()

# 6. Save the LoRA adapters
model.save_pretrained("my_fine_tuned_lora") # Saves only ~50MB
tokenizer.save_pretrained("my_fine_tuned_lora")

Run it: python train.py. On an RTX 4090, 150 steps might take 20 minutes. Watch for the dreaded VRAM OOM.

Training Error & Fix:

Error: OutOfMemoryError: CUDA out of memory.
Fix: Drastically reduce per_device_train_batch_size (try 1). Increase gradient_accumulation_steps to compensate (e.g., batch=1, accumulation=8). This maintains effective batch size while lowering peak VRAM.

Benchmark: Before and After on Your Turf

Don't trust abstract benchmarks. Test on your data. Here’s a hypothetical for a Python internal SDK:

Task	Base Mistral 7B	Fine-Tuned Mistral 7B (Our QLoRA)
"Write a call to internal_sdk.telemetry.log_event"	Uses generic `print()` or hallucinates non-existent params.	Correctly uses `log_event(event_name, payload, severity="INFO")` with proper import.
"Fix this error: 'ValidationError: field 'userId' must be a string'"	Suggests generic type check.	Suggests `data['userId'] = str(data['userId'])` per our API convention.
"Generate a CLI command for our tool to deploy to staging"	Hallucinates flags.	Correctly outputs `./deploy --env staging --region us-east-2 --rollback-on-fail`.
HumanEval Score	30.1%	Not measured (different domain)
Domain-Specific Accuracy	~45%	~92%

The fine-tuned model isn’t smarter; it’s specialized. It trades general knowledge for specific competence—exactly what you want.

From PyTorch Adapters to an Ollama Modelfile

You have a my_fine_tuned_lora directory. Ollama can't use it directly. You need to merge the LoRA adapters with the base model and convert to GGUF, Ollama's native format.

Step 1: Merge and Convert (Using unsloth and llama.cpp)

# merge_and_convert.py
from unsloth import FastLanguageModel
import torch

# Load base model and your adapters
model, tokenizer = FastLanguageModel.from_pretrained(
    base_model_name = "unsloth/Mistral-7B-v0.3-bnb-4bit",
    model_name = "./my_fine_tuned_lora", # Your adapters
)
# Merge model and save to Hugging Face format
merged_model_path = "./merged_model"
model.save_pretrained_merged(merged_model_path, tokenizer, save_method = "merged_16bit",)
# Now convert to GGUF using llama.cpp

Then, use llama.cpp's convert.py (you'll need to clone the repo) to convert the merged model to GGUF. This is the fiddliest part.

# Assuming llama.cpp is cloned and built
python llama.cpp/convert.py ./merged_model --outtype f16 --outfile ./my-model.f16.gguf
# Quantize it for efficiency (recommended)
./llama.cpp/quantize ./my-model.f16.gguf ./my-model.q4_K_M.gguf q4_K_M

Step 2: Create an Ollama Modelfile Create a file named Modelfile:

FROM ./my-model.q4_K_M.gguf

TEMPLATE """[INST] {{ .System }} {{ .Prompt }} [/INST] """

PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"

SYSTEM """You are an expert in our internal software development kit and APIs. You provide accurate, concise code and answers that follow our exact conventions."""

Step 3: Create and Run Your Model

ollama create my-internal-model -f ./Modelfile
ollama run my-internal-model

Import Error & Fix:

Error: Error: model 'my-internal-model' not found after create.
Fix: The FROM path in the Modelfile is relative. Use an absolute path or ensure you're running ollama create from the directory containing the GGUF file. Double-check the GGUF file was generated successfully.

Evaluating What Actually Matters: ROUGE, Humans, and Tasks

Forget MMLU. You need domain-specific evaluation.

Automated Metrics (The First Pass):
- ROUGE-L / BLEU: Good for checking if outputs are structurally similar to your training data (e.g., code syntax). Use libraries like evaluate.
- Task-Specific Pass Rate: Run your 50-100 benchmark tasks. An output "passes" if it meets all criteria (correct function, correct API call, no hallucinations). Automate this with simple rule checks or even using a judge LLM (like a larger base model).
Human Evaluation (The Final Word):
- Blindly present outputs from the base and fine-tuned model to 2-3 senior team members.
- Ask: "Which is correct? Which follows our standards? Which would you approve in a code review?"
- If the fine-tuned model wins >80% of head-to-head comparisons, you've succeeded.

Next Steps: Deployment and Iteration

Your model is now running locally via ollama run my-internal-model. Integrate it:

Into your IDE: Use Continue.dev or a custom extension to call the Ollama API (http://localhost:11434/api/generate) for in-editor completions.
Into your apps: Use the Ollama API directly or via LangChain (Ollama(model="my-internal-model")) for internal tools.
Into your workflow: Script common tasks with Shell-GPT pointed at your local model.

Set up a pipeline. As your internal APIs evolve, add new examples to your dataset. Retraining is cheap. Every quarter, regenerate your benchmark, fine-tune a new adapter, and swap the GGUF file. You’ve built a living system that improves alongside your codebase.

The goal wasn't to beat GPT-4's 67% on HumanEval. It was to get a 92% on your eval. Your RTX 4090 is no longer crying silicon tears; it's running the most knowledgeable intern about your codebase that you've ever had, one that works offline, for free, and never leaks a secret. That’s the point of local LLMs. Not just to run them, but to remake them in your own image.