Problem: DeepSeek V3 Doesn't Know Your Domain

DeepSeek V3 is a 685B MoE model trained on general internet data. It doesn't know your internal terminology, your product's edge cases, or how your team writes SQL. General prompting and RAG help, but neither matches what supervised fine-tuning can do for domain-specific tasks.

You'll learn:

How to prepare a domain-specific instruction dataset in the correct format
How to run QLoRA fine-tuning on DeepSeek V3 using Unsloth
How to evaluate and export your adapter for deployment

Time: 30 min | Difficulty: Advanced

Why Fine-Tuning DeepSeek V3 Is Different

DeepSeek V3 uses a Mixture-of-Experts (MoE) architecture — 671B total parameters, ~37B active per token. This matters for fine-tuning because:

Only the active expert layers process each token, so LoRA adapters on those layers have outsized impact
Full fine-tuning is infeasible outside of multi-node A100/H100 clusters
QLoRA targets the attention and MLP projection layers — the right place to inject domain knowledge

When fine-tuning beats RAG:

Consistent output format (structured JSON, SQL dialects, code style)
Domain jargon the base model keeps hallucinating
Tasks where retrieval latency is unacceptable

When RAG is still better:

Frequently updated knowledge (pricing, inventory, docs)
Needing source attribution
Data too large to compress into adapter weights

Prerequisites

Python 3.11+
CUDA 12.1+ with at least one A100 40GB (or two A6000 48GB in sequence)
uv for package management (recommended) or pip
A dataset of at least 500 examples in your domain (more on this below)

# Verify GPU
nvidia-smi

# Expected: A100 40GB or equivalent
# Minimum: 40GB VRAM for QLoRA on DeepSeek V3

Solution

Step 1: Install Unsloth and Dependencies

# Create isolated environment
uv venv .venv --python 3.11
source .venv/bin/activate

# Install Unsloth with DeepSeek V3 support (requires Unsloth >= 2025.3)
uv pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
uv pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
uv pip install xformers trl peft accelerate bitsandbytes datasets

# Verify Unsloth installed correctly
python -c "import unsloth; print(unsloth.__version__)"

Expected: 2025.x.x

If it fails:

CUDA version mismatch → Run nvcc --version and match the PyTorch CUDA index URL exactly
unsloth not found → The git install may have failed; try pip install unsloth directly as a fallback

Step 2: Prepare Your Domain Dataset

Unsloth expects data in ShareGPT or Alpaca format. ShareGPT is better for multi-turn conversations; Alpaca is simpler for instruction-response pairs.

Alpaca format (use for single-turn tasks like classification, extraction, SQL generation):

# dataset_prep.py
import json

# Each record: instruction + input (optional context) + output
examples = [
    {
        "instruction": "Convert this plain English query to SQL for our orders schema.",
        "input": "Show me all orders over $500 placed in Q1 2026 that haven't shipped.",
        "output": "SELECT * FROM orders WHERE total_amount > 500 AND order_date BETWEEN '2026-01-01' AND '2026-03-31' AND shipped_at IS NULL;"
    },
    # ... add your domain examples here
]

with open("domain_data.json", "w") as f:
    json.dump(examples, f, indent=2)

print(f"Saved {len(examples)} examples")

ShareGPT format (use for multi-turn chat, customer support, code review):

# Each record is a list of turns with role + content
examples = [
    {
        "conversations": [
            {"from": "human", "value": "What does error code E4021 mean in our system?"},
            {"from": "gpt", "value": "E4021 is a payment gateway timeout. It fires when our Stripe webhook doesn't receive a response within 30 seconds. Check the `payment_logs` table for the `gateway_request_id` to trace the specific transaction."}
        ]
    }
]

Dataset size guidelines:

Task type	Minimum examples	Sweet spot
Output format (SQL, JSON)	300	1,000–2,000
Domain terminology	500	2,000–5,000
Full behavior change	1,000	5,000+

# Validate your dataset before training
python -c "
import json
data = json.load(open('domain_data.json'))
print(f'Records: {len(data)}')
print(f'Sample keys: {list(data[0].keys())}')
print(f'Sample output length: {len(data[0][\"output\"])} chars')
"

Step 3: Load DeepSeek V3 with 4-bit Quantization

# train.py
from unsloth import FastLanguageModel
import torch

# Load DeepSeek V3 in 4-bit — requires ~38GB VRAM
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="deepseek-ai/DeepSeek-V3",
    max_seq_length=4096,   # Increase to 8192 if your examples are long
    dtype=None,            # Auto-detect: bf16 on Ampere+, fp16 on older
    load_in_4bit=True,     # QLoRA: reduces memory from ~160GB to ~38GB
)

print(f"Model loaded. Parameters: {model.num_parameters():,}")

Expected: Model loaded. Parameters: 671,000,000,000 (prints total, not active)

If it fails:

OutOfMemoryError → Your GPU has < 40GB VRAM. Use max_seq_length=2048 and load_in_4bit=True together — this can get you to ~28GB
Repository not found → You need to accept the DeepSeek V3 license at huggingface.co/deepseek-ai/DeepSeek-V3 and run huggingface-cli login

Step 4: Attach LoRA Adapters

# Attach LoRA to attention and MLP layers
# r=16 is a good default; increase to 32 for harder domain shifts
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                          # LoRA rank — higher = more capacity, more VRAM
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention projections
        "gate_proj", "up_proj", "down_proj",       # MLP projections (MoE gates)
    ],
    lora_alpha=16,                 # Scale factor; keep equal to r as a starting point
    lora_dropout=0.05,             # Small dropout prevents overfitting on small datasets
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth's optimized checkpointing saves ~30% VRAM
    random_state=42,
    use_rslora=False,              # Set True if training > 5,000 examples — stabilizes learning
)

# Confirm trainable parameter count
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)")

Expected: roughly 0.05%–0.1% of parameters are trainable — this is correct for LoRA.

Step 5: Format Dataset and Configure Trainer

from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# Load and format dataset
dataset = load_dataset("json", data_files="domain_data.json", split="train")

# Alpaca prompt template — consistent formatting improves training signal
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token

def format_prompts(examples):
    instructions = examples["instruction"]
    inputs = examples.get("input", [""] * len(instructions))
    outputs = examples["output"]
    texts = []
    for instruction, inp, output in zip(instructions, inputs, outputs):
        # EOS token tells the model where the response ends — critical to include
        text = alpaca_prompt.format(instruction, inp, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

dataset = dataset.map(format_prompts, batched=True)

# Training config
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=4096,
    dataset_num_proc=4,
    packing=False,   # Set True for short examples (< 512 tokens) to speed up training
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,   # Effective batch size = 2 * 4 = 8
        warmup_steps=10,
        num_train_epochs=3,              # 3 epochs is usually enough; watch for overfitting
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",              # 8-bit Adam cuts optimizer memory by ~75%
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        seed=42,
        output_dir="./outputs",
        save_strategy="epoch",
    ),
)

Step 6: Train and Monitor

# Start training
trainer_stats = trainer.train()

# Print training summary
print(f"Training time: {trainer_stats.metrics['train_runtime']:.0f}s")
print(f"Final loss: {trainer_stats.metrics['train_loss']:.4f}")

Expected training time on A100 40GB:

500 examples, 3 epochs: ~12 minutes
2,000 examples, 3 epochs: ~45 minutes
5,000 examples, 3 epochs: ~2 hours

Watch for:

Loss should decrease steadily from ~2.0 to ~0.5–1.0 over 3 epochs
Loss below 0.2 → likely overfitting; reduce epochs or add more data
Loss stuck above 1.5 → learning rate too low or dataset formatting issue

Monitor GPU utilization in a separate terminal:

# Watch GPU memory and utilization every 2 seconds
watch -n 2 nvidia-smi

Step 7: Save and Export the Adapter

# Save LoRA adapter only (~200MB, not the full 38GB model)
model.save_pretrained("deepseek-v3-domain-adapter")
tokenizer.save_pretrained("deepseek-v3-domain-adapter")

print("Adapter saved to ./deepseek-v3-domain-adapter")

# Optional: merge adapter into model weights for faster inference
# This requires full VRAM — only do this if deploying to a dedicated server
model.save_pretrained_merged(
    "deepseek-v3-domain-merged",
    tokenizer,
    save_method="merged_16bit",   # "merged_4bit" for smaller file
)

Adapter directory structure:

deepseek-v3-domain-adapter/
├── adapter_config.json    # LoRA hyperparameters
├── adapter_model.safetensors  # Trained weights (~200MB)
└── tokenizer files

Verification

Run a quick inference check before deploying:

from unsloth import FastLanguageModel

# Reload the base model and apply adapter
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="deepseek-ai/DeepSeek-V3",
    max_seq_length=4096,
    dtype=None,
    load_in_4bit=True,
)
model.load_adapter("deepseek-v3-domain-adapter")
FastLanguageModel.for_inference(model)  # Unsloth 2x faster inference mode

# Test with a domain example NOT in your training set
inputs = tokenizer(
    [alpaca_prompt.format(
        "Convert this to SQL for our orders schema.",
        "All customers who made more than 3 purchases last month",
        "",  # Leave output blank — the model fills this in
    )],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

You should see: Your domain-specific SQL syntax, field names, and schema conventions — not generic SQL the base model would generate.

Regression check: Also run 3–5 general prompts (summarization, math) to confirm the base model's general capabilities haven't degraded. LoRA adapters rarely cause catastrophic forgetting, but it's worth confirming.

Production Deployment Options

Option A — Ollama with Modelfile (easiest for local/team deployment):

# Create a Modelfile pointing to your merged weights
cat > Modelfile << 'EOF'
FROM ./deepseek-v3-domain-merged
SYSTEM "You are a SQL assistant for the Acme Corp orders database. Always use our schema conventions."
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
EOF

ollama create deepseek-acme -f Modelfile
ollama run deepseek-acme

Option B — vLLM with adapter (for API serving at scale):

# Serve base model and load adapter at runtime — no merge needed
python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-V3 \
  --enable-lora \
  --lora-modules domain-adapter=./deepseek-v3-domain-adapter \
  --quantization awq \
  --tensor-parallel-size 4

What You Learned

QLoRA on DeepSeek V3's MoE architecture trains only ~0.1% of parameters — efficient and targeted
Dataset quality matters more than size: 500 well-formatted examples beats 5,000 noisy ones
lora_alpha = r is a reliable starting point; tune only if loss behavior is unusual
Save the adapter separately (200MB) rather than the merged model (38GB+) unless you need maximum inference speed
Always run a regression check — fine-tuned models can drift on tasks outside the training distribution

When to increase r: If your domain requires significant behavior change (not just vocabulary), try r=32 or r=64. Expect ~2x VRAM increase for the adapter itself (the base model stays the same size).

When NOT to fine-tune: If your domain data changes weekly, RAG will serve you better. Fine-tuning compresses knowledge into weights that can't be updated without retraining.

Tested on Unsloth 2025.3, DeepSeek V3 (deepseek-ai/DeepSeek-V3), CUDA 12.1, A100 40GB, Ubuntu 24.04