Continued Pre-Training vs Fine-Tuning: TL;DR
Continued pre-training vs fine-tuning is the decision that determines whether your custom LLM actually knows your domain — or just performs better on a narrow task.
| Continued Pre-Training (CPT) | Fine-Tuning (SFT / LoRA) | |
|---|---|---|
| Goal | Teach the model new knowledge | Teach the model new behavior |
| Data format | Raw text corpus | Instruction pairs (prompt/response) |
| Typical data size | 1 GB – 1 TB+ | 1,000 – 100,000 examples |
| Training cost (A100) | $200 – $5,000+ | $10 – $200 |
| Catastrophic forgetting risk | Low | Medium–High |
| Inference cost | Unchanged | Unchanged |
| Best for | Medical, legal, code, niche domains | Chat style, task format, tone, safety |
| Hugging Face support | ✅ Trainer + custom loop | ✅ Trainer, TRL, PEFT |
| USD entry cost | ~$50 on Lambda Labs | ~$5 on RunPod |
Choose continued pre-training if: your base model has no grounding in your domain and retrieval alone won't bridge the gap.
Choose fine-tuning if: the model already understands your domain and you need it to respond in a specific format, style, or persona.
What We're Comparing
LLM customization has three main levers: RAG, fine-tuning, and continued pre-training. Most teams reach for fine-tuning first. That's often wrong.
Here's why the decision matters. A base model like Llama 3.1 8B was trained on general web text. It knows a little about everything. If you're building a contract analysis tool, it has surface-level legal knowledge — but it has never deeply processed thousands of legal briefs, jurisdiction-specific clauses, or internal company templates.
Fine-tuning that model on 2,000 (instruction, answer) pairs teaches it how to respond. It does not fill the knowledge gap.
Continued pre-training on 50 GB of legal text teaches it what to know. Then fine-tuning shapes the output format.
The two methods solve different problems. Understanding the boundary is the whole game.
CPT injects domain knowledge into weights; SFT reshapes output behavior. Most production pipelines chain both.
Continued Pre-Training — Overview
Continued pre-training (CPT) extends the base model's pre-training step on a new corpus. The model continues next-token prediction on your domain text, updating weights across all layers.
What it's good at:
- Injecting terminology the base model has never seen (e.g., proprietary drug names, internal codebases, legacy regulatory language)
- Improving perplexity on domain text — meaning the model becomes fluent in the domain, not just aware of it
- Providing a better starting point before fine-tuning
Pros:
- Durable knowledge — it's baked into weights, not retrieved at runtime
- Scales with data — more domain text usually means better domain fluency
- Combines cleanly with downstream fine-tuning
Cons:
- Data prep is expensive — you need clean, deduplicated, domain-coherent text at scale
- Compute cost is high relative to SFT — you're touching every layer on every token
- Requires careful learning rate scheduling to avoid degrading general capabilities (use ~10% of original pre-training LR)
- Not suitable when you have fewer than ~100 MB of domain text — noise overwhelms signal
Fine-Tuning (SFT + LoRA) — Overview
Supervised fine-tuning (SFT) trains the model on labeled (prompt, response) pairs. LoRA and QLoRA reduce VRAM requirements by training low-rank adapter matrices instead of full weights.
What it's good at:
- Changing output behavior: instruction-following, chat format, JSON output, refusals
- Adapting tone, persona, and style
- Task-specific performance: classification, extraction, summarization in a defined format
Pros:
- Low data requirement — 1,000 high-quality examples often beats 100,000 mediocre ones
- Fast iteration — QLoRA 7B on 4-bit fits in 12 GB VRAM; full run under 2 hours on an RTX 4090
- Merges cleanly into base model weights or serves as a hot-swappable adapter
- TRL's
SFTTrainerhandles packing, masking, and dataset formatting out of the box
Cons:
- Cannot inject factual knowledge that isn't in the base model
- Catastrophic forgetting is real — aggressive fine-tuning on narrow data degrades general capability
- Hallucinations on domain facts don't decrease — the model just hallucinates more confidently
Head-to-Head: When Each Method Wins
Knowledge gap is the deciding question
Before choosing, run this test:
from transformers import pipeline
pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct")
result = pipe("Explain the indemnification clause in a SaaS MSA under California law.")
print(result[0]["generated_text"])
Read the output carefully. If the model:
- Confuses jurisdiction-specific rules → knowledge gap → CPT first
- Gets facts right but answers in the wrong format → behavior gap → fine-tuning only
- Hallucinates company-specific terms entirely → knowledge gap → CPT first
Data volume thresholds
| You have | Best approach |
|---|---|
| < 10 MB domain text | RAG only — not enough for CPT signal |
| 10 MB – 500 MB | RAG + fine-tuning |
| 500 MB – 10 GB | CPT on domain corpus → SFT on task format |
| > 10 GB | CPT required; SFT to shape output |
| < 500 labeled pairs | RAG or few-shot prompting — SFT will overfit |
| 500 – 5,000 pairs | LoRA / QLoRA fine-tuning |
| > 5,000 pairs | Full fine-tuning or QLoRA with longer training |
Budget thresholds (USD, 2026 Lambda Labs pricing)
| Method | Model size | Est. cost |
|---|---|---|
| QLoRA SFT | 7B, 2,000 examples | ~$5–15 |
| QLoRA SFT | 70B, 5,000 examples | ~$80–150 |
| CPT | 7B, 10 GB corpus | ~$150–300 |
| CPT | 70B, 50 GB corpus | ~$1,500–4,000 |
| CPT + SFT | 7B, full pipeline | ~$200–400 |
Setting Up Both: Minimal Code Examples
Continued Pre-Training with Hugging Face Trainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
model_id = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
# Raw domain text — no instruction format
dataset = load_dataset("text", data_files={"train": "domain_corpus.txt"})
def tokenize(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=2048,
return_special_tokens_mask=True,
)
tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])
args = TrainingArguments(
output_dir="./cpt-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
num_train_epochs=1, # 1–3 epochs for CPT; more causes forgetting
learning_rate=1e-5, # ~10% of original pre-training LR — critical for stability
bf16=True,
save_strategy="epoch",
logging_steps=50,
)
trainer = Trainer(model=model, args=args, train_dataset=tokenized["train"])
trainer.train()
Key detail: learning_rate=1e-5 — using a standard SFT LR of 2e-4 here will destabilize general capabilities within a few hundred steps.
LoRA Fine-Tuning with TRL SFTTrainer
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
model_id = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)
lora_config = LoraConfig(
r=16, # Rank 16 is the safe default for task adaptation
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
# Instruction pairs — must be in chat template format
dataset = load_dataset("json", data_files="sft_data.jsonl")
sft_config = SFTConfig(
output_dir="./sft-output",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-4,
bf16=True,
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset["train"],
tokenizer=tokenizer,
)
trainer.train()
The Production Pattern: CPT → SFT
For domain-heavy production use cases (medical, legal, finance, internal enterprise data), the standard pipeline is:
- CPT on raw domain corpus — teach the model your domain's language and facts
- SFT / LoRA on task-specific instruction pairs — teach the model how to respond
This is how most specialized LLMs are actually built. Bloomberg's BloombergGPT, Google's Med-PaLM 2, and Meta's Code Llama all follow CPT → SFT patterns.
You don't need Bloomberg's budget. A 7B model CPT'd on 5 GB of internal documents and fine-tuned on 3,000 labeled examples will outperform a 70B model with RAG on domain-specific tasks — in most benchmarks and in production evaluations.
What You Learned
- CPT fills knowledge gaps — it teaches the model what to know, not how to behave
- Fine-tuning shapes behavior — format, style, tone, task structure
- Use data volume as the first decision gate — under 500 MB of domain text, CPT signal-to-noise ratio is poor
- Learning rate is the most critical CPT hyperparameter — stay at ~10% of the original pre-training LR
- The production pattern is CPT → SFT, not one or the other
- Neither method eliminates hallucination — combine with RAG for retrieval-grounded outputs on facts that change over time
Tested on Llama 3.1 8B + Transformers 4.47, TRL 0.12, PEFT 0.13, Python 3.12, CUDA 12.4, Ubuntu 22.04
FAQ
Q: Can I skip CPT and just use RAG instead? A: Yes, if your domain facts are retrievable at inference time and latency allows it. RAG costs ~$0 to set up vs hundreds for CPT. Use CPT when facts are too dense to retrieve reliably, or when the model's base vocabulary doesn't cover your domain terms.
Q: How much domain text do I need for CPT to be worth it? A: A practical minimum is around 500 MB of clean, deduplicated text. Below that, signal-to-noise degrades and you risk paying compute costs for negligible domain lift. 1–10 GB is the sweet spot for 7B–13B models.
Q: Does LoRA fine-tuning update the model's factual knowledge? A: No. LoRA adapters modify attention projections but don't inject new facts into the feed-forward layers where factual recall lives. For knowledge injection you need full fine-tuning or CPT.
Q: What's the risk of catastrophic forgetting with CPT? A: Low if you keep the learning rate at 10% of the original pre-training LR and limit to 1–3 epochs. Run MMLU or HellaSwag benchmarks before and after to verify general capability hasn't degraded.
Q: Which is cheaper for a US startup on a tight budget? A: QLoRA fine-tuning at ~$5–15 per run on Lambda Labs or RunPod. CPT on even a small corpus will cost 10–20x more. If you're under $100 budget, do RAG + QLoRA SFT and revisit CPT when you've validated the use case.