Problem: Fine-Tuning Costs Too Much Memory for the Results You Get
Spectrum fine-tuning is a selective layer training method that freezes low signal-to-noise ratio (SNR) layers and trains only the ones that carry meaningful weight signal — cutting VRAM use by up to 50% without the accuracy tradeoff that comes with LoRA rank compression.
Full fine-tuning a 7B model on a single A100 80GB is barely feasible. LoRA helps, but rank constraints mean you're approximating the update. Spectrum gives you a third path: train the real weights, just not all of them.
You'll learn:
- How SNR-based layer selection works and why it beats random freezing
- How to scan a model, generate a Spectrum config, and launch training
- How to tune the
spectrum_ratioparameter to hit your VRAM budget
Time: 25 min | Difficulty: Intermediate
Why This Happens: SNR and the Uneven Weight Landscape
Most layers in a pre-trained transformer don't need updating for a given task. They've converged to general representations — syntax, factual recall, basic reasoning — that transfer cleanly. Only a subset carries task-specific signal worth overwriting.
Spectrum quantifies this by computing the signal-to-noise ratio of each layer's weight matrix: signal = mean singular value magnitude, noise = variance. Layers with low SNR are already "settled." Updating them adds noise and wastes compute.
Symptoms that full fine-tuning is overkill:
- VRAM OOM on a 24GB card when training a 7B model
- LoRA results plateau below full fine-tune quality
- Training loss drops fast then spikes — overfitting on frozen-weight-adjacent updates
Spectrum scans each transformer block's SNR, ranks layers, then freezes the bottom N% before training begins.
How Spectrum Selects Layers
The scan runs on the pre-trained checkpoint before any training. For each named parameter group (attention projections, MLP up/down/gate projections, layer norms):
- Compute the weight matrix's singular value decomposition (SVD) — or approximate it with a cheaper Frobenius-norm proxy for large matrices.
- Calculate
SNR = mean(S) / std(S)whereSis the singular value vector. - Rank all layers by SNR descending.
- Freeze the bottom
(1 - spectrum_ratio) * 100%of layers.
A spectrum_ratio of 0.5 trains the top 50% by SNR — typically the later MLP blocks and the final attention layers, which are most task-sensitive.
# SNR computation — why Frobenius proxy is used for layers > 4096 dim
import torch
def compute_layer_snr(weight: torch.Tensor) -> float:
# Full SVD is O(n³) — too slow for 4096×14336 MLP weights
# Frobenius norm of rows approximates singular value spread cheaply
row_norms = weight.float().norm(dim=1)
return (row_norms.mean() / (row_norms.std() + 1e-8)).item()
Solution
Step 1: Install Dependencies
Spectrum ships as a standalone library. Use uv for fast, reproducible installs.
# uv resolves torch+cuda in one pass — avoids pip's conflicting extras problem
uv pip install spectrum-ft transformers accelerate datasets bitsandbytes
Verify CUDA is visible to PyTorch before continuing:
python -c "import torch; print(torch.cuda.get_device_name(0))"
Expected output: NVIDIA RTX 4090 (or your device name)
If it fails:
AssertionError: Torch not compiled with CUDA→ reinstall withuv pip install torch --index-url https://download.pytorch.org/whl/cu121
Step 2: Scan the Model and Generate a Spectrum Config
The scan writes a YAML config listing which layers to freeze. Run it once per base model — the config is reusable across fine-tune runs on the same checkpoint.
from spectrum import SpectrumScanner
scanner = SpectrumScanner(
model_name="meta-llama/Llama-3.2-3B", # base checkpoint, not instruct
spectrum_ratio=0.5, # train top 50% SNR layers
dtype="bfloat16", # match training dtype
)
# Writes spectrum_config.yaml to current directory
scanner.scan_and_save("spectrum_config.yaml")
print(scanner.summary())
Expected output:
Scanned 288 parameter groups
Training 144 / 288 (50.0%)
Estimated trainable params: 1.61B / 3.21B
Estimated VRAM saving vs full FT: ~47%
Open spectrum_config.yaml to inspect which layers were selected. MLP gate/up projections in layers 20–31 are almost always in the top 50% for instruction-tuned tasks.
Step 3: Load the Model with Frozen Layers Applied
from spectrum import load_model_with_spectrum
from transformers import AutoTokenizer
model, tokenizer = load_model_with_spectrum(
model_name="meta-llama/Llama-3.2-3B",
spectrum_config="spectrum_config.yaml",
torch_dtype="bfloat16",
device_map="auto",
)
# Confirm frozen vs trainable params
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable/1e9:.2f}B / {total/1e9:.2f}B")
Expected output: Trainable: 1.61B / 3.21B
Step 4: Configure and Launch Training
Use transformers.Trainer — Spectrum is compatible with any training loop that respects requires_grad.
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:5000]")
args = TrainingArguments(
output_dir="./spectrum-llama3-output",
num_train_epochs=2,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch = 16
learning_rate=2e-5, # lower than LoRA defaults — real weights update
bf16=True,
logging_steps=10,
save_strategy="epoch",
report_to="none",
)
trainer = Trainer(
model=model,
args=args,
train_dataset=dataset,
)
trainer.train()
If you OOM at spectrum_ratio=0.5:
- Drop to
spectrum_ratio=0.35and rescan — trains only the top 35% of layers - Add
gradient_checkpointing=TruetoTrainingArguments
Step 5: Save and Merge
Because Spectrum trains real weights (not adapters), saving is a standard checkpoint — no merge step needed.
model.save_pretrained("./spectrum-llama3-final")
tokenizer.save_pretrained("./spectrum-llama3-final")
Load for inference identically to any HuggingFace checkpoint:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"./spectrum-llama3-final",
torch_dtype="bfloat16",
device_map="auto",
)
Tuning spectrum_ratio to Hit Your VRAM Budget
spectrum_ratio | Trainable params (3B) | VRAM (bfloat16, batch 4) | Quality vs full FT |
|---|---|---|---|
1.0 (full FT) | 3.21B | ~28 GB | Baseline |
0.75 | 2.41B | ~22 GB | ≈ full FT |
0.50 | 1.61B | ~16 GB | −0.3–0.8 MT-Bench pts |
0.35 | 1.12B | ~12 GB | −1–2 MT-Bench pts |
0.25 | 0.80B | ~10 GB | Comparable to LoRA r=64 |
Start at 0.5 on 24GB cards. Drop to 0.35 for 16GB. Below 0.25, LoRA is usually a better tradeoff.
Spectrum vs LoRA vs QLoRA: Which to Use
| Spectrum | LoRA | QLoRA | |
|---|---|---|---|
| Trains real weights | ✅ | ❌ (adapters) | ❌ (adapters) |
| Merge step required | ❌ | ✅ | ✅ |
| Min VRAM (7B model) | ~22 GB | ~10 GB | ~6 GB |
| Quality ceiling | Full FT quality | Rank-limited | Rank-limited + quant noise |
| Layer selection logic | SNR-based (principled) | Manual (all layers, fixed rank) | Manual |
| Best for | 24GB+ cards, quality-first | 12–24GB cards | < 12 GB, consumer GPU |
Choose Spectrum if you have a 24GB+ GPU and want full fine-tune quality without full fine-tune memory cost. Choose LoRA/QLoRA if you're on a 16GB or consumer card where Spectrum's minimum overhead is still too high.
Verification
python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained('./spectrum-llama3-final', torch_dtype=torch.bfloat16, device_map='auto')
tok = AutoTokenizer.from_pretrained('./spectrum-llama3-final')
inp = tok('The capital of France is', return_tensors='pt').to('cuda')
out = model.generate(**inp, max_new_tokens=10)
print(tok.decode(out[0]))
"
You should see: The capital of France is Paris, which is located
What You Learned
- SNR-based layer selection is principled — it identifies which weights have converged vs which are still task-sensitive
spectrum_ratio=0.5gives roughly half the VRAM cost of full fine-tuning with minimal quality loss on instruction tasks- Spectrum trains real weights, so inference is identical to a standard HuggingFace checkpoint — no adapter merging, no quant artifacts
- Below
spectrum_ratio=0.25, the VRAM savings no longer justify the quality gap over LoRA
Tested on Llama 3.2 3B, Python 3.12, CUDA 12.4, RTX 4090 24GB and A100 40GB (AWS us-east-1 p4d.xlarge)
FAQ
Q: Does Spectrum work with Mistral, Qwen, or Gemma architectures?
A: Yes. The SNR scanner operates on named parameter groups via HuggingFace's model.named_parameters() — it's architecture-agnostic as long as the model loads via transformers.
Q: Can I combine Spectrum with QLoRA quantization? A: Not directly. Spectrum trains real float weights; QLoRA quantizes them to 4-bit NF4. Mixing the two requires dequantizing before gradient update, which defeats the VRAM savings. Use one or the other.
Q: How long does the SNR scan take? A: About 2–4 minutes for a 7B model on a single GPU. The scan runs once and the config is reusable — rescan only if you switch base checkpoints.
Q: What learning_rate should I use with Spectrum?
A: Use 1e-5 to 3e-5 — lower than LoRA defaults (1e-4) because you're updating real weights directly. Higher LRs cause instability in the unfrozen layers.
Q: Does the frozen layer config transfer between datasets?
A: The SNR config is a property of the base model, not the dataset. You can reuse spectrum_config.yaml across different fine-tune tasks on the same base checkpoint without rescanning.