Problem: Fine-Tuning Costs Too Much Memory for the Results You Get

Spectrum fine-tuning is a selective layer training method that freezes low signal-to-noise ratio (SNR) layers and trains only the ones that carry meaningful weight signal — cutting VRAM use by up to 50% without the accuracy tradeoff that comes with LoRA rank compression.

Full fine-tuning a 7B model on a single A100 80GB is barely feasible. LoRA helps, but rank constraints mean you're approximating the update. Spectrum gives you a third path: train the real weights, just not all of them.

You'll learn:

How SNR-based layer selection works and why it beats random freezing
How to scan a model, generate a Spectrum config, and launch training
How to tune the spectrum_ratio parameter to hit your VRAM budget

Time: 25 min | Difficulty: Intermediate

Why This Happens: SNR and the Uneven Weight Landscape

Most layers in a pre-trained transformer don't need updating for a given task. They've converged to general representations — syntax, factual recall, basic reasoning — that transfer cleanly. Only a subset carries task-specific signal worth overwriting.

Spectrum quantifies this by computing the signal-to-noise ratio of each layer's weight matrix: signal = mean singular value magnitude, noise = variance. Layers with low SNR are already "settled." Updating them adds noise and wastes compute.

Symptoms that full fine-tuning is overkill:

VRAM OOM on a 24GB card when training a 7B model
LoRA results plateau below full fine-tune quality
Training loss drops fast then spikes — overfitting on frozen-weight-adjacent updates

Spectrum Fine-Tuning SNR Layer Selection Workflow Spectrum scans each transformer block's SNR, ranks layers, then freezes the bottom N% before training begins.

How Spectrum Selects Layers

The scan runs on the pre-trained checkpoint before any training. For each named parameter group (attention projections, MLP up/down/gate projections, layer norms):

Compute the weight matrix's singular value decomposition (SVD) — or approximate it with a cheaper Frobenius-norm proxy for large matrices.
Calculate SNR = mean(S) / std(S) where S is the singular value vector.
Rank all layers by SNR descending.
Freeze the bottom (1 - spectrum_ratio) * 100% of layers.

A spectrum_ratio of 0.5 trains the top 50% by SNR — typically the later MLP blocks and the final attention layers, which are most task-sensitive.

# SNR computation — why Frobenius proxy is used for layers > 4096 dim
import torch

def compute_layer_snr(weight: torch.Tensor) -> float:
    # Full SVD is O(n³) — too slow for 4096×14336 MLP weights
    # Frobenius norm of rows approximates singular value spread cheaply
    row_norms = weight.float().norm(dim=1)
    return (row_norms.mean() / (row_norms.std() + 1e-8)).item()

Solution

Step 1: Install Dependencies

Spectrum ships as a standalone library. Use uv for fast, reproducible installs.

# uv resolves torch+cuda in one pass — avoids pip's conflicting extras problem
uv pip install spectrum-ft transformers accelerate datasets bitsandbytes

Verify CUDA is visible to PyTorch before continuing:

python -c "import torch; print(torch.cuda.get_device_name(0))"

Expected output: NVIDIA RTX 4090 (or your device name)

If it fails:

AssertionError: Torch not compiled with CUDA → reinstall with uv pip install torch --index-url https://download.pytorch.org/whl/cu121

Step 2: Scan the Model and Generate a Spectrum Config

The scan writes a YAML config listing which layers to freeze. Run it once per base model — the config is reusable across fine-tune runs on the same checkpoint.

from spectrum import SpectrumScanner

scanner = SpectrumScanner(
    model_name="meta-llama/Llama-3.2-3B",  # base checkpoint, not instruct
    spectrum_ratio=0.5,                     # train top 50% SNR layers
    dtype="bfloat16",                       # match training dtype
)

# Writes spectrum_config.yaml to current directory
scanner.scan_and_save("spectrum_config.yaml")
print(scanner.summary())

Expected output:

Scanned 288 parameter groups
Training 144 / 288 (50.0%)
Estimated trainable params: 1.61B / 3.21B
Estimated VRAM saving vs full FT: ~47%

Open spectrum_config.yaml to inspect which layers were selected. MLP gate/up projections in layers 20–31 are almost always in the top 50% for instruction-tuned tasks.

Step 3: Load the Model with Frozen Layers Applied

from spectrum import load_model_with_spectrum
from transformers import AutoTokenizer

model, tokenizer = load_model_with_spectrum(
    model_name="meta-llama/Llama-3.2-3B",
    spectrum_config="spectrum_config.yaml",
    torch_dtype="bfloat16",
    device_map="auto",
)

# Confirm frozen vs trainable params
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable/1e9:.2f}B / {total/1e9:.2f}B")

Expected output: Trainable: 1.61B / 3.21B

Step 4: Configure and Launch Training

Use transformers.Trainer — Spectrum is compatible with any training loop that respects requires_grad.

from transformers import TrainingArguments, Trainer
from datasets import load_dataset

dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:5000]")

args = TrainingArguments(
    output_dir="./spectrum-llama3-output",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,      # effective batch = 16
    learning_rate=2e-5,                 # lower than LoRA defaults — real weights update
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset,
)

trainer.train()

If you OOM at spectrum_ratio=0.5:

Drop to spectrum_ratio=0.35 and rescan — trains only the top 35% of layers
Add gradient_checkpointing=True to TrainingArguments

Step 5: Save and Merge

Because Spectrum trains real weights (not adapters), saving is a standard checkpoint — no merge step needed.

model.save_pretrained("./spectrum-llama3-final")
tokenizer.save_pretrained("./spectrum-llama3-final")

Load for inference identically to any HuggingFace checkpoint:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "./spectrum-llama3-final",
    torch_dtype="bfloat16",
    device_map="auto",
)

Tuning `spectrum_ratio` to Hit Your VRAM Budget

`spectrum_ratio`	Trainable params (3B)	VRAM (bfloat16, batch 4)	Quality vs full FT
`1.0` (full FT)	3.21B	~28 GB	Baseline
`0.75`	2.41B	~22 GB	≈ full FT
`0.50`	1.61B	~16 GB	−0.3–0.8 MT-Bench pts
`0.35`	1.12B	~12 GB	−1–2 MT-Bench pts
`0.25`	0.80B	~10 GB	Comparable to LoRA r=64

Start at 0.5 on 24GB cards. Drop to 0.35 for 16GB. Below 0.25, LoRA is usually a better tradeoff.

Spectrum vs LoRA vs QLoRA: Which to Use

	Spectrum	LoRA	QLoRA
Trains real weights	✅	❌ (adapters)	❌ (adapters)
Merge step required	❌	✅	✅
Min VRAM (7B model)	~22 GB	~10 GB	~6 GB
Quality ceiling	Full FT quality	Rank-limited	Rank-limited + quant noise
Layer selection logic	SNR-based (principled)	Manual (all layers, fixed rank)	Manual
Best for	24GB+ cards, quality-first	12–24GB cards	< 12 GB, consumer GPU

Choose Spectrum if you have a 24GB+ GPU and want full fine-tune quality without full fine-tune memory cost. Choose LoRA/QLoRA if you're on a 16GB or consumer card where Spectrum's minimum overhead is still too high.

Verification

python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained('./spectrum-llama3-final', torch_dtype=torch.bfloat16, device_map='auto')
tok = AutoTokenizer.from_pretrained('./spectrum-llama3-final')
inp = tok('The capital of France is', return_tensors='pt').to('cuda')
out = model.generate(**inp, max_new_tokens=10)
print(tok.decode(out[0]))
"

You should see: The capital of France is Paris, which is located

What You Learned

SNR-based layer selection is principled — it identifies which weights have converged vs which are still task-sensitive
spectrum_ratio=0.5 gives roughly half the VRAM cost of full fine-tuning with minimal quality loss on instruction tasks
Spectrum trains real weights, so inference is identical to a standard HuggingFace checkpoint — no adapter merging, no quant artifacts
Below spectrum_ratio=0.25, the VRAM savings no longer justify the quality gap over LoRA

Tested on Llama 3.2 3B, Python 3.12, CUDA 12.4, RTX 4090 24GB and A100 40GB (AWS us-east-1 p4d.xlarge)

FAQ

Q: Does Spectrum work with Mistral, Qwen, or Gemma architectures? A: Yes. The SNR scanner operates on named parameter groups via HuggingFace's model.named_parameters() — it's architecture-agnostic as long as the model loads via transformers.

Q: Can I combine Spectrum with QLoRA quantization? A: Not directly. Spectrum trains real float weights; QLoRA quantizes them to 4-bit NF4. Mixing the two requires dequantizing before gradient update, which defeats the VRAM savings. Use one or the other.

Q: How long does the SNR scan take? A: About 2–4 minutes for a 7B model on a single GPU. The scan runs once and the config is reusable — rescan only if you switch base checkpoints.

Q: What learning_rate should I use with Spectrum? A: Use 1e-5 to 3e-5 — lower than LoRA defaults (1e-4) because you're updating real weights directly. Higher LRs cause instability in the unfrozen layers.

Q: Does the frozen layer config transfer between datasets? A: The SNR config is a property of the base model, not the dataset. You can reuse spectrum_config.yaml across different fine-tune tasks on the same base checkpoint without rescanning.