Problem: Full Fine-Tuning Is Too Expensive and LoRA Misses Critical Layers
LISA fine-tuning (Layer-Wise Importance Sampling for LLM fine-tuning) solves the GPU memory crisis that blocks most developers from training 7B+ models on consumer hardware. LoRA freezes most weights and trains low-rank adapters — but it distributes parameter updates across all layers equally, ignoring the fact that different layers contribute differently to task learning.
LISA fixes this. It dynamically samples which layers to update each training step, based on gradient importance. The result: 60% less GPU memory than LoRA on the same model, with comparable or better downstream accuracy.
You'll learn:
- How LISA's importance sampling selects layers per training step
- How to run a LISA fine-tune on Llama 3.1 8B using LLaMA-Factory
- How to tune
lisa_activated_layersandlisa_step_intervalfor your hardware - When LISA outperforms LoRA — and when it doesn't
Time: 25 min | Difficulty: Intermediate | Tested on: Python 3.12, CUDA 12.4, RTX 4090 (24GB) and A100 40GB
Why This Happens: Layer Contribution Is Not Uniform
Training a 7B parameter model full fine-tune requires ~112GB of GPU VRAM in fp32, or ~56GB in bf16. That rules out every consumer GPU and most cloud instances under $4/hr.
LoRA reduced this by introducing low-rank adapter matrices and freezing the base model weights. But LoRA has a structural weakness: it applies adapters uniformly across layers, with no mechanism to prioritize layers that are actually driving gradient updates for your task.
LISA, introduced in the 2024 paper LISA: Layerwise Importance Sampled AdamW, reframes fine-tuning as a stochastic layer selection problem:
- At each training step, LISA scores all layers by gradient norm
- It activates only the top-K layers (controlled by
lisa_activated_layers) for that step - All other layers are frozen for that step — cutting memory to a fraction of full fine-tune
The embedding layer is always trained. Only the transformer blocks are sampled.
Symptoms that you need LISA instead of LoRA:
- OOM errors with 4-bit LoRA on 7B+ models
- LoRA task accuracy plateaus before convergence
- You need full-weight updates (not adapter merging) for production deployment
Architecture: How LISA Samples Layers
LISA training loop: gradient norms score all layers → top-K activated → AdamW step on selected layers only → repeat every
lisa_step_interval steps
LISA's memory footprint per step scales with lisa_activated_layers, not total layer count. For a 32-layer model with lisa_activated_layers=2, only 2 transformer blocks plus embeddings hold optimizer states at any moment.
The lisa_step_interval parameter controls how often the active layer set is resampled. Lower values = more frequent resampling = more even coverage but more overhead. Higher values = fewer switches = faster per-step throughput but less layer diversity.
Solution
Step 1: Install LLaMA-Factory with LISA Support
LISA is implemented in LLaMA-Factory. Clone the repo and install in a clean environment.
# Use uv for fast, reproducible installs
pip install uv --break-system-packages
uv venv lisa-env --python 3.12
source lisa-env/bin/activate
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
uv pip install -e ".[torch,metrics]"
Verify CUDA is visible:
python -c "import torch; print(torch.cuda.get_device_name(0), torch.version.cuda)"
Expected output: NVIDIA RTX 4090 12.4 (or your device)
If it fails:
CUDA not available→ Runnvidia-smito confirm driver. Reinstalltorchwithuv pip install torch --index-url https://download.pytorch.org/whl/cu124No module named llamafactory→ Confirm you ranpip install -e .from inside the cloned repo directory
Step 2: Prepare Your Dataset in ShareGPT Format
LISA trains on the same formats as standard LLaMA-Factory fine-tunes. ShareGPT is the most common.
[
{
"conversations": [
{
"from": "human",
"value": "Summarize the following earnings call transcript in three bullet points:\n\n[transcript text]"
},
{
"from": "gpt",
"value": "• Revenue grew 18% YoY to $2.4B, beating consensus by $120M...\n• Operating margin compressed 200bps due to infrastructure investment...\n• Management guided Q4 revenue to $2.6–2.7B, below the $2.75B street estimate..."
}
]
}
]
Save to data/my_dataset.json. Register it in data/dataset_info.json:
{
"my_dataset": {
"file_name": "my_dataset.json",
"formatting": "sharegpt",
"columns": {
"messages": "conversations"
}
}
}
Step 3: Configure the LISA Training YAML
Create configs/lisa_llama3_8b.yaml. The critical LISA parameters are finetuning_type: full with the lisa_* keys — without finetuning_type: full, LISA silently falls back to LoRA behavior.
### Model
model_name_or_path: meta-llama/Meta-Llama-3.1-8B-Instruct
trust_remote_code: true
### Training method
finetuning_type: full # REQUIRED — LISA runs on full fine-tune, not LoRA
stage: sft
### LISA-specific parameters
lisa_activated_layers: 2 # How many transformer layers to activate per step
lisa_step_interval: 20 # Resample active layers every N optimizer steps
### Dataset
dataset: my_dataset
template: llama3
cutoff_len: 2048
max_samples: 10000
### Output
output_dir: outputs/lisa-llama3-8b
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
### Training hyperparameters
per_device_train_batch_size: 2
gradient_accumulation_steps: 8 # Effective batch = 16
num_train_epochs: 3
learning_rate: 2.0e-5
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true # Required for LISA on Ampere+ GPUs
ddp_find_unused_parameters: false
### Memory optimization
gradient_checkpointing: true
flash_attn: fa2 # Flash Attention 2 — halves attention memory
Step 4: Launch the LISA Training Run
llamafactory-cli train configs/lisa_llama3_8b.yaml
Expected output in the first 30 seconds:
[INFO] LISA activated layers: 2 / 32
[INFO] LISA step interval: 20
[INFO] Trainable params: 394,592,256 (current step) of 8,030,261,248 total
The "trainable params" count will change every lisa_step_interval steps as the active layer set rotates — this is normal and expected.
If it fails:
RuntimeError: Expected all tensors to be on the same device→ Addddp_find_unused_parameters: falseto your config (already included above)CUDA out of memory→ Reduceper_device_train_batch_sizeto 1, or lowerlisa_activated_layersto 1Flash attention not available→ Runpip install flash-attn --no-build-isolationor switchflash_attn: sdpa
Step 5: Monitor VRAM Usage and Layer Sampling
In a second terminal, watch GPU memory across the training run:
# Log VRAM every 10 seconds to a file
watch -n 10 "nvidia-smi --query-gpu=memory.used,memory.free --format=csv >> vram_log.csv"
You'll see VRAM usage fluctuate slightly every lisa_step_interval steps as optimizer states shift between layer groups. Peak VRAM for Llama 3.1 8B with the config above:
| Config | VRAM (RTX 4090 24GB) |
|---|---|
| Full fine-tune (bf16) | OOM |
| LoRA r=16 + 4-bit | ~16GB |
| LISA activated_layers=2 | ~14GB |
| LISA activated_layers=1 | ~11GB |
Step 6: Tune lisa_activated_layers for Your Hardware
lisa_activated_layers is the single most important parameter. More layers = better gradient coverage = higher accuracy potential, but more VRAM.
General tuning heuristics for a 32-layer model:
# 8–16GB VRAM (RTX 3090, 4080)
lisa_activated_layers: 1
lisa_step_interval: 20
# 24GB VRAM (RTX 4090, A10G)
lisa_activated_layers: 2
lisa_step_interval: 20
# 40GB+ VRAM (A100, H100)
lisa_activated_layers: 4
lisa_step_interval: 10 # More frequent resampling — you have the memory budget
For models larger than 13B, the research paper recommends starting at lisa_activated_layers=2 and scaling to 4 only if convergence is slow after epoch 1.
Verification
After training completes, run inference on a held-out prompt:
llamafactory-cli chat \
--model_name_or_path outputs/lisa-llama3-8b \
--template llama3 \
--finetuning_type full
For automated eval, use the built-in MMLU benchmark:
llamafactory-cli eval \
--model_name_or_path outputs/lisa-llama3-8b \
--task mmlu \
--template llama3 \
--finetuning_type full \
--batch_size 4
You should see: MMLU accuracy within 1–2% of the base model on general tasks, with significant gains on your target domain.
LISA vs LoRA vs QLoRA: When to Use Each
| LISA | LoRA | QLoRA | |
|---|---|---|---|
| VRAM (8B model) | ~14GB | ~16GB | ~8GB |
| Output format | Full weights | Adapter files | Adapter files |
| Deployment | Drop-in replace | Merge or serve adapter | Merge or serve adapter |
| Convergence speed | Fast | Medium | Slower |
| Best for | Production full-weight replacement | Rapid iteration, multiple adapters | Extreme memory constraint |
| Min VRAM (8B) | 11GB | 12GB | 6GB |
Choose LISA if you need full-weight model outputs (no adapter merging step at deploy time), you have at least 12GB VRAM, and you want strong convergence without tuning rank and alpha hyperparameters.
Choose LoRA if you need to maintain a base model and swap multiple task adapters without retraining.
Choose QLoRA if you're on a GPU with less than 12GB VRAM and can accept a small accuracy trade-off.
What You Learned
- LISA activates only top-K layers per step based on gradient norm, reducing optimizer state memory dramatically
lisa_activated_layersis your primary memory knob — start at 2 for 24GB GPUslisa_step_interval: 20is a reliable default; lower it (to 10) only when training on A100/H100 where the overhead is negligible- LISA outputs a full fine-tuned model — not adapter files — which simplifies production deployment
- LISA underperforms LoRA when your fine-tune dataset is very small (<500 samples) because infrequent layer sampling doesn't cover all relevant weights
Tested on LLaMA-Factory v0.9.x, Python 3.12, CUDA 12.4, RTX 4090 24GB and A100 40GB
FAQ
Q: Does LISA work with Mistral, Qwen, and Gemma models?
A: Yes. LISA is a training algorithm, not a model-specific feature. Any model supported by LLaMA-Factory (Mistral, Qwen2.5, Gemma 2, DeepSeek) works with finetuning_type: full + lisa_activated_layers.
Q: Can I use LISA with 4-bit quantized models? A: No. LISA requires full-precision (bf16 or fp16) weights because it updates actual layer parameters. For 4-bit base models, use QLoRA instead.
Q: What is the minimum VRAM to run LISA on Llama 3.1 8B?
A: 11GB with lisa_activated_layers: 1, gradient_checkpointing: true, flash_attn: fa2, and batch size 1. The RTX 3080 10GB cannot run it — step up to an RTX 3080 12GB or 3090.
Q: How does lisa_step_interval affect final model quality?
A: Lower intervals (more frequent resampling) generally improve task accuracy by 0.5–1.5% on domain benchmarks because more layers receive gradient updates over the full training run. The cost is ~5–8% more training time per epoch.
Q: Does LISA work with multi-GPU setups via DeepSpeed?
A: Yes. Add deepspeed: configs/ds_z2_config.json to your YAML and launch with torchrun. DeepSpeed ZeRO-2 is recommended — ZeRO-3 can conflict with LISA's per-step layer freezing logic.