Fine-Tune LLMs on RunPod: GPU Cloud Setup Guide 2026

Fine-tune Llama 3, Mistral, or Qwen2.5 on RunPod GPU cloud using Axolotl and QLoRA. Step-by-step setup for A100/H100 pods, pod config, and cost control. Tested on Python 3.12.

RunPod GPU cloud LLM fine-tuning gives you on-demand A100 and H100 access for $1.19–$3.99/hr — no reserved instance commitment, no idle billing between runs. This guide walks you from pod creation to a trained LoRA adapter in under 30 minutes.

You'll learn:

  • Provision a RunPod GPU pod with the right template for fine-tuning
  • Install Axolotl and configure a QLoRA run for Llama 3.1 8B
  • Monitor VRAM, save checkpoints to RunPod Volume, and terminate cleanly to stop billing

Time: 30 min | Difficulty: Intermediate


Why RunPod for LLM Fine-Tuning

Most fine-tuning jobs fail at the infrastructure layer, not the model layer. Local GPUs run out of VRAM. Cloud VMs charge you while they sit idle. RunPod solves both: you rent GPU-hours, not GPU-months.

The three failure modes RunPod eliminates:

  • OOM crashes mid-run — choose the exact VRAM tier (24GB / 40GB / 80GB) for your model size
  • Idle billing — pods bill per second; terminate when the job finishes
  • Data loss on crash — Network Volumes persist independently of pod lifecycle

RunPod's community cloud starts at $0.39/hr for an RTX 4090 (24GB). The secure cloud tier (SOC 2 Type II, US data centers in us-east-1 and us-west-2) starts at $1.19/hr for an A100 80GB — the standard for serious fine-tuning work.


GPU Selection: What VRAM You Actually Need

RunPod fine-tuning workflow: pod provisioning to LoRA adapter export End-to-end flow: RunPod pod → Axolotl QLoRA training → checkpoint on Network Volume → adapter export

Match your model size to the right GPU before you provision. Undersizing wastes money on retries.

ModelPrecisionTechniqueMin VRAMRunPod GPUEst. Cost/hr (USD)
Llama 3.1 8Bbfloat16QLoRA16GBRTX 4080$0.74
Llama 3.1 8Bbfloat16Full fine-tune40GBA100 40GB$1.19
Llama 3.3 70B4-bit NF4QLoRA40GBA100 40GB$1.19
Llama 3.3 70Bbfloat16QLoRA80GBA100 80GB$1.99
Mixtral 8x7B4-bit NF4QLoRA48GBA100 80GB$1.99

Rule of thumb: QLoRA on a 7–8B model fits on 16GB VRAM. Go to 40GB when batch size or sequence length pushes you over. H100 ($3.99/hr) is only worth it for jobs over 6 hours — the speed gain rarely justifies the premium on shorter runs.


Step 1: Create a RunPod Account and Add Credits

RunPod uses prepaid credits. Add a minimum of $10 to start — a 30-minute A100 job costs under $1.00.

  1. Go to runpod.io and sign up
  2. Navigate to BillingAdd Credits → enter a USD amount
  3. Under SettingsSSH Public Key, paste your public key (~/.ssh/id_ed25519.pub)

SSH access is required for reliable file transfers and terminal access. Do not skip this step.


Step 2: Create a Network Volume

Network Volumes persist your datasets, checkpoints, and final adapters across pod restarts. Without one, a pod crash wipes everything.

  1. Go to Storage+ Network Volume
  2. Set Name: llm-finetune-vol
  3. Set Size: 50 GB (enough for a 7B model + dataset + checkpoints)
  4. Set Region: US-TX-3 or any US region matching your planned pod region
  5. Click Create

Cost: $0.07/GB/month — 50GB costs $3.50/month. Terminate unused volumes after the project.


Step 3: Provision a GPU Pod

  1. Go to Pods+ Deploy
  2. Select Secure Cloud for US-based SOC 2 infrastructure, or Community Cloud to minimize cost
  3. Filter by GPU: select A100 80GB SXM for Llama 3.3 70B, or RTX 4090 for 8B models
  4. Under Template, search for and select: RunPod PyTorch 2.4 (CUDA 12.4, Python 3.12, pre-installed)
  5. Under Volume, attach llm-finetune-vol and set mount path to /workspace
  6. Set Container Disk: 20 GB (OS + dependencies only — data lives on the volume)
  7. Click Deploy On-Demand

The pod reaches Running state in 60–90 seconds. Click ConnectStart Web Terminal or copy the SSH command.


Step 4: Install Axolotl

Axolotl is the standard fine-tuning framework for QLoRA on consumer and cloud GPUs. It wraps Hugging Face Transformers, PEFT, and bitsandbytes into a single YAML-driven config.

SSH into your pod:

ssh root@<pod-ip> -p <pod-port> -i ~/.ssh/id_ed25519

Install Axolotl into the persistent volume so you don't reinstall on pod restarts:

cd /workspace

# Install into /workspace so it survives pod restarts
pip install packaging ninja --quiet
pip install axolotl[flash-attn,deepspeed] --quiet

# Verify GPU is visible
python -c "import torch; print(torch.cuda.get_device_name(0), torch.cuda.get_device_properties(0).total_memory // 1e9, 'GB')"

Expected output: NVIDIA A100-SXM4-80GB 79.0 GB

If it fails:

  • ModuleNotFoundError: flash_attn → the RunPod PyTorch template ships CUDA 12.4; run pip install flash-attn --no-build-isolation then retry
  • CUDA error: no kernel image → you selected the wrong template; stop the pod, switch to RunPod PyTorch 2.4, and redeploy

Step 5: Prepare Your Dataset

Axolotl expects data in one of three formats: alpaca, sharegpt, or completion. The sharegpt format maps directly to chat-tuned models.

Create a minimal dataset file:

mkdir -p /workspace/data
cat > /workspace/data/train.jsonl << 'EOF'
{"conversations": [{"from": "human", "value": "Explain gradient checkpointing in one sentence."}, {"from": "gpt", "value": "Gradient checkpointing trades compute for memory by recomputing activations during the backward pass instead of storing them."}]}
{"conversations": [{"from": "human", "value": "What is LoRA?"}, {"from": "gpt", "value": "LoRA freezes base model weights and injects trainable low-rank matrices into attention layers, reducing trainable parameters by up to 10,000x."}]}
EOF

For real fine-tuning jobs, upload your dataset to the Network Volume via scp:

# Run locally — copies dataset from your machine to the pod volume
scp -P <pod-port> -i ~/.ssh/id_ed25519 ./my_dataset.jsonl root@<pod-ip>:/workspace/data/

Step 6: Write the Axolotl Config

Create /workspace/config/llama3-qlora.yml:

mkdir -p /workspace/config
# /workspace/config/llama3-qlora.yml

base_model: meta-llama/Meta-Llama-3.1-8B-Instruct

# 4-bit NF4 quantization — loads the base model in 4-bit, trains LoRA adapters in bfloat16
load_in_4bit: true
bnb_4bit_compute_dtype: bfloat16    # compute in bfloat16 even though weights are 4-bit
bnb_4bit_quant_type: nf4            # NF4 outperforms FP4 on LLM weights empirically

# LoRA adapter config
lora_r: 64                 # rank — higher = more capacity, more VRAM. 64 is the sweet spot for 8B
lora_alpha: 128            # scale = alpha/r; keeping 2x rank is standard
lora_dropout: 0.05         # small dropout prevents adapter overfitting on small datasets
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

# Dataset
datasets:
  - path: /workspace/data/train.jsonl
    type: sharegpt

# Training
sequence_len: 2048           # reduce to 1024 to halve VRAM usage on smaller GPUs
micro_batch_size: 2          # per-GPU batch size; increase if VRAM allows
gradient_accumulation_steps: 4   # effective batch = micro_batch * grad_accum = 8
num_epochs: 3
optimizer: adamw_bnb_8bit    # 8-bit Adam — saves ~1GB VRAM vs standard AdamW
lr_scheduler: cosine
learning_rate: 2e-4

# Flash Attention 2 — mandatory on A100/H100 for 2x throughput
flash_attention: true

# Output
output_dir: /workspace/output/llama3-qlora
save_steps: 50               # checkpoint every 50 steps — safe against spot instance preemption
logging_steps: 10

# Hugging Face token for gated model access
hf_use_auth_token: true

Set your Hugging Face token (required for Llama 3 gated access):

export HF_TOKEN=hf_your_token_here   # generate at huggingface.co/settings/tokens

Step 7: Launch Fine-Tuning

cd /workspace

# Start training — logs to stdout; Ctrl+C safely stops and saves the last checkpoint
accelerate launch -m axolotl.cli.train /workspace/config/llama3-qlora.yml

Expected output on first run:

[INFO] Loading model meta-llama/Meta-Llama-3.1-8B-Instruct in 4-bit...
[INFO] trainable params: 83,886,080 || all params: 8,114,774,016 || trainable%: 1.0338
[INFO] Step 10/300 | loss: 1.8432 | lr: 0.000198 | tokens/sec: 2841

The trainable%: 1.03 line confirms QLoRA is active — you're training 84M parameters, not 8B.

Monitor VRAM in a second terminal:

watch -n 2 nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu --format=csv

You should see ~18–22GB used on an A100 80GB for this config. If usage exceeds 75GB, reduce sequence_len to 1024 or micro_batch_size to 1.


Step 8: Export the Adapter

When training completes, merge the LoRA adapter into the base model for inference:

# Merge adapter weights into the base model
python -m axolotl.cli.merge_lora /workspace/config/llama3-qlora.yml \
  --lora_model_dir /workspace/output/llama3-qlora \
  --output_dir /workspace/output/llama3-merged

Expected output: Saving merged model to /workspace/output/llama3-merged

The merged model is ~16GB on disk. Download it to your local machine before terminating the pod:

# Run locally
rsync -avz -e "ssh -p <pod-port> -i ~/.ssh/id_ed25519" \
  root@<pod-ip>:/workspace/output/llama3-merged ./llama3-merged

Verification

Test the merged model with a quick inference check before downloading:

python - <<'EOF'
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = "/workspace/output/llama3-merged"
tok = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="auto")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "What is gradient checkpointing?"}],
    tokenize=False, add_generation_prompt=True
)
inputs = tok(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tok.decode(out[0], skip_special_tokens=True))
EOF

You should see: A coherent response about gradient checkpointing from your fine-tuned model.


Step 9: Terminate the Pod

This step saves money. An A100 80GB left running overnight costs ~$48.

  1. Go to RunPod dashboardPods
  2. Click Stop on your pod — billing pauses immediately
  3. Click Terminate only after confirming all outputs are on the Network Volume or downloaded locally

Your Network Volume (llm-finetune-vol) persists. The next run, attach it to a fresh pod and skip Steps 4–6.


Cost Reference (USD)

JobGPUDurationCost
Llama 3.1 8B QLoRA, 3 epochs, 10k samplesRTX 4090~45 min~$0.55
Llama 3.1 8B QLoRA, 3 epochs, 10k samplesA100 40GB~25 min~$0.50
Llama 3.3 70B QLoRA, 3 epochs, 10k samplesA100 80GB~3.5 hrs~$6.97
Llama 3.3 70B full fine-tune, 1 epochH100 80GB × 2~6 hrs~$47.88

For jobs over $20, use Spot pods (50–70% discount) with save_steps: 25 — preemption is rare but possible.


What You Learned

  • RunPod bills per second; always terminate pods after the job, not just stop them
  • Network Volumes are the only persistent layer — attach one before every fine-tuning run
  • QLoRA at lora_r: 64 on an 8B model trains ~1% of parameters while preserving base model quality
  • flash_attention: true in Axolotl config is mandatory on A100/H100 — disabling it halves throughput

Tested on RunPod Secure Cloud, A100 80GB SXM, Axolotl 0.5.1, Python 3.12, CUDA 12.4, PyTorch 2.4


FAQ

Q: Does RunPod work without a Hugging Face token for non-gated models? A: Yes — set hf_use_auth_token: false in the Axolotl config. Only Llama 3, Gemma, and other gated models require a token.

Q: What is the difference between Stop and Terminate on RunPod? A: Stop pauses billing and preserves the container disk. Terminate deletes the container disk permanently. Always use Stop first, then Terminate only after your outputs are saved to the Network Volume.

Q: Minimum VRAM for Llama 3.3 70B QLoRA on RunPod? A: 40GB VRAM with sequence_len: 1024 and micro_batch_size: 1. For sequence_len: 2048, use an 80GB A100.

Q: Can I use RunPod with DeepSpeed for multi-GPU fine-tuning? A: Yes — set deepspeed: deepspeed_configs/zero3.json in the Axolotl config and deploy a pod with 2×A100 or 4×A100 from the RunPod multi-GPU pod type selector.