Fix CUDA Out of Memory Errors Running Local AI in 15 Minutes

Solve CUDA OOM errors when running local LLMs with practical memory management fixes for PyTorch and llama.cpp.

Problem: CUDA Out of Memory When Running Local AI Models

You're loading a local LLM or running inference with PyTorch and hit RuntimeError: CUDA out of memory. Tried to allocate X GiB. The model worked yesterday, or it works on someone else's machine.

You'll learn:

  • Why CUDA OOM errors happen even when VRAM looks "free"
  • How to actually clear GPU memory between runs
  • Which quantization and offloading settings to change first

Time: 15 min | Level: Intermediate


Why This Happens

VRAM is managed differently from RAM. PyTorch caches allocations aggressively — memory marked "free" in your task manager is often still held by the CUDA allocator. Fragmentation means 8 GB free doesn't mean you can load an 8 GB model.

Common symptoms:

  • RuntimeError: CUDA out of memory mid-inference, not at load time
  • torch.cuda.memory_reserved() shows VRAM is full even after deleting tensors
  • Works on first run, crashes on second run in the same Python session

Terminal showing CUDA OOM error The full error includes how much it tried to allocate vs. what was free — read both numbers


Solution

Step 1: Check What's Actually Using Your VRAM

Before changing anything, see the real picture.

# Show all processes using GPU memory
nvidia-smi

# In Python — check PyTorch's view of memory
python3 -c "import torch; print(torch.cuda.memory_summary())"

Expected: You'll often find a zombie Python process or a previous model still loaded.

# Kill leftover GPU processes (replace PID from nvidia-smi output)
kill -9 <PID>

nvidia-smi output showing GPU memory usage Look at the "MEM-USAGE" column — anything above 200 MiB that isn't your current process is a leak


Step 2: Clear PyTorch's Memory Cache

PyTorch holds a reserved memory cache. Deleting a tensor doesn't release it to the OS.

import torch
import gc

def clear_gpu_memory():
    # Delete your model references first
    # del model  ← do this before calling this function

    gc.collect()

    # Empties PyTorch's cache — returns memory to CUDA allocator
    torch.cuda.empty_cache()

    # Confirm it worked
    print(f"Reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
    print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

Why this works: empty_cache() releases cached but unallocated blocks back to CUDA. Without gc.collect() first, Python may still hold references and empty_cache() does nothing.

If it fails:

  • Still OOM after clearing: The model itself is too large. Move to Step 3.
  • Reserved stays high: Another process owns that memory — check nvidia-smi again.

Step 3: Load Model in Lower Precision

Full float32 models use 4 bytes per parameter. A 7B model = ~28 GB in fp32. Switch to quantized formats.

For PyTorch / Hugging Face models:

from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",

    # Use bfloat16 — cuts VRAM in half, minimal quality loss
    torch_dtype=torch.bfloat16,

    # Load directly to GPU, skip CPU staging
    device_map="auto",
)

For llama.cpp / Ollama:

# Set GPU layers — start conservative, increase until OOM
ollama run mistral --gpu-layers 20

# Or with llama.cpp directly
./llama-cli -m model.gguf -ngl 20  # -ngl = number of GPU layers

VRAM requirements by quantization level:

Format7B Model VRAMQuality Loss
fp32~28 GBNone
fp16 / bf16~14 GBMinimal
Q8_0 (GGUF)~7 GBVery low
Q4_K_M (GGUF)~4.5 GBLow
Q2_K (GGUF)~2.7 GBNoticeable

If it fails:

  • bfloat16 not supported: Your GPU is older than Ampere. Use torch.float16 instead.
  • device_map="auto" still OOM: Add load_in_4bit=True with bitsandbytes installed.

Step 4: Enable CPU Offloading (Last Resort)

If the model still won't fit, offload layers to RAM. It's slower but works.

from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    torch_dtype=torch.bfloat16,

    # auto splits layers between GPU and CPU based on available VRAM
    device_map="auto",

    # Explicit max VRAM — leave ~1 GB headroom for activations
    max_memory={0: "6GiB", "cpu": "24GiB"},
)

Why leave headroom: Activations during forward pass need scratch space. A 6 GB card shouldn't try to load 6 GB of weights — inference will OOM at runtime.

Memory split between GPU and CPU during inference The device_map splits the model across devices — layers that don't fit move to RAM


Verification

import torch

# After loading your model
print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"GPU memory reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Run a test inference
output = model.generate(input_ids, max_new_tokens=50)
print("Inference succeeded:", output.shape)

You should see: Allocated VRAM below your GPU's limit, and successful output shape printed without errors.


What You Learned

  • PyTorch caches VRAM aggressively — always call gc.collect() + torch.cuda.empty_cache() between runs, not just del model
  • Q4_K_M GGUF is the best quality-to-VRAM tradeoff for most consumer GPUs
  • device_map="auto" with max_memory is safer than letting the library guess — always leave ~1 GB headroom

Limitation: CPU offloading makes inference 5–20x slower depending on how many layers spill. For interactive use, it's often better to use a smaller quantized model that fits entirely in VRAM.

When NOT to use this: If you need maximum output quality for production (medical, legal, code generation), don't go below Q8_0 quantization — the degradation becomes measurable at Q4 and below for complex reasoning tasks.


Tested on PyTorch 2.2, CUDA 12.3, RTX 3090 and RTX 4060 Ti. llama.cpp build b2963.