Problem: PyTorch Won't Use Your GPU on Ubuntu
Deep learning with PyTorch on Ubuntu is the go-to setup for developers training neural networks locally — but a bad CUDA install leaves you training on CPU at 50× the speed penalty.
You'll learn:
- Install the correct NVIDIA driver, CUDA 12, and cuDNN on Ubuntu 22.04 / 24.04
- Create an isolated Python 3.12 environment with uv and install PyTorch 2.x
- Verify GPU access and run a real training loop on CIFAR-10
Time: 30 min | Difficulty: Intermediate
Why This Happens
The most common failure: torch.cuda.is_available() returns False even with an NVIDIA GPU installed. This happens because the PyTorch CUDA build and the system CUDA toolkit version don't match. Ubuntu ships with open-source Nouveau drivers by default — PyTorch needs the proprietary NVIDIA driver and a matching CUDA runtime.
Symptoms:
torch.cuda.is_available()returnsFalsenvidia-smicommand not found after rebootRuntimeError: CUDA error: no kernel image is available for execution on the device(driver/CUDA version mismatch)- Training runs but
nvidia-smishows 0% GPU utilization
Deep Learning with PyTorch on Ubuntu: Full Stack Diagram
The four layers that must version-match: NVIDIA driver → CUDA runtime → cuDNN → PyTorch CUDA build
Solution
Step 1: Check Your GPU and Ubuntu Version
One command confirms the GPU is visible to the kernel before you install anything.
# Confirm the GPU is detected by the kernel
lspci | grep -i nvidia
# Confirm Ubuntu release
lsb_release -a
Expected output:
01:00.0 VGA compatible controller: NVIDIA Corporation AD102 [GeForce RTX 4090] (rev a1)
...
Description: Ubuntu 24.04.2 LTS
If lspci shows nothing NVIDIA, the problem is hardware or BIOS — not drivers.
Step 2: Remove Nouveau and Install the NVIDIA Driver
Ubuntu's open-source Nouveau driver conflicts with the proprietary NVIDIA driver. Block it first.
# Block the Nouveau driver
sudo bash -c "echo 'blacklist nouveau
options nouveau modeset=0' > /etc/modprobe.d/blacklist-nouveau.conf"
sudo update-initramfs -u
# Install the recommended proprietary driver
sudo ubuntu-drivers autoinstall
# Reboot — required for the kernel module to load
sudo reboot
After reboot, verify the driver loaded:
nvidia-smi
Expected output:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15 Driver Version: 570.86.15 CUDA Version: 12.8 |
+-----------------------------------------------------------------------------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 On | N/A |
| 0% 42C P8 22W / 450W | 312MiB / 24564MiB | 1% Default |
+-----------------------------------------------------------------------------------------+
If it fails:
nvidia-smi: command not found→ the driver didn't install. Runsudo apt install nvidia-driver-570explicitly and reboot again.Failed to initialize NVML: Driver/library version mismatch→ reboot is still pending; a previous driver kernel module is loaded. Reboot fixes this.
Step 3: Install CUDA 12 Toolkit (Network Installer)
The CUDA version shown in nvidia-smi is the maximum version your driver supports. Install CUDA 12.4 — it's compatible with PyTorch 2.3+ and stable across RTX 30/40 series and datacenter A100/H100 cards.
# Add NVIDIA's official CUDA repo (Ubuntu 24.04 / x86_64)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
# Install CUDA 12.4 toolkit — skip the driver package, it's already installed
sudo apt-get -y install cuda-toolkit-12-4
Add the toolkit to PATH and LD_LIBRARY_PATH so every shell finds it:
# Append to ~/.bashrc — these two lines are required
echo 'export PATH=/usr/local/cuda-12.4/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
Verify:
nvcc --version
Expected output:
nvcc: NVIDIA (R) Cuda compiler driver
...
Cuda compilation tools, release 12.4, V12.4.131
If it fails:
nvcc: command not found→ PATH isn't set. Double-check your.bashrcexport and runsource ~/.bashrcagain.
Step 4: Set Up a Python 3.12 Environment with uv
Never install PyTorch into the system Python. Use uv — it's 10–100× faster than pip for resolving large ML dependency trees (NumPy, SciPy, torchvision all in one shot).
# Install uv (single binary, no sudo needed)
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.cargo/env
# Create and activate a Python 3.12 virtual environment
uv venv --python 3.12 ~/envs/dl-pytorch
source ~/envs/dl-pytorch/bin/activate
Step 5: Install PyTorch 2.x with CUDA 12.4 Support
PyTorch publishes separate wheels for each CUDA version. Installing the wrong one (e.g., the CPU-only build from pip install torch) is the single most common cause of cuda.is_available() returning False.
# Install PyTorch 2.3 + torchvision + torchaudio — CUDA 12.4 build
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
This downloads ~2.5 GB. uv parallelizes downloads — expect 3–5 min on a 100 Mbps connection.
Step 6: Verify GPU Access in PyTorch
# save as verify_gpu.py
import torch
print(f"PyTorch version : {torch.__version__}")
print(f"CUDA available : {torch.cuda.is_available()}")
print(f"CUDA version : {torch.version.cuda}")
print(f"GPU count : {torch.cuda.device_count()}")
print(f"GPU name : {torch.cuda.get_device_name(0)}")
# Quick tensor operation on the GPU — confirms compute works
x = torch.randn(1000, 1000, device="cuda")
y = torch.matmul(x, x.T)
print(f"Matrix result shape: {y.shape} (device: {y.device})")
python verify_gpu.py
Expected output:
PyTorch version : 2.3.1+cu124
CUDA available : True
CUDA version : 12.4
GPU count : 1
GPU name : NVIDIA GeForce RTX 4090
Matrix result shape: torch.Size([1000, 1000]) (device: cuda:0)
If CUDA available: False:
- Run
pip show torchand confirm the version ends in+cu124, not+cpu. If it says+cpu, you pulled from the wrong index. Reinstall using the--index-urlfrom Step 5. - Run
nvidia-smi— if that also fails, the driver unloaded. Reboot.
Step 7: Train a CNN on CIFAR-10 (End-to-End Verification)
This script trains a small CNN for 5 epochs. It's the real-world proof that your GPU pipeline is end-to-end functional — data loading, forward pass, backprop, and optimizer step all on the GPU.
# save as train_cifar10.py
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Training on: {DEVICE}")
# Normalize to ImageNet-style mean/std — standard for pretrained-compatible pipelines
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.247, 0.243, 0.261)),
])
train_set = torchvision.datasets.CIFAR10(root="./data", train=True, download=True, transform=transform)
# pin_memory=True + non_blocking transfers avoid CPU↔GPU stalls during data loading
train_loader = DataLoader(train_set, batch_size=128, shuffle=True, num_workers=4, pin_memory=True)
class SmallCNN(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(),
nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
nn.MaxPool2d(2),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(128 * 8 * 8, 256), nn.ReLU(),
nn.Dropout(0.5), # 0.5 drop rate is the standard starting point for fully-connected layers
nn.Linear(256, 10),
)
def forward(self, x):
return self.classifier(self.features(x))
model = SmallCNN().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
for epoch in range(5):
running_loss = 0.0
for i, (inputs, labels) in enumerate(train_loader):
# non_blocking=True overlaps H2D transfer with CPU work
inputs, labels = inputs.to(DEVICE, non_blocking=True), labels.to(DEVICE, non_blocking=True)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 100 == 99:
print(f"Epoch {epoch+1}, step {i+1:4d} | loss: {running_loss/100:.4f}")
running_loss = 0.0
print("Training complete.")
python train_cifar10.py
Expected output (RTX 4090, ~90 sec for 5 epochs):
Training on: cuda
Epoch 1, step 100 | loss: 1.8423
Epoch 1, step 200 | loss: 1.5671
...
Epoch 5, step 390 | loss: 0.9812
Training complete.
Watch GPU utilization during training — it should sit at 70–95%:
# Run in a second terminal while training
watch -n 1 nvidia-smi
PyTorch vs TensorFlow on Ubuntu: Which Should You Use?
| PyTorch 2.3 | TensorFlow 2.16 | |
|---|---|---|
| Ecosystem | Dominant in research, HuggingFace, fast adoption | Production-focused, TFX, TensorFlow Serving |
| Ubuntu CUDA setup | Single pip index flag | Separate tensorflow[and-cuda] package |
| Dynamic graphs | Native (define-by-run) | Eager mode default since 2.0 |
| Model export | TorchScript, ONNX, ExecuTorch | SavedModel, TFLite, TensorFlow.js |
| 2026 momentum | ✅ Dominant for LLMs, diffusion models | ✅ Strong in mobile/edge and enterprise |
| Starting price | Free, open-source | Free, open-source |
Choose PyTorch if: you're training LLMs, diffusion models, or any model where HuggingFace Transformers is in your stack. Choose TensorFlow if: you need TFX pipelines, TFLite deployment, or your team already has a production TF codebase.
Verification
python verify_gpu.py
You should see: CUDA available: True and your GPU's name on the last line.
What You Learned
- CUDA and PyTorch are separately versioned — the
--index-urlflag in Step 5 is what locks them together. Installing the wrong wheel is the root cause of 80% of "CUDA not available" issues. uvis now the standard environment manager for Python ML projects —pipis too slow for dependency graphs this large.pin_memory=True+non_blocking=Truein your DataLoader is a free 5–15% throughput improvement on any NVIDIA GPU.
Tested on PyTorch 2.3.1, CUDA 12.4, Ubuntu 22.04 LTS and Ubuntu 24.04 LTS, RTX 4090 and RTX 3080
FAQ
Q: Does torch.cuda.is_available() return False even after a correct install?
A: Yes — if the PyTorch wheel doesn't end in +cu124 (or your CUDA version), it's the CPU build. Run pip show torch to confirm the version string, then reinstall from https://download.pytorch.org/whl/cu124.
Q: What is the minimum VRAM needed for deep learning with PyTorch? A: 8 GB VRAM handles most CIFAR/ImageNet-scale CNNs and fine-tuning LoRA adapters on 7B models with 4-bit quantization. 24 GB (RTX 4090) lets you train 7B models in full precision or do batch inference at scale.
Q: Can I use PyTorch on Ubuntu with an AMD GPU?
A: Yes — install the ROCm build: pip install torch --index-url https://download.pytorch.org/whl/rocm6.0. Most training APIs are identical; ROCm 6.0 targets RX 7000 and Instinct MI300 cards.
Q: How do I run this setup inside Docker on Ubuntu?
A: Use the official image nvcr.io/nvidia/pytorch:24.03-py3 — it ships with CUDA 12.3, cuDNN 9, and PyTorch 2.3 pre-installed. Mount your dataset with -v /data:/workspace/data and pass --gpus all to docker run.
Q: Does PyTorch 2.x torch.compile() work on Ubuntu with CUDA 12?
A: Yes — torch.compile(model) works out of the box on CUDA 12.x with the default inductor backend. It typically gives 20–40% faster training on Ampere and Ada Lovelace GPUs with no code changes beyond the one-liner.