Deep Learning with PyTorch on Ubuntu: Full Setup Guide 2026

Set up deep learning with PyTorch on Ubuntu 22.04 or 24.04. Install CUDA 12, configure GPU training, and run your first neural network in under 30 min.

Problem: PyTorch Won't Use Your GPU on Ubuntu

Deep learning with PyTorch on Ubuntu is the go-to setup for developers training neural networks locally — but a bad CUDA install leaves you training on CPU at 50× the speed penalty.

You'll learn:

  • Install the correct NVIDIA driver, CUDA 12, and cuDNN on Ubuntu 22.04 / 24.04
  • Create an isolated Python 3.12 environment with uv and install PyTorch 2.x
  • Verify GPU access and run a real training loop on CIFAR-10

Time: 30 min | Difficulty: Intermediate


Why This Happens

The most common failure: torch.cuda.is_available() returns False even with an NVIDIA GPU installed. This happens because the PyTorch CUDA build and the system CUDA toolkit version don't match. Ubuntu ships with open-source Nouveau drivers by default — PyTorch needs the proprietary NVIDIA driver and a matching CUDA runtime.

Symptoms:

  • torch.cuda.is_available() returns False
  • nvidia-smi command not found after reboot
  • RuntimeError: CUDA error: no kernel image is available for execution on the device (driver/CUDA version mismatch)
  • Training runs but nvidia-smi shows 0% GPU utilization

Deep Learning with PyTorch on Ubuntu: Full Stack Diagram

Deep Learning with PyTorch on Ubuntu — full stack from NVIDIA driver to training loop The four layers that must version-match: NVIDIA driver → CUDA runtime → cuDNN → PyTorch CUDA build


Solution

Step 1: Check Your GPU and Ubuntu Version

One command confirms the GPU is visible to the kernel before you install anything.

# Confirm the GPU is detected by the kernel
lspci | grep -i nvidia

# Confirm Ubuntu release
lsb_release -a

Expected output:

01:00.0 VGA compatible controller: NVIDIA Corporation AD102 [GeForce RTX 4090] (rev a1)
...
Description: Ubuntu 24.04.2 LTS

If lspci shows nothing NVIDIA, the problem is hardware or BIOS — not drivers.


Step 2: Remove Nouveau and Install the NVIDIA Driver

Ubuntu's open-source Nouveau driver conflicts with the proprietary NVIDIA driver. Block it first.

# Block the Nouveau driver
sudo bash -c "echo 'blacklist nouveau
options nouveau modeset=0' > /etc/modprobe.d/blacklist-nouveau.conf"

sudo update-initramfs -u

# Install the recommended proprietary driver
sudo ubuntu-drivers autoinstall

# Reboot — required for the kernel module to load
sudo reboot

After reboot, verify the driver loaded:

nvidia-smi

Expected output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15   Driver Version: 570.86.15   CUDA Version: 12.8               |
+-----------------------------------------------------------------------------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0  On |                  N/A |
|  0%   42C    P8             22W / 450W |     312MiB /  24564MiB |      1%      Default |
+-----------------------------------------------------------------------------------------+

If it fails:

  • nvidia-smi: command not found → the driver didn't install. Run sudo apt install nvidia-driver-570 explicitly and reboot again.
  • Failed to initialize NVML: Driver/library version mismatch → reboot is still pending; a previous driver kernel module is loaded. Reboot fixes this.

Step 3: Install CUDA 12 Toolkit (Network Installer)

The CUDA version shown in nvidia-smi is the maximum version your driver supports. Install CUDA 12.4 — it's compatible with PyTorch 2.3+ and stable across RTX 30/40 series and datacenter A100/H100 cards.

# Add NVIDIA's official CUDA repo (Ubuntu 24.04 / x86_64)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# Install CUDA 12.4 toolkit — skip the driver package, it's already installed
sudo apt-get -y install cuda-toolkit-12-4

Add the toolkit to PATH and LD_LIBRARY_PATH so every shell finds it:

# Append to ~/.bashrc — these two lines are required
echo 'export PATH=/usr/local/cuda-12.4/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

Verify:

nvcc --version

Expected output:

nvcc: NVIDIA (R) Cuda compiler driver
...
Cuda compilation tools, release 12.4, V12.4.131

If it fails:

  • nvcc: command not found → PATH isn't set. Double-check your .bashrc export and run source ~/.bashrc again.

Step 4: Set Up a Python 3.12 Environment with uv

Never install PyTorch into the system Python. Use uv — it's 10–100× faster than pip for resolving large ML dependency trees (NumPy, SciPy, torchvision all in one shot).

# Install uv (single binary, no sudo needed)
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.cargo/env

# Create and activate a Python 3.12 virtual environment
uv venv --python 3.12 ~/envs/dl-pytorch
source ~/envs/dl-pytorch/bin/activate

Step 5: Install PyTorch 2.x with CUDA 12.4 Support

PyTorch publishes separate wheels for each CUDA version. Installing the wrong one (e.g., the CPU-only build from pip install torch) is the single most common cause of cuda.is_available() returning False.

# Install PyTorch 2.3 + torchvision + torchaudio — CUDA 12.4 build
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

This downloads ~2.5 GB. uv parallelizes downloads — expect 3–5 min on a 100 Mbps connection.


Step 6: Verify GPU Access in PyTorch

# save as verify_gpu.py
import torch

print(f"PyTorch version : {torch.__version__}")
print(f"CUDA available  : {torch.cuda.is_available()}")
print(f"CUDA version    : {torch.version.cuda}")
print(f"GPU count       : {torch.cuda.device_count()}")
print(f"GPU name        : {torch.cuda.get_device_name(0)}")

# Quick tensor operation on the GPU — confirms compute works
x = torch.randn(1000, 1000, device="cuda")
y = torch.matmul(x, x.T)
print(f"Matrix result shape: {y.shape}  (device: {y.device})")
python verify_gpu.py

Expected output:

PyTorch version : 2.3.1+cu124
CUDA available  : True
CUDA version    : 12.4
GPU count       : 1
GPU name        : NVIDIA GeForce RTX 4090
Matrix result shape: torch.Size([1000, 1000])  (device: cuda:0)

If CUDA available: False:

  • Run pip show torch and confirm the version ends in +cu124, not +cpu. If it says +cpu, you pulled from the wrong index. Reinstall using the --index-url from Step 5.
  • Run nvidia-smi — if that also fails, the driver unloaded. Reboot.

Step 7: Train a CNN on CIFAR-10 (End-to-End Verification)

This script trains a small CNN for 5 epochs. It's the real-world proof that your GPU pipeline is end-to-end functional — data loading, forward pass, backprop, and optimizer step all on the GPU.

# save as train_cifar10.py
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Training on: {DEVICE}")

# Normalize to ImageNet-style mean/std — standard for pretrained-compatible pipelines
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.247, 0.243, 0.261)),
])

train_set = torchvision.datasets.CIFAR10(root="./data", train=True, download=True, transform=transform)
# pin_memory=True + non_blocking transfers avoid CPU↔GPU stalls during data loading
train_loader = DataLoader(train_set, batch_size=128, shuffle=True, num_workers=4, pin_memory=True)


class SmallCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(),
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 8 * 8, 256), nn.ReLU(),
            nn.Dropout(0.5),  # 0.5 drop rate is the standard starting point for fully-connected layers
            nn.Linear(256, 10),
        )

    def forward(self, x):
        return self.classifier(self.features(x))


model = SmallCNN().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)

for epoch in range(5):
    running_loss = 0.0
    for i, (inputs, labels) in enumerate(train_loader):
        # non_blocking=True overlaps H2D transfer with CPU work
        inputs, labels = inputs.to(DEVICE, non_blocking=True), labels.to(DEVICE, non_blocking=True)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        if i % 100 == 99:
            print(f"Epoch {epoch+1}, step {i+1:4d} | loss: {running_loss/100:.4f}")
            running_loss = 0.0

print("Training complete.")
python train_cifar10.py

Expected output (RTX 4090, ~90 sec for 5 epochs):

Training on: cuda
Epoch 1, step  100 | loss: 1.8423
Epoch 1, step  200 | loss: 1.5671
...
Epoch 5, step  390 | loss: 0.9812
Training complete.

Watch GPU utilization during training — it should sit at 70–95%:

# Run in a second terminal while training
watch -n 1 nvidia-smi

PyTorch vs TensorFlow on Ubuntu: Which Should You Use?

PyTorch 2.3TensorFlow 2.16
EcosystemDominant in research, HuggingFace, fast adoptionProduction-focused, TFX, TensorFlow Serving
Ubuntu CUDA setupSingle pip index flagSeparate tensorflow[and-cuda] package
Dynamic graphsNative (define-by-run)Eager mode default since 2.0
Model exportTorchScript, ONNX, ExecuTorchSavedModel, TFLite, TensorFlow.js
2026 momentum✅ Dominant for LLMs, diffusion models✅ Strong in mobile/edge and enterprise
Starting priceFree, open-sourceFree, open-source

Choose PyTorch if: you're training LLMs, diffusion models, or any model where HuggingFace Transformers is in your stack. Choose TensorFlow if: you need TFX pipelines, TFLite deployment, or your team already has a production TF codebase.


Verification

python verify_gpu.py

You should see: CUDA available: True and your GPU's name on the last line.


What You Learned

  • CUDA and PyTorch are separately versioned — the --index-url flag in Step 5 is what locks them together. Installing the wrong wheel is the root cause of 80% of "CUDA not available" issues.
  • uv is now the standard environment manager for Python ML projects — pip is too slow for dependency graphs this large.
  • pin_memory=True + non_blocking=True in your DataLoader is a free 5–15% throughput improvement on any NVIDIA GPU.

Tested on PyTorch 2.3.1, CUDA 12.4, Ubuntu 22.04 LTS and Ubuntu 24.04 LTS, RTX 4090 and RTX 3080


FAQ

Q: Does torch.cuda.is_available() return False even after a correct install? A: Yes — if the PyTorch wheel doesn't end in +cu124 (or your CUDA version), it's the CPU build. Run pip show torch to confirm the version string, then reinstall from https://download.pytorch.org/whl/cu124.

Q: What is the minimum VRAM needed for deep learning with PyTorch? A: 8 GB VRAM handles most CIFAR/ImageNet-scale CNNs and fine-tuning LoRA adapters on 7B models with 4-bit quantization. 24 GB (RTX 4090) lets you train 7B models in full precision or do batch inference at scale.

Q: Can I use PyTorch on Ubuntu with an AMD GPU? A: Yes — install the ROCm build: pip install torch --index-url https://download.pytorch.org/whl/rocm6.0. Most training APIs are identical; ROCm 6.0 targets RX 7000 and Instinct MI300 cards.

Q: How do I run this setup inside Docker on Ubuntu? A: Use the official image nvcr.io/nvidia/pytorch:24.03-py3 — it ships with CUDA 12.3, cuDNN 9, and PyTorch 2.3 pre-installed. Mount your dataset with -v /data:/workspace/data and pass --gpus all to docker run.

Q: Does PyTorch 2.x torch.compile() work on Ubuntu with CUDA 12? A: Yes — torch.compile(model) works out of the box on CUDA 12.x with the default inductor backend. It typically gives 20–40% faster training on Ampere and Ada Lovelace GPUs with no code changes beyond the one-liner.