Reproducible PyTorch Training with Docker: GPU Access, CUDA Versions, and Checkpoint Volumes

Your training runs on your machine. It fails on your colleague's because their CUDA version is different. Docker fixes this permanently. That 57% of developers using Docker (Stack Overflow 2025) aren't just following a trend; they're avoiding the specific hell of "works on my machine" when that machine has a $2,000 GPU. But slapping FROM pytorch/pytorch into a Dockerfile and hoping for the best is how you end up with a 1.1GB image that takes forever to build and still can't see your GPU. Let's build a reproducible PyTorch training environment that's fast, portable, and doesn't waste your GPU's silicon tears.

The CUDA Base Image Dilemma: Official PyTorch vs. NVIDIA CUDA

Your first and most consequential choice is the FROM line. Get it wrong, and you're either bloating your image or missing critical libraries. You have two primary contenders: pytorch/pytorch and nvidia/cuda.

Use pytorch/pytorch when you want the simplest path to a working PyTorch environment. The PyTorch team pre-builds these images with compatible versions of PyTorch, CUDA, and cuDNN. It's a batteries-included experience. The tag syntax is your bible: pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime. This tells you PyTorch 2.4.0, CUDA 12.4, cuDNN 9, and the -runtime variant (smaller, no build tools).

Use nvidia/cuda when you need fine-grained control over the OS and CUDA version, or you're building a custom stack beyond PyTorch. You start with a bare-bones CUDA installation and add layers yourself. For example, nvidia/cuda:12.4.0-cudnn9-runtime-ubuntu22.04.

Here’s the rule of thumb: For 90% of PyTorch training workloads, start with pytorch/pytorch. It’s the officially supported path and eliminates version-matching guesswork. Only descend into nvidia/cuda if you have specific system library requirements or are building a multi-framework container.

Actually Getting Your Container to See the GPU

You've built your image. You run docker run my-pytorch-app python train.py. It runs, but is it using the GPU? If you haven't set up the NVIDIA Container Toolkit, it's silently falling back to CPU, wasting your RTX 4090.

First, verify your host can do GPU passthrough. Install the NVIDIA Container Toolkit. On Ubuntu, it's:


distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Now, run your container with the --gpus all flag:

docker run --gpus all -it my-pytorch-app nvidia-smi

You should see the same GPU table as on your host. Inside your training script, add this verification:

import torch
print(f"PyTorch CUDA Available: {torch.cuda.is_available()}")
print(f"PyTorch CUDA Device Count: {torch.cuda.device_count()}")
print(f"PyTorch Current Device: {torch.cuda.current_device()}")
print(f"PyTorch Device Name: {torch.cuda.get_device_name()}")

No GPU access? The classic error is OCI runtime exec failed: exec format error. This often happens on Apple Silicon Macs trying to run AMD64 images. The fix: build with explicit platform targeting: docker build --platform linux/amd64 -t my-app .. If you're on Linux, ensure the NVIDIA Container Toolkit is correctly installed and the Docker daemon restarted.

Where to Put Your Precious Model Checkpoints (Hint: Not in the Container)

Containers are ephemeral. Your 48-hour training run's checkpoint is not. Storing checkpoints inside the container's writable layer is a recipe for disaster. The solution is a named volume.

A named volume is managed by Docker and persists independently of any container lifecycle. It's the preferred method for production data persistence. Here's how you integrate it into a training workflow using docker-compose.yml:

version: '3.8'

services:
  trainer:
    build: .
    runtime: nvidia # Use the NVIDIA runtime
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - BATCH_SIZE=32
      - LEARNING_RATE=1e-4
    volumes:
      # Mount the named volume to the container's checkpoint directory
      - checkpoint_volume:/app/checkpoints
    command: python train.py

volumes:
  checkpoint_volume: # This declares the named volume

Your training script saves to /app/checkpoints/model_epoch_10.pt. Destroy the container, rebuild it, bring it back up—the checkpoint volume remains. To inspect its contents: docker run --rm -v checkpoint_volume:/checkpoints alpine ls -la /checkpoints.

What about the Error: no space left on device that grinds everything to a halt? Docker's build cache and unused volumes are the usual culprits. The fix: run docker system prune -a --volumes to reclaim disk (warning: this removes all unused data). For a more sustainable solution, configure the Docker daemon's log rotation and storage limits in /etc/docker/daemon.json.

Tuning Hyperparameters Without Rebuilding the World

Rebuilding a Docker image because you changed the learning rate is insanity. Hyperparameters are runtime configuration, not build-time configuration. Pass them via environment variables.

In your Dockerfile, use ENV to set sensible defaults:

FROM pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .

# Set default hyperparameters
ENV BATCH_SIZE=16
ENV LEARNING_RATE=1e-3
ENV NUM_EPOCHS=50

CMD ["python", "train.py"]

In your training script (train.py), read them:

import os
batch_size = int(os.getenv('BATCH_SIZE', 16))
learning_rate = float(os.getenv('LEARNING_RATE', 1e-3))
num_epochs = int(os.getenv('NUM_EPOCHS', 50))

Now, override them at runtime:

docker run --gpus all \
  -e BATCH_SIZE=32 \
  -e LEARNING_RATE=2e-4 \
  my-pytorch-app

Or, in your docker-compose.yml, under the environment: key for the service. This pattern is crucial for running parallel experiments, which we'll cover next.

The Multi-Stage Build: One Dockerfile, Two Purposes

Your training environment needs CUDA, compilers, and development tools. Your inference API needs a lean, fast, secure image. Multi-stage builds solve this, letting you drop 60–80% of the image size (Docker Hub analysis, 2025). BuildKit makes this 2–5x faster on cache-heavy projects (Docker, 2025).

Here’s the blueprint:

# Stage 1: The Builder/Trainer (Fat)
FROM pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
# Train your model, output artifacts to /app/model_artifacts
RUN python train.py --checkpoint-dir /app/model_artifacts

# Stage 2: The Inference Server (Slim)
FROM python:3.12-slim AS inference
WORKDIR /app
# Copy ONLY the runtime dependencies and artifacts
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY --from=builder /app/model_artifacts /app/model_artifacts
COPY --from=builder /app/inference_api.py /app/
# Install a minimal web server (e.g., FastAPI/uvicorn)
RUN pip install --no-cache-dir fastapi uvicorn
EXPOSE 8000
CMD ["uvicorn", "inference_api:app", "--host", "0.0.0.0"]

Build it with BuildKit enabled (default in Docker Desktop and recent Docker CE):

DOCKER_BUILDKIT=1 docker build --target inference -t my-model-inference:latest .

Compare the sizes. The devel image might be ~3GB. Your inference stage, based on python:3.12-slim (130MB), will be a fraction of that. For extreme minimalism, consider python:3.12-alpine (45MB), but beware of potential glibc/musl libc compatibility issues with some Python wheels.

Orchestrating Parallel Training Experiments

You need to test three different learning rates across two datasets. Spinning up VMs is overkill. With Docker Compose, you can define an isolated environment per experiment.

Create a docker-compose.parallel.yml:

version: '3.8'

services:
  experiment_lr1e3:
    build: .
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1 # Request one GPU per experiment
              capabilities: [gpu]
    environment:
      - LEARNING_RATE=1e-3
      - EXPERIMENT_NAME=lr1e3
    volumes:
      - experiment_lr1e3_checkpoints:/app/checkpoints
    networks:
      - experiment-net

  experiment_lr1e4:
    build: .
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - LEARNING_RATE=1e-4
      - EXPERIMENT_NAME=lr1e4
    volumes:
      - experiment_lr1e4_checkpoints:/app/checkpoints
    networks:
      - experiment-net

networks:
  experiment-net: # Isolated network for these services

volumes:
  experiment_lr1e3_checkpoints:
  experiment_lr1e4_checkpoints:

Run it: docker-compose -f docker-compose.parallel.yml up -d. Each service gets its own container, its own named volume for checkpoints, and shares the GPU resources (ensure you have enough GPU memory). Use Docker Scout or docker stats to monitor resource usage. To tear it all down cleanly and avoid Error response from daemon: network not found errors later, use docker-compose -f docker-compose.parallel.yml down --volumes.

Pushing to the Cloud: Your Image is Your Artifact

Reproducibility extends to the cloud. You can't ship your local Docker daemon to AWS. You ship an image. Docker Hub serves 13M+ image pulls daily (Docker blog, Q1 2026), but for private model artifacts, use AWS ECR or Google GCR.

The workflow is standardized:

Build with a cloud-friendly tag.

docker build -t 123456789.dkr.ecr.us-east-1.amazonaws.com/my-org/training:v1.0 .

Authenticate your Docker client to the registry.

# AWS ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com

Push the image.

docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/my-org/training:v1.0

Pull and run on your cloud instance (with the NVIDIA container runtime installed).

docker run --gpus all 123456789.dkr.ecr.us-east-1.amazonaws.com/my-org/training:v1.0

Encounter a Layer already exists / push retried error? First, ensure BuildKit is enabled (DOCKER_BUILDKIT=1). Second, verify authentication with docker login. Third, check your network stability—the push will retry on its own, but a poor connection can cause cryptic failures.

The Toolbox for the Serious Practitioner

Your Docker CLI is just the start. Integrate these into your workflow:

Hadolint: Lint your Dockerfile for best practices and common mistakes. Use the VS Code extension.
Dive: (dive my-image:tag) Analyze your image layer-by-layer to see where the bloat is coming from.
Trivy: Scan your images for CVEs (trivy image my-image:tag). Do this before pushing to production.
Lazydocker: A terminal UI for managing everything: containers, images, volumes, and logs.

In VS Code, the Docker extension is non-negotiable. Use Ctrl+Shift+P to open the command palette and run Docker: Build Image. Use the integrated terminal (`Ctrl+``) to run commands. Tools like Continue.dev or GitHub Copilot can help draft Dockerfiles and compose files based on your comments.

Next Steps: From Reproducible to Optimal

You now have a reproducible GPU training environment. The next evolution is optimization and robustness.

Implement a Proper CI/CD Pipeline: Don't build and push from your local machine. Use GitHub Actions, GitLab CI, or Jenkins to build, scan (with Trivy), and push your images on every git tag. The pipeline should run a quick sanity check—perhaps a single training batch—inside the built container to verify GPU access.
Version Your Data and Models Alongside Your Code: Use DVC (Data Version Control) or LakeFS to version your datasets. Your Docker image (code + environment) should be pinned to a specific data version via an environment variable or config file mounted at runtime. True reproducibility means being able to re-create the exact data→model pipeline.
Graduate to a Kubernetes Cluster for Hyperparameter Sweeps: When docker-compose for parallel experiments feels limiting, move to Kubernetes. A single Job manifest can spawn 50 pods (experiment-1 through experiment-50), each with a different hyperparameter set injected via environment variables, all pulling your pre-built image from ECR and writing results to a cloud storage bucket (not container volumes). Tools like kubectl and Lens become your new dashboard.
Adopt a Metadata Store: As you run hundreds of experiments, you need to track what you did. Integrate your training script with MLflow or Weights & Biases. Log the hyperparameters (from those environment variables), metrics, and the path to the checkpoint (in your named volume or cloud bucket). This turns a reproducible mess into reproducible science.

Reproducibility isn't a nice-to-have; it's the foundation of reliable machine learning. Docker is the hammer that nails down the environment. Your job is to wield it precisely—specifying exact base images, mounting persistent volumes, injecting configuration, and building lean artifacts. Stop debugging CUDA version mismatches and start debugging your actual models. Your GPU will thank you.