DVC for ML Reproducibility: Dataset Versioning, Pipeline Stages, and S3 Remote Storage

You can reproduce your code with Git. You cannot reproduce your data or your trained model. DVC adds the missing half of ML version control.

Your model's performance just tanked in production. You scramble to git checkout the exact commit from last month, re-run train.py, and get… completely different results. The culprit? The 10GB training CSV you downloaded last month has silently changed. A colleague filtered out some "bad" rows. A vendor updated the schema. The file now lives on a different S3 path. Git only sees a changed pointer file. Your experiment is dead, your model is a mystery, and your RTX 4090 just burned cycles for nothing.

This is the ML reproducibility gap. Code is the easy part. The real chaos lives in your datasets, your model binaries, and the labyrinthine pipeline that connects them. According to a DVC user survey in 2025, 60% of ML teams now use DVC for dataset versioning because they've been burned by this exact ghost-in-the-machine problem. Let's fix it.

Why Your Git Workflow is a Data Liability

You wouldn't version a 100GB video file in Git. So why are you git add-ing a data.csv symlink or a requirements.txt that points to torch==2.3.0+cu121? Git tracks content, not context. It sees a changed hash for a 10KB data.csv symlink and shrugs. It has no idea the actual 10GB file behind that symlink is now fundamentally different.

Worse, your pipeline is a shell script spaghetti of python preprocess.py && python train.py --lr 0.001. Did the model degrade because of new data, a different hyperparameter, or a library version shift? You're left grep-ping terminal history. This is why 85% of ML projects never reach production—the system is too brittle to trust (Gartner 2025). MLOps practices, starting with proper versioning, can reduce this failure rate by 40%.

DVC (Data Version Control) solves this by making data and models first-class citizens in your Git repo. It replaces symlinks with tracked metadata files (.dvc files), handles large-file storage in S3/GCS/Azure, and defines your pipeline as a declarative DAG. It's git for everything that isn't code.

Tracking Your First 10GB Dataset in 3 Commands

Let's version a dataset. You have a data/raw/train.csv file. With Git, you'd sweat. With DVC, you initialize and add.

First, install DVC with the S3 backend: pip install 'dvc[s3]'. In your Git-initialized project root:


dvc init

# 2. Start tracking your data directory with DVC, not Git
dvc add data/raw/train.csv

# 3. Commit the DVC metadata file to Git
git add data/raw/train.csv.dvc .dvcignore
git commit -m "Track raw training dataset with DVC"

What happened? DVC didn't add train.csv to Git. It created a train.csv.dvc file—a small text file containing a hash of the dataset. It also moved the actual train.csv to .dvc/cache. The file in your workspace is now a symlink (on Unix) or a copy (configurable). The .dvc file is what you commit to Git. This is your lightweight, version-controlled pointer.

Now, anyone cloning your repo gets the .dvc file. To download the actual 10GB data, they run dvc pull. DVC reads the pointer, fetches the real file from the cache (or a remote storage, which we'll set up next), and reconstitutes it. Your Git history stays clean and fast.

Setting Up S3 as Your Single Source of Truth

Your local .dvc/cache is fine until your laptop dies. You need remote storage. DVC calls these "remotes," and they work like git remote but for data. S3 is the classic choice.

First, ensure your AWS CLI is configured (aws configure). Then, from your project root:

# Add an S3 bucket as your DVC remote
dvc remote add -d myremote s3://your-bucket-name/dvc-storage

# Push your cached data to S3
dvc push

The -d flag sets this as the default remote. Now, dvc push and dvc pull sync with S3. The magic is in the differential sync. DVC doesn't re-upload the entire 10GB file if only 1MB changed. It breaks files into chunks, hashes them, and only transfers the new chunks. This is why a dvc pull on a 10GB dataset with a 1% change takes ~45 seconds, compared to ~180 seconds with a tool like git-lfs that might transfer more data than needed.

Real Error & Fix:

ERROR: failed to push data to the cloud - 403 Forbidden on S3

This is an IAM permissions wall. DVC uses your AWS credentials. The fix:

Verify your credentials are active: aws sts get-caller-identity.
Check you can list the bucket: aws s3 ls s3://your-bucket-name.
Ensure your IAM policy has s3:PutObject and s3:GetObject permissions on the bucket and its prefix (dvc-storage/*). A missing wildcard in the resource path is a common culprit.

Building Reproducible Pipelines with `dvc.yaml`

Ad-hoc scripts are the enemy. DVC lets you define your ML pipeline as a declarative Directed Acyclic Graph (DAG) in dvc.yaml. This file defines stages, their dependencies (code + data), and their outputs (data + models).

Here's a pipeline for a simple text classifier:

# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - src/prepare.py
      - data/raw/train.csv
    params:
      - prepare.split_ratio
    outs:
      - data/prepared/train.pkl
      - data/prepared/test.pkl

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/prepared/train.pkl
    params:
      - train.lr
      - train.batch_size
    outs:
      - models/model.onnx
    metrics:
      - metrics/accuracy.json:
          cache: false  # Metrics are small, don't cache in .dvc/cache

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - data/prepared/test.pkl
      - models/model.onnx
    metrics:
      - metrics/final_report.json:
          cache: false

Key concepts:

cmd: The command to run.
deps: Files this stage depends on. If any dependency's hash changes, DVC knows this stage is invalidated.
params: Reads from a separate params.yaml file. Changing a hyperparameter here will invalidate the train stage.
outs: Outputs DVC should track and version. DVC will not reproduce a stage if its outputs already exist and are valid.
metrics: Small output files (JSON, YAML) that DVC can compare across experiments.

Your params.yaml might look like:

prepare:
  split_ratio: 0.8
train:
  lr: 0.001
  batch_size: 32

`dvc repro`: The "Make" Command for ML

With your pipeline defined, you don't run scripts. You run:

dvc repro

DVC loads dvc.yaml, calculates a hash for every dependency and parameter, and checks if the cached hash for each stage's outputs matches. If nothing changed, it prints 'Stage X is up to date'. If you changed train.lr in params.yaml, DVC invalidates the train and evaluate stages and runs only those. This is incremental execution. It saves hours.

To run the full pipeline from scratch? dvc repro --force.

Tracking Experiments and Hyperparameter Sweeps

Now, let's integrate experiment tracking. DVC can manage parallel experiment runs and plug into MLflow or Weights & Biases. Here's how you launch a hyperparameter sweep using DVC experiments and log it to MLflow (which sees 17M+ monthly downloads as the leading open-source tracker).

First, ensure your train.py uses MLflow's context manager:

# src/train.py
import mlflow
import pandas as pd
import json

def train_model(lr, batch_size, train_data_path):
    # ... your training logic ...
    accuracy = 0.92  # dummy result

    # Log to MLflow
    with mlflow.start_run():  # Context manager prevents 'Run not found' errors
        mlflow.log_param("learning_rate", lr)
        mlflow.log_param("batch_size", batch_size)
        mlflow.log_metric("accuracy", accuracy)
        # Log the model artifact
        mlflow.onnx.log_model(onnx_model, "model")

    # Save metric for DVC to track
    with open("metrics/accuracy.json", "w") as f:
        json.dump({"accuracy": accuracy}, f)

Real Error & Fix:

mlflow.exceptions.MlflowException: Run not found or already finished

This happens when you try to log to a run that's been manually closed. The fix is to always use with mlflow.start_run(): as a context manager. It automatically starts and ends the run, even on exceptions. Never call mlflow.end_run() manually unless you're using the low-level API with explicit start_run().

Now, queue experiments with DVC:

# Run experiments with different hyperparameters
dvc exp run --set-param train.lr=0.01
dvc exp run --set-param train.lr=0.0001

# Compare results in a table
dvc exp show

This creates a clean table in your terminal comparing metrics across runs. For richer UI, DVC can stream metrics to MLflow or W&B. Continuous training pipelines built this way can reduce model redeployment time from a manual 2-week process to an automated 4-hour cycle on average.

CI/CD for ML: Automating `dvc repro` on Data Changes

The final step is automation. You want your GitHub Actions CI to retrain the model automatically when new data is pushed to S3. Here's a .github/workflows/train-on-data-change.yml skeleton:

name: Retrain on Data Change
on:
  push:
    branches: [ main ]
    paths:
      - 'data/raw/train.csv.dvc' # Trigger only when DVC data pointer changes

jobs:
  retrain:
    runs-on: ubuntu-latest
    env:
      AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
      AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0 # Fetch all Git history for DVC

      - name: Setup Python & DVC
        run: |
          pip install 'dvc[s3]' mlflow onnx

      - name: Pull Data from S3
        run: dvc pull

      - name: Reproduce Pipeline
        run: dvc repro

      - name: Push Updated Model & Metrics
        run: |
          dvc push
          git config user.name "GitHub Actions Bot"
          git config user.email "actions@github.com"
          git add models/model.onnx.dvc metrics/accuracy.json
          git commit -m "CI: Retrained model on new data"
          git push

This workflow triggers only when the .dvc file for your training data changes (i.e., when dvc add was run on a new dataset version and the change was committed). It pulls the latest data, runs the pipeline, and pushes the new model pointer back to Git. Your Git history becomes an immutable ledger of code, data, and model versions.

Performance & Tool Comparison: Choosing the Right Engine

DVC isn't the only tool, but it's focused. Here’s how it fits within the MLOps stack, with real numbers:

Task / Tool	Performance Benchmark	When to Use
Dataset Versioning DVC vs Git-LFS	DVC pull (10GB, 1% change): ~45s Git-LFS: ~180s	Use DVC. Its rsync-style diff is faster for iterative ML data.
Experiment Tracking Logging Overhead	MLflow: 2–5ms per `log_metric()`	Negligible overhead. Log liberally in batches, not per-sample.
Model Serving Throughput	Ray Serve: ~12,000 req/s TorchServe: ~8,000 req/s (ResNet-50, 8-core CPU)	Ray Serve for complex DAGs, TorchServe for pure PyTorch optimization.
Drift Detection Speed on 1M samples	Evidently: ~8s (full report) Manual NumPy: ~45s	Use Evidently for production dashboards; custom code for one-offs.

DVC excels at the orchestration and versioning layer, sitting between your Git repo and your storage. It plays nicely with the rest: log metrics to MLflow, package the final model with BentoML, serve it with Ray Serve, and monitor for drift with Evidently (which is critical, as model drift can cause 30% accuracy degradation within 6 months without monitoring).

Next Steps: From Versioned Pipeline to Production Pipeline

You've now versioned your data, defined a reproducible pipeline, and automated retraining. This is the foundation. Your next layers are:

Model Registry: Promote your DVC-tracked .onnx file to an MLflow Model Registry stage (Staging, Production, Archived). Use the registry's API to deploy the Production model version automatically.
Feature Store Integration: Replace your raw data/raw/train.csv dependency with a Feast feature retrieval call. This decouples data engineering from ML, providing consistent, point-in-time correct features for training and serving.
Advanced CI/CD: Expand your GitHub Actions to run evaluation against a holdout set and only deploy if accuracy improves by a threshold. Add a Slack notification step with dvc exp show --md output.
Production Monitoring: After deployment with Ray Serve or Seldon, schedule a daily job that runs Evidently on incoming data, logs drift metrics to MLflow, and triggers a dvc exp run if significant drift is detected.

The goal isn't just reproducibility—it's continuous, reliable reproducibility. DVC gives you the leverage to stop guessing why last month's model was better and start systematically building the next one that will be better still. Stop letting your data be a ghost in the machine. Version it, pipeline it, and ship it.