Version Control Huge Datasets (LiDAR/Video) with DVC in 20 Minutes

Problem: Git Can't Handle Your 50GB Dataset

You're working on an autonomous driving or computer vision project. Your LiDAR point clouds and video clips are massive — 10GB, 50GB, 100GB per dataset version. Committing them to Git bloats your repo, slows clones to a crawl, and breaks CI/CD pipelines.

DVC (Data Version Control) solves this by storing large files in remote storage (S3, GCS, Azure, SSH) while tracking them in Git as lightweight pointer files.

You'll learn:

How to set up DVC with a remote for multi-GB dataset files
How to version LiDAR (.pcd, .bin) and video (.mp4, .bag) files across branches
How to share datasets with your team without downloading everything

Time: 20 min | Level: Intermediate

Why This Happens

Git stores every version of every file in its object database. A 2GB .bag file committed three times = 6GB in .git/objects. Your repo becomes unusable within weeks.

Common symptoms:

git clone takes 20+ minutes
git push times out or hits GitHub's 100MB file limit
Teammates skip pulling because it costs too much bandwidth
CI runners run out of disk space

DVC works like Git's .gitignore but for data: it replaces large files with small .dvc pointer files that do get committed to Git, while the actual bytes live in a separate remote store.

Solution

Step 1: Install DVC with Your Storage Backend

# Core install (Python 3.8+)
pip install dvc

# Install with your storage backend
pip install "dvc[s3]"      # Amazon S3 or S3-compatible (MinIO, Backblaze)
pip install "dvc[gs]"      # Google Cloud Storage
pip install "dvc[azure]"   # Azure Blob Storage
pip install "dvc[ssh]"     # SSH server (NAS, HPC cluster)

# Verify
dvc version

Expected: DVC version 3.x with your storage plugin listed under "Remotes".

If it fails:

ModuleNotFoundError: boto3: You forgot the [s3] extra — reinstall with the bracket suffix
Python 3.7 or older: DVC 3.x requires Python 3.8+; upgrade or use pip install "dvc[s3]<3"

Step 2: Initialize DVC in Your Git Repo

# Already inside your git repo
dvc init

# Commit the DVC config files Git needs to track
git add .dvc .dvcignore
git commit -m "chore: initialize DVC"

DVC creates a .dvc/ directory (similar to .git/) and a .dvcignore file (like .gitignore for DVC). Neither is large — commit both.

Expected: .dvc/config and .dvc/.gitignore exist, clean git status.

Step 3: Configure Your Remote Storage

# S3 example (replace with your bucket)
dvc remote add -d myremote s3://my-lidar-datasets/project-alpha

# GCS example
dvc remote add -d myremote gs://my-lidar-datasets/project-alpha

# SSH/NAS example
dvc remote add -d myremote ssh://nas.internal/data/project-alpha

# For S3-compatible storage (MinIO, Backblaze B2)
dvc remote add -d myremote s3://my-bucket/datasets
dvc remote modify myremote endpointurl https://s3.us-west-002.backblazeb2.com

# Save remote config to Git
git add .dvc/config
git commit -m "chore: configure DVC remote storage"

Credentials are read from environment variables (AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS, etc.) — never store secrets in .dvc/config.

If it fails:

AuthorizationError: Credentials not set — export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in your shell
BucketNotFound: Create the bucket first; DVC won't create it for you

Step 4: Add Your Large Dataset Files

# Track an entire directory of LiDAR scans
dvc add data/lidar_scans/

# Track specific large files
dvc add data/video/route_42.mp4
dvc add data/lidar_scans/session_001.pcd

# Track a ROS bag file
dvc add data/rosbag/2026-01-15-highway.bag

DVC does two things here: it moves the actual file(s) to .dvc/cache/ (content-addressed storage), and it creates a small pointer file (e.g., data/lidar_scans.dvc) that contains the MD5 hash and size.

# Commit the pointer files — these are small and go into Git
git add data/lidar_scans.dvc data/video/route_42.mp4.dvc .gitignore
git commit -m "data: add LiDAR session 001 and route 42 video"

Expected: .dvc pointer files are a few hundred bytes each. The actual data is NOT in Git.

DVC pointer file contents A .dvc file is just a YAML with an MD5 hash — tiny enough for Git

Step 5: Push Data to Remote Storage

# Upload all tracked files to remote
dvc push

# Push only specific files
dvc push data/lidar_scans.dvc

Expected: Progress bar uploading to your S3/GCS bucket. First push is slow; subsequent pushes only transfer changed files (content-addressed, so unchanged files are skipped automatically).

Step 6: Pull Data on Another Machine (or CI)

# Clone the Git repo (fast — no big files)
git clone https://github.com/your-org/perception-project
cd perception-project

# Download the data tracked by DVC
dvc pull

# Pull only what you need
dvc pull data/lidar_scans.dvc

This is the workflow payoff: your teammate clones in seconds, then downloads only the data they actually need.

If it fails:

NoCredentialsError: Set cloud credentials in the environment before running dvc pull
FileNotFoundError in cache: Someone ran dvc add but forgot dvc push — ask them to push first

Step 7: Versioning Across Dataset Iterations

# Create a new dataset version
git checkout -b dataset/v2-night-scenes

# Update your data
dvc add data/lidar_scans/   # Re-add after adding new files to the directory
dvc push

git add data/lidar_scans.dvc
git commit -m "data: add 200 night scene LiDAR frames"
git push

Switching dataset versions is now just git checkout + dvc checkout:

# Go back to v1
git checkout main
dvc checkout   # Swaps local files to match the pointer hashes in this branch

# Switch to v2
git checkout dataset/v2-night-scenes
dvc checkout

dvc checkout is fast because it uses your local cache — it hard-links files instead of copying them when possible.

Step 8: Reproduce Pipelines (Optional but Powerful)

For ML workflows where data → preprocessing → training are all versioned together:

# Define a pipeline stage
dvc run -n preprocess \
  -d data/lidar_scans/ \
  -d src/preprocess.py \
  -o data/preprocessed/ \
  python src/preprocess.py

# Run the pipeline
dvc repro

# DVC only reruns stages where inputs changed

This locks your preprocessing script version to your data version — reproducibility without Docker overhead for simple pipelines.

Verification

# Check what DVC is tracking
dvc status

# List remote contents
dvc ls --recursive .

# Confirm data matches remote
dvc status --cloud

You should see: Data and pipelines are up to date. with no diff between local and remote.

# Full round-trip test
dvc push && dvc pull --force

You should see: No files re-downloaded (everything already cached locally).

DVC status output showing clean state Clean status means local files match the committed .dvc pointer hashes

What You Learned

DVC stores large files in remote object storage and commits only tiny pointer files to Git — your repo stays fast
dvc push / dvc pull work like git push / git pull but for data
dvc checkout switches your local data to match the current Git branch's pointer files
Content-addressed caching means unchanged files are never re-uploaded or re-downloaded

Limitations to know:

DVC doesn't lock files — two people can push conflicting versions of the same .dvc file; treat .dvc pointer files like code (review PRs)
Remote storage costs real money at scale — 1TB of LiDAR data at S3 standard rates is ~$23/month just for storage, not counting transfer
dvc repro pipelines are great for research but can feel heavyweight for production; consider Airflow or Prefect for complex DAGs

When NOT to use this:

Datasets under ~100MB — Git LFS is simpler and most hosts support it natively
When your team already has a data lake with its own versioning (Delta Lake, Iceberg) — DVC would duplicate versioning logic

Tested on DVC 3.55, Python 3.12, macOS Sequoia & Ubuntu 24.04. Remote storage tested against AWS S3 and MinIO 7.x.