Version Control Huge Datasets (LiDAR/Video) with DVC in 20 Minutes

Stop committing 50GB LiDAR and video files to Git. Use DVC to version huge datasets without breaking your repo or your team.

Problem: Git Can't Handle Your 50GB Dataset

You're working on an autonomous driving or computer vision project. Your LiDAR point clouds and video clips are massive — 10GB, 50GB, 100GB per dataset version. Committing them to Git bloats your repo, slows clones to a crawl, and breaks CI/CD pipelines.

DVC (Data Version Control) solves this by storing large files in remote storage (S3, GCS, Azure, SSH) while tracking them in Git as lightweight pointer files.

You'll learn:

  • How to set up DVC with a remote for multi-GB dataset files
  • How to version LiDAR (.pcd, .bin) and video (.mp4, .bag) files across branches
  • How to share datasets with your team without downloading everything

Time: 20 min | Level: Intermediate


Why This Happens

Git stores every version of every file in its object database. A 2GB .bag file committed three times = 6GB in .git/objects. Your repo becomes unusable within weeks.

Common symptoms:

  • git clone takes 20+ minutes
  • git push times out or hits GitHub's 100MB file limit
  • Teammates skip pulling because it costs too much bandwidth
  • CI runners run out of disk space

DVC works like Git's .gitignore but for data: it replaces large files with small .dvc pointer files that do get committed to Git, while the actual bytes live in a separate remote store.


Solution

Step 1: Install DVC with Your Storage Backend

# Core install (Python 3.8+)
pip install dvc

# Install with your storage backend
pip install "dvc[s3]"      # Amazon S3 or S3-compatible (MinIO, Backblaze)
pip install "dvc[gs]"      # Google Cloud Storage
pip install "dvc[azure]"   # Azure Blob Storage
pip install "dvc[ssh]"     # SSH server (NAS, HPC cluster)

# Verify
dvc version

Expected: DVC version 3.x with your storage plugin listed under "Remotes".

If it fails:

  • ModuleNotFoundError: boto3: You forgot the [s3] extra — reinstall with the bracket suffix
  • Python 3.7 or older: DVC 3.x requires Python 3.8+; upgrade or use pip install "dvc[s3]<3"

Step 2: Initialize DVC in Your Git Repo

# Already inside your git repo
dvc init

# Commit the DVC config files Git needs to track
git add .dvc .dvcignore
git commit -m "chore: initialize DVC"

DVC creates a .dvc/ directory (similar to .git/) and a .dvcignore file (like .gitignore for DVC). Neither is large — commit both.

Expected: .dvc/config and .dvc/.gitignore exist, clean git status.


Step 3: Configure Your Remote Storage

# S3 example (replace with your bucket)
dvc remote add -d myremote s3://my-lidar-datasets/project-alpha

# GCS example
dvc remote add -d myremote gs://my-lidar-datasets/project-alpha

# SSH/NAS example
dvc remote add -d myremote ssh://nas.internal/data/project-alpha

# For S3-compatible storage (MinIO, Backblaze B2)
dvc remote add -d myremote s3://my-bucket/datasets
dvc remote modify myremote endpointurl https://s3.us-west-002.backblazeb2.com

# Save remote config to Git
git add .dvc/config
git commit -m "chore: configure DVC remote storage"

Credentials are read from environment variables (AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS, etc.) — never store secrets in .dvc/config.

If it fails:

  • AuthorizationError: Credentials not set — export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in your shell
  • BucketNotFound: Create the bucket first; DVC won't create it for you

Step 4: Add Your Large Dataset Files

# Track an entire directory of LiDAR scans
dvc add data/lidar_scans/

# Track specific large files
dvc add data/video/route_42.mp4
dvc add data/lidar_scans/session_001.pcd

# Track a ROS bag file
dvc add data/rosbag/2026-01-15-highway.bag

DVC does two things here: it moves the actual file(s) to .dvc/cache/ (content-addressed storage), and it creates a small pointer file (e.g., data/lidar_scans.dvc) that contains the MD5 hash and size.

# Commit the pointer files — these are small and go into Git
git add data/lidar_scans.dvc data/video/route_42.mp4.dvc .gitignore
git commit -m "data: add LiDAR session 001 and route 42 video"

Expected: .dvc pointer files are a few hundred bytes each. The actual data is NOT in Git.

DVC pointer file contents A .dvc file is just a YAML with an MD5 hash — tiny enough for Git


Step 5: Push Data to Remote Storage

# Upload all tracked files to remote
dvc push

# Push only specific files
dvc push data/lidar_scans.dvc

Expected: Progress bar uploading to your S3/GCS bucket. First push is slow; subsequent pushes only transfer changed files (content-addressed, so unchanged files are skipped automatically).


Step 6: Pull Data on Another Machine (or CI)

# Clone the Git repo (fast — no big files)
git clone https://github.com/your-org/perception-project
cd perception-project

# Download the data tracked by DVC
dvc pull

# Pull only what you need
dvc pull data/lidar_scans.dvc

This is the workflow payoff: your teammate clones in seconds, then downloads only the data they actually need.

If it fails:

  • NoCredentialsError: Set cloud credentials in the environment before running dvc pull
  • FileNotFoundError in cache: Someone ran dvc add but forgot dvc push — ask them to push first

Step 7: Versioning Across Dataset Iterations

# Create a new dataset version
git checkout -b dataset/v2-night-scenes

# Update your data
dvc add data/lidar_scans/   # Re-add after adding new files to the directory
dvc push

git add data/lidar_scans.dvc
git commit -m "data: add 200 night scene LiDAR frames"
git push

Switching dataset versions is now just git checkout + dvc checkout:

# Go back to v1
git checkout main
dvc checkout   # Swaps local files to match the pointer hashes in this branch

# Switch to v2
git checkout dataset/v2-night-scenes
dvc checkout

dvc checkout is fast because it uses your local cache — it hard-links files instead of copying them when possible.


Step 8: Reproduce Pipelines (Optional but Powerful)

For ML workflows where data → preprocessing → training are all versioned together:

# Define a pipeline stage
dvc run -n preprocess \
  -d data/lidar_scans/ \
  -d src/preprocess.py \
  -o data/preprocessed/ \
  python src/preprocess.py

# Run the pipeline
dvc repro

# DVC only reruns stages where inputs changed

This locks your preprocessing script version to your data version — reproducibility without Docker overhead for simple pipelines.


Verification

# Check what DVC is tracking
dvc status

# List remote contents
dvc ls --recursive .

# Confirm data matches remote
dvc status --cloud

You should see: Data and pipelines are up to date. with no diff between local and remote.

# Full round-trip test
dvc push && dvc pull --force

You should see: No files re-downloaded (everything already cached locally).

DVC status output showing clean state Clean status means local files match the committed .dvc pointer hashes


What You Learned

  • DVC stores large files in remote object storage and commits only tiny pointer files to Git — your repo stays fast
  • dvc push / dvc pull work like git push / git pull but for data
  • dvc checkout switches your local data to match the current Git branch's pointer files
  • Content-addressed caching means unchanged files are never re-uploaded or re-downloaded

Limitations to know:

  • DVC doesn't lock files — two people can push conflicting versions of the same .dvc file; treat .dvc pointer files like code (review PRs)
  • Remote storage costs real money at scale — 1TB of LiDAR data at S3 standard rates is ~$23/month just for storage, not counting transfer
  • dvc repro pipelines are great for research but can feel heavyweight for production; consider Airflow or Prefect for complex DAGs

When NOT to use this:

  • Datasets under ~100MB — Git LFS is simpler and most hosts support it natively
  • When your team already has a data lake with its own versioning (Delta Lake, Iceberg) — DVC would duplicate versioning logic

Tested on DVC 3.55, Python 3.12, macOS Sequoia & Ubuntu 24.04. Remote storage tested against AWS S3 and MinIO 7.x.