Problem: Git Can't Handle Your 50GB Dataset
You're working on an autonomous driving or computer vision project. Your LiDAR point clouds and video clips are massive — 10GB, 50GB, 100GB per dataset version. Committing them to Git bloats your repo, slows clones to a crawl, and breaks CI/CD pipelines.
DVC (Data Version Control) solves this by storing large files in remote storage (S3, GCS, Azure, SSH) while tracking them in Git as lightweight pointer files.
You'll learn:
- How to set up DVC with a remote for multi-GB dataset files
- How to version LiDAR (.pcd, .bin) and video (.mp4, .bag) files across branches
- How to share datasets with your team without downloading everything
Time: 20 min | Level: Intermediate
Why This Happens
Git stores every version of every file in its object database. A 2GB .bag file committed three times = 6GB in .git/objects. Your repo becomes unusable within weeks.
Common symptoms:
git clonetakes 20+ minutesgit pushtimes out or hits GitHub's 100MB file limit- Teammates skip pulling because it costs too much bandwidth
- CI runners run out of disk space
DVC works like Git's .gitignore but for data: it replaces large files with small .dvc pointer files that do get committed to Git, while the actual bytes live in a separate remote store.
Solution
Step 1: Install DVC with Your Storage Backend
# Core install (Python 3.8+)
pip install dvc
# Install with your storage backend
pip install "dvc[s3]" # Amazon S3 or S3-compatible (MinIO, Backblaze)
pip install "dvc[gs]" # Google Cloud Storage
pip install "dvc[azure]" # Azure Blob Storage
pip install "dvc[ssh]" # SSH server (NAS, HPC cluster)
# Verify
dvc version
Expected: DVC version 3.x with your storage plugin listed under "Remotes".
If it fails:
ModuleNotFoundError: boto3: You forgot the[s3]extra — reinstall with the bracket suffix- Python 3.7 or older: DVC 3.x requires Python 3.8+; upgrade or use
pip install "dvc[s3]<3"
Step 2: Initialize DVC in Your Git Repo
# Already inside your git repo
dvc init
# Commit the DVC config files Git needs to track
git add .dvc .dvcignore
git commit -m "chore: initialize DVC"
DVC creates a .dvc/ directory (similar to .git/) and a .dvcignore file (like .gitignore for DVC). Neither is large — commit both.
Expected: .dvc/config and .dvc/.gitignore exist, clean git status.
Step 3: Configure Your Remote Storage
# S3 example (replace with your bucket)
dvc remote add -d myremote s3://my-lidar-datasets/project-alpha
# GCS example
dvc remote add -d myremote gs://my-lidar-datasets/project-alpha
# SSH/NAS example
dvc remote add -d myremote ssh://nas.internal/data/project-alpha
# For S3-compatible storage (MinIO, Backblaze B2)
dvc remote add -d myremote s3://my-bucket/datasets
dvc remote modify myremote endpointurl https://s3.us-west-002.backblazeb2.com
# Save remote config to Git
git add .dvc/config
git commit -m "chore: configure DVC remote storage"
Credentials are read from environment variables (AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS, etc.) — never store secrets in .dvc/config.
If it fails:
AuthorizationError: Credentials not set — exportAWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYin your shellBucketNotFound: Create the bucket first; DVC won't create it for you
Step 4: Add Your Large Dataset Files
# Track an entire directory of LiDAR scans
dvc add data/lidar_scans/
# Track specific large files
dvc add data/video/route_42.mp4
dvc add data/lidar_scans/session_001.pcd
# Track a ROS bag file
dvc add data/rosbag/2026-01-15-highway.bag
DVC does two things here: it moves the actual file(s) to .dvc/cache/ (content-addressed storage), and it creates a small pointer file (e.g., data/lidar_scans.dvc) that contains the MD5 hash and size.
# Commit the pointer files — these are small and go into Git
git add data/lidar_scans.dvc data/video/route_42.mp4.dvc .gitignore
git commit -m "data: add LiDAR session 001 and route 42 video"
Expected: .dvc pointer files are a few hundred bytes each. The actual data is NOT in Git.
A .dvc file is just a YAML with an MD5 hash — tiny enough for Git
Step 5: Push Data to Remote Storage
# Upload all tracked files to remote
dvc push
# Push only specific files
dvc push data/lidar_scans.dvc
Expected: Progress bar uploading to your S3/GCS bucket. First push is slow; subsequent pushes only transfer changed files (content-addressed, so unchanged files are skipped automatically).
Step 6: Pull Data on Another Machine (or CI)
# Clone the Git repo (fast — no big files)
git clone https://github.com/your-org/perception-project
cd perception-project
# Download the data tracked by DVC
dvc pull
# Pull only what you need
dvc pull data/lidar_scans.dvc
This is the workflow payoff: your teammate clones in seconds, then downloads only the data they actually need.
If it fails:
NoCredentialsError: Set cloud credentials in the environment before runningdvc pullFileNotFoundErrorin cache: Someone randvc addbut forgotdvc push— ask them to push first
Step 7: Versioning Across Dataset Iterations
# Create a new dataset version
git checkout -b dataset/v2-night-scenes
# Update your data
dvc add data/lidar_scans/ # Re-add after adding new files to the directory
dvc push
git add data/lidar_scans.dvc
git commit -m "data: add 200 night scene LiDAR frames"
git push
Switching dataset versions is now just git checkout + dvc checkout:
# Go back to v1
git checkout main
dvc checkout # Swaps local files to match the pointer hashes in this branch
# Switch to v2
git checkout dataset/v2-night-scenes
dvc checkout
dvc checkout is fast because it uses your local cache — it hard-links files instead of copying them when possible.
Step 8: Reproduce Pipelines (Optional but Powerful)
For ML workflows where data → preprocessing → training are all versioned together:
# Define a pipeline stage
dvc run -n preprocess \
-d data/lidar_scans/ \
-d src/preprocess.py \
-o data/preprocessed/ \
python src/preprocess.py
# Run the pipeline
dvc repro
# DVC only reruns stages where inputs changed
This locks your preprocessing script version to your data version — reproducibility without Docker overhead for simple pipelines.
Verification
# Check what DVC is tracking
dvc status
# List remote contents
dvc ls --recursive .
# Confirm data matches remote
dvc status --cloud
You should see: Data and pipelines are up to date. with no diff between local and remote.
# Full round-trip test
dvc push && dvc pull --force
You should see: No files re-downloaded (everything already cached locally).
Clean status means local files match the committed .dvc pointer hashes
What You Learned
- DVC stores large files in remote object storage and commits only tiny pointer files to Git — your repo stays fast
dvc push/dvc pullwork likegit push/git pullbut for datadvc checkoutswitches your local data to match the current Git branch's pointer files- Content-addressed caching means unchanged files are never re-uploaded or re-downloaded
Limitations to know:
- DVC doesn't lock files — two people can push conflicting versions of the same
.dvcfile; treat.dvcpointer files like code (review PRs) - Remote storage costs real money at scale — 1TB of LiDAR data at S3 standard rates is ~$23/month just for storage, not counting transfer
dvc repropipelines are great for research but can feel heavyweight for production; consider Airflow or Prefect for complex DAGs
When NOT to use this:
- Datasets under ~100MB — Git LFS is simpler and most hosts support it natively
- When your team already has a data lake with its own versioning (Delta Lake, Iceberg) — DVC would duplicate versioning logic
Tested on DVC 3.55, Python 3.12, macOS Sequoia & Ubuntu 24.04. Remote storage tested against AWS S3 and MinIO 7.x.