Day 3 of 30-Day MLOps Challenge: Mastering Data Versioning with DVC

📚 Key Learnings

Why versioning datasets is as important as versioning code in ML workflows
How DVC (Data Version Control) integrates with Git for full pipeline reproducibility
How to use DVC to track datasets, models, and pipelines
Basics of setting up a DVC project, connecting remote storage, and managing large files
How DVC enables collaboration across ML teams by standardizing data + code versioning

🧠 Learn Here — What is Data Versioning?

Data Versioning in ML is the practice of tracking, managing, and controlling changes to datasets used throughout the machine learning lifecycle — similar to how Git tracks code versions.

🧰 Tools Used for Data Versioning

Tool	Description
DVC (Data Version Control)	Git-like version control for data and models
LakeFS	Git-style versioning for object stores (e.g., S3)
Pachyderm	Data lineage and versioning built into pipelines
Weights & Biases / MLflow	Can log and track dataset artifacts and metadata

⚖️ Why Versioning Datasets is as Important as Versioning Code

1. 🧬 Reproducibility

Just like code, the training dataset determines the behavior of the machine learning model.
Without dataset versioning, it becomes impossible to reproduce results because even slight changes in data can lead to different model performance.
Reproducibility is critical for debugging, research validation, audits, and regulated environments.

2. 📈 Experiment Tracking

Tracking which dataset version was used in each experiment is crucial for evaluating model performance over time.
Allows comparison of results across different dataset iterations.
Tools like MLflow, DVC, and Weights & Biases rely on consistent dataset versioning to provide accurate metrics.

3. 🤝 Collaboration

In teams, multiple members may work with the same project. Dataset versioning ensures everyone is working on the same, consistent data.
Prevents the confusion that arises from ad-hoc data modifications.
Enables parallel experimentation on different branches of data.

4. 📊 Model Performance Monitoring

Data changes (like new features, additional rows, or re-labeling) can impact model performance.
Versioning allows tracking of what data changes led to performance improvement or degradation.
Supports rollback to previous versions in case of anomalies.

5. 🚀 Production Consistency

Ensures that the model in production uses the same dataset it was trained and tested on.
Prevents data drift caused by unnoticed changes to training data post-deployment.

6. 🛡️ Compliance and Auditing

Regulated industries require traceability of datasets used for decision-making models.
Dataset versioning supports audit trails and compliance reports.

💡 What is DVC?

DVC (Data Version Control) is an open-source tool that helps track, version, and manage data, models, and experiments in machine learning (ML) workflows — similar to how Git tracks code.

⚙️ Why DVC?

ML projects involve:

Large datasets and model files (often not suitable to store directly in Git)
Reproducibility issues due to dynamic data, changing experiments, etc.
Need for collaboration on both code and data

🔍 What Does DVC Do?

Feature	Description
🔄 Data Versioning	Track large files (datasets, models) via lightweight metadata in Git
⚙️ Pipelines	Define data processing and model training workflows (like Makefiles for ML)
💾 Remote Storage	Sync data/models to S3, GCS, Azure, SSH, etc.
🔬 Experiment Tracking	Track hyperparameters, code, data, and results for each experiment
🔗 Git Integration	Works alongside Git for full project versioning

🧰 Installing DVC

DVC can be installed on macOS, Linux, and Windows using various methods depending on your environment.

🖥 macOS

Using Homebrew (Recommended)
```
 brew install dvc
```
Using pip (Python Package Manager)
```
 pip install dvc
```
✅ Make sure Python 3.6+ is installed
Install specific extras (e.g., S3 support)
```
 pip install "dvc[s3]"
```

🐧 Linux

Using pip (Recommended)
```
 pip install dvc
```
With specific remote support (e.g., GDrive, SSH, etc.)
```
 pip install "dvc[gdrive,ssh]"
```
Using Snap
```
 sudo snap install dvc --classic
```
Using Conda
```
 conda install -c conda-forge dvc
```

🪟 Windows

Using pip (Recommended)
```
 pip install dvc
```
Using Chocolatey
```
 choco install dvc
```
✅ Run PowerShell as Administrator
Using Conda
```
 conda install -c conda-forge dvc
```

✅ Verify Installation

dvc --version

🔧 How DVC Works

🧠 Core Concept

Git tracks code and small metadata files.
DVC manages large data files, model artifacts, and pipeline stages.
DVC creates lightweight metafiles (.dvc, dvc.yaml, dvc.lock) that Git can version.

⚙️ Workflow Integration

🗂️ Version Control Everything

Git stores the pipeline definition (dvc.yaml, dvc.lock) and pointers to data.
DVC stores large files remotely (e.g., S3, GCS, SSH, etc.).

👥 Collaborate

Team members pull the repo via Git.
Run dvc pull to fetch required datasets/models.
Run dvc repro to reproduce the entire pipeline.

🚀 Step-by-Step Workflow

1️⃣ Initialize Git & DVC

git init
dvc init
git commit -m "Initialize Git and DVC"

2️⃣ Track Data and Models

dvc add data/raw_data.csv
git add data/raw_data.csv.dvc .gitignore
git commit -m "Track raw data with DVC"

3️⃣ Remote Storage Configuration

# Set up remote storage
dvc remote add -d myremote s3://mybucket/dvcstore

# Push data to remote
dvc push

4️⃣ Track Machine Learning Models

Save and track model files after training.

mv model.pkl models/model.pkl
dvc add models/model.pkl
git add models/model.pkl.dvc models/.gitignore
git commit -m "Track ML model with DVC"
dvc push

5️⃣ Define & Track ML Pipeline

dvc run -n preprocess \
  -d data/raw_data.csv -o data/processed \
  python scripts/preprocess.py

git add dvc.yaml dvc.lock
git commit -m "Add preprocess stage to pipeline"

dvc run -n train_model \
  -d src/train.py -d data/raw-dataset.csv \
  -o models/model.pkl \
  python src/train.py data/raw-dataset.csv models/model.pkl

6️⃣ Reproduce the Pipeline

To rerun pipeline stages when dependencies change:

dvc repro

7️⃣ Visualize the Pipeline

To visualize dependencies and outputs:

dvc dag

8️⃣ Collaborate with DVC Remotes

To sync datasets/models among team members:

git pull
dvc pull

🎯 Benefits for Pipeline Reproducibility

Data + Code Coupling: Git handles the code; DVC aligns data versions with code versions.
Reproducibility: dvc.lock captures exact inputs, outputs, and commands.
Collaboration: Teams reproduce results reliably by syncing Git + DVC.
Modularity: Pipelines built from multiple stages (like preprocess, train, evaluate).

📖 Learning Resources

📘 Official DVC Documentation
📘 DVC Get Started Guide
📘 Why Use DVC?
📘 DVC + Git Workflow Explained

🔥 Challenges

💡 Set up DVC in a new or existing Git-based ML project
💡 Add and track a dataset (data.csv) using dvc add
💡 Commit and push the changes to GitHub and local DVC remote
💡 Clone the project in a new folder and use dvc pull to reproduce the dataset
💡 Write a README section on “How to use DVC for data versioning in this project”
💡 Set up S3 or GCS as a remote and push/pull data to/from the cloud

🤷🏻 How to Participate?

✅ Complete the tasks and challenges
✅ Document your progress and key takeaways on GitHub ReadMe, Medium, or Hashnode

Follow me on LinkedIn

Follow me on GitHub

Keep Learning……

Command Palette