Day 3 of 30-Day MLOps Challenge: Mastering Data Versioning with DVC

I am Bittu Sharma, a DevOps & AI Engineer with a keen interest in building intelligent, automated systems. My goal is to bridge the gap between software engineering and data science, ensuring scalable deployments and efficient model operations in production.! ππ²π'π ππΌπ»π»π²π°π I would love the opportunity to connect and contribute. Feel free to DM me on LinkedIn itself or reach out to me at bittush9534@gmail.com. I look forward to connecting and networking with people in this exciting Tech World.
π Key Learnings
Why versioning datasets is as important as versioning code in ML workflows
How DVC (Data Version Control) integrates with Git for full pipeline reproducibility
How to use DVC to track datasets, models, and pipelines
Basics of setting up a DVC project, connecting remote storage, and managing large files
How DVC enables collaboration across ML teams by standardizing data + code versioning
π§ Learn Here β What is Data Versioning?
Data Versioning in ML is the practice of tracking, managing, and controlling changes to datasets used throughout the machine learning lifecycle β similar to how Git tracks code versions.

π§° Tools Used for Data Versioning
| Tool | Description |
| DVC (Data Version Control) | Git-like version control for data and models |
| LakeFS | Git-style versioning for object stores (e.g., S3) |
| Pachyderm | Data lineage and versioning built into pipelines |
| Weights & Biases / MLflow | Can log and track dataset artifacts and metadata |
βοΈ Why Versioning Datasets is as Important as Versioning Code
1. 𧬠Reproducibility
Just like code, the training dataset determines the behavior of the machine learning model.
Without dataset versioning, it becomes impossible to reproduce results because even slight changes in data can lead to different model performance.
Reproducibility is critical for debugging, research validation, audits, and regulated environments.
2. π Experiment Tracking
Tracking which dataset version was used in each experiment is crucial for evaluating model performance over time.
Allows comparison of results across different dataset iterations.
Tools like MLflow, DVC, and Weights & Biases rely on consistent dataset versioning to provide accurate metrics.
3. π€ Collaboration
In teams, multiple members may work with the same project. Dataset versioning ensures everyone is working on the same, consistent data.
Prevents the confusion that arises from ad-hoc data modifications.
Enables parallel experimentation on different branches of data.
4. π Model Performance Monitoring
Data changes (like new features, additional rows, or re-labeling) can impact model performance.
Versioning allows tracking of what data changes led to performance improvement or degradation.
Supports rollback to previous versions in case of anomalies.
5. π Production Consistency
Ensures that the model in production uses the same dataset it was trained and tested on.
Prevents data drift caused by unnoticed changes to training data post-deployment.
6. π‘οΈ Compliance and Auditing
Regulated industries require traceability of datasets used for decision-making models.
Dataset versioning supports audit trails and compliance reports.
π‘ What is DVC?
DVC (Data Version Control) is an open-source tool that helps track, version, and manage data, models, and experiments in machine learning (ML) workflows β similar to how Git tracks code.

βοΈ Why DVC?
ML projects involve:
Large datasets and model files (often not suitable to store directly in Git)
Reproducibility issues due to dynamic data, changing experiments, etc.
Need for collaboration on both code and data
π What Does DVC Do?
| Feature | Description |
| π Data Versioning | Track large files (datasets, models) via lightweight metadata in Git |
| βοΈ Pipelines | Define data processing and model training workflows (like Makefiles for ML) |
| πΎ Remote Storage | Sync data/models to S3, GCS, Azure, SSH, etc. |
| π¬ Experiment Tracking | Track hyperparameters, code, data, and results for each experiment |
| π Git Integration | Works alongside Git for full project versioning |
π§° Installing DVC
DVC can be installed on macOS, Linux, and Windows using various methods depending on your environment.
π₯ macOS
Using Homebrew (Recommended)
brew install dvcUsing pip (Python Package Manager)
pip install dvcβ Make sure Python 3.6+ is installed
Install specific extras (e.g., S3 support)
pip install "dvc[s3]"
π§ Linux
Using pip (Recommended)
pip install dvcWith specific remote support (e.g., GDrive, SSH, etc.)
pip install "dvc[gdrive,ssh]"Using Snap
sudo snap install dvc --classicUsing Conda
conda install -c conda-forge dvc
πͺ Windows
Using pip (Recommended)
pip install dvcUsing Chocolatey
choco install dvcβ Run PowerShell as Administrator
Using Conda
conda install -c conda-forge dvc
β Verify Installation
dvc --version
π§ How DVC Works

π§ Core Concept
Git tracks code and small metadata files.
DVC manages large data files, model artifacts, and pipeline stages.
DVC creates lightweight metafiles (
.dvc,dvc.yaml,dvc.lock) that Git can version.
βοΈ Workflow Integration
ποΈ Version Control Everything
Git stores the pipeline definition (
dvc.yaml,dvc.lock) and pointers to data.DVC stores large files remotely (e.g., S3, GCS, SSH, etc.).
π₯ Collaborate
Team members pull the repo via Git.
Run
dvc pullto fetch required datasets/models.Run
dvc reproto reproduce the entire pipeline.
π Step-by-Step Workflow
1οΈβ£ Initialize Git & DVC
git init
dvc init
git commit -m "Initialize Git and DVC"
2οΈβ£ Track Data and Models
dvc add data/raw_data.csv
git add data/raw_data.csv.dvc .gitignore
git commit -m "Track raw data with DVC"
3οΈβ£ Remote Storage Configuration
# Set up remote storage
dvc remote add -d myremote s3://mybucket/dvcstore
# Push data to remote
dvc push
4οΈβ£ Track Machine Learning Models
Save and track model files after training.
mv model.pkl models/model.pkl
dvc add models/model.pkl
git add models/model.pkl.dvc models/.gitignore
git commit -m "Track ML model with DVC"
dvc push
5οΈβ£ Define & Track ML Pipeline
dvc run -n preprocess \
-d data/raw_data.csv -o data/processed \
python scripts/preprocess.py
git add dvc.yaml dvc.lock
git commit -m "Add preprocess stage to pipeline"
dvc run -n train_model \
-d src/train.py -d data/raw-dataset.csv \
-o models/model.pkl \
python src/train.py data/raw-dataset.csv models/model.pkl
6οΈβ£ Reproduce the Pipeline
To rerun pipeline stages when dependencies change:
dvc repro
7οΈβ£ Visualize the Pipeline
To visualize dependencies and outputs:
dvc dag
8οΈβ£ Collaborate with DVC Remotes
To sync datasets/models among team members:
git pull
dvc pull
π― Benefits for Pipeline Reproducibility
Data + Code Coupling: Git handles the code; DVC aligns data versions with code versions.
Reproducibility:
dvc.lockcaptures exact inputs, outputs, and commands.Collaboration: Teams reproduce results reliably by syncing Git + DVC.
Modularity: Pipelines built from multiple stages (like preprocess, train, evaluate).
π Learning Resources
π Why Use DVC?
π DVC + Git Workflow Explained
π₯ Challenges
π‘ Set up DVC in a new or existing Git-based ML project
π‘ Add and track a dataset (data.csv) using dvc add
π‘ Commit and push the changes to GitHub and local DVC remote
π‘ Clone the project in a new folder and use dvc pull to reproduce the dataset
π‘ Write a README section on βHow to use DVC for data versioning in this projectβ
π‘ Set up S3 or GCS as a remote and push/pull data to/from the cloud
π€·π» How to Participate?
β
Complete the tasks and challenges
β
Document your progress and key takeaways on GitHub ReadMe, Medium, or Hashnode
Follow me on LinkedIn
Follow me on GitHub
Keep Learningβ¦β¦




