Skip to main content

Command Palette

Search for a command to run...

Day 3 of 30-Day MLOps Challenge: Mastering Data Versioning with DVC

Published
β€’6 min read
Day 3 of 30-Day MLOps Challenge: Mastering Data Versioning with DVC
B

I am Bittu Sharma, a DevOps & AI Engineer with a keen interest in building intelligent, automated systems. My goal is to bridge the gap between software engineering and data science, ensuring scalable deployments and efficient model operations in production.! π—Ÿπ—²π˜'π˜€ π—–π—Όπ—»π—»π—²π—°π˜ I would love the opportunity to connect and contribute. Feel free to DM me on LinkedIn itself or reach out to me at bittush9534@gmail.com. I look forward to connecting and networking with people in this exciting Tech World.

πŸ“š Key Learnings

  • Why versioning datasets is as important as versioning code in ML workflows

  • How DVC (Data Version Control) integrates with Git for full pipeline reproducibility

  • How to use DVC to track datasets, models, and pipelines

  • Basics of setting up a DVC project, connecting remote storage, and managing large files

  • How DVC enables collaboration across ML teams by standardizing data + code versioning


🧠 Learn Here β€” What is Data Versioning?

Data Versioning in ML is the practice of tracking, managing, and controlling changes to datasets used throughout the machine learning lifecycle β€” similar to how Git tracks code versions.


🧰 Tools Used for Data Versioning

ToolDescription
DVC (Data Version Control)Git-like version control for data and models
LakeFSGit-style versioning for object stores (e.g., S3)
PachydermData lineage and versioning built into pipelines
Weights & Biases / MLflowCan log and track dataset artifacts and metadata

βš–οΈ Why Versioning Datasets is as Important as Versioning Code

1. 🧬 Reproducibility

  • Just like code, the training dataset determines the behavior of the machine learning model.

  • Without dataset versioning, it becomes impossible to reproduce results because even slight changes in data can lead to different model performance.

  • Reproducibility is critical for debugging, research validation, audits, and regulated environments.


2. πŸ“ˆ Experiment Tracking

  • Tracking which dataset version was used in each experiment is crucial for evaluating model performance over time.

  • Allows comparison of results across different dataset iterations.

  • Tools like MLflow, DVC, and Weights & Biases rely on consistent dataset versioning to provide accurate metrics.


3. 🀝 Collaboration

  • In teams, multiple members may work with the same project. Dataset versioning ensures everyone is working on the same, consistent data.

  • Prevents the confusion that arises from ad-hoc data modifications.

  • Enables parallel experimentation on different branches of data.


4. πŸ“Š Model Performance Monitoring

  • Data changes (like new features, additional rows, or re-labeling) can impact model performance.

  • Versioning allows tracking of what data changes led to performance improvement or degradation.

  • Supports rollback to previous versions in case of anomalies.


5. πŸš€ Production Consistency

  • Ensures that the model in production uses the same dataset it was trained and tested on.

  • Prevents data drift caused by unnoticed changes to training data post-deployment.


6. πŸ›‘οΈ Compliance and Auditing

  • Regulated industries require traceability of datasets used for decision-making models.

  • Dataset versioning supports audit trails and compliance reports.


πŸ’‘ What is DVC?

DVC (Data Version Control) is an open-source tool that helps track, version, and manage data, models, and experiments in machine learning (ML) workflows β€” similar to how Git tracks code.


βš™οΈ Why DVC?

ML projects involve:

  • Large datasets and model files (often not suitable to store directly in Git)

  • Reproducibility issues due to dynamic data, changing experiments, etc.

  • Need for collaboration on both code and data


πŸ” What Does DVC Do?

FeatureDescription
πŸ”„ Data VersioningTrack large files (datasets, models) via lightweight metadata in Git
βš™οΈ PipelinesDefine data processing and model training workflows (like Makefiles for ML)
πŸ’Ύ Remote StorageSync data/models to S3, GCS, Azure, SSH, etc.
πŸ”¬ Experiment TrackingTrack hyperparameters, code, data, and results for each experiment
πŸ”— Git IntegrationWorks alongside Git for full project versioning

🧰 Installing DVC

DVC can be installed on macOS, Linux, and Windows using various methods depending on your environment.


πŸ–₯ macOS

  1. Using Homebrew (Recommended)

     brew install dvc
    
  2. Using pip (Python Package Manager)

     pip install dvc
    

    βœ… Make sure Python 3.6+ is installed

  3. Install specific extras (e.g., S3 support)

     pip install "dvc[s3]"
    

🐧 Linux

  1. Using pip (Recommended)

     pip install dvc
    
  2. With specific remote support (e.g., GDrive, SSH, etc.)

     pip install "dvc[gdrive,ssh]"
    
  3. Using Snap

     sudo snap install dvc --classic
    
  4. Using Conda

     conda install -c conda-forge dvc
    

πŸͺŸ Windows

  1. Using pip (Recommended)

     pip install dvc
    
  2. Using Chocolatey

     choco install dvc
    

    βœ… Run PowerShell as Administrator

  3. Using Conda

     conda install -c conda-forge dvc
    

βœ… Verify Installation

dvc --version

πŸ”§ How DVC Works

🧠 Core Concept

  • Git tracks code and small metadata files.

  • DVC manages large data files, model artifacts, and pipeline stages.

  • DVC creates lightweight metafiles (.dvc, dvc.yaml, dvc.lock) that Git can version.


βš™οΈ Workflow Integration

πŸ—‚οΈ Version Control Everything

  • Git stores the pipeline definition (dvc.yaml, dvc.lock) and pointers to data.

  • DVC stores large files remotely (e.g., S3, GCS, SSH, etc.).

πŸ‘₯ Collaborate

  • Team members pull the repo via Git.

  • Run dvc pull to fetch required datasets/models.

  • Run dvc repro to reproduce the entire pipeline.


πŸš€ Step-by-Step Workflow

1️⃣ Initialize Git & DVC

git init
dvc init
git commit -m "Initialize Git and DVC"

2️⃣ Track Data and Models

dvc add data/raw_data.csv
git add data/raw_data.csv.dvc .gitignore
git commit -m "Track raw data with DVC"

3️⃣ Remote Storage Configuration

# Set up remote storage
dvc remote add -d myremote s3://mybucket/dvcstore

# Push data to remote
dvc push

4️⃣ Track Machine Learning Models

Save and track model files after training.

mv model.pkl models/model.pkl
dvc add models/model.pkl
git add models/model.pkl.dvc models/.gitignore
git commit -m "Track ML model with DVC"
dvc push

5️⃣ Define & Track ML Pipeline

dvc run -n preprocess \
  -d data/raw_data.csv -o data/processed \
  python scripts/preprocess.py

git add dvc.yaml dvc.lock
git commit -m "Add preprocess stage to pipeline"

dvc run -n train_model \
  -d src/train.py -d data/raw-dataset.csv \
  -o models/model.pkl \
  python src/train.py data/raw-dataset.csv models/model.pkl

6️⃣ Reproduce the Pipeline

To rerun pipeline stages when dependencies change:

dvc repro

7️⃣ Visualize the Pipeline

To visualize dependencies and outputs:

dvc dag

8️⃣ Collaborate with DVC Remotes

To sync datasets/models among team members:

git pull
dvc pull

🎯 Benefits for Pipeline Reproducibility

  • Data + Code Coupling: Git handles the code; DVC aligns data versions with code versions.

  • Reproducibility: dvc.lock captures exact inputs, outputs, and commands.

  • Collaboration: Teams reproduce results reliably by syncing Git + DVC.

  • Modularity: Pipelines built from multiple stages (like preprocess, train, evaluate).


πŸ“– Learning Resources


πŸ”₯ Challenges

πŸ’‘ Set up DVC in a new or existing Git-based ML project
πŸ’‘ Add and track a dataset (data.csv) using dvc add
πŸ’‘ Commit and push the changes to GitHub and local DVC remote
πŸ’‘ Clone the project in a new folder and use dvc pull to reproduce the dataset
πŸ’‘ Write a README section on β€œHow to use DVC for data versioning in this project”
πŸ’‘ Set up S3 or GCS as a remote and push/pull data to/from the cloud


🀷🏻 How to Participate?

βœ… Complete the tasks and challenges
βœ… Document your progress and key takeaways on GitHub ReadMe, Medium, or Hashnode

Follow me on LinkedIn

Follow me on GitHub

Keep Learning……

More from this blog