DVC and Its Commands: A Comprehensive Guide for MLOps Engineers

DVC and Its Commands: A Comprehensive Guide for MLOps Engineers

In the world of Machine Learning (ML) and MLOps, managing data, models, and experiments efficiently is crucial. DVC (Data Version Control) is a powerful open-source tool designed to address these challenges. It extends Git to handle large datasets, models, and ML pipelines, making it an essential tool for MLOps engineers.

In this blog, we’ll explore what DVC is, its key features, and a detailed guide to its most important commands. By the end, you’ll have a solid understanding of how to use DVC to streamline your ML workflows.


What is DVC?

DVC is a version control system specifically designed for ML projects. It works alongside Git to manage large files, datasets, and models that are typically too big to be stored in a Git repository. DVC stores metadata about these files in Git while keeping the actual data in remote storage (e.g., AWS S3, Google Cloud Storage, or local storage).

Key Features of DVC

  1. Data Versioning: Track changes to datasets and models.

  2. Reproducibility: Ensure experiments can be reproduced by versioning data and code together.

  3. Pipeline Management: Define and execute ML pipelines with dependencies.

  4. Collaboration: Share large datasets and models with team members efficiently.

  5. Storage Agnostic: Works with any cloud or local storage.


Why Use DVC?

  1. Handles Large Files: DVC is designed to handle large datasets and models that Git cannot manage efficiently.

  2. Reproducibility: By versioning data and code together, DVC ensures that experiments can be reproduced.

  3. Pipeline Automation: DVC allows you to define and automate ML pipelines, making it easier to manage complex workflows.

  4. Collaboration: DVC makes it easy to share data and models with team members without duplicating files.


DVC Commands: A Hands-On Guide

Below is a detailed guide to the most important DVC commands, organized by functionality.


1. Initializing DVC

To start using DVC in your project, you need to initialize it.

Command:

bash

dvc init
  • Initializes DVC in the current directory.

  • Creates a .dvc directory to store metadata and configuration files.

Example:

bash

cd my-ml-project
dvc init
git add .dvc
git commit -m "Initialize DVC"

2. Adding Data to DVC

To version control a dataset or file, use the dvc add command.

Command:

bash

dvc add <file_or_directory>
  • Adds the file or directory to DVC.

  • Creates a .dvc file (metadata) and adds the actual data to .gitignore.

Example:

bash

dvc add data.csv
git add data.csv.dvc .gitignore
git commit -m "Add data.csv to DVC"

3. Tracking Data with Git

After adding data to DVC, you need to track the metadata file with Git.

Command:

bash

git add <file>.dvc .gitignore
git commit -m "Add <file> to DVC"
  • Commits the .dvc file to Git, which tracks changes to the data.

Example:

bash

git add data.csv.dvc .gitignore
git commit -m "Add data.csv to DVC"

4. Pushing and Pulling Data

DVC allows you to store data in remote storage and sync it across environments.

Commands:

  • Set up remote storage:

    bash

      dvc remote add -d myremote <remote_url>
    
  • Push data to remote storage:

    bash

      dvc push
    
  • Pull data from remote storage:

    bash

      dvc pull
    

Example:

bash

dvc remote add -d myremote s3://mybucket/dvc-storage
dvc push
dvc pull

5. Checking Data Status

To see the status of your data files compared to the remote storage, use:

Command:

bash

dvc status
  • Shows which files have changed and need to be updated.

Example:

bash

dvc status

6. Importing Data

You can import data from external sources using DVC.

Command:

bash

dvc import-url <url> <output_path>
  • Imports data from a URL and adds it to DVC.

Example:

bash

dvc import-url https://example.com/data.zip data/

7. Building and Running Pipelines

DVC allows you to define and run ML pipelines.

Commands:

  • Define a pipeline in dvc.yaml:

    yaml

      stages:
        prepare:
          cmd: python src/prepare.py
          deps:
            - data.csv
          outs:
            - prepared_data.csv
        train:
          cmd: python src/train.py
          deps:
            - prepared_data.csv
          outs:
            - model.pkl
    
  • Run the pipeline:

    bash

      dvc repro
    
  • Visualize the pipeline:

    bash

      dvc dag
    

Example:

bash

dvc repro
dvc dag

8. Tracking Metrics

DVC can track metrics from your experiments.

Commands:

  • Save metrics in a JSON or YAML file:

    json

      {
        "accuracy": 0.95,
        "loss": 0.05
      }
    
  • Display metrics:

    bash

      dvc metrics show
    

Example:

bash

dvc metrics show metrics.json

9. Comparing Experiments

DVC allows you to compare different experiment runs.

Command:

bash

dvc exp show
  • Displays a comparison of metrics and parameters across experiments.

Example:

bash

dvc exp show

10. Removing Data from DVC

To stop tracking a file with DVC, use:

Command:

bash

dvc remove <file>.dvc
  • Removes the file from DVC tracking.

Example:

bash

dvc remove data.csv.dvc

DVC demo code: https://github.com/bittush8789/MLOps-Foundation-/tree/main/03.DVC-demo

Follow me on LinkedIn: https://www.linkedin.com/in/bittu-kumar-54ab13254/