In the world of Machine Learning (ML) and MLOps, managing data, models, and experiments efficiently is crucial. DVC (Data Version Control) is a powerful open-source tool designed to address these challenges. It extends Git to handle large datasets, models, and ML pipelines, making it an essential tool for MLOps engineers.
In this blog, we’ll explore what DVC is, its key features, and a detailed guide to its most important commands. By the end, you’ll have a solid understanding of how to use DVC to streamline your ML workflows.
What is DVC?
DVC is a version control system specifically designed for ML projects. It works alongside Git to manage large files, datasets, and models that are typically too big to be stored in a Git repository. DVC stores metadata about these files in Git while keeping the actual data in remote storage (e.g., AWS S3, Google Cloud Storage, or local storage).
Key Features of DVC
Data Versioning: Track changes to datasets and models.
Reproducibility: Ensure experiments can be reproduced by versioning data and code together.
Pipeline Management: Define and execute ML pipelines with dependencies.
Collaboration: Share large datasets and models with team members efficiently.
Storage Agnostic: Works with any cloud or local storage.
Why Use DVC?
Handles Large Files: DVC is designed to handle large datasets and models that Git cannot manage efficiently.
Reproducibility: By versioning data and code together, DVC ensures that experiments can be reproduced.
Pipeline Automation: DVC allows you to define and automate ML pipelines, making it easier to manage complex workflows.
Collaboration: DVC makes it easy to share data and models with team members without duplicating files.
DVC Commands: A Hands-On Guide
Below is a detailed guide to the most important DVC commands, organized by functionality.
1. Initializing DVC
To start using DVC in your project, you need to initialize it.
Command:
bash
dvc init
Initializes DVC in the current directory.
Creates a
.dvc
directory to store metadata and configuration files.
Example:
bash
cd my-ml-project
dvc init
git add .dvc
git commit -m "Initialize DVC"
2. Adding Data to DVC
To version control a dataset or file, use the dvc add
command.
Command:
bash
dvc add <file_or_directory>
Adds the file or directory to DVC.
Creates a
.dvc
file (metadata) and adds the actual data to.gitignore
.
Example:
bash
dvc add data.csv
git add data.csv.dvc .gitignore
git commit -m "Add data.csv to DVC"
3. Tracking Data with Git
After adding data to DVC, you need to track the metadata file with Git.
Command:
bash
git add <file>.dvc .gitignore
git commit -m "Add <file> to DVC"
- Commits the
.dvc
file to Git, which tracks changes to the data.
Example:
bash
git add data.csv.dvc .gitignore
git commit -m "Add data.csv to DVC"
4. Pushing and Pulling Data
DVC allows you to store data in remote storage and sync it across environments.
Commands:
Set up remote storage:
bash
dvc remote add -d myremote <remote_url>
Push data to remote storage:
bash
dvc push
Pull data from remote storage:
bash
dvc pull
Example:
bash
dvc remote add -d myremote s3://mybucket/dvc-storage
dvc push
dvc pull
5. Checking Data Status
To see the status of your data files compared to the remote storage, use:
Command:
bash
dvc status
- Shows which files have changed and need to be updated.
Example:
bash
dvc status
6. Importing Data
You can import data from external sources using DVC.
Command:
bash
dvc import-url <url> <output_path>
- Imports data from a URL and adds it to DVC.
Example:
bash
dvc import-url https://example.com/data.zip data/
7. Building and Running Pipelines
DVC allows you to define and run ML pipelines.
Commands:
Define a pipeline in
dvc.yaml
:yaml
stages: prepare: cmd: python src/prepare.py deps: - data.csv outs: - prepared_data.csv train: cmd: python src/train.py deps: - prepared_data.csv outs: - model.pkl
Run the pipeline:
bash
dvc repro
Visualize the pipeline:
bash
dvc dag
Example:
bash
dvc repro
dvc dag
8. Tracking Metrics
DVC can track metrics from your experiments.
Commands:
Save metrics in a JSON or YAML file:
json
{ "accuracy": 0.95, "loss": 0.05 }
Display metrics:
bash
dvc metrics show
Example:
bash
dvc metrics show metrics.json
9. Comparing Experiments
DVC allows you to compare different experiment runs.
Command:
bash
dvc exp show
- Displays a comparison of metrics and parameters across experiments.
Example:
bash
dvc exp show
10. Removing Data from DVC
To stop tracking a file with DVC, use:
Command:
bash
dvc remove <file>.dvc
- Removes the file from DVC tracking.
Example:
bash
dvc remove data.csv.dvc
DVC demo code: https://github.com/bittush8789/MLOps-Foundation-/tree/main/03.DVC-demo
Follow me on LinkedIn: https://www.linkedin.com/in/bittu-kumar-54ab13254/