DVC: the key tool for data management in ML

Version control makes it easier for developers to work independently on a project. DVC (data version control) extends this concept to data by allowing a effective management. This article explores DVC, defining its role in data science and explaining its use.

Mouad ET-TALI Data Scientist

Overview of data versioning tools

When developing data science and machine learning projects, it is essential to have tools to manage data versioning. Data versioning allows you to track changes to datasets, save different versions, and collaborate effectively with other team members. This ensures reproducibility of experiments, facilitates comparison of results and allows reverting to previous versions if necessary.

In this section, we will present an overview of some of the main data version control tools, highlighting their specific features and approaches. Among these tools we will find:

DVC (Data Version Control)

DVC is an open-source tool designed specifically for data versioning. It offers a simple, lightweight approach to managing data versions, with seamless integration with Git. DVC stands out for its ability to maintain data flexibility and portability.

Git LFS (Large File Storage)

Git LFS is a Git extension that allows you to manage and version large data files. It is commonly used in software development projects to track binary files, but can also be used to version data. However, Git LFS may have limitations in performance and handling large files.

Pachyderm

Pachyderm is an open-source platform that combines data versioning with data processing pipelines. It helps manage large data sets and track changes to the data. Pachyderm offers advanced features such as pipeline repeatability and data versioning, but it may require a longer learning curve and more complex setup.

MLflow

MLflow is an open-source platform developed by Databricks for managing the lifecycle of machine learning models. Although its primary function is model tracking and deployment, MLflow also has data versioning capabilities. However, it is more focused on managing experiments and models rather than specific versioning of data.

Neptune.ai

Neptune.ai is a collaborative platform for data science projects. It provides experiment tracking, code and data versioning, and team collaboration capabilities. Neptune.ai focuses on experiment visualization and metadata management, providing a comprehensive view of the model development process.

You have an idea of the landscape of data version control tools most commonly used in data science and machine learning projects. Each of these tools offers its own features and approaches to meet specific data management and collaboration needs.

In the remainder of this article, we will focus on DVC in more detail. We'll explore its advanced features, its integration with Git, and its specific benefits for data versioning in data science and machine learning projects. You'll learn how DVC can improve the management of your data sets, simplify collaboration within your team, and help you maintain rigorous traceability of data changes.

General overview of DVC

DVC is an open source Python library that provides advanced features for data versioning in data science and machine learning projects. DVC can be installed easily using commonly used tools such as conda or pip, making it accessible to a wide range of users.

One of the special features of DVC is its free and open source extension for the Visual Studio Code (VS Code) code editor. This extension provides a dashboard that allows you to visualize different versions of machine learning data, models, and experiments. This feature makes it easier to navigate and understand version history, as well as compare results obtained at different stages of the project.

Using DVC, data scientists and developers can easily manage datasets, models, and configurations, while maintaining reproducibility of experiments and traceability of results. DVC integrates seamlessly with version control tools like Git, making it easier for team members to collaborate and easily share trained datasets and models.

In summary, DVC is a powerful Python library, accompanied by an extension for VS Code, that offers advanced features for data versioning in data science and machine learning projects.

DVC Objectives

DVC was designed to meet several essential objectives in data versioning, thereby promoting effective collaboration in data science and machine learning projects. The main objectives of DVC are:

Tracking data status : DVC allows users to precisely track the status of their data, including the version, path and usage of each data. This makes it easier to understand how data sets change over time, as well as accurately locate data used in different aspects of the project.
Secure data modification : DVC provides a secure approach to making changes to datasets. Users can make changes, such as pre-processing or data augmentation, while preserving the integrity of existing versions. This ensures secure handling of data without risk of loss or corruption.
Effective collaboration and sharing : DVC facilitates collaboration between team members working on data science projects. Users can easily share datasets, machine learning models, metrics, and code with other team members. DVC also ensures that users can access shared data securely and reproducibly.

DVC is specifically designed to handle large files, such as datasets and machine learning models, and their associated metrics. It stores this data in a secure and distributed repository, ensuring its integrity and availability to users. So users can work with confidence, knowing their data is protected and can easily access it for their analysis and collaboration needs.

Why use DVC

When it comes to development projects, it has become common and even essential to version code to benefit from the obvious benefits it brings to the software community. Tracking every code change allows developers to navigate time, compare previous versions, and resolve issues while minimizing disruption to the team. After all, code is a valuable asset that must be protected at all costs!

The same principle applies to data. Here are four reasons why you should use a version control tool for data:

Reproducibility

Machine learning models are typically developed by combining code, data, and configuration files, which can make reproducing results complex. DVC solves this problem by versioning data and tracking dependencies, allowing the same results to be reproduced on different machines or at different times.

Collaboration

DVC facilitates collaboration within teams working on machine learning projects. By versioning data, it becomes easier to share work and ensure that all members are working with the same inputs and outputs. It also simplifies reviewing changes and helps track contributors.

Experiment monitoring

DVC helps you track experiments and the results they produce. By versioning the data, you can easily see which experiments were conducted with which versions of the code and data, as well as the results obtained. This makes it easier to iterate on models and improve performance over time.

Scalability

As your machine learning projects grow, it becomes increasingly difficult to manage data and code. DVC solves this problem by versioning the data, making it easier to manage and scale the project as it grows. Additionally, it optimizes storage efficiency by only saving changes to the data rather than the entire dataset.

By adopting DVC, you will gain these benefits, improving the reproducibility, collaboration, experiment tracking, and scalability of your machine learning projects.

How DVC works

Using DVC includes several key steps for versioning and data management. Here's how DVC works:

Initialization

To get started, you can create a .dvc file using the dvc add file command. This file specifies the path to the directory that contains the current version of the data. Then you can use DVC to create remote storage. This could be a location in the cloud (AWS, GCP, Azure, Palantir Foundry, etc.) or a server/directory separate from your project. You can create this remote storage space by using the dvc remote add command directly from the terminal. Finally, you can push this data version to remote storage using the dvc push command. This ensures security and the possibility of recovery of the original data.

Data versioning

When you modify the original data file, DVC automatically updates the .dvc file to have a new path pointing to the modified data. You can update this file using the same dvc add file command. Using Git, you can track changes to this .dvc file and find any data version you want using Git's various commit messages.

Storing data versions

By using the dvc push command, you can store different versions of data in remote storage. This allows you to maintain a complete history of previous versions of data in this space. So, you have the ability to retrieve any specific version of data that exists in remote storage. This feature provides great flexibility and allows you to revert to earlier versions of data if necessary, which can be particularly useful when exploring patterns or comparing performance between different versions of data. By using DVC, you have the peace of mind of knowing that your data is secure, accessible and repeatable at all times.

An important point to note is that the data files themselves are never tracked by Git. In projects using DVC, only files with the .dvc extension are tracked by Git. This allows large data files to be efficiently managed while maintaining versioning with Git.

Using these steps, DVC makes it possible to track the version history of data, store it securely and retrieve it easily, while working in harmony with Git to track changes to the .dvc file. This makes data management easier in data science and machine learning projects.

Data versioning & MLOps

MLOps is a methodology that aims to facilitate the production and efficient maintenance of machine learning (ML) models. It is a combination of practices, tools and processes aimed at automating and standardizing ML workflows, while ensuring reproducibility, collaboration and scalability of ML projects.

The importance of a data version control tool in MLOps is crucial. ML models depend not only on the code, but also on the data used to train, validate, and deploy them. A data versioning tool like DVC helps version and track data, ensure reproducibility of experiments, facilitate cross-team collaboration, automate data pipelines, and ensure efficient management dependencies of ML models. This allows MLOps teams to maintain rigorous control over data and ensure consistency of results, while making it easier to resolve issues and deploy models into production.

A data version control tool like DVC plays a critical role in enabling MLOps by enabling rigorous data management, experiment reproducibility, cross-team collaboration, and pipeline automation. Although it does not address all aspects of MLOps, DVC can be supplemented with other specialized tools to meet specific feature management, model monitoring, and CI/CD needs. By using DVC and other tools synergistically, MLOps teams can improve the efficiency, quality, and reliability of their ML projects.

Conclusion

DVC offers several key advantages in the field of MLOps. It allows you to version data, track changes, restore previous versions when needed, and facilitate collaboration by providing a consistent way to manage model input and output. Additionally, DVC easily integrates with other MLOps tools such as MLflow, enabling more comprehensive model lifecycle management.

In conclusion, using a data version control tool like DVC is crucial in an MLOps environment. This ensures reproducibility of experiments, facilitates collaboration between teams, enables efficient data management and contributes to process automation. By integrating DVC into your MLOps workflow, you can improve the efficiency, traceability, and reliability of your data science projects.

Services

Customer cases

About

Careers

Services

Customer cases

About

Careers

DVC: the key tool for data management in ML

Overview of data versioning tools

DVC (Data Version Control)

Git LFS (Large File Storage)

Pachyderm

MLflow

Neptune.ai

General overview of DVC

DVC Objectives

Why use DVC

Reproducibility

Collaboration

Experiment monitoring

Scalability

How DVC works

Initialization

Data versioning

Storing data versions

Data versioning & MLOps

Conclusion

Most popular articles

Do you have a transformation project? Let's talk about it !

Occitanie

Ile-de-France

About