Versioning Data and Models for Machine learning projects with DVC
Data for machine learning is usually so big that we can't upload it to a Github repository or track it with Git, But it happens that you still want to use Git and you want maybe your data to be tracked or represented with Git but you don't want to be storing it with the code, but rather handle it separately, DVC is the solution for this kind of issue.
DVC (Data Version Control: https://dvc.org/doc) is data and machine learning management tool that can be used to:
- Track and distribute large files such as datasets and models as part of a Git repository with different storage backends.
- Save and Compare experiment metrics and results.
- Trace how datasets, models or other artifacts were built and transformed.
In this post we will show an example that demonstrates the use of DVC in a machine learning project. Particularly, how it's used to do the following:
- Transfer data to remote servers to run model training
- Save serialized models following training
1- Our Setup
Install DVC: https://dvc.org/doc/install/linux
Then Let us assume that we have this Git repository structure:
|-- Dockerfile
|-- README.md
|-- data /
|-- models / 
|-- requirements.txt
|-- train.py
`-- preprocessing.py
We have basically two empty folders that will host our data and models respectively. A training script (train.py) that will be used to train a certain neural network or any machine learning algorithm and other helper scripts like scripts that can be used to clean and transform the data (preprocessing.py). Finally a Dockerfile to package the whole project.
The next step now is to download or move the data inside the data/ folder to be used for training:
|-- data
|   |-- raw-data/
|       |-- data.csv
2- Initialize, authenticate, and push data to DVC:
With our data ready now, we initialize DVC:
dvc init 
This will create a .dvc directory:
|-- .dvc
|   |-- .gitignore
|   |-- config
|   `-- plots/
|-- .dvcignore
|-- Dockerfile
|-- README.md
|-- data
|-- models
|-- requirements.txt
|-- train.py
`-- transform.py
DVC is compatible with most of data storage solutions such as S3 buckets, GCS buckets, NFS, Google Drive. In this article we will use Google drive as an example based on the official DVC docs https://dvc.org/doc/user-guide/setup-google-drive-remote.
1- First, Create a folder in your drive. The URL in your browser should be similar to this:
https://drive.google.com/drive/u/2/folders/1zqBBkZ4GxoaS5xDwfzwjYlmMlyAAAA0_
The last part of the URL is the folder ID. We use it to create a DVC Google Drive remote:
dvc remote add -d storage gdrive://1zqBBkZ4GxoaS5xDwfzwjYlmMlyAAAA0_
This will actually change the .dvc/config file and you will have something similar to this:
[core]
    remote = storage
['remote "storage"']
    url = gdrive://1zqBBkZ4GxoaS5xDwfzwjYlmMlyAAAA0_
Then let us commit our changes:
git add .dvc/config
git commit -m "updated DVC config"
2- Now we try tracking data in the git repository with DVC then push them to the remote in preparation for training:
dvc add data/raw-data
The dvc add command is similar to git add, in that it makes DVC aware of the target data, in order to start versioning it. It creates a .dvc file to track the added data:
|-- data
|   |-- .gitignore
|   `-- raw-data.dvc
This command can be used to track large files, models, dataset directories, etc. that are too big for Git to handle directly. This enables versioning them indirectly with Git. You can read more about dvc add here.
If we print out the contents of the .dvc file we will have something like this:
outs:
- md5: 67463197ab31aacbbea49967d3e153ca.dir
  size: 671986169
  nfiles: 1
  path: raw-data
The md5: 67463197ab31aacbbea49967d3e153ca.dir will change each time we run the dvc add, you can really see it as the COMMIT SHA for Git. The .dvc now is nothing more than a lightweight meta-file that is linked to our data in a meaningful way, that's why we need to version this .dvc file with Git and then our data is versioned with DVC.
git add data/tfrecords.dvc data/.gitignore
git commit -m "added DVC metadata"
git push
The git push will push our Git tracked code to our Git repository, except the actual raw-data/ which is git-ignored.
3- Now we want to push our DVC tracked files (our actual raw-data/) to our DVC repository, for that we simply run:
dvc push  
3- Load data using DVC and launch training:
In this section we will use a Docker container to create an isolated and separate environment to fetch our data and launch training, to really show how we use DVC to track our data and models. An example of a Dockerfile would be something like this:
FROM python:3.8
RUN apt-get update && apt-get install -y git
RUN pip install pip==21.0.1
WORKDIR /app
ADD requirements.txt .
RUN pip install -r requirements-docker.txt
ADD . .
RUN git config user.email "code@gmail.com"
RUN git config user.name "OussemaLouati"
Then we build the image and get into the image:
docker build -t dvc-example . 
docker run -it -v $HOME/.ssh:/root/.ssh dvc-example 
We added:
RUN git config user.email "code@gmail.com"
RUN git config user.name "OussemaLouati"
And mounted our ssh keys inside the container:
-v $HOME/.ssh:/root/.ssh
to enable authenticating with Github from inside the container.
Now we are inside the docker container, a complete isolated environment, let us fetch our data and launch our training. Fetching and downloading data is as simple as:
dvc pull 
This will automatically download the data and save it under data/raw-data which should be used by the training script:
python train.py  # model is generated and is saved to models/checkpoints/ which is git-ignored also
Finally, we save the trained model checkpoint and add it to DVC to start tracking our models:
dvc add models/checkpoints
git add models/checkpoints.dvc models/.gitignore
git commit -m "added DVC metadata for model checkpoint"
dvc push
git push
4- Verify results:
After running the training, the results should be available for us to pull:
git pull
dvc pull
Finally, if we are to run training again, this means that our model checkpoints will change and we need to track them always with Git and DVC, so to resume, all we each time is:
- Download data/models
dvc pull
- Add new data/models
dvc add data
git add data.dvc
git commit -m "Commit message"
git push
dvc push

