Difficulty: Intermediate
Estimated Time: 30-40 min

In this tutorial we will get some hands-on experience with a very basic scenario – working with multiple versions of datasets and ML models using DVC commands.

We first train a classifier model using 1000 labeled images, then we double the number and retrain our model. We capture both datasets and both results and show how to use dvc checkout along with git checkout to switch between different versions.

The specific algorithm that is used to train and validate the classifier is not important. The script just takes some data and produces a model file.

Note: This is based on a tutorial that François Chollet put together to show how to build a powerful image classifier, using only a small dataset.

Data Versioning

Step 1 of 5

Step 1

Preparation

  1. Get the code of the tutorial:

    git clone https://github.com/iterative/example-versioning

    cd example-versioning/

    ls -al

    git status

    git remote -v

    git remote remove origin

    git remote -v

    git status

  2. Install requirements. It is recommended to install them in a virtual environment:

    virtualenv -p python3 .env

    source .env/bin/activate

    cat requirements.txt

    pip3 install -r requirements.txt

    echo -e "\n.env/" >> .gitignore

    cat .gitignore

    git add .gitignore

    git commit -m "Ignore virtualenv directory"