Difficulty: Beginner
Estimated Time: 20-25 min

How cool would it be to make Git handle arbitrary large files and directories with the same performance as with small code files? Imagine you can do a git clone and see data files and ML model files in the workspace. Or do git checkout and switch to a different version of a 100Gb size file in a less than a second?

The core part of DVC is a few commands that you can run along with Git to track a large file, ML model or a directory.

If you prefer to run locally, you can also supply the commands in this scenario in a container:

docker run -it dvcorg/doc-katacoda:start-versioning

Data and Model Versioning

Step 1 of 6

Step 1

Track a File or Directory

Let's get a data file from the Get Started example project:

dvc get \
  https://github.com/iterative/dataset-registry \
  get-started/data.xml -o data/data.xml

The command dvc get is like a smart wget that can be used to retrieve artifacts from DVC repositories. You don't even need an initialized DVC repository to use dvc get.

ls -lh data/

To track a large file, ML model, or a whole directory with DVC, we use dvc add:

dvc add data/data.xml

DVC has listed data.xml in .gitignore to make sure that we don't commit it to Git.

cat data/.gitignore

Instead, we track data/data.xml.dvc with Git.

git add data/.gitignore \
        data/data.xml.dvc

git commit -m "Add dataset to the project"