Difficulty: Beginner
Estimated Time: 20-25 min

How cool would it be to make Git handle arbitrary large files and directories with the same performance as with small code files? Imagine you can do a git clone and see data files and ML model files in the workspace. Or do git checkout and switch to a different version of a 100Gb size file in a less than a second?

The core part of DVC is a few commands that you can run along with Git to track a large file, ML model or a directory.

Data Versioning

Step 1 of 6

Step 1

Track a file or directory

Let's get a data file from the Get Started example project:

dvc get \
  https://github.com/iterative/dataset-registry \
  get-started/data.xml -o data/data.xml

The command dvc get is like wget, but it is used to download data artifacts from DVC projects which are hosted on Git repositories.

ls -lh data/

To track a large file, ML model or a whole directory with DVC we use dvc add:

dvc add data/data.xml

cat data/.gitignore

DVC has listed data.xml on .gitignore to make sure that we don't commit it to Git.

Instead, we track with Git the file data/data.xml.dvc:

git add data/.gitignore \
        data/data.xml.dvc
git commit -m "Add raw data to project"