How cool would it be to make Git handle arbitrary large files and directories with the same performance as with small code files? Imagine you can do a git clone and see data files and ML model files in the workspace. Or do git checkout and switch to a different version of a 100Gb size file in a less than a second?
The core part of DVC is a few commands that you can run along with Git to track a large file, ML model or a directory.
Track a file or directory
Let's get a data file from the Get Started example project:
dvc get \ https://github.com/iterative/dataset-registry \ get-started/data.xml -o data/data.xml
dvc getis like
wget, but it is used to download data artifacts from DVC projects which are hosted on Git repositories.
ls -lh data/
To track a large file, ML model or a whole directory with DVC we use
dvc add data/data.xml
DVC has listed
.gitignore to make sure that we don't commit it
Instead, we track with Git the file
git add data/.gitignore \ data/data.xml.dvc
git commit -m "Add raw data to project"