How cool would it be to make Git handle arbitrary large files and directories with the same performance as with small code files? Imagine you can do a git clone and see data files and ML model files in the workspace. Or do git checkout and switch to a different version of a 100Gb size file in a less than a second?
The core part of DVC is a few commands that you can run along with Git to track a large file, ML model or a directory.
If you prefer to run locally, you can also supply the commands in this scenario in a container:
docker run -it dvcorg/doc-katacoda:start-versioning
Data and Model Versioning
Track a File or Directory
Let's get a data file from the Get Started example project:
dvc get \ https://github.com/iterative/dataset-registry \ get-started/data.xml -o data/data.xml
dvc getis like a smart
wgetthat can be used to retrieve artifacts from DVC repositories. You don't even need an initialized DVC repository to use
ls -lh data/
To track a large file, ML model, or a whole directory with DVC, we use
dvc add data/data.xml
DVC has listed
.gitignore to make sure that we don't commit it
Instead, we track
data/data.xml.dvc with Git.
git add data/.gitignore \ data/data.xml.dvc git commit -m "Add dataset to the project"