How cool would it be to make Git handle arbitrary large files and directories with the same performance as with small code files? Imagine you can do a git clone and see data files and ML model files in the workspace. Or do git checkout and switch to a different version of a 100Gb size file in a less than a second?
The core part of DVC is a few commands that you can run along with Git to track a large file, ML model or a directory.

Steps
Data Versioning
Step 1
Track a file or directory
Let's get a data file from the Get Started example project:
dvc get \
https://github.com/iterative/dataset-registry \
get-started/data.xml -o data/data.xml
The command
dvc get
is likewget
, but it is used to download data artifacts from DVC projects which are hosted on Git repositories.
ls -lh data/
To track a large file, ML model or a whole directory with DVC we use dvc add
:
dvc add data/data.xml
cat data/.gitignore
DVC has listed data.xml
on .gitignore
to make sure that we don't commit it
to Git.
Instead, we track with Git the file data/data.xml.dvc
:
git add data/.gitignore \
data/data.xml.dvc
git commit -m "Add raw data to project"