Okay, now that we've learned how to track data and models with DVC and how to version them with Git, next question is:
- How can we use these artifacts outside of the project?
- How do I download a model to deploy it?
- How do I download a specific version of a model?
- How do I reuse datasets across different projects?
These questions tend to come up when you browse the files that DVC saves to remote storage, e.g.
s3://dvc-public/remote/get-started/fb/89904ef053f04d64eafcc3d70db673
😱 instead of the original files, name such asmodel.pkl
ordata.xml
.
Let's learn how any DVC tracked ML model, dataset or file can be accessed:
- From CLI with
dvc get
- From Python API with
dvc.api.open
- From another repository with
dvc import
If you prefer to run locally, you can also supply the commands in this scenario in a container:
docker run -it dvcorg/doc-katacoda:start-accessing

Steps
Accessing Data
Step 1
Download
We can download any file from a DVC repository:
dvc get \
https://github.com/iterative/dataset-registry \
get-started/data.xml
md5sum data.xml
dvc get
automated this by reading https://remote.dvc.org/dataset-registry
from
.dvc/config
and a3/04afb96060aad90176268345e10355
path from
get-started/data.xml.dvc.
Just for fun, let's try to download it with wget
:
storage="https://remote.dvc.org/dataset-registry"
path="a3/04afb96060aad90176268345e10355"
wget -O data.xml.1 $storage/$path
Check whether they are the same file:
diff data.xml data.xml.1
Instead of downloading the data file directly, e.g., with aws s3 cp
, scp
,
wget
, we are accessing it using a Git repo URL as an entry point or as
a data/model registry.
By the way, we didn't initialize DVC in the current directory yet. dvc get
doesn't need an initialized project.
Let's initialize DVC now.
dvc init