Difficulty: Beginner
Estimated Time: 10-15 min

Okay, now that we've learned how to track data and models with DVC and how to version them with Git, next question is:

  • How can we use these artifacts outside of the project?
  • How do I download a model to deploy it?
  • How do I download a specific version of a model?
  • How do I reuse datasets across different projects?

These questions tend to come up when you browse the files that DVC saves to remote storage, e.g. s3://dvc-public/remote/get-started/fb/89904ef053f04d64eafcc3d70db673 😱 instead of the original files, name such as model.pkl or data.xml.

Let's learn how any DVC tracked ML model, dataset or file can be accessed:

  • From CLI with dvc get
  • From Python API with dvc.api.open
  • From another repository with dvc import

If you prefer to run locally, you can also supply the commands in this scenario in a container:

docker run -it dvcorg/doc-katacoda:start-accessing

Accessing Data

Step 1 of 5

Step 1

Download

We can download any file from a DVC repository:

dvc get \
  https://github.com/iterative/dataset-registry \
  get-started/data.xml

md5sum data.xml

dvc get automated this by reading https://remote.dvc.org/dataset-registry from .dvc/config and a3/04afb96060aad90176268345e10355 path from get-started/data.xml.dvc.

Just for fun, let's try to download it with wget:

storage="https://remote.dvc.org/dataset-registry"
path="a3/04afb96060aad90176268345e10355"
wget -O data.xml.1 $storage/$path

Check whether they are the same file:

diff data.xml data.xml.1

Instead of downloading the data file directly, e.g., with aws s3 cp, scp, wget, we are accessing it using a Git repo URL as an entry point or as a data/model registry.

By the way, we didn't initialize DVC in the current directory yet. dvc get doesn't need an initialized project.

Let's initialize DVC now.

dvc init