Difficulty: Beginner
Estimated Time: 10-15 min

Data science process is inherently iterative and R&D like. Data scientist may try many different approaches, different hyper-parameter values, and "fail" many times before the required level of a metric is achieved. DVC provides a framework to capture and compare the performance of experiments through metrics.

If you prefer to run locally, you can also supply the commands in this scenario in a container:

docker run -it dvcorg/doc-katacoda:start-params

Params, Metrics, and Plots

Step 1 of 7

Step 1

Adding parameters to pipelines

In the previous scenario, we defined a two-stage pipeline to split the data into training and test sets and extract features for each. In this scenario, the first step is to add parameters to this pipeline to modify the outputs without code changes.

The programs that run these stages already have parameter definitions in them.

example-get-started/src/prepare.py

example-get-started/src/featurization.py

Specifically, we can see that params.yaml is loaded in these files:

grep 'params.yaml' src/prepare.py src/featurization.py

and the values are used to populate global variables:

grep 'params\[.*\]' src/prepare.py src/featurization.py

Let's now check the content of params.yaml using the link below:

example-get-started/params.yaml

In this scenario, we'll update the pipeline to see the effects of changes in these variables. DVC has first-class support for parameters used in experimentation. It can track the parameter changes defined in YAML, JSON, TOML, or Python files and run the affected stages.

By default, DVC uses a file named params.yaml to track parameters. (What a coincidence!) However, these parameters should be referenced in dvc.yaml to associate them with the stages.

Let's begin by reproducing the pipeline without parameters as a baseline:

dvc repro

and commit changes to Git:

git add data/.gitignore dvc.yaml data/prepared/ dvc.lock
git commit -m "baseline experiment without parameters"