Data science process is inherently iterative and R&D like. Data scientist may try many different approaches, different hyper-parameter values, and "fail" many times before the required level of a metric is achieved. DVC provides a framework to capture and compare the performance of experiments through metrics.
If you prefer to run locally, you can also supply the commands in this scenario in a container:
docker run -it dvcorg/doc-katacoda:start-params
Params, Metrics, and Plots
Adding parameters to pipelines
In the previous scenario, we defined a two-stage pipeline to split the data into training and test sets and extract features for each. In this scenario, the first step is to add parameters to this pipeline to modify the outputs without code changes.
The programs that run these stages already have parameter definitions in them.
Specifically, we can see that
params.yaml is loaded in these files:
grep 'params.yaml' src/prepare.py src/featurization.py
and the values are used to populate global variables:
grep 'params\[.*\]' src/prepare.py src/featurization.py
Let's now check the content of
params.yaml using the link below:
In this scenario, we'll update the pipeline to see the effects of changes in these variables. DVC has first-class support for parameters used in experimentation. It can track the parameter changes defined in YAML, JSON, TOML, or Python files and run the affected stages.
By default, DVC uses a file named
params.yaml to track parameters. (What a
coincidence!) However, these parameters should be referenced in
associate them with the stages.
Let's begin by reproducing the pipeline without parameters as a baseline:
and commit changes to Git:
git add data/.gitignore dvc.yaml data/prepared/ dvc.lock git commit -m "baseline experiment without parameters"