In ML projects, usually we need to process data and generate outputs in a reproducible way. This requires establishing a connection between the data processed, the program that processes them, its parameters, and the outputs.
This process is reflected in DVC with a data pipeline. In this scenario, we begin to build pipelines using stage definitions and connect them together.
Stages are the basic building blocks of the pipelines in DVC. They define and execute an action, like data import or feature extraction, and usually produce some output.
In this scenario, our goal is to create a project that classifies the questions and assigns tags to them. Tasks like data preparation, training, testing, evaluation are run manually, and this is prone to errors caused by too many moving parts. Pipelines provide a reproducible way to organize these tasks.
If you prefer to run locally, you can also supply the commands in this scenario in a container:
docker run -it dvcorg/doc-katacoda:start-stages
Stages and Pipelines
Manual Data Preparation
src/prepare.py splits the data into train and test
We first run this script without DVC to see what happens:
python3 src/prepare.py data/data.xml
Let's see the output:
ls -l data/prepared
We delete the artifacts before reproducing them with DVC.
rm -fr data/prepared