Difficulty: Beginner
Estimated Time: 15-25 min

In ML projects, usually we need to process data and generate outputs in a reproducible way. This requires establishing a connection between the data processed, the program that processes them, its parameters, and the outputs.

This process is reflected in DVC with a data pipeline. In this scenario, we begin to build pipelines using stage definitions and connect them together.

Stages are the basic building blocks of the pipelines in DVC. They define and execute an action, like data import or feature extraction, and usually produce some output.

In this scenario, our goal is to create a project that classifies the questions and assigns tags to them. Tasks like data preparation, training, testing, evaluation are run manually, and this is prone to errors caused by too many moving parts. Pipelines provide a reproducible way to organize these tasks.

If you prefer to run locally, you can also supply the commands in this scenario in a container:

docker run -it dvcorg/doc-katacoda:start-stages

Stages and Pipelines

Step 1 of 8

Step 1

Manual Data Preparation

The script src/prepare.py splits the data into train and test sets.


We first run this script without DVC to see what happens:

python3 src/prepare.py data/data.xml

Let's see the output:

ls -l data/prepared

We delete the artifacts before reproducing them with DVC.

rm -fr data/prepared