This guides uses the DVC Get Started Guide as a starting point and takes you on how to build maintainable Machine Learning pipelines using DVC.
If you have some time you can check the full article here (it has more in depth explanations than this readme 😉)
The principles are:
- Write a python script for each pipeline step
- Save the parameters each script uses in a
yaml
file - Specify the files each script depends on
- Specify the files each script generates
In this tutorial we're going to build a model to classify the 20newsgroups dataset.
Environment: Linux with Python 3, pip and Git installed
$ mkdir dvc_tutorial
$ cd dvc_tutorial
$ python3 -m venv .env
$ source .env/bin/activate
(.env)$ pip3 install dvc
(.env)$ git init
(.env)$ dvc init
# file params.yaml
prepare:
categories:
- comp.graphics
- sci.space
Save the file prepare.py
file (it's available here on this repo) inside /src
. Your folder structure should look like this:
├── params.yaml
└── src
└── prepare.py
The steps for doing that are:
- Write a python script:
prepare.py
- Save the parameters:
categories
insideparams.yaml
- Specify the files the script depends on:
prepare.py
- Specify the files the script generates: the folder
data/prepared
- Defined the command line instruction to run this step
(.env)$ pip install pyyaml scikit-learn pandas
(.env)$ dvc run -n prepare -p prepare.categories -d src/prepare.py -o data/prepared python3 src/prepare.py
(.env)$ dvc run -n featurize -d src/featurize.py -d data/prepared -o data/features python3 src/featurize.py data/prepared data/features
(.env)$ dvc run -n train -p train.alpha -d src/train.py -d data/features -o model.pkl python3 src/train.py data/features model.pkl
(.env)$ dvc run -n evaluate -d src/evaluate.py -d model.pkl -d data/features --metrics-no-cache scores.json --plots-no-cache plots.json python3 src/evaluate.py model.pkl data/features scores.json plots.json
# file params.yaml
prepare:
categories:
- comp.graphics
- rec.sport.baseball
train:
alpha: 0.9
(.env)$ dvc repro
(.env)$ dvc params diff
(.env)$ dvc metrics diff
(.env)$ dvc plots show -y precision -x recall plots.json
(.env)$ dvc plots diff --targets plots.json -y precision