Skip to content

Commit

Permalink
guide: document matrix in dvc.yaml (#4761)
Browse files Browse the repository at this point in the history
* guide: document matrix in dvc.yaml

fixes #4741
upstream PR: iterative/dvc#9725
available since https://github.com/iterative/dvc/releases/3.12.0

* Apply suggestions from code review

Co-authored-by: Dave Berenbaum <[email protected]>

* Restyled by prettier (#4765)

Co-authored-by: Restyled.io <[email protected]>

* redirect from foreach to matrix; add a complete example for matrix with templating

* format matrix stage list

---------

Co-authored-by: Dave Berenbaum <[email protected]>
Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com>
Co-authored-by: Restyled.io <[email protected]>
  • Loading branch information
4 people authored Aug 17, 2023
1 parent 607c534 commit 9e0c756
Showing 1 changed file with 95 additions and 2 deletions.
97 changes: 95 additions & 2 deletions content/docs/user-guide/project-structure/dvcyaml-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -618,6 +618,13 @@ value), escape it with a backslash, e.g. `\${...`.

## `foreach` stages

<admon type="info">

Checkout [`matrix` stages](#matrix-stages) for a more powerful way to define
multiple stages.

</admon>

You can define more than one stage in a single `dvc.yaml` entry with the
following syntax. A `foreach` element accepts a list or dictionary with values
to iterate on, while `do` contains the regular stage fields (`cmd`, `outs`,
Expand Down Expand Up @@ -745,6 +752,91 @@ Both individual foreach stages (`train@1`) and groups of foreach stages

</admon>

## `matrix` stages

`matrix` allows you do to define multiple stages based on combinations of
variables. A `matrix` element accepts one or more variables, each iterating over
a list of values. For example:

```yaml
stages:
train:
matrix:
model: [cnn, xgb]
feature: [feature1, feature2, feature3]
cmd: ./train.py --feature ${item.feature} ${item.model}
outs:
- ${item.model}.pkl
```

You can reference each variable in your stage definition using the `item`
dictionary key. In the above example, you can access `item.model` and
`item.feature`.

On `dvc repro`, dvc will expand the definition to multiple stages for each
possible combination of the variables. In the above example, dvc will create six
stages, one for each combination of `model` and`feature`. The name of the stages
will be generated by appending values of the variables to the stage name after a
`@` as with [foreach](#foreach). For example, dvc will create the following
stages:

```cli
$ dvc stage list
train@cnn-feature1 Outputs cnn.pkl
train@cnn-feature2 Outputs cnn.pkl
train@cnn-feature3 Outputs cnn.pkl
train@xgb-feature1 Outputs xgb.pkl
train@xgb-feature2 Outputs xgb.pkl
train@xgb-feature3 Outputs xgb.pkl
```

Both individual matrix stages (eg: `train@cnn-feature1`) and group of matrix
stages (`train`) may be used in commands that accept stage targets.

The values in variables can be simple values such as string, integer, etc and
composite values such as list, dictionary, etc. For example:

```yaml
matrix:
config:
- n_estimators: 150
max_depth: 20
- n_estimators: 120
max_depth: 30
labels:
- [label1, label2, label3]
- [labelX, labelY, labelZ]
```

When using a list or a dictionary, dvc will generate the name of stages based on
variable name and the index of the value. In the above example, generated stages
may look like `train@labels0-config0`.

Templating can also be used inside `matrix`, so you can reference
[variables](#variables) defined elsewhere. For example, you can define values in
`params.yaml` file and use them in `matrix`.

```yaml
# params.yaml
datasets: [dataset1/, dataset2/]
processors: [processor1, processor2]
```

```yaml{4-6}
# dvc.yaml
stages:
preprocess:
matrix:
processor: ${processors}
dataset: ${datasets}
cmd: ./preprocess.py ${item.dataset} ${item.processor}
deps:
- ${item.dataset}
outs:
- ${item.dataset}-${item.processor}.json
```

## dvc.lock file

To record the state of your pipeline(s) and help track its <abbr>outputs</abbr>,
Expand Down Expand Up @@ -803,5 +895,6 @@ and all forms of
Full <abbr>parameter dependencies</abbr> (both key and value) are listed too
(under `params`), under each parameters file name.
[templated `dvc.yaml`](#templating) files, the actual values are written to
`dvc.lock` (no `${}` expression). As for [`foreach` stages](#foreach-stages),
individual stages are expanded (no `foreach` structures are preserved).
`dvc.lock` (no `${}` expression). As for [`foreach` stages](#foreach-stages) and
[`matrix` stages](#matrix-stages), individual stages are expanded (no `foreach`
or `matrix` structures are preserved).

0 comments on commit 9e0c756

Please sign in to comment.