From 9e0c756ad2b6d37d61d3442705af2f22dc3c7a98 Mon Sep 17 00:00:00 2001 From: skshetry <18718008+skshetry@users.noreply.github.com> Date: Thu, 17 Aug 2023 10:55:57 +0545 Subject: [PATCH] guide: document matrix in dvc.yaml (#4761) * guide: document matrix in dvc.yaml fixes #4741 upstream PR: iterative/dvc#9725 available since https://github.com/iterative/dvc/releases/3.12.0 * Apply suggestions from code review Co-authored-by: Dave Berenbaum * Restyled by prettier (#4765) Co-authored-by: Restyled.io * redirect from foreach to matrix; add a complete example for matrix with templating * format matrix stage list --------- Co-authored-by: Dave Berenbaum Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com> Co-authored-by: Restyled.io --- .../project-structure/dvcyaml-files.md | 97 ++++++++++++++++++- 1 file changed, 95 insertions(+), 2 deletions(-) diff --git a/content/docs/user-guide/project-structure/dvcyaml-files.md b/content/docs/user-guide/project-structure/dvcyaml-files.md index 1b8c3bb236..bcb8c1e4fa 100644 --- a/content/docs/user-guide/project-structure/dvcyaml-files.md +++ b/content/docs/user-guide/project-structure/dvcyaml-files.md @@ -618,6 +618,13 @@ value), escape it with a backslash, e.g. `\${...`. ## `foreach` stages + + +Checkout [`matrix` stages](#matrix-stages) for a more powerful way to define +multiple stages. + + + You can define more than one stage in a single `dvc.yaml` entry with the following syntax. A `foreach` element accepts a list or dictionary with values to iterate on, while `do` contains the regular stage fields (`cmd`, `outs`, @@ -745,6 +752,91 @@ Both individual foreach stages (`train@1`) and groups of foreach stages +## `matrix` stages + +`matrix` allows you do to define multiple stages based on combinations of +variables. A `matrix` element accepts one or more variables, each iterating over +a list of values. For example: + +```yaml +stages: + train: + matrix: + model: [cnn, xgb] + feature: [feature1, feature2, feature3] + cmd: ./train.py --feature ${item.feature} ${item.model} + outs: + - ${item.model}.pkl +``` + +You can reference each variable in your stage definition using the `item` +dictionary key. In the above example, you can access `item.model` and +`item.feature`. + +On `dvc repro`, dvc will expand the definition to multiple stages for each +possible combination of the variables. In the above example, dvc will create six +stages, one for each combination of `model` and`feature`. The name of the stages +will be generated by appending values of the variables to the stage name after a +`@` as with [foreach](#foreach). For example, dvc will create the following +stages: + +```cli +$ dvc stage list +train@cnn-feature1 Outputs cnn.pkl +train@cnn-feature2 Outputs cnn.pkl +train@cnn-feature3 Outputs cnn.pkl +train@xgb-feature1 Outputs xgb.pkl +train@xgb-feature2 Outputs xgb.pkl +train@xgb-feature3 Outputs xgb.pkl +``` + +Both individual matrix stages (eg: `train@cnn-feature1`) and group of matrix +stages (`train`) may be used in commands that accept stage targets. + +The values in variables can be simple values such as string, integer, etc and +composite values such as list, dictionary, etc. For example: + +```yaml +matrix: + config: + - n_estimators: 150 + max_depth: 20 + - n_estimators: 120 + max_depth: 30 + labels: + - [label1, label2, label3] + - [labelX, labelY, labelZ] +``` + +When using a list or a dictionary, dvc will generate the name of stages based on +variable name and the index of the value. In the above example, generated stages +may look like `train@labels0-config0`. + +Templating can also be used inside `matrix`, so you can reference +[variables](#variables) defined elsewhere. For example, you can define values in +`params.yaml` file and use them in `matrix`. + +```yaml +# params.yaml +datasets: [dataset1/, dataset2/] +processors: [processor1, processor2] +``` + +```yaml{4-6} +# dvc.yaml +stages: + preprocess: + matrix: + processor: ${processors} + dataset: ${datasets} + + cmd: ./preprocess.py ${item.dataset} ${item.processor} + deps: + - ${item.dataset} + outs: + - ${item.dataset}-${item.processor}.json +``` + ## dvc.lock file To record the state of your pipeline(s) and help track its outputs, @@ -803,5 +895,6 @@ and all forms of Full parameter dependencies (both key and value) are listed too (under `params`), under each parameters file name. [templated `dvc.yaml`](#templating) files, the actual values are written to -`dvc.lock` (no `${}` expression). As for [`foreach` stages](#foreach-stages), -individual stages are expanded (no `foreach` structures are preserved). +`dvc.lock` (no `${}` expression). As for [`foreach` stages](#foreach-stages) and +[`matrix` stages](#matrix-stages), individual stages are expanded (no `foreach` +or `matrix` structures are preserved).