Skip to content

Commit

Permalink
Dvclive pipeline transition (#4888)
Browse files Browse the repository at this point in the history
* dvclive: explain transition to pipelines

* mention other friction points in transitioning to pipelines

* Update content/docs/dvclive/how-it-works.md

Co-authored-by: David de la Iglesia Castro <[email protected]>

* minor edit

---------

Co-authored-by: David de la Iglesia Castro <[email protected]>
  • Loading branch information
dberenbaum and daavoo authored Sep 29, 2023
1 parent f29c95d commit 504c285
Show file tree
Hide file tree
Showing 8 changed files with 103 additions and 34 deletions.
8 changes: 7 additions & 1 deletion content/basic-concepts/pipeline.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
---
name: Pipeline
match: [pipeline, pipelines, 'data pipeline', 'data pipelines', 'dvc pipelines']
match:
- pipeline
- pipelines
- 'data pipeline'
- 'data pipelines'
- 'dvc pipelines'
- 'dvc pipeline'
tooltip: >-
DVC pipelines describe data processing workflows in a standard declarative
YAML format ([`dvc.yaml`](/doc/user-guide/project-structure/dvcyaml-files)).
Expand Down
88 changes: 67 additions & 21 deletions content/docs/dvclive/how-it-works.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,22 +109,81 @@ with Git, in which case you can use

## Setup to Run with DVC

You can create or modify the `dvc.yaml` file at the base of your repository (or
elsewhere) to define a [pipeline](#setup-to-run-with-dvc) to run experiments
with DVC or
[customize plots](/doc/user-guide/experiment-management/visualizing-plots#defining-plots).
A pipeline stage for model training might look like:
Running experiments with DVC provides a structured and reproducible
<abbr>pipeline</abbr> for end-to-end model training. To run experiments with
DVC, define a pipeline using `dvc stage add` or by editing `dvc.yaml`. A
pipeline stage for model training might look like:

<toggle>
<tab title="CLI">

```cli
$ dvc stage add --name train \
--deps data_dir --deps src/train.py \
--outs model.pt --outs dvclive \
python train.py
```

</tab>
<tab title="YAML">

```yaml
stages:
train:
cmd: python train.py
deps:
- train.py
- data_dir
outs:
- model.pt
- dvclive
```
</tab>
</toggle>
Adding the DVCLive [directory] to the [outputs] will add it to the DVC [cache]
(if you previously tracked the directory in Git, you must first stop tracking it
there). If you want to keep it in Git, you can disable the cache. You can also
choose to cache only some paths, like keeping lightweight metrics in Git but
adding more heavyweight plots data to the cache:
<toggle>
<tab title="CLI">
```cli
$ dvc stage add --name train \
--deps data_dir --deps src/train.py \
--outs model.pt --outs-no-cache dvclive/metrics.json \
--outs dvclive/plots \
python train.py
```

</tab>
<tab title="YAML">

```yaml
stages:
train:
cmd: python train.py
deps:
- train.py
- data_dir
outs:
- model.pt
- dvclive/metrics.json:
cache: false
- dvclive/plots
```
</tab>
</toggle>
Now you can run an experiment using `dvc exp run`. Instead of DVCLive handling
caching and saving experiments, DVC will do this at the end of each run. See
examples of how to [add DVCLive to a pipeline] or [add a pipeline to DVCLive
code], including how to parametrize your code to iterate on experiments.

<admon type="tip">

You may have previously tracked [outputs] with `Live.log_artifact()` that
Expand All @@ -135,24 +194,11 @@ pipeline. You can optionally drop `Live.log_artifact()` from your code.

</admon>

Optionally add any subpaths of the DVCLive [directory] to the [outputs]. DVC
will [cache] them by default, and you can use those paths as [dependencies]
downstream in your pipeline. For example, to cache all DVCLive plots:

```diff
stages:
train:
cmd: python train.py
deps:
- train.py
outs:
- model.pt
+ - dvclive/plots
```

[directory]: /doc/dvclive/how-it-works#directory-structure
[cache]: /doc/start/data-management/data-versioning
[outputs]: /doc/user-guide/pipelines/defining-pipelines#outputs
[dependencies]: /doc/user-guide/pipelines/defining-pipelines#simple-dependencies
[pipelines]: /doc/start/experiments/experiment-pipelines
[pipeline]: /doc/start/experiments/experiment-pipelines
[generates]: /doc/dvclive/live/make_dvcyaml
[add DVCLive to a pipeline]: /doc/start/data-management/metrics-parameters-plots
[add a pipeline to DVCLive code]: /doc/start/experiments/experiment-pipelines
19 changes: 8 additions & 11 deletions content/docs/dvclive/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,23 +153,20 @@ with Live() as live:
## Outputs

After you run your training code, all the logged data will be stored in the
`dvclive` directory. Check the [DVCLive outputs](/doc/dvclive/how-it-works) page
for more details.
`dvclive` [directory] and [tracked] as a <abbr>DVC experiment</abbr> for
analysis and comparison.

## Run with DVC

Experimenting in Python interactively (like in notebooks) is great for
exploration, but eventually you may need a more structured way to run
reproducible experiments. By configuring DVC [pipelines], you can [run
experiments] with `dvc exp run`. This will track the inputs and outputs of code,
and enable more advanced workflows like multi-step pipelines and queueing
multiple experiments or even an entire grid search. See examples of how to [add
DVCLive to a pipeline] or [add a pipeline to DVCLive code], or get more
information about how to [setup a pipeline] to work with DVCLive.
reproducible experiments. By configuring <abbr>DVC pipelines</abbr>, you can
[run experiments] with `dvc exp run`. Pipelines help you organize your ML
workflow beyond a single notebook or script so you can modularize and
parametrize your code. See how to [setup a pipeline] to work with DVCLive.

[release notes]: https://github.com/iterative/dvclive/releases/tag/3.0.0
[directory]: /doc/dvclive/how-it-works
[tracked]: /doc/start/experiments/experiment-tracking
[run experiments]: /doc/user-guide/experiment-management/running-experiments
[pipelines]: /doc/user-guide/pipelines
[add DVCLive to a pipeline]: /doc/start/data-management/metrics-parameters-plots
[add a pipeline to DVCLive code]: /doc/start/experiments/experiment-pipelines
[setup a pipeline]: /doc/dvclive/how-it-works#setup-to-run-with-dvc
3 changes: 3 additions & 0 deletions content/docs/dvclive/live/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,9 @@ You can use `Live()` as a context manager. When exiting the context manager,
- `cache_images` - If `True`, DVCLive will <abbr>cache</abbr> any images logged
with `Live.log_image()` as part of `Live.end()`. Defaults to `False`.

If running a <abbr>DVC pipeline</abbr>, `cache_images` will be ignored, and
you should instead cache images as pipeline <abbr>outputs</abbr>.

- `exp_message` - If not `None`, and `save_dvc_exp` is `True`, the provided
string will be passed to
[`dvc exp save --message`](/doc/command-reference/exp/save#--message).
Expand Down
3 changes: 3 additions & 0 deletions content/docs/dvclive/live/log_artifact.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,9 @@ it in the <abbr>model registry</abbr>.
Git. Defaults to `True`, but set to `False` if you want to annotate metadata
about the artifact without storing a copy in the DVC cache.

If running a <abbr>DVC pipeline</abbr>, `cache` will be ignored, and you
should instead cache artifacts as pipeline <abbr>outputs</abbr>.

## Exceptions

- `dvclive.error.InvalidDataTypeError` - thrown if the provided `path` does not
Expand Down
6 changes: 6 additions & 0 deletions content/docs/dvclive/live/log_param.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ The logged params can be visualized with `dvc params`:
$ dvc params diff dvclive/params.yaml
```

If you use <abbr>DVC pipelines</abbr>, [parameter dependencies] are tracked
automatically, and you can skip logging them with DVCLive.

</admon>

## Parameters
Expand All @@ -57,3 +60,6 @@ $ dvc params diff dvclive/params.yaml
Dict[str, "ParamLike"]
]
```

[parameter dependencies]:
/doc/user-guide/pipelines/defining-pipelines#parameter-dependencies
6 changes: 6 additions & 0 deletions content/docs/dvclive/live/log_params.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@ The logged params can be visualized with `dvc params`:
dvc params diff dvclive/params.yaml
```

If you use <abbr>DVC pipelines</abbr>, [parameter dependencies] are tracked
automatically, and you can skip logging them with DVCLive.

</admon>

## Parameters
Expand All @@ -68,3 +71,6 @@ dvc params diff dvclive/params.yaml
Dict[str, "ParamLike"]
]
```

[parameter dependencies]:
/doc/user-guide/pipelines/defining-pipelines#parameter-dependencies
4 changes: 3 additions & 1 deletion content/docs/start/experiments/experiment-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ $ dvc stage add -n train \
$ dvc stage add -n evaluate \
-p base,evaluate \
-d src/evaluate.py -d models/model.pkl -d data/test_data \
python src/evaluate.py
-o results python src/evaluate.py
```

The `dvc.yaml` file is updated automatically and should include all the stages
Expand Down Expand Up @@ -155,6 +155,8 @@ stages:
params:
- base
- evaluate
outs:
- results
```

</details>
Expand Down

0 comments on commit 504c285

Please sign in to comment.