Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce ADR for dependencies management in Jupyter notebooks #282

Merged
merged 5 commits into from
Nov 12, 2020
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions docs/0000-dependencies-management-jupyter-notebooks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Dependencies management in Jupyter Notebooks

## Context and Problem Statement

How to guarantee reproducibility of Jupyter Notebooks?

In order to allow any user to re run the notebook with similar behaviour, it's important that each notebook is shipped with dependencies requirements
that include direct and transitive dependencies. This would also enforce and support security, reproducibility, traecability.
pacospace marked this conversation as resolved.
Show resolved Hide resolved

Notebooks should be treated as component/service that use their own dependencies, therefore when storing notebooks,
they should be stored with dependencies so that an image can be built to run them or they can be shared and reused by others.

## Decision Drivers <!-- optional -->

* user prospective
* reproducibility
* traecability
pacospace marked this conversation as resolved.
Show resolved Hide resolved

## Considered Options

* 1. Jupyter notebook without dependencies (no reproducibility)
* 2. Jupyter notebook without dependencies embedded in json file but with Pipfile/Pipfile.lock always present (Jupyter notebook and requirements are decoupled)
* 3. Jupyter notebook with dependencies embedded in json file of the notebook and Pipfile/Pipfile.lock present

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that for a repo, each notebook will live in its own dir with its own pipfile/pipfile.lock as well as having its dependencies embedded in json?

I'll admit I don't fully understand the dependency management process (and @fridex can probably answer this question better 😄 ), but isn't it redundant and potentially error prone to maintain dependencies both in the notebook and as a pipfile? Shouldn't the decision be one or the other? In which case, I think embedded would be the way to go for each notebook, with a single overarching project Pipfile for the whole repo (kinda of how projects are set up currently). Or is that what this Option 3 is saying already?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

We should keep dependencies embedded in the notebook all the time. Having them aside is an action that should be triggered explicitly when exporting them, or when importing dependency listing from Pipfile/Pipfile.lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that for a repo, each notebook will live in its own dir with its own pipfile/pipfile.lock as well as having its dependencies embedded in json?

I'll admit I don't fully understand the dependency management process (and @fridex can probably answer this question better ), but isn't it redundant and potentially error prone to maintain dependencies both in the notebook and as a pipfile? Shouldn't the decision be one or the other? In which case, I think embedded would be the way to go for each notebook, with a single overarching project Pipfile for the whole repo (kinda of how projects are set up currently). Or is that what this Option 3 is saying already?

No, as we talked last DS meetup, we decided not to consider that option of one repo per notebook, but thinking of what you and @fridex said, maybe we can restructure in:

Jupyter notebook with dependencies embedded in json file of the notebook that can be optionally extracted.

But what about the main Pipfile/Pipfile.lock? If a work on three different notebooks, createing dependencies for each, they will be different.

If we want to create an image to run those notebooks, there is need for a single Pipfile/Pipfile.lock with the dependencies from all notebooks.

How do we deal with having one single Pipfile/Pipfile.lock and different notebooks, each with their own dependencies?
Maybe notebook 1 required only numpy, pandas and matplolib, but notebook 2 only tensorflow.

Do we need some way that is able to merge them, syncing a common Pipfile/Pipfile.lock that can be used to run them all?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we deal with having one single Pipfile/Pipfile.lock and different notebooks, each with their own dependencies?
Maybe notebook 1 required only numpy, pandas and matplolib, but notebook 2 only tensorflow.

Do we need some way that is able to merge them, syncing a common Pipfile/Pipfile.lock that can be used to run them all?

Yes, these files are just TOML and JSON files. We have tooling in thoth-python that can merge these files and keep consistency (e.g. check the computed hash, avoid duplicates, ...). The workflow should include Thoth - just Pipfile is created out of the all notebooks and Thoth resolves Pipfile.lock. Thoth part is required as these dependencies can have issues between them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fridex , I will proceed in this way!! I will update the ADR

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jupyter notebook with dependencies embedded in json file of the notebook that can be optionally extracted.

Sounds good to me.

Maybe add a bit more specificity to it? "Jupyter notebook with dependencies embedded in json file of the notebook that can be optionally extracted as a merged Pipfile via Thoth"

Copy link
Contributor Author

@pacospace pacospace Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will have two options:

  • One in notebook itself, to extract Pipfile/Pipfile.lock from the notebook

  • one other button, might be in the menu under kernels tab, that would look at all notebooks and create a merged Pipfile and Pipfile.lock.

Jupyter notebook with dependencies embedded in json file of the notebook that can be optionally extracted if user wants

  • If more notebooks are present, a common Pipfile can be created with a button that can automatically extract from all notebook dependencies and new common Pipfile.lock will be created. This would allow the creation of an image that can run the notebooks.

WDYT?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Decision Outcome

The option selected is 3. because:

* enforce reproducibility
* enforce traceability between notebook and requirements

### Positive Consequences <!-- optional -->

* Satisfy reproducibility, traecability, shareability.
pacospace marked this conversation as resolved.
Show resolved Hide resolved
* Notebooks are coupled with dependencies in their metadata.
74 changes: 74 additions & 0 deletions docs/template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# [short title of solved problem and solution]

* Status: [proposed | rejected | accepted | deprecated | … | superseded by [ADR-0005](0005-example.md)] <!-- optional -->
* Deciders: [list everyone involved in the decision] <!-- optional -->
* Date: [YYYY-MM-DD when the decision was last updated] <!-- optional -->

Technical Story: [description | ticket/issue URL] <!-- optional -->

## Context and Problem Statement

[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]

## Decision Drivers <!-- optional -->

* [driver 1, e.g., a force, facing concern, …]
* [driver 2, e.g., a force, facing concern, …]
* … <!-- numbers of drivers can vary -->

## Considered Options

* [option 1]
* [option 2]
* [option 3]
* … <!-- numbers of options can vary -->

## Decision Outcome

Chosen option: "[option 1]", because [justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force force | … | comes out best (see below)].

### Positive Consequences <!-- optional -->

* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
* …

### Negative Consequences <!-- optional -->

* [e.g., compromising quality attribute, follow-up decisions required, …]
* …

## Pros and Cons of the Options <!-- optional -->

### [option 1]

[example | description | pointer to more information | …] <!-- optional -->

* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->

### [option 2]

[example | description | pointer to more information | …] <!-- optional -->

* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->

### [option 3]

[example | description | pointer to more information | …] <!-- optional -->

* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->

## Links <!-- optional -->

* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
* … <!-- numbers of links can vary -->

<!-- markdownlint-disable-file MD013 -->