Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce ADR for dependencies management in Jupyter notebooks #282
Introduce ADR for dependencies management in Jupyter notebooks #282
Changes from 4 commits
2c3820a
e3f1aab
e17c0e1
fd98d8a
e3c1165
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that for a repo, each notebook will live in its own dir with its own pipfile/pipfile.lock as well as having its dependencies embedded in json?
I'll admit I don't fully understand the dependency management process (and @fridex can probably answer this question better 😄 ), but isn't it redundant and potentially error prone to maintain dependencies both in the notebook and as a pipfile? Shouldn't the decision be one or the other? In which case, I think embedded would be the way to go for each notebook, with a single overarching project Pipfile for the whole repo (kinda of how projects are set up currently). Or is that what this Option 3 is saying already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
We should keep dependencies embedded in the notebook all the time. Having them aside is an action that should be triggered explicitly when exporting them, or when importing dependency listing from Pipfile/Pipfile.lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, as we talked last DS meetup, we decided not to consider that option of one repo per notebook, but thinking of what you and @fridex said, maybe we can restructure in:
Jupyter notebook with dependencies embedded in json file of the notebook that can be optionally extracted.
But what about the main Pipfile/Pipfile.lock? If a work on three different notebooks, createing dependencies for each, they will be different.
If we want to create an image to run those notebooks, there is need for a single Pipfile/Pipfile.lock with the dependencies from all notebooks.
How do we deal with having one single Pipfile/Pipfile.lock and different notebooks, each with their own dependencies?
Maybe notebook 1 required only numpy, pandas and matplolib, but notebook 2 only tensorflow.
Do we need some way that is able to merge them, syncing a common Pipfile/Pipfile.lock that can be used to run them all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these files are just TOML and JSON files. We have tooling in thoth-python that can merge these files and keep consistency (e.g. check the computed hash, avoid duplicates, ...). The workflow should include Thoth - just Pipfile is created out of the all notebooks and Thoth resolves Pipfile.lock. Thoth part is required as these dependencies can have issues between them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fridex , I will proceed in this way!! I will update the ADR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
Maybe add a bit more specificity to it? "Jupyter notebook with dependencies embedded in json file of the notebook that can be optionally extracted as a merged Pipfile via Thoth"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will have two options:
One in notebook itself, to extract Pipfile/Pipfile.lock from the notebook
one other button, might be in the menu under kernels tab, that would look at all notebooks and create a merged Pipfile and Pipfile.lock.
Jupyter notebook with dependencies embedded in json file of the notebook that can be optionally extracted if user wants
If more notebooks are present, a common Pipfile can be created with a button that can automatically extract from all notebook dependencies and new common Pipfile.lock will be created. This would allow the creation of an image that can run the notebooks.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MichaelClifford @fridex!