This tutorial is used to show how to manage dependencies for Jupyter Notebooks using Python to allow reproducibility and shareability.
Even though many developers (including data scientists) focus on their core problems when working on their experiments, there is one aspect that can make these projects not reusable.
One of the first steps during the development of a project is the selection of libraries or dependencies
. When someone runs pip install <package-name>
, they might not be aware that along with the library that is going to be installed, a direct dependency, many other dependencies will be installed on your machine, so called transitive dependencies. Any change in one of those dependencies can break your experiment. It's fundamental to have a way to state all the dependencies used, including the operating system, python interpreter and hardware that was used to run a certain experiment.
Dependency management is one of the most important requirements for reproducibility. Having dependencies clearly stated allows portability of notebooks, so they can be shared safely with others, reused in other projects or simply reproduced. If you want to know more about this issue in the data science domain, have a look at this article or this video.
Project Thoth keeps dependencies up to date by giving recommendations through developer's daily tools. Thanks to this service, developers (including data scientists) do not have to worry about managing the dependencies after they are selected, since conflicts can be handled by Thoth bots and automated pipelines. Having this AI support can benefit AI projects, offering improvements such as performance improvements due to optimized dependencies and additional security since insecure libraries cannot be introduced. If you want to know more, have a look at Thoth's website.
Within the different Thoth integations, in this tutorial we are going to focus on the JupyterLab extension for dependency management, which is called jupyterlab-requirements.
You can use this extension for each of your notebooks to guarantee they have the correct dependencies. This extension is able to add/remove dependencies, lock them and store them in the notebook metadata. In this way, all the dependencies information required to repeat the environment are shipped with the notebook.
In particular, the following notebook metadata is created for you, when you use Thoth's dependency management tool:
-
requirements
(Pipfile); -
requirements locked
with all versions and hashes of libraries (direct and transitive ones) (Pipfile.lock); -
dependency resolution engine
used (Thoth or Pipenv); -
configuration file containing runtime environment
(only for Thoth resolution engine).
All this information can allow reproducibility and shareability of the notebook.
At the end of this tutorial you will be able to manage dependencies for your projects in Jupyter Notebooks, enabling others to reproduce what you did and allowing them to contribute to it. The last section will teach also how to enable Kebechet Bot to keep dependencies automatically up to date for you and how you can setup and use automatic pipelines from AICoE CI to create release and images of your projects that you can easily share with others.
Operate First is an open infrastructure environment started at Red Hat's Office of the CTO. It has been selected to run this tutorial since it is an open source initiative that fulfills all the requirements stated above. Anyone with a Google account can log in and start developing. To learn more about Operate First, visit the website or GitHub community.
Operate First hosts Open Data Hub with all the tools provided for Data Science projects (e.g. JupyterHub, Elyra, Kubeflow Pipelines, Seldon, Prometheus, Grafana, Superset) running on Red Hat Openshift.
The project template used can be found here: project template. It shows correlation between a data scientist needs (e.g. data, notebooks, models) and that of an AI DevOps engineer (e.g. manifests). Having structure in a project ensures all the pieces required for the ML and DevOps lifecycles are present and easily discoverable.