Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce logic for optional AI Service for optimized software stack for Python environments. #1084

Open
pacospace opened this issue Sep 15, 2021 · 2 comments

Comments

@pacospace
Copy link

pacospace commented Sep 15, 2021

Proposed change

Project Thoth [1] uses Artificial Intelligence to analyze and recommend software stacks for Python applications optimized for their specific environment and recommendation (software stack provided as Pipfile, Pipfile.lock format as supported by pypa https://pipenv.pypa.io/en/latest/). Thoth wants to help developers (including data scientists) to manage dependency easily, allowing them to state their requirements and for reproducibility and shareability of their projects.

The AI service is enabled with a .thoth.yaml (https://github.com/thoth-station/thamos#using-custom-configuration-file-template) that states runtime environment and recommendation type (https://thoth-station.ninja/recommendation-types/). This can help others to find out what Python interpreters, OS and Hardware were used to create a certain software stack.

Project Thoth [1] has several integrations:

  • CLI/library integration called thamos [2] to handle any action with .thoth.yaml and interaction with Thoth service.
  • JupyterLab Notebooks integration called jupyterlab-requirements [3] in order to make them reproducible and shareable: https://github.com/thoth-station/jupyterlab-requirements#usage. This could help also tools like repo2docker to identify the pinned-down software stack coming with a certain notebook.
  • S2I integration [4].
  • Github App [5]

For this feature request, I would focus on thamos library [2]. This could be an optional feature that could be enabled by users that want to have that (adding .thoth.yaml to their repo).

Thoth has also a provenance check feature to verify where the packages come from before installing them that could be optionally enabled as well.

Alternative options

Who would use this feature?

All users interested in having an optimized software stack to use for their projects.

How much effort will adding it take?

It should not take too much time to introduce optional Thoth logic that can be enabled by repo2docker only if users have .thoth.yaml in their repo.

Who can do this work?

I'm available to write the part with thamos library [2] to optionally enable Thoth service, I would need some support on testing those changes.

References

@manics
Copy link
Member

manics commented Sep 15, 2021

Thoth sounds like quite a cool tool. It's not clear to me how it fits in with repo2docker though.

The aim of repo2docker is to reproducibly build an environment from a git repo, using whatever dependency specifications are provided. How do you envisage Thoth fitting in to this, and what are the pros and cons of integrating it in repo2docker vs running it as a standalone tool?

@pacospace
Copy link
Author

pacospace commented Sep 16, 2021

Thoth sounds like quite a cool tool. It's not clear to me how it fits in with repo2docker though.

The aim of repo2docker is to reproducibly build an environment from a git repo, using whatever dependency specifications are provided.

Little note on reproducibility: it can be obtained if you have software stacks pinned down with all versions, direct and transitive ones, and runtime environment (OS and hardware). So unless the users provide Pipfile.lock or requirements.txt with all packages, versions and hashes of direct and transitive (obtained with pip-compile --generate-hashes requirements.in command), reproducibility cannot be achieved.
Every time you start a build with a requirements.txt with a package without any other info, will be different today and tomorrow because a new release can be done on that package). Moreover, if a software stack was created on Fedora, no one guarantees that is going to run on UBI8 or Ubuntu, so you need to have that information on the runtime used to have actual reproducible builds.
This is what Project Thoth is trying to bring to the community. We have a video on reproducibility to explain better this concept, especially for Jupyter notebook: https://www.youtube.com/watch?v=ifyQ2oSxjnU&t=1s
Thoth can support users on their GitHub repo with the Github App (https://github.com/marketplace/khebhut) keeping their dependencies up to date and well maintained with Pipfile/Pipfile.lock format and giving them the opportunity to choose what kind of recommendations they want.

How do you envisage Thoth fitting into this, and what are the pros and cons of integrating it in repo2docker vs running it as a standalone tool?

Considering the reason explained above Thoth can support actual reproducible builds.
Thoth can make sure that the software stack created and runtime environment are correct because it has knowledge about the installability of libraries in the runtime environments. It can discover the runtime which is running and break the builds if you are trying to build an image for a software stack not created for that environment.

Thoth has security features, it can check that what you are trying to install has correct hashes, and that nothing was modified between Pipfile and Pipfile.lock for example.

Thoth can also install dependencies because it uses micropipenv under the hood: #1083

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants