Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate that builds are reproducible #507

Open
joshuatcasey opened this issue Aug 8, 2022 · 12 comments · May be fixed by #514
Open

Investigate that builds are reproducible #507

joshuatcasey opened this issue Aug 8, 2022 · 12 comments · May be fixed by #514

Comments

@joshuatcasey
Copy link
Contributor

joshuatcasey commented Aug 8, 2022

Describe the Enhancement

Builds with this buildpack should be reproducible, meaning given identical inputs, the SHAs of resulting buildpack-built images are the same. This means, for a given app, if I run:

pack build my-app -b paketo-buildpacks/python

and then run

pack build my-app-copy -b paketo-buildpacks/python

with the same source code and configurations, the resulting image SHAs should be the same.

Currently, builds are not reproducible because of SBOMs included in the final app image. See paketo-buildpacks/packit#367 and paketo-buildpacks/packit#368. But once those issues are resolved and a new version of packit has been released, we should expect that the buildpack builds are reproducible.

Possible Solution

Add assertions to integration tests that show that two builds with the same inputs produce identical outputs.

Motivation

Build reproducibility is a selling point of CNBs that we want to provide to Paketo buildpack users. We want to know if future implementation decisions compromise build reproducibility.

@ForestEckhardt ForestEckhardt self-assigned this Aug 29, 2022
@ForestEckhardt ForestEckhardt linked a pull request Aug 29, 2022 that will close this issue
5 tasks
@ForestEckhardt ForestEckhardt removed their assignment Sep 6, 2022
@fg-j fg-j self-assigned this Sep 27, 2022
@fg-j
Copy link

fg-j commented Oct 3, 2022

As @ForestEckhardt points out on his draft PR, there are still aspects of python builds that are NOT reproducible (now that SBOM reproducibility has been resolved). Some of these may be unavoidable parts of python build processes. I'll investigate this on the issue.

@fg-j
Copy link

fg-j commented Oct 3, 2022

For the simple no_package_manager sample app in the Paketo samples repo, rebuilding with the same inputs produces images with identical SHAs. From this, I can surmise that the manner in which the buildpack installs python is reproducible.

@fg-j
Copy link

fg-j commented Oct 3, 2022

For the pip sample app in the Paketo samples repo, rebuilding with the same inputs produces images with different SHAs - i.e. builds are NOT reproducible. The layers that differ between builds are the CPython layer and the pip-install buildpack's packages layer. Something about the process of installing pip and/or running pip install both a) modifies the CPython layer in an non-reproducible way and b) installs packages in a non-reproducible way.

The cpython layer seems to differ because it contains a _pycache_ directory with *.pyc files that aren't reproducible. This issue on cpython led me to consider whether setting SOURCE_DATE_EPOCH in the build environment would result in reproducible *.pyc files. It didn't.

Likewise, the layers/paketo-buildpacks_pip-install/packages/lib/python<version>/site-packages/ directory where the pip-install buildpack installs application dependencies contains several __pycache__ directories (one in each app dependency's subdirectory) with non-reproducible *.pyc files.

I notice that this PR sets PYTHONPYCACHEPREFIX so that the __pycache__ isn't in the working directory at app launch time. But the env var isn't set at build time. I believe this is why *.pyc files are making their way into the cpython layer. Setting the env var at build time to a path outside the /layers directory may protect the cpython and pip-install layers from non-reproducible *.pyc files.

Exploration reveals that setting the environment variable to a temp location at build time results in reproducible builds of the pip sample app.

pack build pip-sample-one --buildpack paketo-buildpacks/python:2.2.0 --builder paketobuildpacks/builder-jammy-buildpackless-base:0.0.19 --env PYTHONPYCACHEPREFIX="/tmp" --clear-cache
...
pack build pip-sample-two --buildpack paketo-buildpacks/python:2.2.0 --builder paketobuildpacks/builder-jammy-buildpackless-base:0.0.19 --env PYTHONPYCACHEPREFIX="/tmp" --clear-cache
...
docker image ls | grep pip-sample
pip-sample-one                                                                                latest               3303ad6715c7   42 years ago   321MB
pip-sample-two                                                                                latest               3303ad6715c7   42 years ago   321MB

Filed paketo-buildpacks/cpython#426

@fg-j
Copy link

fg-j commented Oct 12, 2022

For the pipenv sample app, both the cpython and pipenv-install layers have different SHAs for builds with identical inputs. This makes builds NOT reproducible. Setting PYTHONPYCACHEPREFIX at build time addresses the cpython layer differences, as with pip. paketo-buildpacks/cpython#426 is necessary for making pipenv builds reproducible, but it's not sufficient. Even with the PyCache location set to a temporary directory at build time, the pipenv-install packages layer has different SHAs on each build. As of now, I am still investigating which files end up differing in the layer.

Edit:
After much container exporting and diffing, I can't seem to pin down files that differ between pipenv-install layers of different builds. cc @paketo-buildpacks/python-maintainers another set of eyes on this might help straighten things out.

@fg-j
Copy link

fg-j commented Oct 13, 2022

For the poetry sample app, setting PYTHONPYCACHEPREFIX also makes the cpython layer reproducible, but the the poetry and poetry-install layers still are not.

This ongoing discussion on the poetry repository upstream suggests that installing poetry itself isn't currently reproducible, without going to some extreme lengths that diverge from the canonical ways that most users install poetry.

Unless there is appetite for making the buildpack install poetry in a reproducible way, reproducible poetry-based builds are a non-starter. This is unfortunate because, theoretically, poetry enables reproducible installation of the packages it manages by introducing a lock file concept.

@fg-j
Copy link

fg-j commented Oct 17, 2022

For conda sample app, the conda-env-update layer is the one that makes builds non-reproducible. The differences are caused by:

  • as with cpython, the __pycache__ directories that are stored in the conda env layer contain non-reproducible content. Setting PYTHONPYCACHEPREFIX to a temp location at build time eliminates these differences.
  • the layer contains a conda-meta/history file which is used for viewing virtual environment modification history. See below for example contents.
    • The file contains a timestamp. A look at the source code reveals that it's hard coded to use the current time when called. That is, it doesn't respect SOURCE_DATE_EPOCH.
    • The file also contains a list of packages that are added/removed/updated. The lists of packages seemsto be written to the file in an inconsistent order.
    • Removing this conda-meta/history file after running a conda installation does not seem to have negative effects at build or launch time. The app also sucessfully rebuilds without this file. An exploration of conda support in syft indicates that other JSON files in conda-meta may be useful for SBOM data, but conda-meta/history is not needed.

In summary, settingPYTHONPYCACHEPREFIX to a temp location and removing conda-meta/history are together sufficient for making conda builds reproducible. With both tweaks, the sample app rebuilds with the same SHA each time. See this PR for the implementation I played with during this exploration. @paketo-buildpacks/python-maintainers I leave it to you to decide whether/how to implement changes for reproducibility in this case.

Example conda-meta/history
==> 2022-10-17 19:10:37 <==
# cmd: /layers/paketo-buildpacks_miniconda/conda/bin/conda-env update --prefix /layers/paketo-buildpacks_conda-env-update/conda-env --file /workspace/environment.yml
# conda version: 4.11.0
+defaults/linux-64::_libgcc_mutex-0.1-main
+defaults/linux-64::_openmp_mutex-5.1-1_gnu
+defaults/linux-64::ca-certificates-2022.07.19-h06a4308_0
+defaults/linux-64::certifi-2022.9.24-py39h06a4308_0
+defaults/linux-64::click-8.0.4-py39h06a4308_0
+defaults/linux-64::ld_impl_linux-64-2.38-h1181459_1
+defaults/linux-64::libffi-3.3-he6710b0_2
+defaults/linux-64::libgcc-ng-11.2.0-h1234567_1
+defaults/linux-64::libgomp-11.2.0-h1234567_1
+defaults/linux-64::libstdcxx-ng-11.2.0-h1234567_1
+defaults/linux-64::markupsafe-2.1.1-py39h7f8727e_0
+defaults/linux-64::ncurses-6.3-h5eee18b_3
+defaults/linux-64::openssl-1.1.1q-h7f8727e_0
+defaults/linux-64::pip-22.2.2-py39h06a4308_0
+defaults/linux-64::python-3.9.13-haa1d7c7_2
+defaults/linux-64::readline-8.1.2-h7f8727e_1
+defaults/linux-64::setuptools-63.4.1-py39h06a4308_0
+defaults/linux-64::sqlite-3.39.3-h5082296_0
+defaults/linux-64::tk-8.6.12-h1ccaba5_0
+defaults/linux-64::xz-5.2.6-h5eee18b_0
+defaults/linux-64::zlib-1.2.12-h5eee18b_3
+defaults/noarch::dataclasses-0.8-pyh6d0b6a4_7
+defaults/noarch::flask-2.0.2-pyhd3eb1b0_0
+defaults/noarch::itsdangerous-2.0.1-pyhd3eb1b0_0
+defaults/noarch::jinja2-3.0.3-pyhd3eb1b0_0
+defaults/noarch::tzdata-2022c-h04d1e81_0
+defaults/noarch::werkzeug-2.0.3-pyhd3eb1b0_0
+defaults/noarch::wheel-0.37.1-pyhd3eb1b0_0
# update specs: ['flask=2.0.2', 'python=3.9']

@robdimsdale
Copy link
Member

Just a quick note, in addition to the option of PYTHONPYCACHEPREFIX, we could also explore the environment variable: PYTHONDONTWRITEBYTECODE.

From the docs:

If this is set to a non-empty string, Python won’t try to write .pyc files on the import of source modules. This is equivalent to specifying the -B option.

If it works as advertised, it could be a more elegant option than writing files to a discarded directory (e.g. /tmp)

@fg-j fg-j removed their assignment Oct 18, 2022
@robdimsdale
Copy link
Member

Following on from my previous comment, we decided to use PYTHONPYCACHEPREFIX=/tmp to allow users to override that value if they wish. It hard, if not impossible, to override PYTHONDONTWRITEBYTECODE=<any-value>.

@edmorley
Copy link

edmorley commented Oct 27, 2022

If it's of any interest, another approach (that I'm using for the in-progress Heroku Python CNB) is to switch the pyc invalidation mode from its default of "timestamp" to one of the hash based modes ("checked-hash" or "unchecked-hash"). These modes are discussed in PEP-552, which is about deterministic pycs:
https://peps.python.org/pep-0552/

The advantage this has over not writing the pycs at all (which is the case when using PYTHONDONTWRITEBYTECODE), is that the app image will then boot faster, since it doesn't have to generate pycs at every boot.

To switch to hash based pycs, you would need to:

  1. Change the pycs in the python install itself (ideally when building Python, so it only has to be done once, rather than fixing up the files during every build)
  2. Ensure the pycs generated by pip use one of the hash invalidation modes.

For (1), see here for prior art:
https://github.com/heroku/heroku-buildpack-python/blob/7dc6bdec681d0a57fa8399ca99f48b127e65053e/builds/runtimes/python#L152-L175
(Note: That repo is for the classic buildpack, not CNB, but it's where the builds for both occur.)

For (2), there are two options:

  1. Use --no-compile with pip (to save wasting time generating timestamp based pycs that will only then be overwritten after) and then after pip install has completed, run compileall --invalidation-mode ... to generate the pycs with the chosen invalidation mode. One advantage of this approach is that you can pass --workers 0 to compileall and get it to run in multiple processes (pip on the other hand only ever runs the compileall step in a single process).
  2. Or, (and this may be appealing for cases where you can't control the args passed to pip, such as when it's wrapped by another package manager), ensure that the env var SOURCE_DATE_EPOCH is set when pip is run, which will cause compileall to use checked-hash mode automatically, per: https://docs.python.org/3.10/library/compileall.html?highlight=source_date_epoch#cmdoption-compileall-invalidation-mode

@robdimsdale
Copy link
Member

@edmorley very interesting, thank you for sharing! We were looking for something that would control the reproducibility of the pyc files, and we saw PEP 0552, but I think we missed this key paragraph (emphasis mine):

The compileall tool will be extended with a command new option, --invalidation-mode to generate hash-based pycs with and without the check_source bit set. --invalidation-mode will be a tristate option taking values timestamp (the default), checked-hash, and unchecked-hash corresponding to the values of PycInvalidationMode.

That would be really interesting to explore.

The main issue that I see with using SOURCE_DATE_EPOCH is that we want to respect this variable if the lifecycle has provided it. By default it is unset, causing images to have a build date of January 1, 1980. I suppose it's possible to evaluate whether that environment variable is unset during the build phase, and choose to set it to January 1 1980 value if it is unset. This should result in SOURCE_DATE_EPOCH being set either way to the correct value, which it seems should result in reproducible caches.

Taking a step back, we opted for PYTHONPYCACHEPREFIX=/tmp during the build phase only, meaning that at runtime the python caches are created for the app and its dependencies. We didn't notice any performance difference during either build or run time for our sample apps. But perhaps our test cases aren't representative enough. Do you generally see a noticeable performance loss in buildpack-built apps (either at build or run time) when python caches are disabled? Could you point us to an example app that shows the performance issue?

Finally, I'd love to learn more about your plans for the Python buildpack on Heroku, specifically if there are any blockers to using this buildpack rather than writing your own? I understand if that's something you're not willing or able to share, but I would love to learn more about any blockers or issues you have with this buildpack in its current form.

@edmorley
Copy link

edmorley commented Oct 31, 2022

The main issue that I see with using SOURCE_DATE_EPOCH is that we want to respect this variable if the lifecycle has provided it.

In my local WIP implementation, I set SOURCE_DATE_EPOCH only when calling pip, not as an --env passed to eg the entire pack build. Also, py_compile (used by compileall) doesn't actually use the value the SOURCE_DATE_EPOCH env var is set to - only whether it's set or unset:
https://github.com/python/cpython/blob/v3.11.0/Lib/py_compile.py#L72-L76
(Though it's possible other build processes by non-pure Python packages invoked during the pip install will use the value, so it's still worth ensuring it's not bogus)

We didn't notice any performance difference during either build or run time for our sample apps.

My testing both locally and against non-CNB Heroku apps showed that pycs do actually make quite a difference. Is it possible your testing was when using timestamp mode, which will always get invalidated at runtime when using CNBs, due to lifecycle's timestamp normalisation?

Here's a comparison on a Heroku 2 vCPU instance, for a non-CNB Hello World Django app running Python 3.11.0 (source):

$ h run -a getting-started-ci-python -s Performance-M bash -c 'time ./manage.py check'
Running bash -c "time ./manage.py check" on ⬢ getting-started-ci-python... up, run.4219 (Performance-M)
System check identified no issues (0 silenced).

real	0m0.414s
user	0m0.328s
sys	0m0.056s
$ h run -a getting-started-ci-python -s Performance-M bash -c 'find /app/.heroku/python/lib/ -depth -type f -name "*.pyc" -delete && time ./manage.py check'
Running bash -c "find /app/.heroku/python/lib/ -depth -type f -name \"*.pyc\" -delete && time ./manage.py check" on ⬢ getting-started-ci-python... up, run.9227 (Performance-M)
System check identified no issues (0 silenced).

real	0m1.451s
user	0m1.176s
sys	0m0.088s

Also locally it makes a massive difference when running under QEMU with a non-ARM64 image (eg on an M1 Macbook) - given that (a) the upstream CNB project doesn't support multi-arch images very well yet, (b) even when they do there will still delay before stack image/buildpack support is sufficient to run the native arch, (c) even then, people may want to run the same image locally as they will run on their production AMD64 servers.

For example, running pip --version with Python 3.10 under Docker's QEMU with pycs populated takes ~1.3s, but jumps to ~5.3s when there are no pycs.

I'd love to learn more about your plans for the Python buildpack on Heroku, specifically if there are any blockers to using this buildpack rather than writing your own?

So I'm very exited that we now have a shared standard in the form of Cloud Native Buildpacks, and I'm sure there will be buildpacks that are commonly used/shared across platforms.

However, given how central the core language buildpacks are to both the user experience and reliability of builds, I don't think it would ever be viable for us to use anything but our own implementations of them, for reasons like stack compatibility, needing to be in control of uptime/security of binary hosting, needing to be in control of design/UX/new features/feature sunset/documentation links etc. For example, if a customer opens a support ticket about builds failing or needing a new feature, or a new Python security release not being available yet, or their app being broken by a buildpack change - our answer cannot be "sorry we don't own the component in question, there's nothing we can do".

I don't think this is a bad thing however - all of the CNB implementations can learn from each other (hence why I watch this repo and have commented above) - and end users will have more choices than they did with classic buildpacks :-)

@ryanmoran ryanmoran changed the title Assert that builds are reproducible Investigage that builds are reproducible Nov 7, 2022
@ryanmoran ryanmoran changed the title Investigage that builds are reproducible Investigate that builds are reproducible Nov 7, 2022
@robdimsdale
Copy link
Member

Thank you for the detailed explanation! There's definitely a lot in there for us to wrap our heads around.

Hash-based pycs via SOURCE_DATE_EPOCH

At a high-level, it sounds setting SOURCE_DATE_EPOCH during the build could allow us to preserve the pycaches whilst also enabling reproducible builds. You've investigated it for pip and cpython, so we'd want to investigate that for pipenv and miniconda too. Although Frankie investigated conda and pipenv (see comments above), it's not currently clear to me if they will both switch to / respect hash-based pycs. I think this is going to be easier to try it rather than theorize about it, so we will explore this.

Testing / Performance

Your testing is far more rigorous than anything we've done so far 😄

It's very possible that we were running in timestamp mode. Also, our testing looked more holistically at things like total time to run integration tests, and time to run docker run on a sample python app build via pack build. The former has much more complexity and therefore are likely to introduce noise, which could further reduce the validity of our testing. The latter might just be too simple to give meaningful data.

Running under QEMU

Yeah, I basically have to run all buildpack development on a Linux x86_64 VM because QEMU on the M1 Macbook is too slow. For example, trying to compile python 3.11 with --enable-optimizations takes over 30 minutes on my M1 Macbook but about 5 minutes on the VM.

As a result, I haven't really noticed the pycache-specific issues under QEMU, but I appreciate you calling them out.

Community / Contributions / Direction

I completely understand the desire for Heroku to maintain its own set of buildpacks rather than using Paketo. Even on this thread it's clear that there are areas where Paketo and Heroku have different priorities - Paketo is currently placing a large emphasis on reproducibility (and supply-chain security/provenance more generally) which I can imagine is less of an issue for Heroku, which instead wants to prioritize performance - especially at scale.

That being said, performance isn't an anti-goal of the Paketo project, and I would hope that over time the feature sets of Paketo and Heroku will overlap more.

Your feedback on this OSS project has been very helpful and is always welcome. I would also welcome any contributions you are willing to make, too 😉

Additionally, if you would like a more synchronous mechanism of communication, you can join us in the Paketo Slack instance: https://slack.paketo.io/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants