Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cause of reproducibility problem of bazel build in jaxlib #321920

Open
UlyssesZh opened this issue Jun 23, 2024 · 11 comments
Open

The cause of reproducibility problem of bazel build in jaxlib #321920

UlyssesZh opened this issue Jun 23, 2024 · 11 comments
Labels
0.kind: bug Something is broken 5. scope: tracking Long-lived issue tracking long-term fixes or multiple sub-problems

Comments

@UlyssesZh
Copy link
Member

A discussion issue for reproducibility issue with bazel that causes build failures in jaxlib package.

I created #296737 when I first noticed this problem. After it was fixed (by #291705), another person opened it again because the issue occurred again. As per #321559 (review), I am opening a new issue for discussing what are possible causes of this. Hopefully the original issue or a similar issue won't need to be opened again.

CC: @ndl @samuela @natsukium @GaetanLepage


Add a 👍 reaction to issues you find important.

@UlyssesZh UlyssesZh added the 0.kind: bug Something is broken label Jun 23, 2024
@lromor
Copy link
Contributor

lromor commented Jun 23, 2024

For #296737 , when it first happened, @ConnorBaker and I managed to reproduce it and point out that the difference in hash was due to:

Files this/external/go_sdk/versions.json and other/external/go_sdk/versions.json differ

This was the diff in the tar file:

339C 1240: 7B 0A 20 20 20 20 22 66  69 6C 65 6E 61 6D 65 22  {.    "f ilename"                                                           
339C 1250: 3A 20 22 67 6F 31 2E 32  32 2E 31 2E 73 72 63 2E  : "go1.2 2.1.src.                                                           
339C 1260: 74 61 72 2E 67 7A 22 2C  0A 20 20 20 20 22 6F 73  tar.gz", .    "os                                                           
339C 1270: 22 3A 20 22 22 2C 0A 20  20 20 20 22 61 72 63 68  ": "",.     "arch                                                           
339C 1280: 22 3A 20 22 22 2C 0A 20  20 20 20 22 76 65 72 73  ": "",.     "vers                                                           
339C 1290: 69 6F 6E 22 3A 20 22 67  6F 31 2E 32 32 2E 31 22  ion": "g o1.22.1"

vs

339C 1250: 3A 20 22 67 6F 31 2E 32  32 2E 30 2E 73 72 63 2E  : "go1.2 2.0.src.                                                           
339C 1260: 74 61 72 2E 67 7A 22 2C  0A 20 20 20 20 22 6F 73  tar.gz", .    "os                                                           
339C 1270: 22 3A 20 22 22 2C 0A 20  20 20 20 22 61 72 63 68  ": "",.     "arch                                                           
339C 1280: 22 3A 20 22 22 2C 0A 20  20 20 20 22 76 65 72 73  ": "",.     "vers                                                           
339C 1290: 69 6F 6E 22 3A 20 22 67  6F 31 2E 32 32 2E 30 22  ion": "g o1.22.0"                                                           
339C 12A0: 2C 0A 20 20 20 20 22 73  68 61 32 35 36 22 3A 20  ,.    "s ha256":  

That file is fetched from an endpoint via: https://github.com/bazelbuild/rules_go/blob/master/go/private/sdk.bzl#L86-L87
I'm not sure why that file would change, but maybe the sdk versions released across the globe are not identical all at the same time. A solution could be to make it more reproducible by storing a fixed version of that file.

@SomeoneSerge
Copy link
Contributor

That file is fetched from an endpoint via: https://github.com/bazelbuild/rules_go/blob/master/go/private/sdk.bzl#L86-L87
I'm not sure why that file would change, but maybe the sdk versions released across the globe are not identical all at the same time. A solution could be to make it more reproducible by storing a fixed version of that file.

Do we know how exactly jaxlib depends on rules_go? It's must be a deep transitive dependency...

@lromor
Copy link
Contributor

lromor commented Jun 30, 2024

SomeoneSerge added a commit to SomeoneSerge/nixpkgs that referenced this issue Jul 1, 2024
Bazel, in hands of Google, can't even fetch reproducibly.
Cf. NixOS#321920 (comment)
@lromor
Copy link
Contributor

lromor commented Jul 1, 2024

Actually, my bad, those dependencies are fetched only by the rules_python internal deps so they shouldn't be pulled.
I think this is to blame: https://github.com/grpc/grpc/blob/master/bazel/grpc_extra_deps.bzl#L24 and grpc gets pulled by at least: https://github.com/openxla/xla/blob/main/workspace0.bzl#L52

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nix-buildproxy-reproducible-http-https-responder-in-sandboxed-nix-builds/40081/18

@tomodachi94 tomodachi94 added the 5. scope: tracking Long-lived issue tracking long-term fixes or multiple sub-problems label Oct 28, 2024
@SpidFightFR
Copy link

any news on that ?

@GaetanLepage
Copy link
Contributor

any news on that ?

Is it currently failing ?

The plan anyway is to update the jax package to the latest version.
Unfortunately, they have changed a lot of things and I have not managed to account for them yet.
I don't know if the update will change anything related to this specific issue though.

draft PR: #318995

@SpidFightFR
Copy link

Is it currently failing ?

Yep we currently have a hash mismatch, in 24.11 or 24.05

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Oct 29, 2024

Yep we currently have a hash mismatch, in 24.11 or 24.05

Oh! I didn't backport #323681

in 24.11

You mean unstable? Seems to work in hydra:

https://hydra.nix-community.org/job/nixpkgs/cuda/python3Packages.jaxlib.x86_64-linux
https://hydra.nixos.org/job/nixpkgs/trunk/python312Packages.jaxlib.x86_64-linux

@SpidFightFR
Copy link

Yep we currently have a hash mismatch, in 24.11 or 24.05

Oh! I didn't backport #323681

in 24.11

You mean unstable? Seems to work in hydra:

https://hydra.nix-community.org/job/nixpkgs/cuda/python3Packages.jaxlib.x86_64-linux https://hydra.nixos.org/job/nixpkgs/trunk/python312Packages.jaxlib.x86_64-linux

Indeed, hydra prints no error.
If i have some free time ahead, i'll launch another try in unstable, i'll keep you in touch when i'll have the results.

@UlyssesZh
Copy link
Member Author

Has the issue never happened since #323681 is merged (though the root cause still seems unclear)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken 5. scope: tracking Long-lived issue tracking long-term fixes or multiple sub-problems
Projects
None yet
Development

No branches or pull requests

7 participants