Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{2023.06}[foss/2023a] TensorFlow v2.15.1 w/ CUDA 12.1.1 #717

Draft
wants to merge 2 commits into
base: 2023.06-software.eessi.io
Choose a base branch
from

Conversation

casparvl
Copy link
Collaborator

No description provided.

Copy link

eessi-bot bot commented Sep 18, 2024

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi-hpc.org-2023.06-compat, eessi-hpc.org-2023.06-software, eessi.io-2023.06-software, eessi.io-2023.06-compat

Copy link

eessi-bot bot commented Sep 18, 2024

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi.io-2023.06-compat

Instance boegel-bot-deucalion is configured to build for:

  • architectures: aarch64/a64fx
  • repositories: eessi.io-2023.06-software

@boegel boegel added the 2023.06-software.eessi.io 2023.06 version of software.eessi.io label Sep 25, 2024
@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 27, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Sep 27, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 27, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_717/20153

date job status comment
Sep 27 08:31:27 UTC 2024 submitted job id 20153 awaits release by job manager
Sep 27 08:31:37 UTC 2024 released job awaits launch by Slurm scheduler
Sep 27 08:36:39 UTC 2024 running job 20153 is running
Sep 27 16:44:00 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-20153.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1727450321.tar.gzsize: 1466 MiB (1537246599 bytes)
entries: 395
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
Bazel/6.1.0-GCCcore-12.3.0.lua
cuDNN/8.9.2.26-CUDA-12.1.1.lua
ml_dtypes/0.3.2-gfbf-2023a.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
Bazel/6.1.0-GCCcore-12.3.0
cuDNN/8.9.2.26-CUDA-12.1.1
ml_dtypes/0.3.2-gfbf-2023a
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
no other files in tarball
Sep 27 16:44:00 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 9/9 test case(s) from 9 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-20153.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

Let's see how this goes. Note that we need a proper cuDNN deployment that strips the necessary files first... So we will need to rebuild in any case.

Copy link
Contributor

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@casparvl Please use easystack file under /accel/nvidia

@boegel boegel marked this pull request as draft September 27, 2024 08:48
@boegel
Copy link
Contributor

boegel commented Sep 27, 2024

Let's see how this goes. Note that we need a proper cuDNN deployment that strips the necessary files first... So we will need to rebuild in any case.

I've marked this a draft, we definitely don't want to deploy with full cuDNN installation

@bedroge
Copy link
Collaborator

bedroge commented Oct 1, 2024

The build succeeded, but many tests failed due to:

ImportError: libnccl.so.2: cannot open shared object file: No such file or directory

This is already available in the CPU-only stack, so I'm not sure why it didn't pick up the library from that module.

@trz42
Copy link
Collaborator

trz42 commented Oct 29, 2024

The build succeeded, but many tests failed due to:

ImportError: libnccl.so.2: cannot open shared object file: No such file or directory

This is already available in the CPU-only stack, so I'm not sure why it didn't pick up the library from that module.

Just opened easybuilders/easybuild-easyblocks#3497 which may fix the libnccl.so.2 error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants