Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{2023.06}[foss/2023a] CUDA 12.1.1 (rebuild) + limit CUDA hook to EESSI installs only, and remove duplication when creating symlinks #735

Merged
merged 13 commits into from
Sep 25, 2024

Conversation

ocaisa
Copy link
Member

@ocaisa ocaisa commented Sep 24, 2024

Also includes #720 to allow for testing

Copy link

eessi-bot bot commented Sep 24, 2024

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software

Copy link

eessi-bot bot commented Sep 24, 2024

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi.io-2023.06-compat

@ocaisa
Copy link
Member Author

ocaisa commented Sep 24, 2024

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 24, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)

Copy link

eessi-bot bot commented Sep 24, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Sep 24, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_735/19693

date job status comment
Sep 24 10:56:04 UTC 2024 submitted job id 19693 awaits release by job manager
Sep 24 10:56:43 UTC 2024 released job awaits launch by Slurm scheduler
Sep 24 11:01:46 UTC 2024 running job 19693 is running
Sep 24 11:40:24 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-19693.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1727176631.tar.gzsize: 2067 MiB (2167685151 bytes)
entries: 5519
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
CUDA/12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
CUDA/12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
2023.06/init/easybuild/eb_hooks.py
Sep 24 11:40:24 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 9/9 test case(s) from 9 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-19693.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

eb_hooks.py Outdated Show resolved Hide resolved
@ocaisa
Copy link
Member Author

ocaisa commented Sep 24, 2024

@casparvl I've checked this build, and the symlinks do indeed exist and point to the location under host_injections without accel/nvidia/cc80.
The one problem is whether this would fully address the issue you saw with the EEESSI EasyBuild hooks when installing CUDA

eb_hooks.py Outdated Show resolved Hide resolved
@ocaisa
Copy link
Member Author

ocaisa commented Sep 24, 2024

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 24, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)

Copy link

eessi-bot bot commented Sep 24, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Sep 24, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_735/19697

date job status comment
Sep 24 14:05:27 UTC 2024 submitted job id 19697 awaits release by job manager
Sep 24 14:06:13 UTC 2024 released job awaits launch by Slurm scheduler
Sep 24 14:11:29 UTC 2024 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Job results file _bot_job19697.result does not exist in job directory or reading it failed.
  • No artefacts were found/reported.
Sep 24 14:11:29 UTC 2024 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job19697.test does not exist in job directory or reading it failed.

Manually cancelled

@ocaisa
Copy link
Member Author

ocaisa commented Sep 24, 2024

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 24, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)

Copy link

eessi-bot bot commented Sep 24, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Sep 24, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_735/19699

date job status comment
Sep 24 14:11:01 UTC 2024 submitted job id 19699 awaits release by job manager
Sep 24 14:11:26 UTC 2024 released job awaits launch by Slurm scheduler
Sep 24 14:12:33 UTC 2024 running job 19699 is running
Sep 24 14:53:47 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-19699.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1727188209.tar.gzsize: 2067 MiB (2167693735 bytes)
entries: 5519
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
CUDA/12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
CUDA/12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
2023.06/init/easybuild/eb_hooks.py
Sep 24 14:53:47 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 9/9 test case(s) from 9 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-19699.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

eb_hooks.py Outdated Show resolved Hide resolved
@boegel boegel added the 2023.06-software.eessi.io 2023.06 version of software.eessi.io label Sep 25, 2024
@ocaisa
Copy link
Member Author

ocaisa commented Sep 25, 2024

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 25, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account ocaisa has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 25, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Sep 25, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_735/19834

date job status comment
Sep 25 08:58:13 UTC 2024 submitted job id 19834 awaits release by job manager
Sep 25 08:58:32 UTC 2024 released job awaits launch by Slurm scheduler
Sep 25 08:59:35 UTC 2024 running job 19834 is running
Sep 25 09:38:55 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-19834.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1727255692.tar.gzsize: 2067 MiB (2167690890 bytes)
entries: 5519
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
CUDA/12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
CUDA/12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
2023.06/init/easybuild/eb_hooks.py
Sep 25 09:38:55 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 9/9 test case(s) from 9 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-19834.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

boegel
boegel previously requested changes Sep 25, 2024
eb_hooks.py Outdated Show resolved Hide resolved
eb_hooks.py Show resolved Hide resolved
eb_hooks.py Outdated Show resolved Hide resolved
@boegel boegel changed the title Limit CUDA hook to EESSI installs only, and remove duplication when creating symlinks {2023.06}[foss/2023a] CUDA 12.1.1 (rebuild) + limit CUDA hook to EESSI installs only, and remove duplication when creating symlinks Sep 25, 2024
@ocaisa
Copy link
Member Author

ocaisa commented Sep 25, 2024

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 25, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account ocaisa has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 25, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Sep 25, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_735/19837

date job status comment
Sep 25 09:49:20 UTC 2024 submitted job id 19837 awaits release by job manager
Sep 25 09:50:23 UTC 2024 released job awaits launch by Slurm scheduler
Sep 25 09:51:28 UTC 2024 running job 19837 is running
Sep 25 10:30:23 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-19837.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1727258780.tar.gzsize: 2067 MiB (2167689551 bytes)
entries: 5519
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
CUDA/12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
CUDA/12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
2023.06/init/easybuild/eb_hooks.py
Sep 25 10:30:23 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 9/9 test case(s) from 9 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-19837.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Sep 25 11:01:27 UTC 2024 uploaded transfer of eessi-2023.06-software-linux-x86_64-amd-zen2-1727258780.tar.gz to S3 bucket succeeded

@ocaisa
Copy link
Member Author

ocaisa commented Sep 25, 2024

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 25, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account ocaisa has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 25, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Sep 25, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen3 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_735/19838

date job status comment
Sep 25 09:58:39 UTC 2024 submitted job id 19838 awaits release by job manager
Sep 25 09:58:46 UTC 2024 released job awaits launch by Slurm scheduler
Sep 25 10:04:00 UTC 2024 running job 19838 is running
Sep 25 10:36:36 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-19838.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1727259416.tar.gzsize: 2067 MiB (2167683174 bytes)
entries: 5519
modules under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/modules/all
CUDA/12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software
CUDA/12.1.1
other under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80
2023.06/init/easybuild/eb_hooks.py
Sep 25 10:36:36 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 9/9 test case(s) from 9 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-19838.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Sep 25 11:02:07 UTC 2024 uploaded transfer of eessi-2023.06-software-linux-x86_64-amd-zen3-1727259416.tar.gz to S3 bucket succeeded

@ocaisa
Copy link
Member Author

ocaisa commented Sep 25, 2024

Verified the links are as we expect, ready to deploy

@ocaisa ocaisa added the ready-to-deploy Mark a PR as ready to deploy label Sep 25, 2024
@boegel boegel added bot:deploy Ask bot to deploy missing software installations to EESSI and removed ready-to-deploy Mark a PR as ready to deploy labels Sep 25, 2024
@bedroge
Copy link
Collaborator

bedroge commented Sep 25, 2024

Staging PRs have been merged, and the tarballs have been ingested.

@bedroge bedroge dismissed boegel’s stale review September 25, 2024 11:48

comments have been addressed

@bedroge bedroge merged commit 1cbb7b7 into EESSI:2023.06-software.eessi.io Sep 25, 2024
35 checks passed
Copy link

eessi-bot bot commented Sep 25, 2024

PR merged! Moved ['/project/def-users/SHARED/jobs/2024.09/pr_735/19693', '/project/def-users/SHARED/jobs/2024.09/pr_735/19697', '/project/def-users/SHARED/jobs/2024.09/pr_735/19699', '/project/def-users/SHARED/jobs/2024.09/pr_735/19834', '/project/def-users/SHARED/jobs/2024.09/pr_735/19837', '/project/def-users/SHARED/jobs/2024.09/pr_735/19838'] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2024.09.25

PR merged! Moved [] to $HOME/trash_bin/EESSI/software-layer/2024.09.25

Copy link

eessi-bot bot commented Sep 25, 2024

PR merged! Moved [] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2024.09.25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia bot:deploy Ask bot to deploy missing software installations to EESSI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants