Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDA 12.1.1, CUDA samples, and CUDA related hooks and lmodrc changes #434

Merged

Conversation

casparvl
Copy link
Collaborator

Equivalent of #381, now targetting the new eessi.io domain.

Caspar van Leeuwen added 6 commits December 19, 2023 18:36
…DA module if a full CUDA SDK was also installed in host_injections (otherwise you have dead links to the non-redistributable parts of the CUDA SDK). Furthermore, for GPU enabled modules, it checks if the drivers have been linked in in the host_injections directory. It also checks if they are new enough to be used with the CUDA version that was used as a dependency for the GPU-enabled module you are trying to load. If any of these checks is not true, it prints an error message with advice on how to proceed.
… are allowed to redistribute. It will create symlinks to the host_injections directory for the rest of the files that we are not allowed to redistribute. Additionally, create a hook to inject the GPU lmod property when creating module files for modules that have CUDA as a dependency
Copy link

eessi-bot bot commented Dec 19, 2023

Instance eessi-bot-mc-aws is configured to build:

  • arch x86_64/generic for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/generic for repo eessi-hpc.org-2023.06-software
  • arch x86_64/generic for repo eessi.io-2023.06-compat
  • arch x86_64/generic for repo eessi.io-2023.06-software
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-software
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-software
  • arch aarch64/generic for repo eessi.io-2023.06-compat
  • arch aarch64/generic for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-software

Copy link
Member

@ocaisa ocaisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some really impressive stuff going on here :P

create_lmodrc.py Outdated Show resolved Hide resolved
create_lmodrc.py Outdated Show resolved Hide resolved
create_lmodrc.py Show resolved Hide resolved
create_lmodrc.py Outdated Show resolved Hide resolved
@ocaisa
Copy link
Member

ocaisa commented Dec 19, 2023

This doesn't work as is, it requires #410

@ocaisa
Copy link
Member

ocaisa commented Dec 19, 2023

@casparvl This needs a sync with the target branch

@casparvl casparvl changed the title Add CUDA, CUDA samples, and CUDA related hooks and lmodrc changes Add CUDA, CUDA samples, and CUDA related hooks and lmodrc changes [WIP, do NOT deploy unless sure it works!] Dec 20, 2023
@casparvl
Copy link
Collaborator Author

Just want to see what happens if I trigger the bot on this...

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/intel/skylake_avx512

Copy link

eessi-bot bot commented Dec 20, 2023

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/intel/skylake_avx512 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/intel/skylake_avx512
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/intel/skylake_avx512 resulted in:

Copy link

eessi-bot bot commented Dec 20, 2023

New job on instance eessi-bot-mc-aws for architecture x86_64-intel-skylake_avx512 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2023.12/pr_434/2696

date job status comment
Dec 20 09:49:04 UTC 2023 submitted job id 2696 awaits release by job manager
Dec 20 09:49:45 UTC 2023 released job awaits launch by Slurm scheduler
Dec 20 09:54:55 UTC 2023 running job 2696 is running
Dec 20 10:13:38 UTC 2023 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-2696.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Dec 20 10:13:38 UTC 2023 test result (no tests yet)

Caspar van Leeuwen added 2 commits December 20, 2023 11:12
@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/intel/skylake_avx512

@boegel
Copy link
Contributor

boegel commented Dec 21, 2023

bot: build repo:eessi.io-2023.06-software arch:x86_64/generic
bot: build repo:eessi.io-2023.06-software arch:x86_64/intel/haswell
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3
bot: build repo:eessi.io-2023.06-software arch:aarch64/generic
bot: build repo:eessi.io-2023.06-software arch:aarch64/neoverse_n1
bot: build repo:eessi.io-2023.06-software arch:aarch64/neoverse_v1

Copy link

eessi-bot bot commented Dec 21, 2023

Updates by the bot instance eessi-bot-mc-aws (click for details)

Copy link

eessi-bot bot commented Dec 21, 2023

New job on instance eessi-bot-mc-aws for architecture x86_64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2023.12/pr_434/2735

date job status comment
Dec 21 13:40:05 UTC 2023 submitted job id 2735 awaits release by job manager
Dec 21 13:40:08 UTC 2023 released job awaits launch by Slurm scheduler
Dec 21 13:45:44 UTC 2023 running job 2735 is running
Dec 21 14:02:06 UTC 2023 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-2735.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-generic-1703167048.tar.gzsize: 1451 MiB (1522389011 bytes)
entries: 5959
modules under 2023.06/software/linux/x86_64/generic/modules/all
CUDA/12.1.1.lua
CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/generic/software
CUDA/12.1.1
CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/generic
2023.06/scripts/gpu_support/nvidia/install_cuda_host_injections.sh
2023.06/scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh
2023.06/scripts/utils.sh
.lmod/cache/spiderT.lua
.lmod/cache/spiderT.luac_5.1
.lmod/cache/timestamp
.lmod/lmodrc.lua
Dec 21 14:02:06 UTC 2023 test result (no tests yet)
Dec 21 14:05:49 UTC 2023 uploaded transfer of eessi-2023.06-software-linux-x86_64-generic-1703167048.tar.gz to S3 bucket succeeded

Copy link

eessi-bot bot commented Dec 21, 2023

New job on instance eessi-bot-mc-aws for architecture x86_64-intel-haswell for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2023.12/pr_434/2736

date job status comment
Dec 21 13:40:09 UTC 2023 submitted job id 2736 awaits release by job manager
Dec 21 13:41:25 UTC 2023 released job awaits launch by Slurm scheduler
Dec 21 13:48:10 UTC 2023 running job 2736 is running
Dec 21 14:04:09 UTC 2023 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-2736.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-intel-haswell-1703167226.tar.gzsize: 1451 MiB (1522381745 bytes)
entries: 5959
modules under 2023.06/software/linux/x86_64/intel/haswell/modules/all
CUDA/12.1.1.lua
CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/intel/haswell/software
CUDA/12.1.1
CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/intel/haswell
2023.06/scripts/gpu_support/nvidia/install_cuda_host_injections.sh
2023.06/scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh
2023.06/scripts/utils.sh
.lmod/cache/spiderT.lua
.lmod/cache/spiderT.luac_5.1
.lmod/cache/timestamp
.lmod/lmodrc.lua
Dec 21 14:04:09 UTC 2023 test result (no tests yet)
Dec 21 14:07:29 UTC 2023 uploaded transfer of eessi-2023.06-software-linux-x86_64-intel-haswell-1703167226.tar.gz to S3 bucket succeeded

Copy link

eessi-bot bot commented Dec 21, 2023

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2023.12/pr_434/2737

date job status comment
Dec 21 13:40:13 UTC 2023 submitted job id 2737 awaits release by job manager
Dec 21 13:41:20 UTC 2023 released job awaits launch by Slurm scheduler
Dec 21 13:46:53 UTC 2023 running job 2737 is running
Dec 21 14:01:02 UTC 2023 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-2737.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1703167014.tar.gzsize: 1451 MiB (1522355425 bytes)
entries: 5959
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
CUDA/12.1.1.lua
CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
CUDA/12.1.1
CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/scripts/gpu_support/nvidia/install_cuda_host_injections.sh
2023.06/scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh
2023.06/scripts/utils.sh
.lmod/cache/spiderT.lua
.lmod/cache/spiderT.luac_5.1
.lmod/cache/timestamp
.lmod/lmodrc.lua
Dec 21 14:01:02 UTC 2023 test result (no tests yet)
Dec 21 14:06:55 UTC 2023 uploaded transfer of eessi-2023.06-software-linux-x86_64-amd-zen2-1703167014.tar.gz to S3 bucket succeeded

Copy link

eessi-bot bot commented Dec 21, 2023

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2023.12/pr_434/2738

date job status comment
Dec 21 13:40:17 UTC 2023 submitted job id 2738 awaits release by job manager
Dec 21 13:41:22 UTC 2023 released job awaits launch by Slurm scheduler
Dec 21 13:46:55 UTC 2023 running job 2738 is running
Dec 21 13:59:54 UTC 2023 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-2738.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1703166956.tar.gzsize: 1451 MiB (1522373174 bytes)
entries: 5959
modules under 2023.06/software/linux/x86_64/amd/zen3/modules/all
CUDA/12.1.1.lua
CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen3/software
CUDA/12.1.1
CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen3
2023.06/scripts/gpu_support/nvidia/install_cuda_host_injections.sh
2023.06/scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh
2023.06/scripts/utils.sh
.lmod/cache/spiderT.lua
.lmod/cache/spiderT.luac_5.1
.lmod/cache/timestamp
.lmod/lmodrc.lua
Dec 21 13:59:54 UTC 2023 test result (no tests yet)
Dec 21 14:08:02 UTC 2023 uploaded transfer of eessi-2023.06-software-linux-x86_64-amd-zen3-1703166956.tar.gz to S3 bucket succeeded

Copy link

eessi-bot bot commented Dec 21, 2023

New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2023.12/pr_434/2739

date job status comment
Dec 21 13:40:20 UTC 2023 submitted job id 2739 awaits release by job manager
Dec 21 13:41:11 UTC 2023 released job awaits launch by Slurm scheduler
Dec 21 13:46:47 UTC 2023 running job 2739 is running
Dec 21 14:01:00 UTC 2023 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-2739.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-generic-1703166970.tar.gzsize: 1416 MiB (1485496032 bytes)
entries: 3940
modules under 2023.06/software/linux/aarch64/generic/modules/all
CUDA/12.1.1.lua
CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1.lua
software under 2023.06/software/linux/aarch64/generic/software
CUDA/12.1.1
CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1
other under 2023.06/software/linux/aarch64/generic
2023.06/scripts/gpu_support/nvidia/install_cuda_host_injections.sh
2023.06/scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh
2023.06/scripts/utils.sh
.lmod/cache/spiderT.lua
.lmod/cache/spiderT.luac_5.1
.lmod/cache/timestamp
.lmod/lmodrc.lua
Dec 21 14:01:00 UTC 2023 test result (no tests yet)
Dec 21 14:08:35 UTC 2023 uploaded transfer of eessi-2023.06-software-linux-aarch64-generic-1703166970.tar.gz to S3 bucket succeeded

Copy link

eessi-bot bot commented Dec 21, 2023

New job on instance eessi-bot-mc-aws for architecture aarch64-neoverse_n1 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2023.12/pr_434/2740

date job status comment
Dec 21 13:40:25 UTC 2023 submitted job id 2740 awaits release by job manager
Dec 21 13:41:14 UTC 2023 released job awaits launch by Slurm scheduler
Dec 21 13:42:28 UTC 2023 running job 2740 is running
Dec 21 13:55:19 UTC 2023 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-2740.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-neoverse_n1-1703166655.tar.gzsize: 1416 MiB (1485464609 bytes)
entries: 3940
modules under 2023.06/software/linux/aarch64/neoverse_n1/modules/all
CUDA/12.1.1.lua
CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1.lua
software under 2023.06/software/linux/aarch64/neoverse_n1/software
CUDA/12.1.1
CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1
other under 2023.06/software/linux/aarch64/neoverse_n1
2023.06/scripts/gpu_support/nvidia/install_cuda_host_injections.sh
2023.06/scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh
2023.06/scripts/utils.sh
.lmod/cache/spiderT.lua
.lmod/cache/spiderT.luac_5.1
.lmod/cache/timestamp
.lmod/lmodrc.lua
Dec 21 13:55:19 UTC 2023 test result (no tests yet)
Dec 21 14:06:22 UTC 2023 uploaded transfer of eessi-2023.06-software-linux-aarch64-neoverse_n1-1703166655.tar.gz to S3 bucket succeeded

Copy link

eessi-bot bot commented Dec 21, 2023

New job on instance eessi-bot-mc-aws for architecture aarch64-neoverse_v1 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2023.12/pr_434/2741

date job status comment
Dec 21 13:40:29 UTC 2023 submitted job id 2741 awaits release by job manager
Dec 21 13:41:17 UTC 2023 released job awaits launch by Slurm scheduler
Dec 21 13:42:31 UTC 2023 running job 2741 is running
Dec 21 13:52:59 UTC 2023 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-2741.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-neoverse_v1-1703166553.tar.gzsize: 1416 MiB (1485467840 bytes)
entries: 3940
modules under 2023.06/software/linux/aarch64/neoverse_v1/modules/all
CUDA/12.1.1.lua
CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1.lua
software under 2023.06/software/linux/aarch64/neoverse_v1/software
CUDA/12.1.1
CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1
other under 2023.06/software/linux/aarch64/neoverse_v1
2023.06/scripts/gpu_support/nvidia/install_cuda_host_injections.sh
2023.06/scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh
2023.06/scripts/utils.sh
.lmod/cache/spiderT.lua
.lmod/cache/spiderT.luac_5.1
.lmod/cache/timestamp
.lmod/lmodrc.lua
Dec 21 13:52:59 UTC 2023 test result (no tests yet)

@boegel boegel added the bot:deploy Ask bot to deploy missing software installations to EESSI label Dec 21, 2023
@boegel
Copy link
Contributor

boegel commented Dec 21, 2023

Upload of tarball for aarch64/neoverse_v1 by bot failed due to connection error:

[20231221-T14:09:06] WARNING: A crash occurred!
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib64/python3.6/http/client.py", line 1365, in getresponse
    response.begin()
  File "/usr/lib64/python3.6/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python3.6/http/client.py", line 289, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

I've uploaded eessi-2023.06-software-linux-aarch64-neoverse_v1-1703166553.tar.gz manually using the bot account...

@boegel
Copy link
Contributor

boegel commented Dec 21, 2023

All 8 staging PRs merged, deploy is under way...

@ocaisa ocaisa dismissed boegel’s stale review December 21, 2023 15:17

This has all been heavily reviewed and tested!

@ocaisa
Copy link
Member

ocaisa commented Dec 21, 2023

Going in, thanks to everyone for all the efforts to get this merged, especially @huebner-m who put a lot of the initial effort in. It's been a journey!

@ocaisa ocaisa merged commit 5c322b0 into EESSI:2023.06-software.eessi.io Dec 21, 2023
33 checks passed
@ocaisa ocaisa changed the title Add CUDA, CUDA samples, and CUDA related hooks and lmodrc changes [WIP, do NOT deploy unless sure it works!] Add CUDA, CUDA samples, and CUDA related hooks and lmodrc changes Dec 21, 2023
@boegel boegel changed the title Add CUDA, CUDA samples, and CUDA related hooks and lmodrc changes Add CUDA 12.1.1, CUDA samples, and CUDA related hooks and lmodrc changes Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia bot:deploy Ask bot to deploy missing software installations to EESSI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants