Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{2023.06}[foss/2023a] UCX-CUDA v1.14.1 w/ CUDA 12.1.1 (rebuild) #719

Merged
merged 4 commits into from
Sep 25, 2024

Conversation

casparvl
Copy link
Collaborator

@casparvl casparvl commented Sep 18, 2024

Copy link

eessi-bot bot commented Sep 18, 2024

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software

Copy link

eessi-bot bot commented Sep 18, 2024

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi.io-2023.06-compat

Instance boegel-bot-deucalion is configured to build for:

  • architectures: aarch64/a64fx
  • repositories: eessi.io-2023.06-software

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 18, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Sep 18, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 18, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_719/18983

date job status comment
Sep 18 19:35:02 UTC 2024 submitted job id 18983 awaits release by job manager
Sep 18 19:35:12 UTC 2024 released job awaits launch by Slurm scheduler
Sep 18 19:40:19 UTC 2024 running job 18983 is running
Sep 18 19:41:19 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-18983.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Sep 18 19:41:19 UTC 2024 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-18983.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 19, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 19, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Sep 19, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_719/19105

date job status comment
Sep 19 07:18:48 UTC 2024 submitted job id 19105 awaits release by job manager
Sep 19 07:19:48 UTC 2024 released job awaits launch by Slurm scheduler
Sep 19 07:26:04 UTC 2024 running job 19105 is running
Sep 19 07:32:29 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-19105.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1726730817.tar.gzsize: 0 MiB (45 bytes)
entries: 0
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen2
no other files in tarball
Sep 19 07:32:29 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 9/9 test case(s) from 9 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-19105.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 19, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Sep 19, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Sep 19, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_719/19111

date job status comment
Sep 19 08:32:04 UTC 2024 submitted job id 19111 awaits release by job manager
Sep 19 08:32:24 UTC 2024 released job awaits launch by Slurm scheduler
Sep 19 08:37:47 UTC 2024 running job 19111 is running
Sep 19 08:43:13 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-19111.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1726735091.tar.gzsize: 0 MiB (45 bytes)
entries: 0
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen2
no other files in tarball
Sep 19 08:43:13 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 9/9 test case(s) from 9 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-19111.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 19, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Sep 19, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Sep 19, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_719/19113

date job status comment
Sep 19 08:35:08 UTC 2024 submitted job id 19113 awaits release by job manager
Sep 19 08:35:36 UTC 2024 released job awaits launch by Slurm scheduler
Sep 19 08:41:02 UTC 2024 running job 19113 is running
Sep 19 08:48:36 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-19113.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1726735419.tar.gzsize: 0 MiB (778217 bytes)
entries: 38
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen2
accel/nvidia/cc80/modules/all/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
accel/nvidia/cc80/modules/lib/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/easybuild/
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/easybuild/easybuild-UCX-CUDA-1.14.1-20240919.084238.log.bz2
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/easybuild/easybuild-UCX-CUDA-1.14.1-20240919.084238_test_report.md
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/easybuild/reprod/
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/easybuild/reprod/easyblocks/
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/easybuild/reprod/easyblocks/configuremake.py
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/easybuild/reprod/easyblocks/ucx_plugins.py
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/easybuild/reprod/hooks/
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/easybuild/reprod/hooks/eb_hooks.py
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/easybuild/reprod/UCX-CUDA-1.14.1-GCCcore-12.3.0-CUDA-12.1.1.eb
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/easybuild/reprod/UCX-CUDA-1.14.1-GCCcore-12.3.0-CUDA-12.1.1.env
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/easybuild/UCX-CUDA-1.11.0_link_against_existing_UCX_libs.patch
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/easybuild/UCX-CUDA-1.14.1-GCCcore-12.3.0-CUDA-12.1.1-easybuild-devel
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/easybuild/UCX-CUDA-1.14.1-GCCcore-12.3.0-CUDA-12.1.1.eb
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libucm_cuda.a
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libucm_cuda.la
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libucm_cuda.so
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libucm_cuda.so.0
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libucm_cuda.so.0.0.0
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libuct_cuda.a
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libuct_cuda_gdrcopy.a
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libuct_cuda_gdrcopy.la
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libuct_cuda_gdrcopy.so
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libuct_cuda_gdrcopy.so.0
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libuct_cuda_gdrcopy.so.0.0.0
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libuct_cuda.la
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libuct_cuda.so
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libuct_cuda.so.0
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libuct_cuda.so.0.0.0
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libucx_perftest_cuda.a
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libucx_perftest_cuda.la
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libucx_perftest_cuda.so
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libucx_perftest_cuda.so.0
accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libucx_perftest_cuda.so.0.0.0
Sep 19 08:48:36 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 9/9 test case(s) from 9 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-19113.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@boegel boegel added the 2023.06-software.eessi.io 2023.06 version of software.eessi.io label Sep 25, 2024
@boegel boegel changed the title {2023.06}[foss/2023a] UCX-CUDA v1.14.1 w/ CUDA 12.1.1 {2023.06}[foss/2023a] UCX-CUDA v1.14.1 w/ CUDA 12.1.1 (rebuild) Sep 25, 2024
@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 25, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Sep 25, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Sep 25, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_719/19864

date job status comment
Sep 25 13:51:09 UTC 2024 submitted job id 19864 awaits release by job manager
Sep 25 13:51:26 UTC 2024 released job awaits launch by Slurm scheduler
Sep 25 13:57:59 UTC 2024 running job 19864 is running
Sep 25 14:16:50 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-19864.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1727273093.tar.gzsize: 0 MiB (778855 bytes)
entries: 38
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
no other files in tarball
Sep 25 14:16:50 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 9/9 test case(s) from 9 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-19864.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Sep 25 18:33:05 UTC 2024 uploaded transfer of eessi-2023.06-software-linux-x86_64-amd-zen2-1727273093.tar.gz to S3 bucket succeeded

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80

Copy link

eessi-bot bot commented Sep 25, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80 resulted in:

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account casparvl has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Sep 25, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Sep 25, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen3 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_719/19865

date job status comment
Sep 25 13:51:53 UTC 2024 submitted job id 19865 awaits release by job manager
Sep 25 13:52:49 UTC 2024 released job awaits launch by Slurm scheduler
Sep 25 13:54:13 UTC 2024 running job 19865 is running
Sep 25 14:10:57 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-19865.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1727272789.tar.gzsize: 0 MiB (778901 bytes)
entries: 38
modules under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/modules/all
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80
no other files in tarball
Sep 25 14:10:57 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 9/9 test case(s) from 9 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-19865.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Sep 25 18:33:24 UTC 2024 uploaded transfer of eessi-2023.06-software-linux-x86_64-amd-zen3-1727272789.tar.gz to S3 bucket succeeded

@casparvl
Copy link
Collaborator Author

I checked the tarball for 19864. This

$ readelf -d libucm_cuda.so | grep RPATH
 0x000000000000000f (RPATH)              Library rpath: [/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/lib64:$ORIGIN:$ORIGIN/../lib:$ORIGIN/../lib64:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/GDRCopy/2.3.1-GCCcore-12.3.0/lib64:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/UCX/1.14.1-GCCcore-12.3.0/lib64:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software/CUDA/12.1.1/lib64:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/GCCcore/12.3.0/lib64:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/GCCcore/12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/numactl/2.0.16-GCCcore-12.3.0/lib64:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/GCCcore/12.3.0/lib/gcc/x86_64-pc-linux-gnu/12.3.0:/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/lib/../lib64:/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/lib/../lib64:/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/lib:/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/UCX/1.14.1-GCCcore-12.3.0/lib]

Looks like it's using the correct path to the newly deployed CUDA installation at /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software/CUDA/12.1.1/lib64. One thing I did notice is that we hae GDRcopy in the CPU tree. We might consider moving that to the GPU tree (first).

@boegel
Copy link
Contributor

boegel commented Sep 25, 2024

Also for job 19865, looks good:

[bot@login1 19865]$ readelf -d 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1/ucx/libuct_cuda_gdrcopy.so.0| grep RPATH | tr ':' '\n' | grep '/CUDA/'
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software/CUDA/12.1.1/lib64

W.r.t. GDRCopy: since it doesn't depend on CUDA at all, I wouldn't move it under accel/, even though it's really only useful on an NVIDIA GPU system, just to keep the rule simple as to what goes under accel/ (stuff that depends on CUDA).

@ocaisa
Copy link
Member

ocaisa commented Sep 25, 2024

Gdrcopy is a tough one but I think I lean towards @boegel point of view, it doesn't need CUDA and builds just fine for all CPUs (at least currently). It allows us to keep the hook exception free which has value, and if it does cause problems later we can fix it in the next version

@boegel boegel added the bot:deploy Ask bot to deploy missing software installations to EESSI label Sep 25, 2024
@boegel
Copy link
Contributor

boegel commented Sep 25, 2024

staging PRs merged

@boegel
Copy link
Contributor

boegel commented Sep 25, 2024

ingested, so merging this...

@boegel boegel merged commit 808161d into EESSI:2023.06-software.eessi.io Sep 25, 2024
35 checks passed
Copy link

eessi-bot bot commented Sep 25, 2024

PR merged! Moved ['/project/def-users/SHARED/jobs/2024.09/pr_719/18983', '/project/def-users/SHARED/jobs/2024.09/pr_719/19105', '/project/def-users/SHARED/jobs/2024.09/pr_719/19111', '/project/def-users/SHARED/jobs/2024.09/pr_719/19113', '/project/def-users/SHARED/jobs/2024.09/pr_719/19864', '/project/def-users/SHARED/jobs/2024.09/pr_719/19865'] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2024.09.25

PR merged! Moved [] to $HOME/trash_bin/EESSI/software-layer/2024.09.25

Copy link

eessi-bot bot commented Sep 25, 2024

PR merged! Moved [] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2024.09.25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia bot:deploy Ask bot to deploy missing software installations to EESSI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants