Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEBUG only {2023.06}[2023a] PyTorch-Bundle v2.1.2 -- tweaked hooks employing LD_PRELOAD #688

Open
wants to merge 20 commits into
base: 2023.06-software.eessi.io
Choose a base branch
from

Conversation

trz42
Copy link
Collaborator

@trz42 trz42 commented Aug 27, 2024

Just to test if the tweaked hooks solve the build issue for SentencePiece on aarch64 CPUs. See #585

1 out of 2 required modules missing:

* libmad/0.15.1b-GCCcore-12.3.0 (libmad-0.15.1b-GCCcore-12.3.0.eb)

and

12 out of 137 required modules missing:

* parameterized/0.9.0-GCCcore-12.3.0 (parameterized-0.9.0-GCCcore-12.3.0.eb)
* tqdm/4.66.1-GCCcore-12.3.0 (tqdm-4.66.1-GCCcore-12.3.0.eb)
* Scalene/1.5.26-GCCcore-12.3.0 (Scalene-1.5.26-GCCcore-12.3.0.eb)
* gperftools/2.12-GCCcore-12.3.0 (gperftools-2.12-GCCcore-12.3.0.eb)
* SentencePiece/0.2.0-GCC-12.3.0 (SentencePiece-0.2.0-GCC-12.3.0.eb)
* imageio/2.33.1-gfbf-2023a (imageio-2.33.1-gfbf-2023a.eb)
* tensorboard/2.15.1-gfbf-2023a (tensorboard-2.15.1-gfbf-2023a.eb)
* libmad/0.15.1b-GCCcore-12.3.0 (libmad-0.15.1b-GCCcore-12.3.0.eb)
* SoX/14.4.2-GCCcore-12.3.0 (SoX-14.4.2-GCCcore-12.3.0.eb)
* NLTK/3.8.1-foss-2023a (NLTK-3.8.1-foss-2023a.eb)
* scikit-image/0.22.0-foss-2023a (scikit-image-0.22.0-foss-2023a.eb)
* PyTorch-bundle/2.1.2-foss-2023a (PyTorch-bundle-2.1.2-foss-2023a.eb)

@trz42 trz42 added aarch64 related to Arm 64-bit targets (aarch64) 2023.06-software.eessi.io 2023.06 version of software.eessi.io labels Aug 27, 2024
Copy link

eessi-bot bot commented Aug 27, 2024

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi-hpc.org-2023.06-compat, eessi-hpc.org-2023.06-software, eessi.io-2023.06-software, eessi.io-2023.06-compat

Instance boegel-bot-deucalion is configured to build for:

  • architectures: aarch64/a64fx
  • repositories: eessi.io-2023.06-software

Copy link

eessi-bot bot commented Aug 27, 2024

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-compat, eessi-hpc.org-2023.06-software, eessi.io-2023.06-software

@trz42
Copy link
Collaborator Author

trz42 commented Aug 27, 2024

bot: build arch:aarch64/generic repo:eessi.io-2023.06-software

Copy link

eessi-bot bot commented Aug 27, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account trz42 has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Aug 27, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build arch:aarch64/generic repo:eessi.io-2023.06-software from trz42

    • expanded format: build architecture:aarch64/generic repository:eessi.io-2023.06-software
  • handling command build architecture:aarch64/generic repository:eessi.io-2023.06-software resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 27, 2024

New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_688/17105

date job status comment
Aug 27 20:37:22 UTC 2024 submitted job id 17105 awaits release by job manager
Aug 27 20:38:12 UTC 2024 released job awaits launch by Slurm scheduler
Aug 27 20:39:16 UTC 2024 running job 17105 is running
Aug 27 21:47:17 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-17105.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-generic-1724792235.tar.gzsize: 142 MiB (149144297 bytes)
entries: 4815
modules under 2023.06/software/linux/aarch64/generic/modules/all
gperftools/2.12-GCCcore-12.3.0.lua
imageio/2.33.1-gfbf-2023a.lua
libmad/0.15.1b-GCCcore-12.3.0.lua
NLTK/3.8.1-foss-2023a.lua
parameterized/0.9.0-GCCcore-12.3.0.lua
Scalene/1.5.26-GCCcore-12.3.0.lua
scikit-image/0.22.0-foss-2023a.lua
SentencePiece/0.2.0-GCC-12.3.0.lua
SoX/14.4.2-GCCcore-12.3.0.lua
tensorboard/2.15.1-gfbf-2023a.lua
tqdm/4.66.1-GCCcore-12.3.0.lua
software under 2023.06/software/linux/aarch64/generic/software
gperftools/2.12-GCCcore-12.3.0
imageio/2.33.1-gfbf-2023a
libmad/0.15.1b-GCCcore-12.3.0
NLTK/3.8.1-foss-2023a
parameterized/0.9.0-GCCcore-12.3.0
Scalene/1.5.26-GCCcore-12.3.0
scikit-image/0.22.0-foss-2023a
SentencePiece/0.2.0-GCC-12.3.0
SoX/14.4.2-GCCcore-12.3.0
tensorboard/2.15.1-gfbf-2023a
tqdm/4.66.1-GCCcore-12.3.0
other under 2023.06/software/linux/aarch64/generic
2023.06/init/easybuild/eb_hooks.py
Aug 27 21:47:17 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 18/18 test case(s) from 18 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-17105.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented Aug 28, 2024

Try to use bash from compat layer...

bot: build arch:aarch64/generic repo:eessi.io-2023.06-software

Copy link

eessi-bot bot commented Aug 28, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build arch:aarch64/generic repo:eessi.io-2023.06-software from trz42
    • expanded format: build architecture:aarch64/generic repository:eessi.io-2023.06-software

Copy link

eessi-bot bot commented Aug 28, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build arch:aarch64/generic repo:eessi.io-2023.06-software from trz42

    • expanded format: build architecture:aarch64/generic repository:eessi.io-2023.06-software
  • handling command build architecture:aarch64/generic repository:eessi.io-2023.06-software resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 28, 2024

`688.diff:29: trailing whitespace.

error: cannot apply binary patch to 'scripts/2023.06/aarch64/bash' without full index line
error: scripts/2023.06/aarch64/bash: patch does not apply
error: cannot apply binary patch to 'scripts/2023.06/x86_64/bash' without full index line
error: scripts/2023.06/x86_64/bash: patch does not apply
`Unable to download or merge changes between the source branch and the destination branch.Tip: This can usually be resolved by syncing your branch and resolving any merge conflicts.

@trz42
Copy link
Collaborator Author

trz42 commented Aug 28, 2024

Second attempt to use bash from compat layer (now obtaining it from CVMFS)...

bot: build arch:aarch64/generic repo:eessi.io-2023.06-software

Copy link

eessi-bot bot commented Aug 28, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)

Copy link

eessi-bot bot commented Aug 28, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)

Copy link

eessi-bot bot commented Aug 28, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build arch:aarch64/generic repo:eessi.io-2023.06-software from trz42

    • expanded format: build architecture:aarch64/generic repository:eessi.io-2023.06-software
  • handling command build architecture:aarch64/generic repository:eessi.io-2023.06-software resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 28, 2024

New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_688/17163

date job status comment
Aug 28 22:20:36 UTC 2024 submitted job id 17163 awaits release by job manager
Aug 28 22:21:22 UTC 2024 released job awaits launch by Slurm scheduler
Aug 28 22:22:23 UTC 2024 running job 17163 is running
Aug 28 22:34:35 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-17163.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Aug 28 22:34:35 UTC 2024 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job17163.test does not exist in job directory or reading it failed.

@trz42
Copy link
Collaborator Author

trz42 commented Aug 28, 2024

Next...

bot: build arch:aarch64/generic repo:eessi.io-2023.06-software

Copy link

eessi-bot bot commented Aug 28, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)

Copy link

eessi-bot bot commented Aug 28, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build arch:aarch64/generic repo:eessi.io-2023.06-software from trz42

    • expanded format: build architecture:aarch64/generic repository:eessi.io-2023.06-software
  • handling command build architecture:aarch64/generic repository:eessi.io-2023.06-software resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 28, 2024

New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_688/17164

date job status comment
Aug 28 22:46:21 UTC 2024 submitted job id 17164 awaits release by job manager
Aug 28 22:46:39 UTC 2024 released job awaits launch by Slurm scheduler
Aug 28 22:47:40 UTC 2024 running job 17164 is running
Aug 28 22:49:42 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-17164.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Aug 28 22:49:42 UTC 2024 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job17164.test does not exist in job directory or reading it failed.

@trz42
Copy link
Collaborator Author

trz42 commented Aug 28, 2024

One more...

bot: build arch:aarch64/generic repo:eessi.io-2023.06-software

Copy link

eessi-bot bot commented Aug 28, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)

Copy link

eessi-bot bot commented Aug 28, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build arch:aarch64/generic repo:eessi.io-2023.06-software from trz42

    • expanded format: build architecture:aarch64/generic repository:eessi.io-2023.06-software
  • handling command build architecture:aarch64/generic repository:eessi.io-2023.06-software resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 28, 2024

New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_688/17165

date job status comment
Aug 28 22:49:51 UTC 2024 submitted job id 17165 awaits release by job manager
Aug 28 22:50:45 UTC 2024 released job awaits launch by Slurm scheduler
Aug 28 22:51:47 UTC 2024 running job 17165 is running
Aug 28 22:55:51 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-17165.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Aug 28 22:55:51 UTC 2024 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job17165.test does not exist in job directory or reading it failed.

@trz42
Copy link
Collaborator Author

trz42 commented Aug 28, 2024

+1

bot: build arch:aarch64/generic repo:eessi.io-2023.06-software

Copy link

eessi-bot bot commented Aug 28, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)

Copy link

eessi-bot bot commented Aug 28, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build arch:aarch64/generic repo:eessi.io-2023.06-software from trz42

    • expanded format: build architecture:aarch64/generic repository:eessi.io-2023.06-software
  • handling command build architecture:aarch64/generic repository:eessi.io-2023.06-software resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 28, 2024

New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_688/17166

date job status comment
Aug 28 22:56:39 UTC 2024 submitted job id 17166 awaits release by job manager
Aug 28 22:56:53 UTC 2024 released job awaits launch by Slurm scheduler
Aug 28 22:57:55 UTC 2024 running job 17166 is running
Aug 29 00:07:19 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-17166.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-generic-1724887036.tar.gzsize: 142 MiB (149143012 bytes)
entries: 4815
modules under 2023.06/software/linux/aarch64/generic/modules/all
gperftools/2.12-GCCcore-12.3.0.lua
imageio/2.33.1-gfbf-2023a.lua
libmad/0.15.1b-GCCcore-12.3.0.lua
NLTK/3.8.1-foss-2023a.lua
parameterized/0.9.0-GCCcore-12.3.0.lua
Scalene/1.5.26-GCCcore-12.3.0.lua
scikit-image/0.22.0-foss-2023a.lua
SentencePiece/0.2.0-GCC-12.3.0.lua
SoX/14.4.2-GCCcore-12.3.0.lua
tensorboard/2.15.1-gfbf-2023a.lua
tqdm/4.66.1-GCCcore-12.3.0.lua
software under 2023.06/software/linux/aarch64/generic/software
gperftools/2.12-GCCcore-12.3.0
imageio/2.33.1-gfbf-2023a
libmad/0.15.1b-GCCcore-12.3.0
NLTK/3.8.1-foss-2023a
parameterized/0.9.0-GCCcore-12.3.0
Scalene/1.5.26-GCCcore-12.3.0
scikit-image/0.22.0-foss-2023a
SentencePiece/0.2.0-GCC-12.3.0
SoX/14.4.2-GCCcore-12.3.0
tensorboard/2.15.1-gfbf-2023a
tqdm/4.66.1-GCCcore-12.3.0
other under 2023.06/software/linux/aarch64/generic
2023.06/init/easybuild/eb_hooks.py
Aug 29 00:07:19 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 18/18 test case(s) from 18 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-17166.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Collaborator

bedroge commented Aug 29, 2024

      /bin/sh: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_mi
nimal.so)

@trz42 looks like you need to do the same for /bin/sh?

@trz42
Copy link
Collaborator Author

trz42 commented Aug 29, 2024

Also replace /bin/sh with sh (which usually symlinks to bash)...

bot: build arch:aarch64/generic repo:eessi.io-2023.06-software

Copy link

eessi-bot bot commented Aug 29, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)

Copy link

eessi-bot bot commented Aug 29, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build arch:aarch64/generic repo:eessi.io-2023.06-software from trz42

    • expanded format: build architecture:aarch64/generic repository:eessi.io-2023.06-software
  • handling command build architecture:aarch64/generic repository:eessi.io-2023.06-software resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 29, 2024

New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_688/17215

date job status comment
Aug 29 15:19:29 UTC 2024 submitted job id 17215 awaits release by job manager
Aug 29 15:19:34 UTC 2024 released job awaits launch by Slurm scheduler
Aug 29 15:25:37 UTC 2024 running job 17215 is running
Aug 29 17:32:22 UTC 2024 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Job results file _bot_job17215.result does not exist in job directory or reading it failed.
  • No artefacts were found/reported.
Aug 29 17:32:22 UTC 2024 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job17215.test does not exist in job directory or reading it failed.

@trz42
Copy link
Collaborator Author

trz42 commented Aug 29, 2024

Revert change on /bin/sh and build libmad first...

bot: build arch:aarch64/generic repo:eessi.io-2023.06-software

Copy link

eessi-bot bot commented Aug 29, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)

Copy link

eessi-bot bot commented Aug 29, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build arch:aarch64/generic repo:eessi.io-2023.06-software from trz42

    • expanded format: build architecture:aarch64/generic repository:eessi.io-2023.06-software
  • handling command build architecture:aarch64/generic repository:eessi.io-2023.06-software resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 29, 2024

New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_688/17216

date job status comment
Aug 29 18:43:35 UTC 2024 submitted job id 17216 awaits release by job manager
Aug 29 18:44:31 UTC 2024 released job awaits launch by Slurm scheduler
Aug 29 18:50:33 UTC 2024 running job 17216 is running
Aug 29 20:00:01 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-17216.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-generic-1724958582.tar.gzsize: 142 MiB (149142604 bytes)
entries: 4815
modules under 2023.06/software/linux/aarch64/generic/modules/all
gperftools/2.12-GCCcore-12.3.0.lua
imageio/2.33.1-gfbf-2023a.lua
libmad/0.15.1b-GCCcore-12.3.0.lua
NLTK/3.8.1-foss-2023a.lua
parameterized/0.9.0-GCCcore-12.3.0.lua
Scalene/1.5.26-GCCcore-12.3.0.lua
scikit-image/0.22.0-foss-2023a.lua
SentencePiece/0.2.0-GCC-12.3.0.lua
SoX/14.4.2-GCCcore-12.3.0.lua
tensorboard/2.15.1-gfbf-2023a.lua
tqdm/4.66.1-GCCcore-12.3.0.lua
software under 2023.06/software/linux/aarch64/generic/software
gperftools/2.12-GCCcore-12.3.0
imageio/2.33.1-gfbf-2023a
libmad/0.15.1b-GCCcore-12.3.0
NLTK/3.8.1-foss-2023a
parameterized/0.9.0-GCCcore-12.3.0
Scalene/1.5.26-GCCcore-12.3.0
scikit-image/0.22.0-foss-2023a
SentencePiece/0.2.0-GCC-12.3.0
SoX/14.4.2-GCCcore-12.3.0
tensorboard/2.15.1-gfbf-2023a
tqdm/4.66.1-GCCcore-12.3.0
other under 2023.06/software/linux/aarch64/generic
2023.06/init/easybuild/eb_hooks.py
Aug 29 20:00:01 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 18/18 test case(s) from 18 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-17216.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented Aug 30, 2024

The last job failed while building the wheel for torchtext when running ninja which is using /bin/sh. It might be that ninja has this path hard-coded vs using sh from the compat layer. See below

$ strings /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Ninja/1.11.1-GCCcore-12.3.0/bin/ninja | grep /bin
#!/usr/bin/env python
/bin/sh

We could try to figure out how to use /bin/sh (and maybe /usr/bin/env) from the compat layer. We have to look into how ninja was built and try to change that, then rebuild it ... we could probably test if a change helps solving the problem by installing a changed ninja into some other location employing EESSI-extend/2023.06-easybuild.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io aarch64 related to Arm 64-bit targets (aarch64)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants