-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpi4py bug #417
Comments
Well thanks for the detailed reproducer on this! Will take a look and see if we can get this sorted out for you. |
@aowen87 I've a few more questions for you on this:
|
I'm running this from the login node, and the commands I'm using are the exact commands shown above (no srun/mpirun/etc. just maestro). |
Ok, @aowen87 , I finally made some headway here. Part of the problem appears to be pgen calling mpi ends up setting a bunch of mpi related env vars which confuses the batch job in an unintuitive way: the default slurm configuration is to treat missing
|
Great info! Thanks for digging into this! |
Description
Hello,
I'm working on an LLNL project that uses maestro to manage ML workflows, and we've recently encountered an odd bug. If the following conditions are met, the job will hang indefinitely:
Reproducer
I've included files reproduce the issue below. There is 1 yaml file, 1 python script that maestro will launch with srun, and 3 parameter generation files. 1 of the parameter generation files works fine because it doesn't import mpi4py, and the other two parameter generation files will import mpi4py directly or indirectly and cause the job to hang.
Here are commands to reproduce each scenario:
This works:
maestro run -p param_gen.py mpi_bug.yaml
This causes job hang:
maestro run -p mpi_param_gen.py mpi_bug.yaml
This causes job hang:
maesturo run -p kosh_param_gen.py mpi_bug.yaml
Files to reproduce:
mpi_bug.yaml
:hello_world.py
:param_gen.py
:mpi_param_gen.py
:kosh_param_gen.py
:The text was updated successfully, but these errors were encountered: