Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

64 safely checking for mpi4py import #126

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

jeinstei
Copy link
Member

Duplicate of previous PR but to develop instead of main

@jeinstei
Copy link
Member Author

@cchall any chance you or someone else could try this in the salloc? I haven't seen the same issues since making this change, generally speaking. I'm trying to replicate the original "As far as I can tell this doesn't change anything for just running in salloc. If mpi4py is installed there then you will still run into the same failure when MPI.Init() is run.", but I haven't been able to have it NOT work now, using the config_chwirut.yml example

@cchall
Copy link
Member

cchall commented Jul 18, 2022

@jeinstei What command(s) are you using to execute the example within salloc?

@jeinstei
Copy link
Member Author

As far as I can tell, just rsopt optimize configuration config_chwirut.yml. Previously this threw an error.

@cchall
Copy link
Member

cchall commented Jul 18, 2022

Something doesn't seem quite right.
Is MPI defined here:

if not MPI.COMM_WORLD.Get_size() - 1:

I would think this test should always result in an error from that at least.

@jeinstei
Copy link
Member Author

Yup. That is most definitely a bug. That shouldn't have worked at all. I tested pre-mpi.py change and everything was working, let me make sure this doesn't break anything else

@cchall
Copy link
Member

cchall commented Jul 18, 2022

I'm seeing at least one discrepancy between what is on this branch and what is installed in the rsopt_software_jec environment:
ls /global/common/software/m3687/rsopt_software_jec/lib/python3.8/site-packages/rsopt/run.py

def startup_sequence(config: str) -> Configuration:
    """
    Safely initialize rsopt accounting for the local MPI configuration (if any)
    :param config: Path to configuration file that will be used
    :return: (rsopt.configuration.configuration.Configuration) object
    """
    config_yaml = parse.read_configuration_file(config)
    _config = parse.parse_yaml_configuration(config_yaml)
    mpi_environment = None  # mpi.get_mpi_environment() <======================

@jeinstei
Copy link
Member Author

You are correct -- I need to pull main back in, it looks like

rsopt/util.py Outdated Show resolved Hide resolved
rsopt/util.py Show resolved Hide resolved
rsopt/util.py Show resolved Hide resolved
@cchall
Copy link
Member

cchall commented Jul 20, 2022

Looks like this check is falsely reporting no MPI, even when it should find an MPI comm. Since there is an error thrown during the subprocess check:

Traceback (most recent call last):
  File "/global/common/software/m3687/rsopt_develop/lib/python3.8/site-packages/rsopt/__main__.py", line 2, in <module>
    MPI.Init()
  File "mpi4py/MPI/MPI.pyx", line 114, in mpi4py.MPI.Init
mpi4py.MPI.Exception: Other MPI error, error stack:
MPI_Init(141): Cannot call MPI_INIT or MPI_INIT_THREAD more than once

The evaluation at

if pp.returncode != 0:
fails and subsequently utils.return_unused_node returns all nodes causing the shifter model evaluation to run on the first, occupied node at which point everything hangs waiting for the node to be available.

@jeinstei
Copy link
Member Author

jeinstei commented Jul 20, 2022

I don't understand why it is trying to init MPI more than once -- I didn't see that when I was testing. Do you have a record of the commands you ran as well? I have a feeling we might be running different commands or environments at this point as I never saw a double init since testing yesterday

@cchall
Copy link
Member

cchall commented Jul 20, 2022

I've used:
srun -N 1 rsopt optimize configuration ...
and
srun --ntasks=2 -N 1 rsopt optimize configuration ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants