Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI-related test failure using mpich 4.2.0, gcc 13.2.0 #3015

Open
WardF opened this issue Sep 6, 2024 · 2 comments
Open

MPI-related test failure using mpich 4.2.0, gcc 13.2.0 #3015

WardF opened this issue Sep 6, 2024 · 2 comments

Comments

@WardF
Copy link
Member

WardF commented Sep 6, 2024

Update: For clarity, the tests pass when using mpich 4.0, gcc 11.4.0.


I'm observing a failure using mpicc and running nc_test4/run_par_test.sh.

This issue occurs when running mpicc version 13.x, but does not occur on systems using mpicc version 11.x. This is most easily observed on my end using Ubuntu 22.04 vs. 24.04. I've created a couple of docker images which can be used to observe this. They can be run as follows:

$ docker run --rm -it docker.unidata.ucar.edu/h5par:2204 

and

$ docker run --rm -it docker.unidata.ucar.edu/h5par:2404

You can enter the environment by appending bash to the end of either docker command.

It seems that the issue is related to the different version of mpicc, but I'm trying to sort through what exactly is going on. Any suggestions would be appreciated.

The error specifically is as follows:

153: Testing simple parallel I/O with 16 processors...

153: 

153: *** Testing more advanced parallel access.

153: *** Testing parallel IO for raw-data with MPI-IO (driver)...

153: *** Testing more advanced parallel access.

153: *** Testing parallel IO for raw-data with MPI-IO (driver)...

153: *** Testing more advanced parallel access.

153: *** Testing parallel IO for raw-data with MPI-IO (driver)...Sorry! Unexpected result, /root/hdf5-1.14.3/netcdf-c/nc_test4/tst_parallel3.c, line: 284

153: Sorry! Unexpected result, /root/hdf5-1.14.3/netcdf-c/nc_test4/tst_parallel3.c, line: 91

153: 

153: *** Testing more advanced parallel access.

153: *** Testing parallel IO for raw-data with MPI-IO (driver)...

153: *** Testing more advanced parallel access.

153: *** Testing parallel IO for raw-data with MPI-IO (driver)...Sorry! Unexpected result, /root/hdf5-1.14.3/netcdf-c/nc_test4/tst_parallel3.c, line: 284

153: Sorry! Unexpected result, /root/hdf5-1.14.3/netcdf-c/nc_test4/tst_parallel3.c, line: 91

153:
(...)
@WardF
Copy link
Member Author

WardF commented Sep 6, 2024

@edwardhartnett @jhendersonHDF if anything leaps out at you, feel free to chime in, it might save some time as I dig through this! And if not, no worries XD. Thanks!

@WardF WardF added this to the 4.9.3 milestone Sep 6, 2024
@WardF
Copy link
Member Author

WardF commented Sep 9, 2024

Additional notes:

On ubuntu 24.04, installing libhdf5-mpi-dev installs openmpi and related tools. This version of libhdf5 works just fine, although the nc_test4/run_par_test.sh script requires --oversubscribe be passed to mpiexec -n 16 ./tst_parallel3. Otherwise, there is a complaint if the machine has < 16 cores/processors/what-have-you.

Using mpich and a custom-built libhdf5, we cannot oversubscribe. However, this is not an issue, because invoking mpiexec -n 2 ./tst_parallel3 results in the same issue as if we passed 4, or 8, or 16. Running tst_parallel3 directly works, but of course it is bypassing MPI entirely.

Installing libhdf5-mpich-dev sees the same behavior as using the custom-built version of libhdf5. This suggests there is an issue when using mpich but not inherently MPI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant