You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My application gets sporadically deadlocked on mpi_barrier calls. It seems to happen at times when the network is under very heavy load and/or the machines are being oversubscribed. (I don't have any control over that.) This application is running OpenMPI 4.1.4 on SuSE Linux 12. My admins attached a debugger and printed a back trace to all the running processes, and the result is
PID 62880:
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00002aead70f655d in poll () from /lib64/libc.so.6
#0 0x00002aead70f655d in poll () from /lib64/libc.so.6
#1 0x00002aeae04d504e in poll_dispatch (base=0x2fd2ba0, tv=0x12) at ../../../../.././openmpi-4.1.4/opal/mca/event/libevent2022/libevent/poll.c:165
#2 0x00002aeae04c9881 in opal_libevent2022_event_base_loop (base=0x2fd2ba0, flags=18) at ../../../../.././openmpi-4.1.4/opal/mca/event/libevent2022/libevent/event.c:1630
#3 0x00002aeae047254e in opal_progress () from /tools/openmpi/4.1.4/lib/libopen-pal.so.40
#4 0x00002aeaf0818d74 in mca_pml_ob1_send () from /tools/openmpi/4.1.4/lib/openmpi/mca_pml_ob1.so
#5 0x00002aead6b51f9d in ompi_coll_base_barrier_intra_recursivedoubling () from /tools/openmpi/4.1.4/lib/libmpi.so.40
#6 0x00002aead6b03e11 in PMPI_Barrier () from /tools/openmpi/4.1.4/lib/libmpi.so.40
#7 0x00002aead6877f43 in pmpi_barrier__ () from /tools/openmpi/4.1.4/lib/libmpi_mpifh.so.40
#8 0x00002aead6400be2 in mpi_barrier_f08_ () from /tools/openmpi/4.1.4/lib/libmpi_usempif08.so.40
#9 0x00000000005bb177 in (same location in application code)
All other PIDs:
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00002afe6e6c355d in poll () from /lib64/libc.so.6
#0 0x00002afe6e6c355d in poll () from /lib64/libc.so.6
#1 0x00002afe77aa204e in poll_dispatch (base=0x119bbe0, tv=0x9) at ../../../../.././openmpi-4.1.4/opal/mca/event/libevent2022/libevent/poll.c:165
#2 0x00002afe77a96881 in opal_libevent2022_event_base_loop (base=0x119bbe0, flags=9) at ../../../../.././openmpi-4.1.4/opal/mca/event/libevent2022/libevent/event.c:1630
#3 0x00002afe77a3f54e in opal_progress () from /tools/openmpi/4.1.4/lib/libopen-pal.so.40
#4 0x00002afe6e0ba42b in ompi_request_default_wait () from /tools/openmpi/4.1.4/lib/libmpi.so.40
#5 0x00002afe6e11ef0e in ompi_coll_base_barrier_intra_recursivedoubling () from /tools/openmpi/4.1.4/lib/libmpi.so.40
#6 0x00002afe6e0d0e11 in PMPI_Barrier () from /tools/openmpi/4.1.4/lib/libmpi.so.40
#7 0x00002afe6de44f43 in pmpi_barrier__ () from /tools/openmpi/4.1.4/lib/libmpi_mpifh.so.40
#8 0x00002afe6d9cdbe2 in mpi_barrier_f08_ () from /tools/openmpi/4.1.4/lib/libmpi_usempif08.so.40
#9 0x00000000005bb177 in (same location in application code)
What I see: PID 62880 is waiting on mca_pml_ob1_send; the others are on ompi_request_default_wait.
This problem only occurs sporadically. I chewed through ~10,000 core-hours this weekend trying to reproduce the issue and failed - likely because the system was less loaded over the weekend. The jobs are being run with -map-by socket --bind-to socket --rank-by core --mca btl_tcp_if_include 10.216.0.0/16 in order to force all traffic over a single interface.
Also, puzzlingly, I see the following printed to stderr:
[hpap14n4:08897] mca_base_component_repository_open: unable to open mca_fs_lustre: liblustreapi.so: cannot open shared object file: No such file or directory (ignored)
[hpap14n4:08911] mca_base_component_repository_open: unable to open mca_fs_lustre: liblustreapi.so: cannot open shared object file: No such file or directory (ignored)
[hpap14n4:08913] mca_base_component_repository_open: unable to open mca_fs_lustre: liblustreapi.so: cannot open shared object file: No such file or directory (ignored)
[hpap14n4:08929] mca_base_component_repository_open: unable to open mca_fs_lustre: liblustreapi.so: cannot open shared object file: No such file or directory (ignored)
This is odd because liblustreapi.so should resolve reliably to /usr/lib64/liblustreapi.so, which is installed locally on each machine (so no funny business with network mappings).
Does anyone have any guesses as to what might be going on, or how I might mitigate these kinds of failures?
The text was updated successfully, but these errors were encountered:
My application gets sporadically deadlocked on
mpi_barrier
calls. It seems to happen at times when the network is under very heavy load and/or the machines are being oversubscribed. (I don't have any control over that.) This application is running OpenMPI 4.1.4 on SuSE Linux 12. My admins attached a debugger and printed a back trace to all the running processes, and the result isPID 62880:
All other PIDs:
What I see: PID 62880 is waiting on
mca_pml_ob1_send
; the others are onompi_request_default_wait
.This problem only occurs sporadically. I chewed through ~10,000 core-hours this weekend trying to reproduce the issue and failed - likely because the system was less loaded over the weekend. The jobs are being run with
-map-by socket --bind-to socket --rank-by core --mca btl_tcp_if_include 10.216.0.0/16
in order to force all traffic over a single interface.Also, puzzlingly, I see the following printed to stderr:
This is odd because
liblustreapi.so
should resolve reliably to/usr/lib64/liblustreapi.so
, which is installed locally on each machine (so no funny business with network mappings).Does anyone have any guesses as to what might be going on, or how I might mitigate these kinds of failures?
The text was updated successfully, but these errors were encountered: