Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precice test fails on perlmutter #50

Open
wspear opened this issue Dec 5, 2022 · 8 comments
Open

Precice test fails on perlmutter #50

wspear opened this issue Dec 5, 2022 · 8 comments

Comments

@wspear
Copy link
Collaborator

wspear commented Dec 5, 2022

@MakisH @fsimonis

The precice test defined here: https://github.com/E4S-Project/testsuite/tree/master/validation_tests/precice

Fails on perlmutter for this variant installed with e4s 22.11:

-- linux-sles15-zen3 / [email protected] -------------------------------
[email protected]~ipo+mpi+petsc~python+shared build_system=cmake build_type=RelWithDebInfo

With the following console output:

DUMMY: Running solver dummy with preCICE config file "precice-config.xml", participant name "SolverOne", and mesh name "MeshOne".
preCICE:^[[0m This is preCICE version 2.5.0
preCICE:^[[0m Revision info: no-info [git failed to run]
preCICE:^[[0m Build type: Release (without debug log)
preCICE:^[[0m Configuring preCICE with configuration "precice-config.xml"
preCICE:^[[0m I am participant "SolverOne"
preCICE:^[[0m Setting up primary communication to coupling partner/s
MPICH ERROR [Rank 0] [job id ] [Mon Nov 21 12:25:26 2022] [nid001032] - Abort(1616271) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(171).......:
MPID_Init(495)..............:
MPIDI_OFI_mpi_init_hook(816):
create_endpoint(1353).......: OFI EP enable failed (ofi_init.c:1353:create_endpoint:Address already in use)

DUMMY: Running solver dummy with preCICE config file "precice-config.xml", participant name "SolverTwo", and mesh name "MeshTwo".
aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(171).......:
MPID_Init(495)..............:
MPIDI_OFI_mpi_init_hook(816):
create_endpoint(1353).......: OFI EP enable failed (ofi_init.c:1353:create_endpoint:Address already in use)
~                               
@fsimonis
Copy link

fsimonis commented Dec 6, 2022

Are there any specifics of the spec used to build preCICE?

Also as a note, the tilde in the variant doesn't work well in markdown. I suggest to wrap it in a code block.

@wspear
Copy link
Collaborator Author

wspear commented Dec 6, 2022

@fsimonis Fixed that variant. Here is the full dependency tree. Is there anything else that would help pin this down?

-- linux-sles15-zen3 / [email protected] -------------------------------
egt4cn6 [email protected]~ipo+mpi+petsc~python+shared build_system=cmake build_type=RelWithDebInfo
5sebukm     [email protected]~atomic~chrono~clanglibcpp~container~context~contract~coroutine~date_time~debug~exception~fiber+filesystem~graph~graph_parallel~icu~iostreams~json~locale+log~math+mpi+multithreaded~nowide~numpy~pic+program_options~python~random~regex~serialization+shared~signals~singlethreaded~stacktrace+system~taggedlayout+test+thread~timer~type_erasure~versionedlayout~wave build_system=generic cxxstd=98 patches=a440f96 visibility=hidden
bnyqmik         [email protected]+wrappers build_system=generic
4allaay     [email protected]~doc+ncurses~ownlibs~qt build_system=generic build_type=Release
s5aelxi         [email protected]~gssapi~ldap~libidn2~librtmp~libssh~libssh2~nghttp2 build_system=autotools libs=shared,static tls=gnutls
56brvrh             [email protected]~guile+zlib build_system=autotools
s3iopwe                 [email protected]+bzip2+curses+git~libunistring+libxml2+tar+xz build_system=autotools
g2bpsoz                     [email protected]~debug~pic+shared build_system=generic
rnafwos                         [email protected] build_system=autotools
xfogkcu                             [email protected] build_system=autotools libs=shared,static
jbbwlo5                     [email protected]~python build_system=autotools
savxweu                         [email protected] build_system=autotools
yucs7bj                         [email protected]+pic build_system=autotools libs=shared,static
76b2zrq                         [email protected]+optimize+pic+shared build_system=makefile
igbrz2c                     [email protected]~symlinks+termlib abi=none build_system=autotools
a35zenx                     [email protected] build_system=autotools zip=pigz
dmtmfzy                         [email protected] build_system=makefile
crilnoq                         [email protected]+programs build_system=makefile compression=none libs=shared,static
7sx44ru                 [email protected] build_system=autotools
omjzrqu                     [email protected] build_system=autotools
rv7bhhx                 [email protected] build_system=autotools
5n3nphp                     [email protected] build_system=autotools libs=shared,static
4my7pdm                         [email protected] build_system=autotools patches=35c4492,7793209,a49dd5b
yasn2hy                             [email protected]+sigsegv build_system=autotools patches=9dc5fbd,bfdffa7
ni76haj                                 [email protected] build_system=autotools
ucjrwtm                             [email protected]+cpanm+shared+threads build_system=generic
gqdvawb                                 [email protected]+cxx~docs+stl build_system=autotools patches=26090f4,b231fcc
otqsxvg                                 [email protected] build_system=autotools
6mvf2em                                     [email protected] build_system=autotools
t3onfyz                         [email protected] build_system=autotools
xyihrmc                         [email protected] build_system=autotools
jfoyxbd         [email protected]+libbsd build_system=autotools
uo7vnpu             [email protected] build_system=autotools
bcya2vp                 [email protected] build_system=autotools
di26ddu         [email protected]+iconv build_system=autotools compression=bz2lib,lz4,lzma,lzo2,zlib,zstd crypto=mbedtls libs=shared,static programs=none xar=expat
z67fidq             [email protected] build_system=makefile libs=shared,static
7a4tsiy             [email protected] build_system=autotools libs=shared,static
mskuajx             [email protected]+pic build_system=makefile build_type=Release libs=static
k5mmyyz         [email protected] build_system=autotools
qrkehbg         [email protected] build_system=makefile patches=093518c,3fbfe46
wzlxfkh     [email protected]~ipo build_system=cmake build_type=RelWithDebInfo
bpqapvu     [email protected]~X~batch~cgns~complex~cuda~debug+double~exodusii~fftw+fortran~giflib+hdf5~hpddm~hwloc+hypre~int64~jpeg~knl~kokkos~libpng~libyaml~memkind+metis~mkl-pardiso~mmg~moab~mpfr+mpi~mumps~openmp~p4est~parmmg~ptscotch~random123~rocm~saws~scalapack+shared~strumpack~suite-sparse+superlu-dist~tetgen~trilinos~valgrind build_system=generic clanguage=C
qztwosa         [email protected]+mpi~openmp+shared build_system=generic
vw5amky         [email protected]~cxx+fortran+hl~ipo~java+mpi+shared~szip~threadsafe+tools api=default build_system=cmake build_type=RelWithDebInfo
dbfenpi         [email protected]~complex~cuda~debug+fortran~gptune~int64~internal-superlu~mixedint+mpi~openmp~rocm+shared~superlu-dist~umpire~unified-memory build_system=autotools
f2t5phj         [email protected]~gdb~int64~ipo~real64+shared build_system=cmake build_type=RelWithDebInfo patches=4991da9,93a7903,b1225da
4wdj56e         [email protected]~gdb~int64~ipo+shared build_system=cmake build_type=RelWithDebInfo patches=4f89253,50ed208,704b84f
iespikt         [email protected]+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3~ssl~tix~tkinter~ucs4+uuid+zlib build_system=generic patches=0d98e93,f2fd060
kd4a5vc             [email protected] build_system=autotools
czakzn2             [email protected]+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools
4qyr4mp             [email protected] build_system=autotools
auhajyt         [email protected]~cuda~int64~ipo~openmp~rocm+shared build_system=cmake build_type=RelWithDebInfo

@wspear
Copy link
Collaborator Author

wspear commented Dec 6, 2022

It looks like this was caused by a poisoned runtime environment. This error doesn't appear on a fresh run node.

@wspear wspear closed this as completed Dec 6, 2022
@wspear
Copy link
Collaborator Author

wspear commented Dec 6, 2022

@fsimonis I resolved this too quickly. I get a hang/timeout the run output in a clean environment:

DUMMY: Running solver dummy with preCICE config file "precice-config.xml", participant name "SolverOne", and mesh name "MeshOne".
DUMMY: Running solver dummy with preCICE config file "precice-config.xml", participant name "SolverTwo", and mesh name "MeshTwo".
MPICH ERROR [Rank 0] [job id ] [Tue Dec  6 15:18:11 2022] [nid001901] - Abort(1616271) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(171).......: 
MPID_Init(495)..............: 
MPIDI_OFI_mpi_init_hook(816): 
create_endpoint(1353).......: OFI EP enable failed (ofi_init.c:1353:create_endpoint:Address already in use)

aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(171).......: 
MPID_Init(495)..............: 
MPIDI_OFI_mpi_init_hook(816): 
create_endpoint(1353).......: OFI EP enable failed (ofi_init.c:1353:create_endpoint:Address already in use)
wspear@nid001901:~/SPACK-SPACE/wspear/perlmutter/22.11/gnu/testsuite/validation_tests/precice> preCICE: This is preCICE version 2.5.0
preCICE: Revision info: no-info [git failed to run]
preCICE: Build type: Release (without debug log)
preCICE: Configuring preCICE with configuration "precice-config.xml"
preCICE: I am participant "SolverOne"
preCICE: Setting up primary communication to coupling partner/s

@wspear wspear reopened this Dec 6, 2022
@fsimonis
Copy link

Both solvers fail in MPI_INIT with the same error: create_endpoint:Address already in use.

Given that both of them fail with the same error, I expect that this is some kind of problem in the environment.

We don't do any fancy things in preCICE, so this should be reproducible with any dummy MPI code.

@wspear
Copy link
Collaborator Author

wspear commented Dec 14, 2022

I'm seeing the same issue on Crusher. Error and variants/dependencies for the crusher install are below. This is in a clean environment (basically all I've done is spack load precice) with other MPI based products generally testing successfully.

kipping load: Environment already setup
MPICH ERROR [Rank 0] [job id ] [Fri Nov 18 11:40:41 2022] [crusher131] - Abort(1616271) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(171).......:
MPID_Init(495)..............:
MPIDI_OFI_mpi_init_hook(816):
create_endpoint(1353).......: OFI EP enable failed (ofi_init.c:1353:create_endpoint:Address already in use)

DUMMY: Running solver dummy with preCICE config file "precice-config.xml", participant name "SolverOne", and mesh name "MeshOne".
aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(171).......:
MPID_Init(495)..............:
MPIDI_OFI_mpi_init_hook(816):
create_endpoint(1353).......: OFI EP enable failed (ofi_init.c:1353:create_endpoint:Address already in use)
DUMMY: Running solver dummy with preCICE config file "precice-config.xml", participant name "SolverTwo", and mesh name "MeshTwo".
preCICE:^[[0m This is preCICE version 2.5.0
preCICE:^[[0m Revision info: no-info [git failed to run]
preCICE:^[[0m Build type: Release (without debug log)
preCICE:^[[0m Configuring preCICE with configuration "precice-config.xml"
preCICE:^[[0m I am participant "SolverTwo"
preCICE:^[[0m Setting up primary communication to coupling partner/s
~                 
-- linux-sles15-zen3 / [email protected] -------------------------------
2weu3di [email protected]~ipo+mpi+petsc~python+shared build_system=cmake build_type=RelWithDebInfo
trtrf3b     [email protected]~atomic~chrono~clanglibcpp~container~context~contract~coroutine~date_time~debug~exception~fiber+filesystem~graph~graph_parallel~icu~iostreams~json~locale+log~math+mpi+multithreaded~nowide~numpy~pic+program_options~python~random~regex~serialization+shared~signals~singlethreaded~stacktrace+system~taggedlayout+test+thread~timer~type_erasure~versionedlayout~wave build_system=generic cxxstd=98 patches=a440f96 visibility=hidden
oaykapp         [email protected]+wrappers build_system=generic
c6gpjyk     [email protected]~doc+ncurses+ownlibs~qt build_system=generic build_type=Release
igbrz2c         [email protected]~symlinks+termlib abi=none build_system=autotools
savxweu             [email protected] build_system=autotools
kq7i44v         [email protected]~docs~shared build_system=generic certs=mozilla
6ki4n47             ca-certificates-mozilla@2022-10-11 build_system=generic
ucjrwtm             [email protected]+cpanm+shared+threads build_system=generic
gqdvawb                 [email protected]+cxx~docs+stl build_system=autotools patches=26090f4,b231fcc
g2bpsoz                 [email protected]~debug~pic+shared build_system=generic
rnafwos                     [email protected] build_system=autotools
xfogkcu                         [email protected] build_system=autotools libs=shared,static
otqsxvg                 [email protected] build_system=autotools
6mvf2em                     [email protected] build_system=autotools
76b2zrq                 [email protected]+optimize+pic+shared build_system=makefile
3oefhug     [email protected]~ipo build_system=cmake build_type=RelWithDebInfo
jbbwlo5     [email protected]~python build_system=autotools
yucs7bj         [email protected]+pic build_system=autotools libs=shared,static
hn5xr53     [email protected]~X+batch~cgns~complex~cuda~debug+double~exodusii~fftw+fortran~giflib+hdf5~hpddm~hwloc+hypre~int64~jpeg~knl~kokkos~libpng~libyaml~memkind+metis~mkl-pardiso~mmg~moab~mpfr+mpi~mumps~openmp~p4est~parmmg~ptscotch~random123~rocm~saws~scalapack+shared~strumpack~suite-sparse+superlu-dist~tetgen~trilinos~valgrind build_system=generic clanguage=C
dc5jfan         [email protected]~cxx+fortran+hl~ipo~java+mpi+shared~szip~threadsafe+tools api=default build_system=cmake build_type=RelWithDebInfo
e5s4iy7         [email protected]~complex~cuda~debug+fortran~gptune~int64~internal-superlu~mixedint+mpi~openmp~rocm+shared~superlu-dist~umpire~unified-memory build_system=autotools
bgpvt5g             [email protected]~bignuma~consistent_fpcsr+fortran~ilp64+locking+pic+shared build_system=makefile patches=d3d9b15 symbol_suffix=none threads=openmp
jfxbkfk         [email protected]~gdb~int64~ipo~real64+shared build_system=cmake build_type=RelWithDebInfo patches=4991da9,93a7903,b1225da
f3ztx6d         [email protected]~gdb~int64~ipo+shared build_system=cmake build_type=RelWithDebInfo patches=4f89253,50ed208,704b84f
du4hbnl         [email protected]+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4+uuid+zlib build_system=generic patches=0d98e93,f2fd060
jfoyxbd             [email protected]+libbsd build_system=autotools
uo7vnpu                 [email protected] build_system=autotools
bcya2vp                     [email protected] build_system=autotools
s3iopwe             [email protected]+bzip2+curses+git~libunistring+libxml2+tar+xz build_system=autotools
a35zenx                 [email protected] build_system=autotools zip=pigz
dmtmfzy                     [email protected] build_system=makefile
crilnoq                     [email protected]+programs build_system=makefile compression=none libs=shared,static
kd4a5vc             [email protected] build_system=autotools
czakzn2             [email protected]+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools
4qyr4mp             [email protected] build_system=autotools
kzjsqlm         [email protected]~cuda~int64~ipo~openmp~rocm+shared build_system=cmake build_type=RelWithDebInfo

@fsimonis
Copy link

fsimonis commented Dec 15, 2022

We test MPICH in our CI using fedora, which is still at version 34 (mpich 3.4.1).
I'll upgrade to fedora 37 (mpich 4.0.2) and see if this succeeds.
In the meanwhile, I'll build precice 2.5.0 using the newest spack with mpich to see if that succeeds on my workstation.
Then I'll get back to you.

Have you tried launching multiple other MPI programs simultaneously to see if the system can handle this? We experienced problems on the SuperMUC(-NG) with multiple MPI programs running simultaneously on the same slots, whilst spanning multiple nodes. This could be another symptom of the same problem. (Of course this is more of a guess, as you don't actually run the solverdummies with mpirun. )

@fsimonis
Copy link

Your test runs fine locally with:

spack --version
0.20.0.dev0 (7056a4bffd8f37615bc5efee8f02a400dceaec5c)

Using the spec:

-- linux-archrolling-zen3 / [email protected] --------------------------
t4mqo7z [email protected]~ipo+mpi+petsc~python+shared build_system=cmake build_type=RelWithDebInfo
us4udt5     [email protected]~atomic~chrono~clanglibcpp~container~context~contract~coroutine~date_time~debug~exception~fiber+filesystem~graph~graph_parallel~icu~iostreams~json~locale+log~math~mpi+multithreaded~nowide~numpy~pic+program_options~python~random~regex~serialization+shared~signals~singlethreaded~stacktrace+system~taggedlayout+test+thread~timer~type_erasure~versionedlayout~wave build_system=generic cxxstd=98 patches=a440f96 visibility=hidden
7xgan6m     [email protected]~doc+ncurses+ownlibs~qt build_system=generic build_type=Release
2tmrrpw     [email protected]~ipo build_system=cmake build_type=RelWithDebInfo
vy67cbo     [email protected]~python build_system=autotools
6ltr5dl         [email protected] build_system=autotools libs=shared,static
5ggmxkn         [email protected]~pic build_system=autotools libs=shared,static
dpj4bms         [email protected]+optimize+pic+shared build_system=makefile
zq4eoyj     [email protected]~argobots~cuda+fortran+hwloc+hydra+libxml2+pci~rocm+romio~slurm~two_level_namespace~vci~verbs+wrapperrpath build_system=autotools datatype-engine=auto device=ch4 netmod=ofi patches=d4c0e99 pmi=pmi
53lepuv         [email protected] build_system=autotools patches=440b954
wnrcksl         [email protected]~cairo~cuda~gl~libudev+libxml2~netloc~nvml~oneapi-level-zero~opencl+pci~rocm build_system=autotools libs=shared,static
tehwqeo             [email protected]~symlinks+termlib abi=none build_system=autotools
kq4iabz         [email protected]~debug~kdreg build_system=autotools fabrics=sockets,tcp,udp
jidochn         [email protected] build_system=autotools
ehr3efd             [email protected] build_system=autotools
53b3qec             [email protected] build_system=autotools
yy5vpjv         [email protected]~cuda~rocm build_system=autotools
bd6cvfl             [email protected] build_system=autotools
s3bwkg4             [email protected] build_system=autotools
ztsqh6m             [email protected]+sigsegv build_system=autotools patches=9dc5fbd,bfdffa7
cehm5ed     [email protected]~X~batch~cgns~complex~cuda~debug+double~exodusii~fftw+fortran~giflib+hdf5~hpddm~hwloc+hypre~int64~jpeg~knl~kokkos~libpng~libyaml~memkind+metis~mkl-pardiso~mmg~moab~mpfr+mpi~mumps~openmp~p4est~parmmg~ptscotch~random123~rocm~saws~scalapack+shared~strumpack~suite-sparse+superlu-dist~tetgen~trilinos~valgrind build_system=generic clanguage=C
4ivuxig         [email protected] build_system=autotools
jyct3ow         [email protected]~cxx~fortran~hl~ipo~java+mpi+shared~szip~threadsafe+tools api=default build_system=cmake build_type=RelWithDebInfo
lamojl4         [email protected]~complex~cuda~debug+fortran~gptune~int64~internal-superlu~mixedint+mpi~openmp~rocm+shared~superlu-dist~umpire~unified-memory build_system=autotools
gkgqte5         [email protected]~gdb~int64~ipo~real64+shared build_system=cmake build_type=RelWithDebInfo patches=4991da9,93a7903,b1225da
23ihaez         [email protected]~bignuma~consistent_fpcsr+fortran~ilp64+locking+pic+shared build_system=makefile patches=d3d9b15 symbol_suffix=none threads=none
3wvtlf6             [email protected]+cpanm+shared+threads build_system=generic
fogt6mt                 [email protected]+cxx~docs+stl build_system=autotools patches=26090f4,b231fcc
skhcew2         [email protected]~gdb~int64~ipo+shared build_system=cmake build_type=RelWithDebInfo patches=4f89253,50ed208,704b84f
tdmgiza         [email protected]+bz2+crypt+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tkinter+uuid+zlib build_system=generic patches=0d98e93,7d40923,f2fd060
pzibomt             [email protected]~debug~pic+shared build_system=generic
qp2r7iz             [email protected]+libbsd build_system=autotools
eb55tgs                 [email protected] build_system=autotools
h643yv4                     [email protected] build_system=autotools
aap5vzx             [email protected] build_system=autotools
q7goc63             [email protected]+bzip2+curses+git~libunistring+libxml2+tar+xz build_system=autotools
mlrmz6k                 [email protected] build_system=autotools zip=pigz
jndjnxn             [email protected] build_system=autotools
ojhzllf             [email protected]~obsolete_api build_system=autotools
xu5sfij             [email protected]~docs~shared build_system=generic certs=mozilla
sqdghw3                 ca-certificates-mozilla@2022-10-11 build_system=generic
crr6ch5             [email protected] build_system=autotools patches=bbf97f1
46hrmmf             [email protected]+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools
6fveg3y             [email protected] build_system=autotools
wc4fllt         [email protected]~cuda~int64~ipo~openmp~rocm+shared build_system=cmake build_type=RelWithDebInfo
vgr5oe6     [email protected] build_system=autotools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants