Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builds of mpi-serial case with intel and DEBUG on are failing on Derecho #130

Closed
ekluzek opened this issue Dec 1, 2023 · 7 comments · Fixed by ESMCI/cime#4533 or #144
Closed

Builds of mpi-serial case with intel and DEBUG on are failing on Derecho #130

ekluzek opened this issue Dec 1, 2023 · 7 comments · Fixed by ESMCI/cime#4533 or #144
Labels
bug Something isn't working

Comments

@ekluzek
Copy link
Contributor

ekluzek commented Dec 1, 2023

I'm getting a fail in the build of mpi-serial cases with the intel compiler and DEBUG on are failing on Derecho in ccs_config_cesm0.0.84 with ctsm5.1.dev156-43-g84bab54dc in what will become ctsm5.1.dev157 (ESCOMP/CTSM#2269).

Two test cases that fail are:

ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesCold
SMS_Lm3_D_Mmpi-serial.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdHydro

The build fails at the link step as follows with undefined references to MPI for mpich. Which is odd
because this is built with mpi-serial, so mpich shouldn't be anywhere in here.

model_only is True
         - Building atm Library
Building atm with output to /glade/derecho/scratch/erik/tests_ctsm51d155derechofs/ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesCold.GC.ctsm51d155derechofs_int/bld/atm.bldlog.231201-010530
datm built in 0.957645 seconds
Building cesm from /glade/work/erik/ctsm_worktrees/external_updates/components/cmeps/cime_config/buildexe with output to /glade/derecho/scratch/erik/tests_ctsm51d155derechofs/ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesCold.GC.ctsm51d155derechofs_int/bld/cesm.bldlog.231201-010530
Component cesm exe build complete with 43 warnings
Building test for ERS in directory /glade/derecho/scratch/erik/tests_ctsm51d155derechofs/ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesCold.GC.ctsm51d155derechofs_int
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_strerror@FABRIC_1.0'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_fabric@FABRIC_1.1'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_getinfo@FABRIC_1.3'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_dupinfo@FABRIC_1.3'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_freeinfo@FABRIC_1.3'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_version@FABRIC_1.0'

I can see references for mpich in my software_env.txt for my case, which seems odd...

software_environment.txt:LMOD_SYSTEM_DEFAULT_MODULES=ncarenv/23.09:craype/2.7.23:intel/2023.2.1:ncarcompilers/1.0.0:cray-mpich/8.1.27:netcdf/4.9.2
software_environment.txt:PBS_O_PATH=/glade/u/apps/derecho/23.06/spack/opt/spack/netcdf/4.9.2/oneapi/2023.0.0/iijr/bin:/glade/u/apps/derecho/23.06/spack/opt/spack/hdf5/1.12.2/oneapi/2023.0.0/d6xa/bin:/glade/u/apps/derecho/23.06/spack/opt/spack/ncarcompilers/1.0.0/oneapi/2023.0.0/ec7b/bin/mpi:/opt/cray/pe/pals/1.2.11/bin:/opt/cray/libfabric/1.15.2.0/bin:/opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/bin:/opt/cray/pe/mpich/8.1.25/bin:/glade/u/apps/derecho/23.06/spack/opt/spack/ncarcompilers/1.0.0/oneapi/2023.0.0/ec7b/bin:/glade/u/apps/common/23.04/spack/opt/spack/intel-oneapi-compilers/2023.0.0/compiler/2023.0.0/linux/lib/oclfpga/bin:/glade/u/apps/common/23.04/spack/opt/spack/intel-oneapi-compilers/2023.0.0/compiler/2023.0.0/linux/bin/intel64:/glade/u/apps/common/23.04/spack/opt/spack/intel-oneapi-compilers/2023.0.0/compiler/2023.0.0/linux/bin:/opt/cray/pe/craype/2.7.20/bin:/glade/u/apps/derecho/23.06/opt/utils/bin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/glade/u/home/erik/bin:/usr/sbin:/opt/c3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/opt/pbs/bin:/glade/u/apps/derecho/23.06/opt/bin:/usr/local/bin:/usr/bin:/sbin:/bin:/opt/cray/pe/bin

@ekluzek
Copy link
Contributor Author

ekluzek commented Dec 2, 2023

I thought the mpich part of this might have been from my modules environment before running a case. But, it looks like that isn't the case, as both cesmdev and ncarenv seem to add in mpich at least to the LMOD_SYSTEM_DEFAULT_MODULES env variable. Unsetting that env variable beforehand doesn't help as they both set it for you.

The module purge in env_mach_specifc.xml doesn't completely unload the users environment for the modules they loaded that are sticky.

@ekluzek
Copy link
Contributor Author

ekluzek commented Dec 2, 2023

OK, I ran a production case that worked and a debug one that failed. In comparing the link step between the two I think the key difference is the PIO library here...

< -L/glade/u/apps/derecho/23.09/spack/opt/spack/parallelio/2.6.2/mpi-serial/2.3.0/oneapi/2023.2.1/mdpq/lib
---
> -L/glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/oneapi-2023.2.1/parallelio-2.6.2-ztld6j4qg5warlaaek3eql6bo2mlq4bm/lib

The first one in the filename includes a directory with mpi-serial explicitly, while the second does not. So I hacked the Makefile to use the PIO library from the working one, that still failed. But, when I also hacked the Makefile to use the ESMF library from the non-debug version -- I got it to work.

So using the non-debug ESMF and PIO versions allow the code to compile.

Here's the difference in the hacked Makefile to show what I did to make it work

 diff -c Tools/Makefile.orig Tools/Makefile
*** Tools/Makefile.orig 2023-12-02 13:12:40.356436000 -0700
--- Tools/Makefile      2023-12-02 15:21:35.517732412 -0700
***************
*** 260,265 ****
--- 260,266 ----
     SLIBS += -L$(LIB_PNETCDF) -lpnetcdf
  endif

+ ESMFMKFILE := /glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/oneapi-2023.2.1/esmf-8.6.0b04-kvqb7p62vw5d6dgsbyhnh6j2esucma2t/lib/esmf.mk
  # Set esmf.mk location with ESMF_LIBDIR having precedence over ESMFMKFILE
  CIME_ESMFMKFILE := undefined_ESMFMKFILE
  ifdef ESMFMKFILE
***************
*** 446,451 ****
--- 447,453 ----
    MCT_LIBDIR=$(INSTALL_SHAREDPATH)/lib
  endif

+ PIO_LIBDIR := /glade/u/apps/derecho/23.09/spack/opt/spack/parallelio/2.6.2/mpi-serial/2.3.0/oneapi/2023.2.1/mdpq/lib
  ifdef PIO_LIBDIR
    ifeq ($(PIO_VERSION),$(PIO_VERSION_MAJOR))
      INCLDIR += -I$(PIO_INCDIR)

So it sounds like the ESMF and PIO libraries with DEBUG on for intel, must have issues and aren't really using mpi-serial. At least maybe in the module environment?

@jedwards4b
Copy link
Collaborator

mpi-serial is an installed module on derecho, but there is a problem with the install as mpi.mod is missing.
I have opened NCAR/spack-derecho#18 for cisl to correct that problem. This will also
require a cime PR and a ccs_config PR - coming soon.

@jedwards4b
Copy link
Collaborator

After fixing the mpi-serial install I am still getting the error

ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_strerror@FABRIC_1.0'

ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_fabric@FABRIC_1.1'

ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_getinfo@FABRIC_1.3'

ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_dupinfo@FABRIC_1.3'

ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_freeinfo@FABRIC_1.3'

ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_version@FABRIC_1.0'

Still trying to understand why.

@jedwards4b
Copy link
Collaborator

If I simplify to SMS_Mmpi-serial.f19_g17.X.derecho_intel it works.
SMS_Mmpi-serial.f19_g17.A.derecho_intel also builds correctly.
I tried
SMS_Mmpi-serial.f19_g17.2000_DATM%CRUv7_CLM50%BGC_SICE_SOCN_SROF_SGLC_SWAV_SESP.derecho_intel
and it also fails.

@ekluzek
Copy link
Contributor Author

ekluzek commented Dec 12, 2023

I tried in CTSM with

ccs_config_cesm0.0.85
cime6.0.193

and it's still failing for me. What set of externals did you use in ESMCI/cime#4533 that you got to work?

Also note that DEBUG off tests were working for me it's DEBUG on tests that fail. so do debug tests for the X and A compsets work?

So

SMS_D_Mmpi-serial.f19_g17.X.derecho_intel

and

SMS_D_Mmpi-serial.f19_g17.A.derecho_intel

?

@ekluzek
Copy link
Contributor Author

ekluzek commented Jan 13, 2024

I tried again with the latest following externals and it's still failing:

@@ -34,7 +34,7 @@ hash = 34723c2
 required = True
 
 [ccs_config]
-tag = ccs_config_cesm0.0.84
+tag = ccs_config_cesm0.0.87
 protocol = git
 repo_url = https://github.com/ESMCI/ccs_config_cesm.git
 local_path = ccs_config
@@ -44,11 +44,11 @@ required = True
 local_path = cime
 protocol = git
 repo_url = https://github.com/ESMCI/cime
-tag = cime6.0.175
+tag = cime6.0.198
 required = True
 
 [cmeps]
-tag = cmeps0.14.43
+tag = cmeps0.14.47
 protocol = git
 repo_url = https://github.com/ESCOMP/CMEPS.git
 local_path = components/cmeps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants