-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Builds of mpi-serial case with intel and DEBUG on are failing on Derecho #130
Comments
I thought the mpich part of this might have been from my modules environment before running a case. But, it looks like that isn't the case, as both cesmdev and ncarenv seem to add in mpich at least to the LMOD_SYSTEM_DEFAULT_MODULES env variable. Unsetting that env variable beforehand doesn't help as they both set it for you. The module purge in env_mach_specifc.xml doesn't completely unload the users environment for the modules they loaded that are sticky. |
OK, I ran a production case that worked and a debug one that failed. In comparing the link step between the two I think the key difference is the PIO library here... < -L/glade/u/apps/derecho/23.09/spack/opt/spack/parallelio/2.6.2/mpi-serial/2.3.0/oneapi/2023.2.1/mdpq/lib
---
> -L/glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/oneapi-2023.2.1/parallelio-2.6.2-ztld6j4qg5warlaaek3eql6bo2mlq4bm/lib The first one in the filename includes a directory with mpi-serial explicitly, while the second does not. So I hacked the Makefile to use the PIO library from the working one, that still failed. But, when I also hacked the Makefile to use the ESMF library from the non-debug version -- I got it to work. So using the non-debug ESMF and PIO versions allow the code to compile. Here's the difference in the hacked Makefile to show what I did to make it work diff -c Tools/Makefile.orig Tools/Makefile
*** Tools/Makefile.orig 2023-12-02 13:12:40.356436000 -0700
--- Tools/Makefile 2023-12-02 15:21:35.517732412 -0700
***************
*** 260,265 ****
--- 260,266 ----
SLIBS += -L$(LIB_PNETCDF) -lpnetcdf
endif
+ ESMFMKFILE := /glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/oneapi-2023.2.1/esmf-8.6.0b04-kvqb7p62vw5d6dgsbyhnh6j2esucma2t/lib/esmf.mk
# Set esmf.mk location with ESMF_LIBDIR having precedence over ESMFMKFILE
CIME_ESMFMKFILE := undefined_ESMFMKFILE
ifdef ESMFMKFILE
***************
*** 446,451 ****
--- 447,453 ----
MCT_LIBDIR=$(INSTALL_SHAREDPATH)/lib
endif
+ PIO_LIBDIR := /glade/u/apps/derecho/23.09/spack/opt/spack/parallelio/2.6.2/mpi-serial/2.3.0/oneapi/2023.2.1/mdpq/lib
ifdef PIO_LIBDIR
ifeq ($(PIO_VERSION),$(PIO_VERSION_MAJOR))
INCLDIR += -I$(PIO_INCDIR) So it sounds like the ESMF and PIO libraries with DEBUG on for intel, must have issues and aren't really using mpi-serial. At least maybe in the module environment? |
mpi-serial is an installed module on derecho, but there is a problem with the install as mpi.mod is missing. |
After fixing the mpi-serial install I am still getting the error
Still trying to understand why. |
If I simplify to SMS_Mmpi-serial.f19_g17.X.derecho_intel it works. |
I tried in CTSM with ccs_config_cesm0.0.85 and it's still failing for me. What set of externals did you use in ESMCI/cime#4533 that you got to work? Also note that DEBUG off tests were working for me it's DEBUG on tests that fail. so do debug tests for the X and A compsets work? So SMS_D_Mmpi-serial.f19_g17.X.derecho_intel and SMS_D_Mmpi-serial.f19_g17.A.derecho_intel ? |
I tried again with the latest following externals and it's still failing:
|
I'm getting a fail in the build of mpi-serial cases with the intel compiler and DEBUG on are failing on Derecho in ccs_config_cesm0.0.84 with ctsm5.1.dev156-43-g84bab54dc in what will become ctsm5.1.dev157 (ESCOMP/CTSM#2269).
Two test cases that fail are:
ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesCold
SMS_Lm3_D_Mmpi-serial.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdHydro
The build fails at the link step as follows with undefined references to MPI for mpich. Which is odd
because this is built with mpi-serial, so mpich shouldn't be anywhere in here.
I can see references for mpich in my software_env.txt for my case, which seems odd...
The text was updated successfully, but these errors were encountered: