Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two tests that fail early on in the driver/atm #2914

Open
ekluzek opened this issue Dec 18, 2024 · 2 comments
Open

Two tests that fail early on in the driver/atm #2914

ekluzek opened this issue Dec 18, 2024 · 2 comments
Labels
bug something is working incorrectly external issue needs to be addressed elsewhere (submodule); issue here for the sake of project tracking investigation Needs to be verified and more investigation into what's going on.

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Dec 18, 2024

Brief summary of bug

In what will be ctsm5.3.016 the following two tests fail at the run step in Initialization

ERP_P64x2_Ld765.f10_f10_mg37.I2000Clm60BgcCrop.derecho_intel.clm-monthly (NLCOMP RUN)
ERS_P128x1_Ld765.f10_f10_mg37.I2000Clm60Fates.derecho_intel.clm-FatesColdNoComp (NLCOMP RUN)

General bug information

CTSM version you are using: ctsm5.3.016

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: Maybe ER tests for 765 days?

Details of bug

Important details of your setup / configuration so we can reproduce the bug

The initial case runs fine, it's the restart step that fails in the case2/$CASE directory.

Important output or errors that show the problem

ERP_P64x2_Ld765.f10_f10_mg37.I2000Clm60BgcCrop.derecho_intel.clm-monthly fails as follows, only the cesm.log file exists.

cesm.log

dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_outpe_stride=               0
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_single_file=      F
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_global_stats=     T
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_ovhd_measurement= F
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_add_detail=       F
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_papi_enable=      F
dec2343.hsn.de.hpc.ucar.edu 0:  ESMF_Finalize: Error closing trace stream
dec2343.hsn.de.hpc.ucar.edu 0: MPICH ERROR [Rank 0] [job id 2dd16cc6-e949-427e-bb59-48726c16f9fa] [Wed Dec 18 15:47:41 2024] [dec2343] - Abort(1) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process
 0
dec2343.hsn.de.hpc.ucar.edu 0: 
dec2343.hsn.de.hpc.ucar.edu 0: forrtl: severe (174): SIGSEGV, segmentation fault occurred
dec2343.hsn.de.hpc.ucar.edu 0: Image              PC                Routine            Line        Source             
dec2343.hsn.de.hpc.ucar.edu 0: libpthread-2.31.s  000015004133C8C0  Unknown               Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libmpi_intel.so.1  000015003F2FBE7E  Unknown               Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libmpi_intel.so.1  000015003F10A22F  Unknown               Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libmpi_intel.so.1  000015003D7376A8  MPI_Abort             Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libesmf.so         0000150049332277  _ZN5ESMCI3VMK5abo     Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libesmf.so         0000150049330814  _ZN5ESMCI2VM5abor     Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libesmf.so         00001500493476E5  c_esmc_vmabort_       Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libesmf.so         0000150049B5C7A8  esmf_vmmod_mp_esm     Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libesmf.so         00001500499CC1EE  esmf_initmod_mp_e     Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: cesm.exe           0000000000433ADA  MAIN__                    132  esmApp.F90
dec2343.hsn.de.hpc.ucar.edu 0: cesm.exe           00000000004230FD  Unknown               Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libc-2.31.so       000015003C7E129D  __libc_start_main     Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: cesm.exe           000000000042302A  Unknown               Unknown  Unknown

It looks like the problem is that DRV_RESTART_POINTER is wrong for case2 as we see here:

./xmlquery DRV_RESTART_POINTER
	DRV_RESTART_POINTER: rpointer.cpl.2001-01-18-00000
(ctsm_pylib) case2/ERP_P64x2_Ld765.f10_f10_mg37.I2000Clm60BgcCrop.derecho_intel.clm-monthly.GC.ctsm5316acl_int> ls ../../run/rpointer.cpl.*
../../run/rpointer.cpl.2001-01-19-00000

The other problem is that there isn't graceful error reporting that the rpointer file asked for doesn't exist and what needs to be done about it.

@ekluzek ekluzek added investigation Needs to be verified and more investigation into what's going on. bug something is working incorrectly external issue needs to be addressed elsewhere (submodule); issue here for the sake of project tracking labels Dec 18, 2024
@ekluzek ekluzek added this to the cesm3_0_beta06 milestone Dec 19, 2024
@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 20, 2024

@jedwards4b has a fix for this in a cime PR.

ESMCI/cime#4723

I ran aux_clm and it passes and these tests compare exactly to ctsm5.3.015 as it should.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 20, 2024

The cime version with the fix is:

cime6.1.54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly external issue needs to be addressed elsewhere (submodule); issue here for the sake of project tracking investigation Needs to be verified and more investigation into what's going on.
Projects
Status: Todo
Development

No branches or pull requests

1 participant