Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After #75, Compsets using MPAS-O on GPUs fail during run-time #84

Open
gdicker1 opened this issue Dec 4, 2024 · 3 comments
Open

After #75, Compsets using MPAS-O on GPUs fail during run-time #84

gdicker1 opened this issue Dec 4, 2024 · 3 comments
Assignees
Labels
EW specific This has to do with EarthWorks only - files, goals, code that probably won't be wanted upstream external Has to do with externals invalid This doesn't seem right OpenACC Involves OpenACC porting

Comments

@gdicker1
Copy link
Contributor

gdicker1 commented Dec 4, 2024

Though #75 contains an initial GPU port of MPAS-Ocean, tests involving this compset on GPUs now fail at run-time. Some fields get NaNs and MPAS-O aborts.

Testing with GPU builds using the ewm-2.3.006 tag (before GPU MPAS-O) succeed.

Example steps to re-create this problem

  1. Clone EarthWorks, using a version equivalent to tag ewm-2.3.010 or later
  2. Create a case that uses GPUs and some non-simple CAM physics (e.g. F2000dev which uses cam7 physics)
    • Using GPUs in EarthWorks and CESM is under active development, please ask if you are unsure how to request GPUs for a case.
  3. Run ./case.setup, ./case.build, and ./case.submit
  4. The simulation will run for some time. The run will eventually fail due to MPAS-O aborting from finding NaNs in the fields.
    • The output from MPAS-O (during the run) will be in some file like fort.99 with extra error information also being added to the mpas_ocean_block_stats_${RANK} files.

Example output

Excerpt from a mpas_ocean_block_stats_0 file 1 :

  ERROR: NaN Detected in state see below for which field contained a NaN.
    -- Statistics information for block fields
      Field: latCell
          Min:   -0.6014597403056250
          Max:   -0.1402792456120920
      Field: lonCell
          Min:     4.136882946640790
          Max:     4.613029934416640
      Field: xCell
          Min:    -3356045.861461610
          Max:    -605355.3493913120
      Field: yCell
          Min:    -6131081.962771160
          Max:    -4865592.014447640
      Field: zCell
          Min:    -3605138.567225930
          Max:    -890822.8347366140
      Field: areaCell
          Min:     12398456274.30820
          Max:     12854325071.01830
 
 
  ERROR: NaN Detected in layerThickness.
    -- Statistics information for layerThickness fields
...

Footnotes

  1. Full file path on Derecho: "/glade/derecho/scratch/gdicker/ewv24_2024Nov18170000/ew-v24test-gpu/ERS_Ln9_P64_G4-a100-openacc_Vnuopc.T62_oQU120.MPASOOnly.derecho_nvhpc.ew-outfrq9s.G.20241118_170035_wv9uyx/run/mpas_ocean_block_stats_0"

@gdicker1 gdicker1 added invalid This doesn't seem right external Has to do with externals EW specific This has to do with EarthWorks only - files, goals, code that probably won't be wanted upstream OpenACC Involves OpenACC porting labels Dec 4, 2024
@dazlich
Copy link
Contributor

dazlich commented Jan 13, 2025

@gdicker1 Rich brought this issue to my attention. I'd like to test this out - do you have a script I can use to try this?

@gdicker1
Copy link
Contributor Author

@dazlich, sure!

CASEDIR="you name it"
./cime/scripts/create_newcase -- case "${CASEDIR}" --res mpasa120_oQU120 --compset CHAOS2000dev --project UCSU0085 --input-dir /glade/campaign/univ/ucsu0085/inputdata --output-root "${CASEDIR}/.." --driver nuopc --compiler nvhpc --ngpus-per-node 4 --gpu-type a100 --gpu-offload openacc
cd "${CASEDIR}"
./case.setup
qcmd -A UCSU0085 -l walltime 06:00:00 -- ./case.build --sharedlib-only
qcmd -A UCSU0085 -l walltime 06:00:00 -- ./case.build --model-only
./case.submit

Very little changes. Just note:

  • The last 4 arguments to create_newcase are the most important. Must use nvhpc compilers and have the correct values for the 3 GPU-related args
  • The test infrastructure builds shared libraries and the model in separate steps. I also think it speeds up the NVHPC build. It still takes a long time with NVHPC (even longer for GPU builds than CPU-only), but at least this seems faster to me.
  • CHAOS2000dev isn't required, I got the "NaN detected in LayerThickness" error with a "MPASOOnly" compset within this Derecho testdir: "/glade/derecho/scratch/gdicker/ewv24_2024Nov18170000/ew-v24test-gpu/ERS_Ln9_P64_G4-a100-openacc_Vnuopc.T62_oQU120.MPASOOnly.derecho_nvhpc.ew-outfrq9s.G.20241118_170035_wv9uyx"
    • MPASOOnly = 2000_DATM%NYF_SLND_DICE%SSMI_MPASO_SROF_SGLC_SWAV_SESP

@dazlich
Copy link
Contributor

dazlich commented Jan 15, 2025

I've dug a little further. I've run gpu and cpu (nvhpc) cases for ewm-2.3.006 and ewm-2.3.007 for both the split-explicit and split-implicit time integration schemes.

  • I have successful two month simulations for all cpu cases.
  • ewm-2.3.006, split-explicit fails after about 15 days, dies from signal 15. There are empty mpas_ocean_block_stats files so apparently there was a state validation failure but no useful diagnostic message.
  • ewm-2.3.006, split-implicit fails after about two days. There is an mpas_ocean_block_stats file implying state validation failure, but again empty. Again, died from signal 15.
  • ewm-2.3.007, split-explicit fails on timestep 1. Here the messages are explicit, state validation failure, and the mpas_ocean_block_stats files have data. The exit code is 255.
  • ewm-2.3.007, split-implicit fails on timestep 1. This time it stops due to exceeding an iteration limit. The exit code is 255.

The code never ran satisfactorily on gpu, but the ewm-2.3.007 tag fails immediately. I will now see if I can track where the solutions diverge from the cpu solutions.

@dazlich dazlich self-assigned this Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EW specific This has to do with EarthWorks only - files, goals, code that probably won't be wanted upstream external Has to do with externals invalid This doesn't seem right OpenACC Involves OpenACC porting
Projects
None yet
Development

No branches or pull requests

2 participants