Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I'm running create_newcase underneath SLURM #98

Open
cponder opened this issue Jan 9, 2025 · 13 comments
Open

I'm running create_newcase underneath SLURM #98

cponder opened this issue Jan 9, 2025 · 13 comments
Assignees
Labels
user support Helping a specific user (or group) with a problem building or running the code

Comments

@cponder
Copy link

cponder commented Jan 9, 2025

I'm trying to generate a testcase that will be run underneath SLURM.
But, also, I'm running create_newcase inside a container that doesn't have SLURM installed.
This error at the end here

srun: job 1786165 queued and waiting for resources
srun: job 1786165 has been allocated resources
pyxis: importing docker image: registry.gitlab.com/cponder/containers/ubuntu-pgi-openmpi/eos:latest
pyxis: imported docker image: registry.gitlab.com/cponder/containers/ubuntu-pgi-openmpi/eos:latest
Compset longname is 2000_CAM%HS94_SLND_SICE_SOCN_SROF_SGLC_SWAV
Compset specification file is /lustre/fsw/coreai_devtech_all/cponder/EarthWorks/2025-01-08_2.4/Source/components/cam//cime_config/config_compsets.xml
Automatically adding SESP to compset
Compset forcing is 1972-2004
ATM component is CAM simplified and non-versioned physics :CAM dry Held-Suarez forcing (Held and Suarez (1994)):
LND component is Stub land component
ICE component is Stub ice component
OCN component is Stub ocn component
ROF component is Stub river component
GLC component is Stub glacier (land ice) component
WAV component is Stub wave component
ESP component is Stub external system processing (ESP) component
Pes     specification file is /lustre/fsw/coreai_devtech_all/cponder/EarthWorks/2025-01-08_2.4/Source/components/cam//cime_config/config_pes.xml
Machine is Eos
Pes setting: grid          is a%mpasa120_l%null_oi%null_r%null_g%null_w%null_z%null_m%gx1v7 
Pes setting: compset       is 2000_CAM%HS94_SLND_SICE_SOCN_SROF_SGLC_SWAV_SESP 
Pes setting: tasks       is {'NTASKS_ATM': -1, 'NTASKS_LND': -1, 'NTASKS_ROF': -1, 'NTASKS_ICE': -1, 'NTASKS_OCN': -1, 'NTASKS_GLC': -1, 'NTASKS_WAV': -1, 'NTASKS_CPL': -1} 
Pes setting: threads     is {'NTHRDS_ATM': 1, 'NTHRDS_LND': 1, 'NTHRDS_ROF': 1, 'NTHRDS_ICE': 1, 'NTHRDS_OCN': 1, 'NTHRDS_GLC': 1, 'NTHRDS_WAV': 1, 'NTHRDS_CPL': 1} 
Pes setting: rootpe      is {'ROOTPE_ATM': 0, 'ROOTPE_LND': 0, 'ROOTPE_ROF': 0, 'ROOTPE_ICE': 0, 'ROOTPE_OCN': 0, 'ROOTPE_GLC': 0, 'ROOTPE_WAV': 0, 'ROOTPE_CPL': 0} 
Pes setting: pstrid      is {} 
Pes other settings: {}
Pes other settings append: {}
Pes comments: none
setting additional fields from config_pes: {}, append {}
 Compset is: 2000_CAM%HS94_SLND_SICE_SOCN_SROF_SGLC_SWAV_SESP 
 Grid is: a%mpasa120_l%null_oi%null_r%null_g%null_w%null_z%null_m%gx1v7 
 Components in compset are: ['cam', 'slnd', 'sice', 'socn', 'srof', 'sglc', 'swav', 'sesp'] 
No charge_account info available, using value from PROJECT
cesm model version found: release-ew2.4
Batch_system_type is slurm
job is case.run USER_REQUESTED_WALLTIME None USER_REQUESTED_QUEUE None WALLTIME_FORMAT %H:%M:%S
WARNING: No queue on this system met the requirements for this job. Falling back to defaults
ERROR: No queues found

What is it trying to use the queue for?

@briandobbins
Copy link

This is almost certainly because in your ccs_config/machines/ file, you have CONFIG_BATCH set to slurm. Try setting it to none.

To be sure, what's your full 'create_newcase' command? And what 'machine' are you running on (eg, ./xmlquery MACH)? And what do the contents of that ccs_config/machines//config_machines.xml look like?

@gdicker1
Copy link
Contributor

@briandobbins thanks for that suggestion! For your other questions, see #96 where I asked many of the same questions.

@cponder
Copy link
Author

cponder commented Jan 10, 2025

Ok that got me past the failure.

Batch_system_type is none
 Creating Case directory /lustre/fsw/coreai_devtech_all/cponder/EarthWorks/2025-01-08_2.4/Cases/FHS94.mpasa120

There's this file here

Cases/FHS94.mpasa120/case.submit

but I can't tell how it's supposed to be run, I don't see any SLURM details in there.

@gdicker1
Copy link
Contributor

I think that's to be expected with CONFIG_BATCH set to none; you're asking for CIME (CESM) to not use some batch details and so the initial setup for the case doesn't provide it. (Which would be reasonable if you can't see the queues from inside your container.)

I think you'd get around this by managing the SLURM plus container details yourself and then using ./case.submit --no-batch in whatever final script starts CESM from within your container.

One other idea might be to not set CONFIG_BATCH to none and run create_newcase with --non-local (from inside the container). This may allow create_newcase to succeed, you should then be able to do the setup plus build steps from inside the container too, and the scripts to run CESM inside SLURM would be configured correctly. (Depending on what you provide in some config_batch.xml file.)

@cponder
Copy link
Author

cponder commented Jan 13, 2025

If I leave the SLURM setting in place

<BATCH_SYSTEM>slurm</BATCH_SYSTEM>

but build with the --non-local I still get the detection failures:

Batch_system_type is slurm
job is case.run USER_REQUESTED_WALLTIME None USER_REQUESTED_QUEUE None WALLTIME_FORMAT %H:%M:%S
WARNING: No queue on this system met the requirements for this job. Falling back to defaults
ERROR: No queues found

I would expect the non-local setting to suppress the local detection.

@cponder
Copy link
Author

cponder commented Jan 13, 2025

This case.submit script, can it be used to generate the SLURM job-control file without submitting it?
Then I can (edit and) submit it manually from outside the container.

@gdicker1
Copy link
Contributor

I would expect the non-local setting to suppress the local detection.

Huh, I would expect that too.

Looking at it a bit, I think this is because the slurm entry in ccs_config/machines/config_batch.xml doesn't define any queues. This is an error of not finding queues defined in a file, not necessarily on the system. If you want to try --non-local more, I think you would need to add some (even fake) queue info.

Really it might be best to just leave BATCH_SYSTEM as none, and handle the batch configuration on your own for now.

This case.submit script, can it be used to generate the SLURM job-control file without submitting it?

I'm unsure. The .case.run script inside a case may be closest to what you're looking for. I know the case.submit does whatever is needed so that the .case.run script (and really the mpirun ... cesm.exe ... command inside) is started correctly.

@gdicker1 gdicker1 added the user support Helping a specific user (or group) with a problem building or running the code label Jan 13, 2025
@briandobbins
Copy link

briandobbins commented Jan 13, 2025

Yeah, just to add briefly to this -- I often, when debugging, just write a very short script that changes into the run directory for a case, and calls 'mpiexec ../bld/cesm.exe'. You can easily write your own to do that if the model is building.

Alternatively, as Dylan alludes to, you can also add a queue inside your config_machines.xml, and it'll generate the .case.run script for Slurm. All that ./case.submit does is some checks, reads some queue settings, and submits (via 'sbatch' for Slurm) the .case.run file.

The only thing you need to watch out for when bypassing the full method is if you change namelists; you'll need to run ./preview_namelists manually if you do, else they won't propagate.

@cponder
Copy link
Author

cponder commented Jan 13, 2025

I'm inclined to run these components individually so I can control the invocations more directly.
So after running create_case, then I suppose I should run these in sequence?

create_case
case.build
case.cmpgen_namelists
case.setup
check_input_data
srun ... cesm.exe 
check_case

I don't much care if any of the steps are redundant after the first usage, I'm more concerned about bulletproofing the sequence of steps that I'd be running.

@gdicker1
Copy link
Contributor

gdicker1 commented Jan 13, 2025

I think I would go more for:

./create_newcase
./case.setup
./case.build
./preview_namelists && ./check_input_data
srun ... cesm.exe

Basically that:

  • (At least in my flow) setup is always the step after create_newcase. CIME will complain if you try to build before setup.
  • I use ./preview_namelists && ./check_input_data because that always works as long as case.setup completed. The build step (IIRC) generates the namelists, but I usually just run these 2 together. I haven't used case.cmpgen_namelists directly, that's usually called by other scripts. I'd consider this step checking input data optional, but worth checking.
  • check_case is a good idea, but may have some wild-goose-chase-errors. I would run it on the side and look at it's error statements.

@cponder
Copy link
Author

cponder commented Jan 14, 2025

The case.setup gives this failure

ERROR: Could not find a matching MPI for attributes: {'compiler': 'nvhpc', 'mpilib': 'openmpi', 'threaded': False, 'queue': '', 'unit_testing': False, 'comp_interface': 'nuopc'}

I'll be using srun to run the MPI jobs, but it's not visible here since I'm inside the container.
Is there some "MPI=none" setting that I need to use? But I'll still need to use the mpicc etc. compilers that are inside the container.

@briandobbins
Copy link

Carl, can you share the container image or Docker / Singularity file to make it?

These are almost certainly all simple omissions in the configuration file, but it's hard for us to debug without being able to fully see how things are configured, and try changes out.

In this specific case, are you defining an 'openmpi' option for modules with the NVHPC compiler? The case.setup shouldn't need to know anything outside the container, so I don't think it should matter that it doesn't find your actual srun. (I'm assuming you're using the container to run, too, and thus the MPI in the container is compatible with the launch mechanism for the container on the cluster.)

@cponder
Copy link
Author

cponder commented Jan 14, 2025

Ok I uploaded the docker-file here:
Dockerfile_24.3.txt
I had to re-name it to .txt for the uploader to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user support Helping a specific user (or group) with a problem building or running the code
Projects
None yet
Development

No branches or pull requests

3 participants