Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running from cron no longer possible on Derecho #998

Open
mkavulich opened this issue Jan 2, 2024 · 6 comments
Open

Running from cron no longer possible on Derecho #998

mkavulich opened this issue Jan 2, 2024 · 6 comments
Labels
bug Something isn't working documentation Improvements or additions to documentation

Comments

@mkavulich
Copy link
Collaborator

mkavulich commented Jan 2, 2024

Expected behavior

Previously the USE_CRON_TO_RELAUNCH=true option worked on all Tier-1 platforms (as far as I know).

Current behavior

Due to a policy change with the new machine, there are now special procedures for setting up cron tables on Derecho. These procedures are not compatible with automatic modification due to needing to log in to a separate machine, so it is not feasible to support this mode of running the workflow automatically (USE_CRON_TO_RELAUNCH=true) on Derecho.

Machines affected

Derecho

Steps To Reproduce

1. Generate an experiment with the USE_CRON_TO_RELAUNCH=true option on Derecho
2. Observe that the workflow is not run.

So far, it looks like cron jobs do still work on Derecho. But I have been told by CISL that we need to migrate away from this system ASAP.

Detailed Description of Fix

Users guide run instructions will need to be updated with Derecho-specific instructions, as the crontab functionality is currently the recommended way to run the workflow.

Possible Implementation

One way to get around this would be to leverage the WE2E functionality currently present for running and monitoring experiments for general use. This would require some tweaking of the current setup to be more user-friendly outside of the WE2E context.

Output (optional)

Currently, the error message that appears in log.launch_FV3LAM_wflow is:

Running "module reset". Resetting modules to system default. The following $MODULEPATH directories have been removed: None
ERROR:
Loading of platform-specific module file (WFLOW_MOD_FN) for the workflow 
task failed:
  WFLOW_MOD_FN = "wflow_derecho"

But that may change in the future, as I was informed by CISL that the crontab functionality will stop working all together at some point.

@SarahLu-NOAA
Copy link

@mkavulich
My SRW exp at Derecho sit in the queue for 12 hours since last night. The exp with USE_CRON_TO_RELAUNCH=true runs ok until last night. You posted this issue last week. Is my jobs sitting in the queue related to a very busy Derecho or with the cron option in UFS/SRW no longer working/supported?

@mkavulich
Copy link
Collaborator Author

@SarahLu-NOAA It looks like the cron jobs have not yet been disabled (though this change is "imminent"), so this is likely unrelated to the issues you saw.

@gspetro-NOAA
Copy link
Collaborator

@mkavulich In the chapters for running SRW, it explains how to run with or without cron. Do you think we should just add a note that on Derecho, people should use the methods to run without cron? Can't think what else would be needed, but feel free to suggest something!

@gspetro-NOAA gspetro-NOAA added the documentation Improvements or additions to documentation label Dec 6, 2024
@SarahLu-NOAA
Copy link

@mkavulich and @gspetro-NOAA
While the msg is between you both, I like to chime in if you don't mind. I anticipate a significant number of UFS users will use Derecho and thus such a note is helpful and will prevent the EPIC developers to be burdened by the same cron-related questions.

@gspetro-NOAA
Copy link
Collaborator

Thanks for the input @SarahLu-NOAA !
After further inquiry, it appears that cron jobs do still run on Derecho, so there is no need for a special note in the documentation after all. The special procedure that @mkavulich pointed out is used to set up long-term cron jobs outside of an HPC system so that system maintenance won't adversely affect cron jobs.
Since cron runs fine on Derecho, I will close this issue. Folks interested in long-term cron jobs on Derecho can take a look at cron.hpc.ucar.edu, but this shouldn't affect most SRW users on Derecho.

@mkavulich
Copy link
Collaborator Author

We have recently gotten email notices that the old method of running with Cron on Derecho will be discontinued on February 11th. So this issue needs to be addressed soon. Possibly worth including logic so that USE_CRON_TO_RELAUNCH is disabled on Derecho?

@mkavulich mkavulich reopened this Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants