-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout errors from qstat cause UNAVAILABLE status #110
Comments
Rocoto has no way to control the behavior of the host machine and batch system. The commands take as long as they take. The PBSPro interface has already been highly tuned to be as performant as possible (after observing issues on Cheyenne) so, while I can take another look, it's very unlikely further optimization is possible. PBSPro has a history of being prone to scaling/threading problems and it has been easy to overwhelm it. You can try running qstat commands manually and see how long they take:
You can also set |
Another thing to keep in mind. If Rocoto cannot get the status of a job because status commands (e.g. qstat, squeue, etc.) hang and time out, Rocoto will mark the status as "UNAVAILABLE" because the status is not knowable. If the system issues persist long enough that the job status is purged by the batch system, because it only keeps status of jobs that have completed in the recent past, then the "UNAVAILABLE" status will become permanent. However, if the system recovers from the failure, a status update will succeed and the status will be retrieved. There isn't anything Rocoto can do other than make multiple attempts to retrieve the status and time out commands that hang. |
Thanks for the explanation, @christopherwharrop-noaa. I see now that our CI system does not check for |
Resolved by NOAA-EMC/global-workflow#2820. |
On occasion, rocotostat will fail to get the status from a job via qstat within 45 seconds. This ends up resulting in an
UNAVAILABLE
status being reported for the job.An example log file is available here
/u/terry.mcguinness/ROCOTO.org/1.3.5/C96_atmaerosnowDA_d443bf9c/log.20240808
on WCOSS2:More details on this failure are available in these comments:
NOAA-EMC/global-workflow#2755 (comment)
NOAA-EMC/global-workflow#2755 (comment)
The text was updated successfully, but these errors were encountered: