Timeout errors from qstat cause UNAVAILABLE status #110

DavidHuber-NOAA · 2024-08-09T14:55:14Z

On occasion, rocotostat will fail to get the status from a job via qstat within 45 seconds. This ends up resulting in an UNAVAILABLE status being reported for the job.

An example log file is available here /u/terry.mcguinness/ROCOTO.org/1.3.5/C96_atmaerosnowDA_d443bf9c/log.20240808 on WCOSS2:

08/08/24 20:56:49 UTC :: C96_atmaerosnowDA_d443bf9c.xml :: WARNING! The command 'qstat -x -f 147475818 | sed -e ':a' -e 'N' -e '$!ba' -e 's/\n\t/ /g'' timed out after 45 seconds.
08/08/24 20:56:49 UTC :: C96_atmaerosnowDA_d443bf9c.xml :: Timeout::Error

More details on this failure are available in these comments:
NOAA-EMC/global-workflow#2755 (comment)
NOAA-EMC/global-workflow#2755 (comment)

The text was updated successfully, but these errors were encountered:

DavidHuber-NOAA · 2024-08-09T15:00:43Z

FYI @TerryMcGuinness-NOAA

christopherwharrop-noaa · 2024-08-09T15:22:09Z

Rocoto has no way to control the behavior of the host machine and batch system. The commands take as long as they take. The PBSPro interface has already been highly tuned to be as performant as possible (after observing issues on Cheyenne) so, while I can take another look, it's very unlikely further optimization is possible. PBSPro has a history of being prone to scaling/threading problems and it has been easy to overwhelm it.

You can try running qstat commands manually and see how long they take:

qstat -x -f #{joblist} | sed -e ':a' -e 'N' -e '$\!ba' -e 's/\\n\\t/ /g'"

You can also set JobQueueTimeout and JobAcctTimeout in the rocotorc file to something longer than 45 seconds, but keep in mind that if you're running rocotorun every 60 seconds, you will encounter issues if you make the timeout much longer than 45.

christopherwharrop-noaa · 2024-08-09T15:31:50Z

Another thing to keep in mind. If Rocoto cannot get the status of a job because status commands (e.g. qstat, squeue, etc.) hang and time out, Rocoto will mark the status as "UNAVAILABLE" because the status is not knowable. If the system issues persist long enough that the job status is purged by the batch system, because it only keeps status of jobs that have completed in the recent past, then the "UNAVAILABLE" status will become permanent. However, if the system recovers from the failure, a status update will succeed and the status will be retrieved. There isn't anything Rocoto can do other than make multiple attempts to retrieve the status and time out commands that hang.

DavidHuber-NOAA · 2024-08-09T21:09:12Z

Thanks for the explanation, @christopherwharrop-noaa. I see now that our CI system does not check for UNAVAILABLE statuses, so I have added handling for such statuses in NOAA-EMC/global-workflow#2820. I'll close this issue once that PR is merged.

DavidHuber-NOAA · 2024-08-14T12:29:57Z

Resolved by NOAA-EMC/global-workflow#2820.

DavidHuber-NOAA mentioned this issue Aug 9, 2024

Add fixes to products for when REPLAY IC's are used NOAA-EMC/global-workflow#2755

Merged

7 tasks

DavidHuber-NOAA mentioned this issue Aug 9, 2024

Hotfix: Handle UNAVAILABLE rocoto status in Bash CI NOAA-EMC/global-workflow#2820

Merged

5 tasks

DavidHuber-NOAA closed this as completed Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout errors from qstat cause UNAVAILABLE status #110

Timeout errors from qstat cause UNAVAILABLE status #110

DavidHuber-NOAA commented Aug 9, 2024

DavidHuber-NOAA commented Aug 9, 2024

christopherwharrop-noaa commented Aug 9, 2024

christopherwharrop-noaa commented Aug 9, 2024

DavidHuber-NOAA commented Aug 9, 2024

DavidHuber-NOAA commented Aug 14, 2024

Timeout errors from qstat cause UNAVAILABLE status #110

Timeout errors from qstat cause UNAVAILABLE status #110

Comments

DavidHuber-NOAA commented Aug 9, 2024

DavidHuber-NOAA commented Aug 9, 2024

christopherwharrop-noaa commented Aug 9, 2024

christopherwharrop-noaa commented Aug 9, 2024

DavidHuber-NOAA commented Aug 9, 2024

DavidHuber-NOAA commented Aug 14, 2024