Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotfix: Handle UNAVAILABLE rocoto status in Bash CI #2820

Merged

Conversation

DavidHuber-NOAA
Copy link
Contributor

@DavidHuber-NOAA DavidHuber-NOAA commented Aug 9, 2024

Description

From time to time, PBS pro cannot return a qstat response within a given time limit set by rocoto (default is 45 seconds). If that happens, then an UNAVAILABLE status will be returned for the given job. This PR adds checking for this status to allow CI processing to continue.

Refs #2755 christopherwharrop/rocoto#110

Type of change

  • New feature (adds functionality)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO
  • Does this change require an update to any of the following submodules? NO (If YES, please add a link to any PRs that are pending.)

How has this been tested?

Visual inspection

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • New and existing tests pass with my changes

Copy link
Collaborator

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all needs to be married with the scripts that use it in bash and the pipeline and tested before this goes to a PR.
In practice the CI system flagged the PR in those UNKNOWN states for too long as Stalled specifically because they could not advance. I'm not sure yet if this update would change that.

@TerrenceMcGuinness-NOAA
Copy link
Collaborator

TerrenceMcGuinness-NOAA commented Aug 12, 2024

On closer inspection, I noticed the idea of having to deal with UNAVAIBLE (and Rocoto only does this for PBS BTW) is a good one but it is not specifically the cause for the "fix" as it would have continued anyway onto checking for STALL which is still a logically valid check. Why it seems to solve the issue is because it simply adds more time for the system to self-repair in this state. So this extra step to keep checking is indeed helpful and still valid.

@DavidHuber-NOAA
Copy link
Contributor Author

After discussion with @TerrenceMcGuinness-NOAA, I will also add a check for UNKNOWN statuses.

@DavidHuber-NOAA DavidHuber-NOAA self-assigned this Aug 12, 2024
@TerrenceMcGuinness-NOAA
Copy link
Collaborator

TerrenceMcGuinness-NOAA commented Aug 13, 2024

@DavidHuber-NOAA what David means is he adding the "extra" wait in these cases of UNKNOWN and UNAVAIBLE to give the the system extra time to self repair with subsequent runs of rocotorun. UNAVAIBLE is logically equivalent UNKNOWN but more specific to PBS. The extra logic may also cover for corner cases when all states are UNAVAIBLE adding robust-ness to the checker.

@WalterKolczynski-NOAA
Copy link
Contributor

@DavidHuber-NOAA what David means is he adding the "extra" wait in these cases of UNKNOWN and UNAVAIBLE to give the the system extra time to self repair with subsequent runs of rocotorun. UNAVAIBLE is logically equivalent UNKNOWN but more specific to PBS. The extra logic may also cover for corner cases when all states are UNAVAIBLE adding robust-ness to the checker.

I think they designate different things. UNKNOWN means the scheduler no longer has information on a job. UNAVAILABLE means the scheduler did not respond before the time-out.

…e_rocoto

* origin/develop:
  Jenkins Pipeline Updates (NOAA-EMC#2815)
  Add Gaea C5 to CI (NOAA-EMC#2814)
  Add support for forecast-only runs on AWS (NOAA-EMC#2711)
  Add fixes to products for when REPLAY IC's are used  (NOAA-EMC#2755)
  Add capability to run forecast in segments (NOAA-EMC#2795)
Copy link
Collaborator

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice update, tested it and the side effects are all valid and work within the framework

@DavidHuber-NOAA DavidHuber-NOAA marked this pull request as ready for review August 13, 2024 16:36
@DavidHuber-NOAA DavidHuber-NOAA merged commit 336b78a into NOAA-EMC:develop Aug 13, 2024
5 checks passed
@DavidHuber-NOAA DavidHuber-NOAA deleted the feature/unavailable_rocoto branch August 13, 2024 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants