Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial #6914 job_agent separated terminating vs destroying #7337

Merged
merged 4 commits into from
Nov 13, 2024

Conversation

robnagler
Copy link
Member

  • does not cancel the sbatch job when terminating
  • job_agent _SBATCH_ID_FILE write
  • job_supervisor concept of verify_status but doesn't change semantics yet
  • pkcli.elegant-schema better approach to updating schema
  • const.DEV_SRC_RADIASOFT_DIR

- does not cancel the sbatch job when terminating
- job_agent _SBATCH_ID_FILE write
- job_supervisor concept of verify_status but doesn't change semantics yet
- pkcli.elegant-schema better approach to updating schema
- const.DEV_SRC_RADIASOFT_DIR
@robnagler robnagler requested a review from e-carlin October 30, 2024 22:06
robnagler added a commit that referenced this pull request Nov 5, 2024
Copy link
Member

@e-carlin e-carlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some style comments. Not sure what behavior should be implemented with this PR but I saw this:

  • start an sbatch job
  • kill -9 supervisor # No chance to terminate so agent is left running
  • restart supervisor
  • login to sbatch agent
  • simulation is canceled

Is this sim supposed to be canceled?

sirepo/pkcli/job_agent.py Show resolved Hide resolved
if self._status_cb:
self._status_cb.stop()
self._status_cb = None
self._start_ready.set()
if self._sbatch_id:
if self._sbatch_id and not self._terminating:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use such an abstract name like _terminating? I feel like we have a had differences on words like this in the past. I much prefer a specific and unique name that indicates what the heck is going on. Ex _want_kill_cmds or even _want_scancel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of depends on which place you are reading. Inside terminate, terminating is set to True and that cascades down several levels. The intent in terminate() is to signal that the agent is terminating. This use parallels self._destroying, which indicates the _Cmd is in destroy(). There are many conditionals which use _destroying for different purposes, but they are all relying on the fact that the object is in destroy().

In this particular case, there's only one reason to know that the process is in terminate(), but there could be other reasons, and the way the value cascades in the call to terminate() makes sense (to me at least). It's also a private attribute of the class so it's use is localized, e.g. the Dispatcher doesn't refer to it.

That's the reasoning of "what the heck is going on". It's just a different perspective: from the whole process vs the specific conditional.

@robnagler
Copy link
Member Author

  • start an sbatch job
  • kill -9 supervisor # No chance to terminate so agent is left running
  • restart supervisor
  • login to sbatch agent
  • simulation is canceled

Is this sim supposed to be canceled?

This is caused by the scancel in sbatch.py when the agent starts. This PR didn't fix that.

Copy link
Member

@e-carlin e-carlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM then. Do what you will with my style comments.

@robnagler robnagler merged commit 10927b1 into master Nov 13, 2024
3 checks passed
@robnagler robnagler deleted the 6914-sbatch-id-phase-1 branch November 13, 2024 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants