Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reload: wait for pending tasks to submit and pause the workflow #5592

Merged
merged 5 commits into from
Jul 6, 2023

Conversation

oliver-sanders
Copy link
Member

@oliver-sanders oliver-sanders commented Jun 19, 2023

  • Closes Making reload safer #5107
  • Reload now waits for pending tasks to submit before attempting to reload the config itself.
  • Reload now also puts the workflow into the paused state during the reload process. This doesn't actually achieve anything as the reload command is blocking in the main loop, but it does help to communicate that the workflow will not de-queue or submit and new tasks during this process.
  • The workflow status message is now updated to reflect the reload progress.

This should also close #5579 by removing any potential for interaction between reload and preparing tasks. But note, I've not actually reproduced #5579.

Needs extensive functional testing.

The second commit straightens out the reload logic which is currently split between four routines:

  • Scheduler.command_reload_workflow
  • Scheduler.main_loop
  • Pool.set_do_reload
  • Pool.reload_taskdefs

With command_reload_workflow calling set_do_reload somewhere in the middle of the routine, then handing back over to the main loop which does its part before calling reload_taskdefs. I think this is just Cylc 7 hangover, removing the main-loop interaction from reload should reduce the chance of unexpected state changes occurring during the process.

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • CHANGES.md entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

self.process_workflow_db_queue() # see #5593

# flush out preparing tasks before attempting reload
self.reload_pending = 'waiting for pending tasks to submit'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets used in the workflow status so progress will be indicated in the GUI.

script = """
cylc reload "${CYLC_WORKFLOW_ID}"
# wait for the command to complete
cylc__job__poll_grep_workflow_log 'Reload completed'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orthogonal change: This test was timing dependent because it didn't wait for reload to complete.

)
]
)
async def test_illegal_config_load(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered by the new test.

@oliver-sanders oliver-sanders marked this pull request as ready for review June 19, 2023 17:19
@oliver-sanders oliver-sanders added the bug Something is wrong :( label Jun 20, 2023
@oliver-sanders
Copy link
Member Author

oliver-sanders commented Jun 20, 2023

To test, you can simulate a slow remote-init and a slow config load by jamming in sleeps like so:

diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index d3cab5225..28bd79ac8 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -1096,6 +1096,7 @@ class Scheduler:
         LOG.info("Reloading the workflow definition.")
         try:
             config = self.load_flow_file(is_reload=True)
+            sleep(20)
         except (ParsecError, CylcConfigError) as exc:
             if cylc.flow.flags.verbosity > 1:
                 # log full traceback in debug mode
diff --git a/cylc/flow/scripts/remote_init.py b/cylc/flow/scripts/remote_init.py
index 7706cecfa..9c52ae8af 100755
--- a/cylc/flow/scripts/remote_init.py
+++ b/cylc/flow/scripts/remote_init.py
@@ -60,7 +60,8 @@ def get_option_parser() -> COP:
 
 @cli_function(get_option_parser)
 def main(parser, options, install_target, rund, *dirs_to_be_symlinked):
-
+    from time import sleep
+    sleep(20)
     remote_init(
         install_target,
         rund,

In the GUI you should see the following when you reload:

  • Workflow will enter the paused state.
  • Status message (in toolbar) will change to reloading: waiting for pending tasks to submit.
  • Status message will change to reloading: loading the workflow configuration.
  • Workflow will unpause and continue.

@oliver-sanders
Copy link
Member Author

Got a couple of functional tests to fix, but otherwise the functionality should be good.

One question, in order to keep the scheduler responsive whilst we're waiting for pending tasks to submit, I added process_command_queue into the mini event-loop in command_reload_workflow. This means any arbitrary command could be actioned during this period.

This is probably fine, the point of this loop is just to wait for the workflow to enter a safe state for reload to be performed. It's probably even possible to trigger tasks at this stage if you want them to be submitted with the pre-reload config! But there's the potential for unplanned interactions so we may want to consider whitelisting this somehow.

Copy link
Member

@wxtim wxtim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Code read

Functional Review

  • Try this with a variety of workflows
  • Check that this works the same way with cylc vr and cylc reinstall
  • Check what reloading with these changes looks like in the GUI.
  • Try reloading a change which breaks the workflow.
  • Try to get tasks in horrible states to break the logic at line 1418 of scheduler.py

cylc/flow/scheduler.py Outdated Show resolved Hide resolved
@oliver-sanders
Copy link
Member Author

The two test failures have flagged a legitimate issue with reload causing duplicate stall events which I'll sort out soon, will require a line or two of change.

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some trivial, and possibly annoying, comments. Code LGTM. I'll do some more brutal functional testing on Monday, but I'm pretty confident it's all good 🎉

cylc/flow/scheduler.py Outdated Show resolved Hide resolved
cylc/flow/scheduler.py Outdated Show resolved Hide resolved
cylc/flow/scheduler.py Outdated Show resolved Hide resolved
Comment on lines +2028 to +2050
A user-facing string explaining why the workflow was paused if
helpful.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A user-facing string explaining why the workflow was paused if
helpful.
A user-facing string explaining why the workflow was paused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so I'm not going to do that!

Copy link
Member

@hjoliver hjoliver Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? Isn't the "if helpful" redundant? (We wouldn't put it there if it not helpful!)

Comment on lines +2048 to +2070
Whether to log anything in the event the workflow is not
paused.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Whether to log anything in the event the workflow is not
paused.
Whether to log anything if the workflow is not paused.

cylc/flow/scheduler.py Outdated Show resolved Hide resolved
cylc/flow/task_pool.py Show resolved Hide resolved
tests/conftest.py Show resolved Hide resolved
@wxtim
Copy link
Member

wxtim commented Jun 23, 2023

Using a super-basic workflow I tried:

$ cylc vip workflow
$ sed 's@allow implicit tasks = True@allow implicit tasks = False@' -i ~/cylc-src/workflow/flow.cylc
$ cylc vr workflow
#  ... Fails horribly - it's now broken
$ cylc reinstall workflow
#  ... succeeds, does no validation checks
$ cylc reload workflow
Done
$ cylc log workflow 
.... CRITICAL Reload failed - WorkflowConfigError ...
# Contains the evidence that the reload did not in fact work.

Is that what you want? I think it's dangerous/misleading...And this behaviour seems to be matched in the GUI.

@oliver-sanders
Copy link
Member Author

Yes:

  • cylc vr protects us from configuration issues.
  • If users decide to bypass cylc vr and a config error is present then reload will log the error and leave the workflow paused.
  • Cylc commands are async so don't return their completion status.

@oliver-sanders
Copy link
Member Author

Fixed the test failures, the problem was that update_workflow_structure has a role to play in stall event detection, so I've come up with a cut-down version of the method which is enough to update the workflow status and status-message so that users can see what's going on, but doesn't try to update the task pool or mess with stall detection.

https://github.com/cylc/cylc-flow/compare/d5f331fba83b67acd5e12613e786051020859aa7..e7545d86ee27c0f6e51a226297299bbf5b1656d5

* Closes cylc#5107
* Reload now waits for pending tasks to submit before attempting to
  reload the config itself.
* Reload now also puts the workflow into the paused state during the
  reload process. This doesn't actually achieve anything as the
  reload command is blocking in the main loop, but it does help to
  communicate that the workflow will not de-queue or submit and new
  tasks during this process.
* The workflow status message is now updated to reflect the reload
  progress.
* The reload code was spread around four places:
  * Scheduler.command_reload_workflow
  * Scheduler.main_loop
  * Pool.set_do_reload
  * Pool.reload_taskdefs
* This commit co-locates the Scheduler/Pool parts and turns them into a
  single synchronous operation (no main-loop moving parts) to simplify
  the pathway.
* This removes the need for the `do_reload` pool flag.
CHANGES.md Outdated Show resolved Hide resolved
Co-authored-by: Tim Pillinger <[email protected]>
@wxtim wxtim self-requested a review July 6, 2023 07:48
Copy link
Member

@wxtim wxtim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy

@wxtim wxtim merged commit 5108f25 into cylc:master Jul 6, 2023
@oliver-sanders oliver-sanders deleted the safer-reload branch July 7, 2023 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reload: stuck preparing task Making reload safer
3 participants