Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate job submissions on reload #6344

Closed
hjoliver opened this issue Sep 2, 2024 · 9 comments · Fixed by #6345
Closed

Duplicate job submissions on reload #6344

hjoliver opened this issue Sep 2, 2024 · 9 comments · Fixed by #6345
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@hjoliver
Copy link
Member

hjoliver commented Sep 2, 2024

Original bug report from @sjrennie and @ColemanTom tagged onto the end of a similar-but-different issue:

#6329 (comment)

Reloading the workflow while a task is in the preparing state results in duplicate job submissions. Reproducible test case:

  1. Restrict the scheduler process pool size, to keep tasks in the preparing state:
# global.cylc
[scheduler]
    process pool size = 1
  1. add a slow event handler to tie up the restricted process pool for 10 seconds:
[scheduling]
    [[graph]]
        R1 = foo => bar
[runtime]
     [[foo]]
          [[[events]]]
               started handlers = "sleep 10; echo"
      [[bar]]
  1. reload the workflow while bar is in the preparing state.

Result:

INFO - [1/bar:waiting(queued)] => waiting
INFO - [1/bar:waiting] => preparing  # <---- BAR PREPARING
INFO - Command "reload_workflow" received. ID=49f11bf0-e1af-45a6-88c8-82d29fb8feaa
    reload_workflow()
INFO - Pausing the workflow: Reloading workflow  # <----- RELOAD
INFO - [1/bar/01:preparing] submitted to localhost:background[9940]
INFO - [1/bar/01:preparing] => submitted  # <----- BAR/01 SUBMITTED
INFO - [1/bar/01:submitted] => running
INFO - [1/bar/01:running] => succeeded
INFO - Reloading the workflow definition.
INFO - LOADING workflow parameters
INFO - + workflow UUID = 624ad812-f4a0-4557-9d38-147016a3fee0
INFO - + UTC mode = False
INFO - + run mode = None
INFO - + cycle point time zone = Z
INFO - + paused = True
INFO - Reloading task definitions.
INFO - LOADING job data
INFO - Reload completed.
INFO - RESUMING the workflow now
INFO - Command "reload_workflow" actioned. ID=49f11bf0-e1af-45a6-88c8-82d29fb8feaa
INFO - [1/bar/01:succeeded] submitted to localhost:background[9962]  # BAR/01 SUBMITTED AGAIN !!
WARNING - Undeliverable task messages received and ignored:
      1/bar/01: INFO - "started"
      1/bar/01: INFO - "succeeded"
INFO - Waiting for the command process pool to empty for shutdown
INFO - [1/bar/01:succeeded] submitted to localhost:background[9981]  # <------ AND AGAIN!!!
INFO - Workflow shutting down - AUTOMATIC
INFO - DONE
@hjoliver hjoliver added the bug Something is wrong :( label Sep 2, 2024
@hjoliver hjoliver added this to the 8.3.4 milestone Sep 2, 2024
@hjoliver
Copy link
Member Author

hjoliver commented Sep 2, 2024

I also get 3 job submissions with this variant:

[scheduler]
    [[events]]
        startup handlers = "sleep 10; echo"  # tie up the restricted process pool for 10 sec
[scheduling]
    [[graph]]
        R1 = foo
[runtime]
    [[foo]]

@ColemanTom
Copy link
Contributor

Well done on replicating it so fast. Hopefully it is an easy fix.

@hjoliver
Copy link
Member Author

hjoliver commented Sep 2, 2024

Yep, still working on that bit!

@hjoliver
Copy link
Member Author

hjoliver commented Sep 2, 2024

Fix posted. It was easy enough fix inside the subprocess pool code, but I want to see if it should be fixed at a higher level instead.

@oliver-sanders
Copy link
Member

Replicated.

My trigger-finger isn't as fast as yours, so I jammed in a sleep:

diff --git a/cylc/flow/task_state.py b/cylc/flow/task_state.py
index a6a6bc125..3c55f77b4 100644
--- a/cylc/flow/task_state.py
+++ b/cylc/flow/task_state.py
@@ -403,6 +403,10 @@ class TaskState:
             Whether state changed or not (bool)
 
         """
+        if status == 'preparing':
+            from time import sleep
+            sleep(2)
+
         req = status
 
         if forced and req in [TASK_STATUS_SUBMITTED, TASK_STATUS_RUNNING]:

@hjoliver
Copy link
Member Author

hjoliver commented Sep 3, 2024

My trigger-finger isn't as fast as yours, so I jammed in a sleep:

I did it by strapping down the process pool size to 1, then running a 10 sec event handler to keep the next task stuck as preparing while the event handler ran - so no fast trigger finger needed!

@oliver-sanders
Copy link
Member

I tried this, but foo still passes through the preparing stage too quickly to catch. Were you reloading whist bar was in the preparing state instead?

@hjoliver
Copy link
Member Author

hjoliver commented Sep 3, 2024

I was using my second example above, which only has R1 = foo (there is no bar). The critical bit is process pool size = 1 plus a 10 second workflow event handler at start up. That makes foo stay in the preparing state for 10 seconds.

@oliver-sanders oliver-sanders linked a pull request Sep 11, 2024 that will close this issue
8 tasks
@oliver-sanders
Copy link
Member

Closed by #6345

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants