Duplicate job submissions on reload #6344

hjoliver · 2024-09-02T01:09:56Z

Original bug report from @sjrennie and @ColemanTom tagged onto the end of a similar-but-different issue:

Reloading the workflow while a task is in the preparing state results in duplicate job submissions. Reproducible test case:

Restrict the scheduler process pool size, to keep tasks in the preparing state:

# global.cylc
[scheduler]
    process pool size = 1

add a slow event handler to tie up the restricted process pool for 10 seconds:

[scheduling]
    [[graph]]
        R1 = foo => bar
[runtime]
     [[foo]]
          [[[events]]]
               started handlers = "sleep 10; echo"
      [[bar]]

reload the workflow while bar is in the preparing state.

Result:

INFO - [1/bar:waiting(queued)] => waiting
INFO - [1/bar:waiting] => preparing  # <---- BAR PREPARING
INFO - Command "reload_workflow" received. ID=49f11bf0-e1af-45a6-88c8-82d29fb8feaa
    reload_workflow()
INFO - Pausing the workflow: Reloading workflow  # <----- RELOAD
INFO - [1/bar/01:preparing] submitted to localhost:background[9940]
INFO - [1/bar/01:preparing] => submitted  # <----- BAR/01 SUBMITTED
INFO - [1/bar/01:submitted] => running
INFO - [1/bar/01:running] => succeeded
INFO - Reloading the workflow definition.
INFO - LOADING workflow parameters
INFO - + workflow UUID = 624ad812-f4a0-4557-9d38-147016a3fee0
INFO - + UTC mode = False
INFO - + run mode = None
INFO - + cycle point time zone = Z
INFO - + paused = True
INFO - Reloading task definitions.
INFO - LOADING job data
INFO - Reload completed.
INFO - RESUMING the workflow now
INFO - Command "reload_workflow" actioned. ID=49f11bf0-e1af-45a6-88c8-82d29fb8feaa
INFO - [1/bar/01:succeeded] submitted to localhost:background[9962]  # BAR/01 SUBMITTED AGAIN !!
WARNING - Undeliverable task messages received and ignored:
      1/bar/01: INFO - "started"
      1/bar/01: INFO - "succeeded"
INFO - Waiting for the command process pool to empty for shutdown
INFO - [1/bar/01:succeeded] submitted to localhost:background[9981]  # <------ AND AGAIN!!!
INFO - Workflow shutting down - AUTOMATIC
INFO - DONE

The text was updated successfully, but these errors were encountered:

hjoliver · 2024-09-02T01:29:29Z

I also get 3 job submissions with this variant:

[scheduler]
    [[events]]
        startup handlers = "sleep 10; echo"  # tie up the restricted process pool for 10 sec
[scheduling]
    [[graph]]
        R1 = foo
[runtime]
    [[foo]]

ColemanTom · 2024-09-02T02:36:12Z

Well done on replicating it so fast. Hopefully it is an easy fix.

hjoliver · 2024-09-02T02:37:01Z

Yep, still working on that bit!

hjoliver · 2024-09-02T04:00:03Z

Fix posted. It was easy enough fix inside the subprocess pool code, but I want to see if it should be fixed at a higher level instead.

oliver-sanders · 2024-09-02T15:03:09Z

Replicated.

My trigger-finger isn't as fast as yours, so I jammed in a sleep:

diff --git a/cylc/flow/task_state.py b/cylc/flow/task_state.py
index a6a6bc125..3c55f77b4 100644
--- a/cylc/flow/task_state.py
+++ b/cylc/flow/task_state.py
@@ -403,6 +403,10 @@ class TaskState:
             Whether state changed or not (bool)
 
         """
+        if status == 'preparing':
+            from time import sleep
+            sleep(2)
+
         req = status
 
         if forced and req in [TASK_STATUS_SUBMITTED, TASK_STATUS_RUNNING]:

hjoliver · 2024-09-03T01:58:08Z

My trigger-finger isn't as fast as yours, so I jammed in a sleep:

I did it by strapping down the process pool size to 1, then running a 10 sec event handler to keep the next task stuck as preparing while the event handler ran - so no fast trigger finger needed!

oliver-sanders · 2024-09-03T08:37:28Z

I tried this, but foo still passes through the preparing stage too quickly to catch. Were you reloading whist bar was in the preparing state instead?

hjoliver · 2024-09-03T09:14:36Z

I was using my second example above, which only has R1 = foo (there is no bar). The critical bit is process pool size = 1 plus a 10 second workflow event handler at start up. That makes foo stay in the preparing state for 10 seconds.

oliver-sanders · 2024-09-12T14:46:36Z

Closed by #6345

hjoliver added the bug Something is wrong :( label Sep 2, 2024

hjoliver added this to the 8.3.4 milestone Sep 2, 2024

hjoliver self-assigned this Sep 2, 2024

hjoliver mentioned this issue Sep 2, 2024

Fix duplicate job submissions on reload. #6345

Merged

8 tasks

oliver-sanders linked a pull request Sep 11, 2024 that will close this issue

Fix duplicate job submissions on reload. #6345

Merged

8 tasks

oliver-sanders closed this as completed Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate job submissions on reload #6344

Duplicate job submissions on reload #6344

hjoliver commented Sep 2, 2024 •

edited

Loading

hjoliver commented Sep 2, 2024 •

edited

Loading

ColemanTom commented Sep 2, 2024

hjoliver commented Sep 2, 2024

hjoliver commented Sep 2, 2024

oliver-sanders commented Sep 2, 2024

hjoliver commented Sep 3, 2024 •

edited

Loading

oliver-sanders commented Sep 3, 2024

hjoliver commented Sep 3, 2024

oliver-sanders commented Sep 12, 2024

Duplicate job submissions on reload #6344

Duplicate job submissions on reload #6344

Comments

hjoliver commented Sep 2, 2024 • edited Loading

hjoliver commented Sep 2, 2024 • edited Loading

ColemanTom commented Sep 2, 2024

hjoliver commented Sep 2, 2024

hjoliver commented Sep 2, 2024

oliver-sanders commented Sep 2, 2024

hjoliver commented Sep 3, 2024 • edited Loading

oliver-sanders commented Sep 3, 2024

hjoliver commented Sep 3, 2024

oliver-sanders commented Sep 12, 2024

hjoliver commented Sep 2, 2024 •

edited

Loading

hjoliver commented Sep 2, 2024 •

edited

Loading

hjoliver commented Sep 3, 2024 •

edited

Loading