Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

two instances of the same task running in parallel #6329

Closed
oliver-sanders opened this issue Aug 27, 2024 · 18 comments
Closed

two instances of the same task running in parallel #6329

oliver-sanders opened this issue Aug 27, 2024 · 18 comments
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Aug 27, 2024

A bug where two instances of the same task have been observed running at the same time.

Whilst investigating #6315, I discovered an example where there were two parallel active submissions of the same task. This doesn't appear to be the cause of #6315 as it has been observed in the absence of this issue.

State Transitions

$ grep '20190331T0000Z/archive_logs' log/scheduler/02-start-01.log
2024-07-29T08:46:51Z INFO - [20190331T0000Z/archive_logs:waiting] => waiting(queued)
2024-07-29T08:46:51Z INFO - [20190331T0000Z/archive_logs:waiting(queued)] => waiting
2024-07-29T08:46:54Z INFO - [20190331T0000Z/archive_logs:waiting] => preparing
2024-07-29T08:47:06Z INFO - [20190331T0000Z/archive_logs/01:preparing] submitted to xcf:pbs[7678629]
2024-07-29T08:47:06Z INFO - [20190331T0000Z/archive_logs/01:preparing] => submitted
2024-07-29T08:47:07Z INFO - [20190331T0000Z/archive_logs/01:submitted] => running
2024-07-29T08:47:19Z INFO - [20190331T0000Z/archive_logs/02:running] => preparing
2024-07-29T08:47:28Z INFO - [20190331T0000Z/archive_logs/02:preparing] submitted to xcf:pbs[7678729]
2024-07-29T08:47:28Z INFO - [20190331T0000Z/archive_logs/02:preparing] => submitted
2024-07-29T08:47:36Z INFO - [20190331T0000Z/archive_logs/02:submitted] => running
2024-07-29T08:47:54Z WARNING - [20190331T0000Z/archive_logs/02:running] (received-ignored)succeeded for job(01) != job(02)
2024-07-29T08:48:00Z INFO - [20190331T0000Z/archive_logs/02:running] => succeeded
2024-07-29T08:48:00Z INFO - [20190331T0000Z/archive_logs/02:running(runahead)] => running
2024-07-29T09:10:27Z INFO - [20190331T0000Z/archive_logs/02:running] => running(held)

This line is particularly concerning:

INFO - [20190331T0000Z/archive_logs/02:running] => preparing

Which is followed by this line providing some evidence of the two parallel tasks:

WARNING - [20190331T0000Z/archive_logs/02:running] (received-ignored)succeeded for job(01) != job(02)

A task should never be able to regress from running to preparing right?!

Looking through the log for this workflow, there are 17 instances of the first message and 11 of the second. These messages relate to various tasks, they always relate to even numbered jobs (02, 04, 06), never odd numbered.

Job Timings

For further evidence of the parallel jobs we can inspect the timings in the remote logs:

$ grep 'INFO -' log/job/20190331T0000Z/archive_logs/*/job.out
log/job/20190331T0000Z/archive_logs/01/job.out:2024-07-29T08:47:06Z INFO - started
log/job/20190331T0000Z/archive_logs/01/job.out:2024-07-29T08:47:52Z INFO - succeeded
log/job/20190331T0000Z/archive_logs/02/job.out:2024-07-29T08:47:33Z INFO - started
log/job/20190331T0000Z/archive_logs/02/job.out:2024-07-29T08:47:59Z INFO - succeeded

The task was triggered (along with many others) at 2024-07-29T08:46:51Z via cylc trigger '20190331T0000Z/*'.

Reproducible(ish) example

See #6329 (comment)

@oliver-sanders oliver-sanders added the bug Something is wrong :( label Aug 27, 2024
@oliver-sanders oliver-sanders added this to the 8.3.x milestone Aug 27, 2024
@oliver-sanders
Copy link
Member Author

oliver-sanders commented Aug 27, 2024

Have managed to reproduce using this example:

[scheduler]
    allow implicit tasks = True

[task parameters]
    a = 1..3
    b = 1..50

[scheduling]
    [[queues]]
        [[[default]]]
            limit = 40
    [[graph]]
        R1 = """
            <a> => <b>
        """

[runtime]
    [[<a=1>]]
        script = """
            if [[ $CYLC_TASK_SUBMIT_NUMBER -lt 2 ]]; then
                false
            fi
        """

When you run this example, one of the first three tasks will fail.

Unstick the workflow by triggering all tasks.

# start the workflow
cylc vip .

# wait for the first task to fail

# trigger all tasks
cylc trigger '<wid>//*'

Then look for evidence of parallel submissions in the workflow log:

$ cylc cat-log tmp.zu3j5ygsks | grep 'succeeded for job'
2024-08-27T12:45:19+01:00 WARNING - [1/_b28/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b24/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b30/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b26/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b35/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b41/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b37/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b27/02:preparing] (received-ignored)succeeded for job(01) != job(02)

I think it's the triggering of large numbers of tasks that leads to this bug.

...

Confirmed! Manually triggered tasks are being released via task_pool.release_queued_tasks causing duplicate submissions.

@hjoliver
Copy link
Member

Confirmed! Manually triggered tasks are being released via task_pool.release_queued_tasks causing duplicate submissions.

Confirmed here too.

@oliver-sanders
Copy link
Member Author

It looks like the same task is coming off the queue twice, could this be a queue-side bug?

@hjoliver hjoliver self-assigned this Aug 29, 2024
@hjoliver hjoliver modified the milestones: 8.3.x, 8.3.4 Aug 29, 2024
@hjoliver
Copy link
Member

That was difficult to track down, but I got it in the end. It's sort-of a queue bug, but not quite - see the PR.

@ColemanTom
Copy link
Contributor

Just adding this here, I'm not sure if its related or not, but someone at my workplace reported this on Friday evening.

2024-08-30T07:09:38Z INFO - [20231130T1100Z/aa_um_recon_host:waiting(runahead)] => waiting
2024-08-30T07:09:38Z INFO - [20231130T1100Z/aa_um_recon_host:waiting] => waiting(queued)
2024-08-30T07:09:39Z INFO - [20231130T1100Z/aa_um_recon_host:waiting(queued)] => waiting
2024-08-30T07:09:39Z INFO - [20231130T1100Z/aa_um_recon_host:waiting] => preparing
2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:preparing] submitted to PLATFORM[1]
2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:preparing] => submitted
2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] submitted to PLATFORM[2]
2024-08-30T07:09:45Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] reloaded task definition
2024-08-30T07:09:45Z WARNING - [20231130T1100Z/aa_um_recon_host/01:submitted] active with pre-reload settings
2024-08-30T07:09:45Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] submitted to PLATFORM[3]
2024-08-30T07:09:45Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] submitted to PLATFORM[4]
2024-08-30T07:09:56Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] => running

Four submissions in the space of 3 seconds. You can see she was reloading the workflow if that has any impact. I think this is Cylc 8.2.

@sjrennie
Copy link

sjrennie commented Sep 1, 2024

Thanks Tom. Actually this was 8.3.3 that I had the issue in.

@ColemanTom
Copy link
Contributor

I stand corrected, 8.3.3. Also, @sjrennie pointed out, all of her's were submitted as job 01.

@hjoliver
Copy link
Member

hjoliver commented Sep 1, 2024

Hmm, this looks different, I think. There's no resetting of submitted or running states back to preparing as was the case in this issue. Are those PLATFORM placeholders literally 4 different platforms?

@sjrennie
Copy link

sjrennie commented Sep 1, 2024

No, same platform, different PBS IDs.

@hjoliver
Copy link
Member

hjoliver commented Sep 1, 2024

So you can confirm those were all real job submissions?

@sjrennie
Copy link

sjrennie commented Sep 1, 2024

Yes, PBS thought so! I could qstat them all, and they all reported back to job.status, and ran for 20 mins before I decided to kill them, since they wouldn't survive trying to write to the same file all at once, at the end of the job.

@hjoliver
Copy link
Member

hjoliver commented Sep 1, 2024

OK is that a grepped subset of the log? It might be useful to see the whole thing (you email the relevant excerpt to me if it's large) to see the exactly where the reload occurred.

@sjrennie
Copy link

sjrennie commented Sep 1, 2024

Here is the exact bit where it started to go wrong. I can extract a larger bit to send to you if necessary
2024-08-30T07:09:39Z INFO - [20231130T1100Z/aa_um_recon_host:waiting(queued)] => waiting
2024-08-30T07:09:39Z INFO - [20231130T1100Z/aa_um_recon_host:waiting] => preparing
2024-08-30T07:09:40Z INFO - Command "reload_workflow" received. ID=5430acf1-4fd7-40d4-b040-ff0fa20e3c68
reload_workflow()
2024-08-30T07:09:40Z INFO - Pausing the workflow: Reloading workflow
2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:preparing] submitted to PLATFORM[1]

2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:preparing] => submitted
2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] submitted to PLATFORM[2]

@ColemanTom
Copy link
Contributor

The [1] [2], etc are just PBS job ids. I didn't really need to mask them out. So, the 1,2,3,4 was really 6872518, 6872521, 6872525, 6872527.

@hjoliver
Copy link
Member

hjoliver commented Sep 2, 2024

OK thanks for the more detailed log. Maybe it's something to do with reloading while the task is in the preparing state. I'll see if I can reproduce this...

@hjoliver
Copy link
Member

hjoliver commented Sep 2, 2024

A-ha, reproduced - thanks for the bug report! I'll open a new issue for this.

@hjoliver
Copy link
Member

hjoliver commented Sep 2, 2024

Closed by #6337

@hjoliver
Copy link
Member

hjoliver commented Sep 4, 2024

@ColemanTom @sjrennie - I backed out of my first attempt at a fix for your new duplicate job submission, but the real thing is up now: #6345

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

No branches or pull requests

4 participants