Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

platforms: broadcasted platform ignored after ssh failure #6320

Closed
oliver-sanders opened this issue Aug 22, 2024 · 7 comments · Fixed by #6330
Closed

platforms: broadcasted platform ignored after ssh failure #6320

oliver-sanders opened this issue Aug 22, 2024 · 7 comments · Fixed by #6330
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Aug 22, 2024

We can use broadcasts to change the platform a task submits to.

Under normal circumstances this works fine, however, when hosts go down and the submission is retried, the broadcast seems to be forgotten about and the new submission uses the configured platform.

This could lead to jobs being submitted to the wrong platform.

Reproducible example:

Run the following workflow.

Once the "remote_init_one" and "remote_init_two" tasks have submitted, break your SSH config to force subsequent calls to fail.

[scheduling]
    [[graph]]
        R1 = remote_init_one & remote_init_two & local => remote

[runtime]
    # ensure that the workflow has remote-init'ed on platforms "one" and "two"
    [[remote_init_one]]
        platform = one-bg
    [[remote_init_two]]
        platform = two-bg

    # change the platform of "remote" via broadcast
    [[local]]
        script = """
            cylc broadcast "${CYLC_WORKFLOW_ID}" -n remote -p "${CYLC_TASK_CYCLE_POINT}" -s 'platform=one'
            sleep 10
        """

    [[remote]]
        platform = localhost

The "remote" task should attempt to submit to each of the hosts in the "one" platform. All SSH connections will fail so the task will run out of hosts and become submit-failed.

However, that's not what happens! Running this command reveals that after running out of hosts, the task then attempted to submit to localhost (the platform defined before the broadcast):

$ grep 'DEBUG - \[jobs-submit cmd\].*1/remote/01' --color=never ~/cylc-run/<workflow>/log/scheduler/log
... ssh ... one.01 ... cylc jobs-submit ... 1/remote/01
... ssh ... one.02 ... cylc jobs-submit ... 1/remote/01
... cylc jobs-submit ... 1/remote/01

Note: This erroneous submission appears to happen after all the hosts of the broadcasted platform have been exhausted which may help pin down the offending code pathway.

Interestingly, when I try this, the attempted submission to localhost actually fails due to the qsub command not being in $PATH. In my case platform one uses PBS so this suggests that it is attempting to submit to localhost, but with the configuration of one?!

@oliver-sanders oliver-sanders added the bug Something is wrong :( label Aug 22, 2024
@oliver-sanders oliver-sanders added this to the 8.3.x milestone Aug 22, 2024
@wxtim wxtim self-assigned this Aug 23, 2024
@wxtim
Copy link
Member

wxtim commented Aug 23, 2024

Looks like it's remote initing on the same host?

    [[remote_init_one]]
        platform = one-bg
    [[remote_init_two]]
        platform = one-bg

@oliver-sanders
Copy link
Member Author

Typo, corrected in OP

@wxtim
Copy link
Member

wxtim commented Aug 23, 2024

Replicated it with local site installation. Now working out how to replicate in a more debuggable way.

@oliver-sanders
Copy link
Member Author

I think this example should be enough to debug with. Here's my stab in the dark over debugging strategy if it helps....

I would start by identifying the bits of the code where a host is selected and logging each of these. This should allow you to pinpoint the particular branch / method where the incorrect host comes from. Given the convoluted nature of the call/callback code, the same method can be called multiple times, so this might not actually be that much help. If so, I would then try to log the relevant function calls (likely prep/submit methods and their 255 callbacks) so you can map out the callchain. After that, no idea!

@wxtim
Copy link
Member

wxtim commented Aug 29, 2024

Checks for similar bugs:

Search for rtconfig\[["']platform["']\]:

  • data_store_mgr.runtime_from_config - Looks like it's used to initialize fields at startup so should be safe to not check for broadcasts. Checked by looking at TUI.
  • subprocpool.SubProcPoll.run_command_exit - Functionally safe because it's only used for logging. Might concievable produce strange log output, but even this shouldn't happen if the callback is given sensible arguments - an apparent bug found in this code at this point during the investigation dissapeared once the fix in Ensure that platform from group selection checks broadcast manager #6330 was made.
  • All other lookups are in task_job_mgr.TaskJobManager._prep_submit_task_job on a function scoped copy of the rtconfig which has broadcasts applied.

@oliver-sanders
Copy link
Member Author

Here's a version of the workflow in the OP that has been adapted to use [remote]host and [job]batch system rather than platform.

This example does not replicate the bug (presumably uses a different code pathway):

[scheduling]
    [[graph]]
        R1 = remote_init_one & remote_init_two & local => remote

[runtime]
    # ensure that the workflow has remote-init'ed on platforms "one" and "two"
    [[remote_init_one]]
        [[[remote]]]
            host = one.login.01
    [[remote_init_two]]
        [[[remote]]]
            host = two.login.01

    # change the platform of "remote" via broadcast
    [[local]]
        script = """
            cylc broadcast "${CYLC_WORKFLOW_ID}" -n remote -p "${CYLC_TASK_CYCLE_POINT}" -s '[remote]host=one.login.01'
            sleep 10
        """

    [[remote]]
        [[[remote]]]
            host = localhost
        [[[job]]]
            batch system = pbs

Posting this here as I'm using this to test the fix to ensure it still works as intended.

@wxtim wxtim linked a pull request Sep 5, 2024 that will close this issue
8 tasks
@oliver-sanders
Copy link
Member Author

Closed by #6330

@oliver-sanders oliver-sanders modified the milestones: 8.3.x, 8.3.4 Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants