Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use full path to Cylc when submitting jobs #6302

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ScottWales
Copy link
Contributor

@ScottWales ScottWales commented Aug 15, 2024

Use the full path to the Cylc script when submitting jobs

Fixes #6301

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@ScottWales
Copy link
Contributor Author

Test 00-torture test is failing because the Cylc path is explicitly set to a bad path

@ScottWales ScottWales marked this pull request as draft August 15, 2024 02:34
@ColemanTom
Copy link
Contributor

Any ability for a small unit test for the new function?

@ColemanTom
Copy link
Contributor

I'm wondering if this can go into 8.3.x rather than master too? Cylc devs may disagree whether this is a feature or bugfix though. Although I think this only became an issue for us with 8.3.x and was not an issue in 8.2.x so some change happned somewhere.

@ScottWales ScottWales marked this pull request as ready for review August 15, 2024 05:39
@oliver-sanders
Copy link
Member

I'm a bit confused by this. In order to start a Cylc workflow, the cylc command must be in $PATH.

So how could a subprocess started from within a process where cylc is in the $PATH end up with an environment where cylc is not in $PATH?!

Two things come to mind:

  1. Cylc runs commands via Bash login shells. This allows Cylc to be configured via Bash startup files. These Bash startup files could potentially overwrite the environment, i.e. export PATH=....
  2. The clean job submission environment configuration will purposfully remove entries from $PATH, but only for job-submission. This might explain why this PR only changes the logic for job-submission, and not for the other subprocess calls? If this is the case, there is a pass through configuration that goes with clean job submission environment. This allows you to whitelist paths from less standard locations that you do not want removed for job-submission.

@oliver-sanders
Copy link
Member

oliver-sanders commented Aug 15, 2024

I think this only became an issue for us with 8.3.x and was not an issue in 8.2.x

Now that's really confusing! I don't think there were any changes during this period that could have explained this behaviour. Nothing is jumping out from this list:

https://github.com/cylc/cylc-flow/milestone/117?closed=1

Cylc devs may disagree whether this is a feature or bugfix though.

It looks like a clear bug to me, but we're going to need to understand what's going on here a bit better to determine whether this is a Cylc bug or a setup issue.

Suggestions:

  1. Make sure you're not using clean job submission environment.
  2. Make sure there's no $PATH overriding in your shell startup files.
  3. Try inspecting $PATH in the job submission environment.
    • You can do this by copying the Cylc "job runner" and overriding the SUBMIT_CMD_TMPL (e.g. pbs job runner). How does the environment there differ from the one the scheduler is running in?
    • Or, you could probably hack the code to replace cylc with the env command where you made the modifications for this PR.

@ColemanTom
Copy link
Contributor

ColemanTom commented Aug 15, 2024

Ok, the "worked in 8.2" may be wrong. There may be other things affecting that or I may have been misled. I just tried in a clean environment and this same thing happens. I put more details in the Issue, but I can put some full details here if you prefer.

Below, you can see, nothing in .bashrc, global.cylc is basic. Both 8.2.4 and 8.3.3 at least have job submission failure as I don't have cylc in my login shell PATH, it has been shifted to ~/cb/ to demonstrate. Job scripts do add it to the PATH.

$ cat ~/.bashrc
# .bashrc

# Source global definitions (Required for modules)
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi
return
$ cat ~/.cylc/flow/global.cylc
# default configuration for the Cylc "swarm" platforms
# override configurations in your global.cylc/global-tests.cylc as required

[platforms]
    [[localhost]]
        cylc path = HOME/cb

$ cylc play rewind/8.2.4

 ▪ ■  Cylc Workflow Engine 8.2.4
 ██   Copyright (C) 2008-2024 NIWA
▝▘    & British Crown (Met Office) & Contributors

2024-08-16T08:39:09+10:00 INFO - Extracting job.sh to HOME/cylc-run/rewind/8.2.4/.service/etc/job.sh
rewind/8.2.4: HOST PID=50245
$ export CYLC_VERSION=8.3.3; ~/cb/cylc vip --run-name $CYLC_VERSION $PWD
$ cylc validate HOME/code/rewind
Valid for cylc-8.3.3
$ cylc install HOME/code/rewind
INSTALLED rewind/8.3.3 from HOME/code/rewind
NOTE: 1 run of "rewind" is already active:
  ▶ rewind/8.2.4 HOST:43012 50245
You can stop it with:
  cylc stop rewind/8.2.4
See "cylc stop --help" for options.
$ cylc play rewind/8.3.3

 ▪ ■  Cylc Workflow Engine 8.3.3
 ██   Copyright (C) 2008-2024 NIWA
▝▘    & British Crown (Met Office) & Contributors

INFO - Extracting job.sh to HOME/cylc-run/rewind/8.3.3/.service/etc/job.sh
rewind/8.3.3: HOST PID=52062

$ ~/cb/cylc cat-log rewind/8.2.4 | tail
2024-08-16T08:39:11+10:00 ERROR - [jobs-submit cmd] cylc jobs-submit --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --path=/usr/local/sbin -- '$HOME/cylc-run/rewind/8.2.4/log/job' 20240806T0000Z/a/01
    [jobs-submit ret_code] 1
    [jobs-submit out] 2024-08-16T08:39:11+10:00|20240806T0000Z/a/01|1
2024-08-16T08:39:11+10:00 CRITICAL - [20240806T0000Z/a preparing job:01 flows:1] submission failed
2024-08-16T08:39:11+10:00 INFO - [20240806T0000Z/a preparing job:01 flows:1] => submit-failed
2024-08-16T08:39:11+10:00 WARNING - [20240806T0000Z/a submit-failed job:01 flows:1] did not complete required outputs: ['submitted', 'succeeded']
2024-08-16T08:39:11+10:00 ERROR - Incomplete tasks:
      * 20240806T0000Z/a did not complete required outputs: ['submitted', 'succeeded']
2024-08-16T08:39:11+10:00 CRITICAL - Workflow stalled
2024-08-16T08:39:11+10:00 WARNING - PT1H stall timer starts NOW

$ ~/cb/cylc cat-log rewind/8.3.3 | tail
    [jobs-submit out] 2024-08-16T08:39:39+10:00|20240806T0000Z/a/01|1
2024-08-16T08:39:39+10:00 CRITICAL - [20240806T0000Z/a/01:preparing] submission failed
2024-08-16T08:39:39+10:00 INFO - [20240806T0000Z/a/01:preparing] => submit-failed
2024-08-16T08:39:39+10:00 WARNING - [20240806T0000Z/a/01:submit-failed] did not complete the required outputs:
    ⨯ ⦙  succeeded
2024-08-16T08:39:39+10:00 ERROR - Incomplete tasks:
    * 20240806T0000Z/a did not complete the required outputs:
      ⨯ ⦙  succeeded
2024-08-16T08:39:39+10:00 CRITICAL - Workflow stalled
2024-08-16T08:39:39+10:00 WARNING - PT1H stall timer starts NOW

$ grep PATH ~/cylc-run/rewind/*/log/job/20240806T0000Z/a/NN/job
HOME/cylc-run/rewind/8.2.4/log/job/20240806T0000Z/a/NN/job:export PATH=HOME/cb:$PATH
HOME/cylc-run/rewind/8.3.3/log/job/20240806T0000Z/a/NN/job:export PATH=HOME/cb:$PATH

As mentioned in #6301, this platforms approach is one option. The other would be making sure the PATH, which is obviously known to Cylc given it adds it to the job script, is passed through in the subprocess call for cylc job-submit. The latter approach is probably better IMO.

@ColemanTom
Copy link
Contributor

The other would be making sure the PATH, which is obviously known to Cylc given it adds it to the job script, is passed through in the subprocess call for cylc job-submit. The latter approach is probably better IMO.

Actually, correction to this, the job script uses what is defined in job_conf['platform']['cylc path']. So the behaviour added here is conistent with the job script change.

@ScottWales
Copy link
Contributor Author

We're seeing the issue in our automated environment which I don't have access to myself, but I would guess the automation is running with a clean user environment and calling cylc explicitly e.g. /path/to/wrapper/cylc vip workflow so there's nothing on PATH to find.

@oliver-sanders
Copy link
Member

Thanks for the info, but I'm still confused.

In order for you run the cylc play command, the cylc command must be in your $PATH. This $PATH will be inherited by the subprocesses that Cylc spawns.

For example:

$ FOO=42 python -c 'import subprocess; subprocess.Popen(["env"])' | grep FOO
FOO=42

Your report is suggesting that an environment variable has changed in the subprocess, along the lines of this:

$ PATH=aaa python -c 'import subprocess; subprocess.Popen(["env"])' | grep PATH
PATH=bbb

We need to understand what mechanism is causing $PATH to be changed.

If this mechanism is inside of Cylc (quite possible), it needs fixing inside of Cylc. If the mechanism is inside the deployment, it needs fixing in the deployment.

I'm not sure that hardcoding the path to the Cylc executable is the right way to go about this. What if job submission requires other things to be in $PATH, such as the job-submission command for example?

@ColemanTom
Copy link
Contributor

must be in your $PATH

Not if you call the wrapper directly.

@oliver-sanders
Copy link
Member

oliver-sanders commented Aug 16, 2024

We're seeing the issue in our automated environment which I don't have access to myself, but I would guess the automation is running with a clean user environment and calling cylc explicitly e.g. /path/to/wrapper/cylc vip workflow so there's nothing on PATH to find.

We should be able to determine whether this is the cause by adding something along the lines of export PATH="$(dirname "$0")" into the /path/to/wrapper/cylc script. If that fixes the issue, we know what the cause is and can start thinking about the fix.

Calling the wrapper directly might conflict with other aspects of how Cylc works as the stack Cylc relies on might not function in this configuration [1].

[1] That is, if the wrapper does not modify the environment correctly.

@oliver-sanders
Copy link
Member

must be in your $PATH

Not if you call the wrapper directly.

The example you gave used the command $ cylc play rewind/8.2.4?

@ColemanTom
Copy link
Contributor

ColemanTom commented Aug 16, 2024

The example you gave used the command $ cylc play rewind/8.2.4?

That must be a copy error. Look at the 8.3.3 one after it. Sorry.

@oliver-sanders
Copy link
Member

oliver-sanders commented Aug 16, 2024

Right.

Can you confirm that setting the $PATH in the wrapper script resolves the issue? Once we know that's the cause, we can think about how to resolve it (there are other local Cylc subprocess calls besides job-submission).

@ColemanTom
Copy link
Contributor

Can you confirm that setting the $PATH in the wrapper script resolves the issue? Once we know that's the cause, we can think about how to resolve it (there are other local Cylc subprocess calls besides job-submission).

I have just modified the wrapper and run again

$ export CYLC_VERSION=8.3.3; ~/cb/cylc vip --run-name $CYLC_VERSION.patch $PWD
$ cylc validate HOME/code/rewind
Valid for cylc-8.3.3
$ cylc install HOME/code/rewind
INSTALLED rewind/8.3.3.patch from HOME/code/rewind
$ cylc play rewind/8.3.3.patch

 ▪ ■  Cylc Workflow Engine 8.3.3
 ██   Copyright (C) 2008-2024 NIWA
▝▘    & British Crown (Met Office) & Contributors

INFO - Extracting job.sh to HOME/cylc-run/rewind/8.3.3.patch/.service/etc/job.sh
rewind/8.3.3.patch: HOST PID=2995626

...

2024-08-20T10:04:27+10:00 DEBUG - [jobs-submit cmd] cylc jobs-submit --debug --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --path=/usr/local/sbin -- '$HOME/cylc-run/rewind/8.3.3.patch/log/job' 20240806T0000Z/a/01
    [jobs-submit ret_code] 0
    [jobs-submit out]

So yes, an update to the wrapper allows it to pass.

Related to this, the patch provided by @ScottWales doesn't fully do what is needed. One of our developers found this issue happened for xtriggers too. And this patch did not resolve that.

@ColemanTom
Copy link
Contributor

Summary: I think the best change is a modification to the suggested wrapper. Perhaps a simple change at https://github.com/cylc/cylc-flow/blob/master/cylc/flow/etc/cylc#L176 (with appropriate commenting)

- if [[ ${0##*/} == "cylc" && ${1:-} == "hub" ]]; then
+ if [[ ${0##*/} == "cylc" ]]; then

@ColemanTom
Copy link
Contributor

Summary: I think the best change is a modification to the suggested wrapper. Perhaps a simple change at https://github.com/cylc/cylc-flow/blob/master/cylc/flow/etc/cylc#L176 (with appropriate commenting)

- if [[ ${0##*/} == "cylc" && ${1:-} == "hub" ]]; then
+ if [[ ${0##*/} == "cylc" ]]; then

Actually, this can't be right. Looking at how things were run where we saw these failures, the PATH had been modified to include the bin location. So whilst adding it to the PATH fixed it in my little user space test case, the one which actually is used, which I do not have access to to fiddle with and explore, already had it in the PATH.

@jarich
Copy link

jarich commented Aug 21, 2024

This has also tripped us up for seeing cylc logs in the hub. The cause are these lines:

  1. https://github.com/cylc/cylc-uiserver/blob/7b492b7b7909fdf6a9ef93990276a3cdbf2d3c4b/cylc/uiserver/resolvers.py#L360
  2. https://github.com/cylc/cylc-uiserver/blob/7b492b7b7909fdf6a9ef93990276a3cdbf2d3c4b/cylc/uiserver/resolvers.py#L441

By default these seem to call the miniconda environment's bin/cylc, but that cylc doesn't know enough to populate the jinja in our global.cylc and therefore the required platform is not found.

@oliver-sanders
Copy link
Member

Summary: I think the best change is a modification to the suggested wrapper. Perhaps a simple change at https://github.com/cylc/cylc-flow/blob/master/cylc/flow/etc/cylc#L176 (with appropriate commenting)

We have been very careful to avoid environment manipulation in the wrapper script. The exception for cylc hub is more of a hack to work around its requirement on an external non-python system (configurable-http-proxy).

The reason we have avoided environment manipulation is that this alters the environment of Rose apps in a way that is very difficult to undo. It may also alter the environment of Cylc tasks, e.g. tasks submitted to localhost platforms. So this is not a change that we would want to use at our site as it would have functional impacts.

I think the root cause of this is the desire to use multiple Cylc wrapper scripts, but not wanting to load the appropriate wrapper script in the $PATH. I do not understand the requirement for this, but whatever the motivation, it should be solvable by either modifying the Cylc wrapper to add this extra logic, or by creating a higher-level wrapper which invokes the appropriate wrapper which could then be put in the $PATH?

@ColemanTom
Copy link
Contributor

As I mentioned, I do not believe modifying the wrapper was right. We are setting the PATH variable to include the cylc wrapper already. The problem only occurs when Cylc launches the workflow on a remote host. If it runs on the localhost (--host=localhost), we have no problems.

So, looking at remote.py which is what does this, it is assuming when logging onto a remote place to launch a wrokflow, that the cylc wrapper is available either in platform.['cylc path'] or from the login bashrc. I have confirmed that if I put PATH into ssh forward environment variables in the global.cylc then the error disappears.

From a very quick thought process, something in https://github.com/cylc/cylc-flow/blob/master/cylc/flow/remote.py#L297-L330 would be modified so that, on SSH, it would pass in something like env CYLC_VERSION=.... CYLC_ENV_NAME=.... PATH={CYLC_HOME}:$PATH to ensure the path to the wrapper script is passed through to the remote host too.

@oliver-sanders
Copy link
Member

oliver-sanders commented Aug 22, 2024

The way we work is like this:

  • The wrapper script is in the $PATH.
  • The Conda bin/ directory is never explicitly placed in $PATH (except for cylc hub use so it can pick up configurable-http-proxy).
  • Otherwise we don't modify $PATH at all in the wrapper script (for good reason, this would override the system env).
  • The cylc-path can be configured for remote platforms as needed, point this at a wrapper script if you are using one.
  • The cylc-path does not need to be configured for localhost as the wrapper is already in the path.

If you are having issues, perhaps it is with cylc hub use? Are your Cylc UI Servers being spawned with a path inherited from the hub? That would make sense if you are using LocalProcessSpawner. If so, perhaps this is messing with the logic.

If so, try making a copy of the cylc wrapper for configurable-http-proxy (like how we do for rose and isodatetime) and removing the if [[ ${0##*/} == "cylc" && ${1:-} == "hub" ]]; then hack, and restart your hub/servers. This will prevent the PATH modification from leaking through the hub into the uiserver into the scheduler.

It sounds like you have modified your wrapper script in other ways which is likely exacerbating the issue. I'm guessing that this is relation to per-workflow symlink-dir config? However, this should not be necessary as we added special logic to allow env vars in the rose-suite.conf file to be used in the global.cylc to allow per-workflow customisations to be configured in this way. I was a right pain in the neck as the global.cylc has to be reloaded to support this, but we have confirmed the approach works for configuring [symlink dirs].

@hjoliver
Copy link
Member

hjoliver commented Aug 22, 2024

The cylc-path can be configured for remote platforms as needed, point this at a wrapper script if you are using one.

Probably stating the obvious @ColemanTom but this config is only needed if you haven't arranged for the wrapper path to be in the default PATH on the remote.

Is it not possible to do that at your site because you don't have a central wrapper? (I think you are deploying a separate wrapper with each workflow?)

https://cylc.github.io/cylc-doc/stable/html/reference/config/global.html#global.cylc[platforms][%3Cplatform%20name%3E]cylc%20path

@ScottWales
Copy link
Contributor Author

I think I have a better understanding of what's happening now. Still doing some testing but I think it's as Hilary says.

We have two Cylc servers, cylc-1 and cylc-2. The automation goes to one of these, say cylc-1, and starts a workflow with PATH set to find the wrapper, all good there.

Cylc then decides to load-balance the run to run on cylc-2, submitting with the equivalent of

ssh cylc-2 /path/to/wrapper/cylc play

using the Cylc path defined in the [[localhost]] wrapper explicitly. This is what catches us, since we're not forwarding the PATH variable or adding the wrapper to $PATH in /etc/profile or equivalent. Adding the real Cylc to $PATH in the wrapper should fix the issue.

@oliver-sanders
Copy link
Member

oliver-sanders commented Aug 23, 2024

That doesn't quite make sense as the load-balancing works by re-invoking the same cylc command on the selected host. This means the new process should be going through the same wrapper script as the original call. As long as the Cylc wrapper is in $PATH everything should work fine. If you don't want the Cylc wrapper in $PATH, then you're going to have to come up with some other mechanism, but I don't understand the motive for this.

Adding the real Cylc to $PATH in the wrapper should fix the issue.

Note, you don't want to add the "real" Cylc to $PATH, you want to add the Cylc wrapper to $PATH, but you shouldn't need to add the Cylc wrapper to $PATH because it should already be in $PATH.

@ScottWales
Copy link
Contributor Author

ScottWales commented Aug 23, 2024 via email

@hjoliver
Copy link
Member

It shouldn’t matter if subprocesses are called with the wrapper, the environment should be the same either way shouldn’t it?

Not sure if I've understood the question exactly, but the main point of the wrapper is to cause the right version of Cylc (i.e., the right Cylc environment) to be invoked.

@hjoliver
Copy link
Member

hjoliver commented Aug 26, 2024

@ColemanTom and @ScottWales - it might help (me at least) to take a step back and compare what you're doing with the way we intend and recommend Cylc to be used:

  1. Every scheduler host, and every job host, needs a central wrapper script called cylc that is in the default $PATH on the host (or if not in $PATH, specified via the global config cylc path platform setting) [of course, on a shared FS this may entail a single wrapper accessible on all hosts].
  2. The cylc wrapper (being in the default $PATH) will automatically be invoked whenever the scheduler ssh's to a job platform to submit a job there (or to kill a job, or whatever)
  3. The cylc wrapper knows where all the Cylc versions are installed, and will invoke the right version as specified by $CYLC_HOME or $CYLC_VERSION or $CYLC_ENV_NAME in the environment (initially set by the user to start the right scheduler, and then exported by the scheduler to ensure that remote jobs call the same Cylc version too).

This is rather simple, and it works really well - even for users who need to run distributed workflows under different versions of Cylc at the same time.

So I think the question should be: why can you not just do it this way?

Then, if you do have a valid reason for doing something different to what we have recommended and tested, what exactly is different - because that might help us understand how best to achieve it. E.g.:

  • I think I was told that every individual workflow is deployed with its own wrapper?
  • has someone decreed that there is not allowed to be a central wrapper on the system?
  • and if so,
    • what's the point of even having wrappers - the purpose of which is to select between Cylc versions?
    • does each workflow gets its own isolated Cylc environment too, or are they still installed centrally?

@ScottWales
Copy link
Contributor Author

This is pretty much what we're doing, steps 2 and 3 work fine. Our issue is that we are trying to do

(or if not in $PATH, specified via the global config cylc path platform setting)

And then when Cylc spawns subprocesses it fails, as subprocesses do not look at this platform setting.

It has been decreed by the higher ups that all of our workflows should have independent environments, including the Cylc wrapper and Cylc Conda environments alongside whatever the workflow itself needs, so yes no central wrapper. Details of operational use are outside my area, but I think they still want the wrapper to be able to change Cylc versions, it's just each of the workflows will have their own independent cylc-8.3, cylc-8.4 etc. environments. We also use the wrapper to manage project output directories with some customisations, we've not yet migrated to the rose feature though that does sound useful.

@hjoliver
Copy link
Member

hjoliver commented Aug 26, 2024

And then when Cylc spawns subprocesses it fails, as subprocesses do not look at this platform setting.

Right, hopefully re-reading the posts above will show which subprocesses exactly. It should work - all job submission, for instance, happens in subprocesses, and I'm pretty sure we test those platform settings.

OK I've confirmed that the cylc path global config setting does get through the job-submission subprocess, to a remote host where cylc is not in the default $PATH, and the value of it gets added to $PATH in the job script so that the job itself has access to Cylc on the remote.

[scheduling]
    [[graph]]
        R1 = "foo"
[runtime]
    [[foo]]
        script = "cylc version --long"
        platform = <my-remote> 

Tested with and without clean job submission environment = True

@hjoliver
Copy link
Member

This has also tripped us up for seeing cylc logs in the hub. The cause are these lines:

  1. https://github.com/cylc/cylc-uiserver/blob/7b492b7b7909fdf6a9ef93990276a3cdbf2d3c4b/cylc/uiserver/resolvers.py#L360
  2. https://github.com/cylc/cylc-uiserver/blob/7b492b7b7909fdf6a9ef93990276a3cdbf2d3c4b/cylc/uiserver/resolvers.py#L441

By default these seem to call the miniconda environment's bin/cylc, but that cylc doesn't know enough to populate the jinja in our global.cylc and therefore the required platform is not found.

Not sure if this is a side-issue or not, but I've just checked that cylc cat-log (from the CLI, at least) does respect the platform-specific cylc path global config setting too.

@hjoliver
Copy link
Member

It has been decreed by the higher ups that all of our workflows should have independent environments, including the Cylc wrapper and Cylc Conda environments alongside whatever the workflow itself needs, so yes no central wrapper.

@ScottWales - does this imply that you cannot use the cylc path platform setting, because you do not have a platform-specific cylc wrapper (every workflow has its own wrapper)?

@hjoliver
Copy link
Member

hjoliver commented Aug 27, 2024

This has also tripped us up for seeing cylc logs in the hub. The cause are these lines:

  1. https://github.com/cylc/cylc-uiserver/blob/7b492b7b7909fdf6a9ef93990276a3cdbf2d3c4b/cylc/uiserver/resolvers.py#L360
  2. https://github.com/cylc/cylc-uiserver/blob/7b492b7b7909fdf6a9ef93990276a3cdbf2d3c4b/cylc/uiserver/resolvers.py#L441

By default these seem to call the miniconda environment's bin/cylc, but that cylc doesn't know enough to populate the jinja in our global.cylc and therefore the required platform is not found.

@jarich - a quick check shows that cylc cat-log respects the cylc path global config setting, for platforms that don't have cylc in $PATH.

But as per my previous question to @ScottWales it's looking like:

  • you cannot put cylc in $PATH because there is no central wrapper (every workflow has its own wrapper)
  • you cannot use the cylc path platform setting either (for the same reason)

Is that right? (Just to be 100% clear)

[UPDATE] or does @ScottWales' comment contradict that:

And then when Cylc spawns subprocesses it fails, as subprocesses do not look at this platform setting.

Anyhow, I'm not grokking this bit. Subprocesses should not need to look up platform settings. What runs in the subprocess is a command constructed by the scheduler, which does look up platform settings.

@jarich
Copy link

jarich commented Aug 27, 2024 via email

@hjoliver
Copy link
Member

hjoliver commented Aug 27, 2024

. What does cylc look for for local submissions?

Do you mean how does Cylc distinguish local job submissions from remote ones?

Also, @jarich - could you answer my question just above, to help me get a handle on exactly what you're trying to do at your site?

  • you cannot put cylc in $PATH because there is no central wrapper
    (every workflow has its own wrapper)
  • you cannot use the cylc path platform setting either (for the same reasons)

Is that right? (Just to be 100% clear)

Or if this is getting to complicated for a GitHub Issue discussion, maybe we should tee up another video chat?

@oliver-sanders
Copy link
Member

help me get a handle on exactly what you're trying to do at your site?

Additionally, we could also do with answers to other questions asked above:

  1. Are these issues you are reporting in relation to the use of cylc hub?

    If you are having issues, perhaps it is with cylc hub use? Are your Cylc UI Servers being spawned with a path inherited from the hub? That would make sense if you are using LocalProcessSpawner. If so, perhaps this is messing with the logic.

  2. What is the motivation behind this multiple wrapper script approach? This seems to be the root cause of the issues.

    It sounds like you have modified your wrapper script in other ways which is likely exacerbating the issue. I'm guessing that this is relation to install: configure remote platform symlink dirs per workflow #5418? However, this should not be necessary as Environment variables set in rose-suite.conf should apply when the global config is loaded cylc-rose#237 to allow env vars in the rose-suite.conf file to be used in the global.cylc to allow per-workflow customisations to be configured in this way. It was a right pain in the neck as the global.cylc has to be reloaded to support this, but we have confirmed the approach works for configuring [symlink dirs].

As it stands, I don't think that this PR is going to solve your problems as it only targets job submission. Cylc launches many other subprocesses besides the job submission command which would not be fixed by this patch, you've also reported issues with subcommands launched by Cylc UI Server which this will not address.

We need a better grapple on the problem before we can suggest a solution.

@jarich
Copy link

jarich commented Aug 28, 2024 via email

@oliver-sanders
Copy link
Member

Sorry to hear that. No pressure from us, happy to park this for now.

@jarich
Copy link

jarich commented Sep 2, 2024

Our original goal was to install workflows in a similar fashion to containers without actually using containers. Ideally this means that every workflow has everything it needs in its own space (except the actually global configuration file and some other things that we couldn't localise due to the not-actually containers aspect of it). We have had to compromise on this to get everything to work, but there are legacies left in how we're deploying things because of that original goal.

Originally we were installing a cylc-bin/ for the wrapper, symlink for rose etc and Cylc conda environment in a separate location for each workflow. Currently we're still installing the individual cylc-bin/{cylc,rose,*} in a separate location for each workflow, but now installing Cylc conda environments in a separate location for each user (but shared across their workflows). The correct path for each workflow's cylc-bin directory is exported to their $PATH when we start the Cylc workflow. It is not added to the workflow owner's .bashrc file. Child processes should inherit the exported $PATH, but new connections will not.

To enable the UI Servers (running on a separate host than the schedulers) to find Cylc on the scheduler hosts we installed the cylc wrapper in /opt/cylc8/cylc-bin/cylc on all hosts. This is added to $PATH when the UI Servers are started. The global configuration sets cylc path as /opt/cylc8/cylc-bin/cylc for localhost submissions.

Workflows that are started with --host localhost do not run across the same problems we have run across since moving to use Cylc 8.3., which is why we only started to run across these problems more recently. We needed to wait for Cylc 8.3's ability to forward environment variables so that we can pass through enough information that the cylc wrapper in /opt/cylc8/cylc-bin/cylc can find the correct conda environment for the workflow. However, as per the start of this ticket, cylc path from the global config does not appear to be used in all cases that cylc is called. We have discussed, but been wary of passing PATH as one of the forwarded environment variables.

In cylc/uiserver/resolvers.py (from whichever was the youngest uiserver version 3 weeks ago) the calls to cylc cat-log appear to invoke the conda/bin/cylc not the /opt/cylc8/cylc-bin/cylc we exported into $PATH when we started the UI Server. Hard-coding the path in those calls resolves the issue, but demonstrates the problem.

* you cannot put `cylc` in `$PATH` because there is no central wrapper (every workflow has its own wrapper)

As per the above, we do put the relevant cylc-bin/cylc into $PATH when we start the Cylc workflow, and we put /opt/cylc8/cylc-bin/cylc into $PATH when we start UI Servers.

* you cannot use the `cylc path` platform setting either (for the same reason)

We make good use of the cylc path platform setting for both job host and localhost submissions. Our problem seems to be that somehow or another we aren't seeing Cylc honour it in some situations. Calls to cylc psutil from the WUI to the scheduler hosts correctly use the cylc path setting. The various checks to decide on which host to run the workflow on (load etc) also correctly use the cylc path setting. The cylc path setting is used correctly to find the correct cylc-bin on the jobhost. Calls to cylc cat-log from resolvers.py do not seem to honor the cylc path setting, and we've run into some issues with localhost job submission and xtrigger settings as well.

@hjoliver
Copy link
Member

hjoliver commented Sep 2, 2024

While I try to unpack that response, one clarification on my part:

  • you cannot put cylc in $PATH because there is no central wrapper

I should have said in the default $PATH on all Cylc hosts, so that any cylc invocations over ssh will automatically find the central wrapper.

I'll try to reproduce a failure to use the cylc path setting as you've said above...

@oliver-sanders
Copy link
Member

Ideally this means that every workflow has everything it needs in its own space (except the actually global configuration file and some other things that we couldn't localise due to the not-actually containers aspect of it)

Wow, ok. This is a new deployment pattern for us and not one we presently support or test so I would expect to hit a few stumbling blocks when trailblazing it.

We (and others) have deployed containerised Cylc orchestrations onto cloud platforms. This pushes the environment management problem (solved by the wrapper script) up a level to an orchestration problem. However, this half-way solution is a bit different.

I don't think that trying to adopt this pattern is going to help ease the transition to container orchestrations laster because this part of the solution (the Cylc environment management bit) will need to be re-implemented when you move to containers anyway. In a Cylc container orchestration:

  • The Cylc installation will be per-container not per-workflow.
  • There will be no wrapper script as the container will only have one Cylc environment to manage and it will be installed into the system environment.
  • The job-submission script will need to configure the container that jobs run in rather than the per-workflow wrapper.

As long as you're deploying a conventional-ish distributed system, I would personally suggest sticking with the supported centralised deployment approach and only changing this when you move to a container orchestration (as time / resources / systems allow). This is going to be a lot simpler to set up and, because you're following the supported pattern, your systems won't be exposed to the unnecessary risk of an untested deployment pattern.

Assuming you want to push ahead with this approach, here are the immediate problems:

Problem 1) The cylc hub wrapper

I think you are hitting an issue caused by this nasty hack in the wrapper:

# Set PATH when running cylc hub so that configurable-http-proxy can find node
if [[ ${0##*/} == "cylc" && ${1:-} == "hub" ]]; then
PATH=${CYLC_HOME}/bin:${PATH}
fi

This adds the bin/ directory containing the real Cylc executable (not the wrapper) into the $PATH. This is a lazy workaround to ensure configurable-http-proxy is in the $PATH, but it bypasses the wrapper.

I think the "correct" way to resolve this issue (under the current wrapper script approach) would be to create a "configurable-http-proxy" wrapper (suggested here). With this wrapper script, the hack will no longer be required removing the nasty $PATH manipulation (which actually causes another issue).

I.E, you would need wrappers for:

  • cylc
  • isodatetime
  • rose (if using rose, note I'm not sure how rose will work with the per-workflow wrapper approach, it also runs subprocesses).
  • configurable-http-proxy (or whatever proxy you have configured Jupyter Hub to work with).

Problem 2) cylc path

We make good use of the cylc path platform setting for both job host and localhost submissions. Our problem seems to be that somehow or another we aren't seeing Cylc honour it in some situations.

It's not so much that Cylc isn't "respecting" this configuration, the "cylc path" is being used for something a bit different to the problem it was designed to solve. I think the behaviour you are seeing is actually by-design but would require modification to support your use case.

Cylc does not support configuring the cylc path for the localhost platform. It sounds like you are defining an alternate local platform (with cylc path configured) to get around this. This may work in some situations, however, it will not work (reliably) in others.

For example, some functionalities know that they need to perform an action once on each "install target". So they will go through the list of platforms and pick the first one that matches the "install target" they require. This means that Cylc may well use the "localhost" platform settings rather than the "local" platform you've defined.

The "clean", "remote-tidy" (and I think "cat-log") functionalities are examples of this.

In other words, Cylc requires that the "localhost" platform works whether you are submitting jobs to it or not.

In order to permit your intended usage, we would need to broaden the capabilities of cylc path to cover all cylc commands that are submitted to localhost (not just job submission as targetted by this PR). I think this should be relatively easy as we have collected most commands into a single interface at Cylc 8, but we'll need to check all Popen calls are covered by this as intended (event handlers, auto-restart, job-submission, psutil, cat-log, there are a bunch).

@oliver-sanders
Copy link
Member

(marking as draft pending resolution of the above)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

localhost cylc path should come from global.cylc platform if defined there
5 participants