-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use full path to Cylc when submitting jobs #6302
base: master
Are you sure you want to change the base?
Conversation
The Cylc command is assumed to be on PATH in several other places:
It may make sense to use the full path in these places as well |
Test 00-torture test is failing because the Cylc path is explicitly set to a bad path |
Any ability for a small unit test for the new function? |
I'm wondering if this can go into 8.3.x rather than master too? Cylc devs may disagree whether this is a feature or bugfix though. Although I think this only became an issue for us with 8.3.x and was not an issue in 8.2.x so some change happned somewhere. |
d64b0d5
to
131fa03
Compare
I'm a bit confused by this. In order to start a Cylc workflow, the So how could a subprocess started from within a process where Two things come to mind:
|
Now that's really confusing! I don't think there were any changes during this period that could have explained this behaviour. Nothing is jumping out from this list: https://github.com/cylc/cylc-flow/milestone/117?closed=1
It looks like a clear bug to me, but we're going to need to understand what's going on here a bit better to determine whether this is a Cylc bug or a setup issue. Suggestions:
|
Ok, the "worked in 8.2" may be wrong. There may be other things affecting that or I may have been misled. I just tried in a clean environment and this same thing happens. I put more details in the Issue, but I can put some full details here if you prefer. Below, you can see, nothing in .bashrc, global.cylc is basic. Both 8.2.4 and 8.3.3 at least have job submission failure as I don't have
As mentioned in #6301, this |
Actually, correction to this, the |
We're seeing the issue in our automated environment which I don't have access to myself, but I would guess the automation is running with a clean user environment and calling cylc explicitly e.g. |
Thanks for the info, but I'm still confused. In order for you run the For example: $ FOO=42 python -c 'import subprocess; subprocess.Popen(["env"])' | grep FOO
FOO=42 Your report is suggesting that an environment variable has changed in the subprocess, along the lines of this: $ PATH=aaa python -c 'import subprocess; subprocess.Popen(["env"])' | grep PATH
PATH=bbb We need to understand what mechanism is causing If this mechanism is inside of Cylc (quite possible), it needs fixing inside of Cylc. If the mechanism is inside the deployment, it needs fixing in the deployment. I'm not sure that hardcoding the path to the Cylc executable is the right way to go about this. What if job submission requires other things to be in |
Not if you call the wrapper directly. |
We should be able to determine whether this is the cause by adding something along the lines of Calling the wrapper directly might conflict with other aspects of how Cylc works as the stack Cylc relies on might not function in this configuration [1].
|
The example you gave used the command |
That must be a copy error. Look at the 8.3.3 one after it. Sorry. |
Right. Can you confirm that setting the |
I have just modified the wrapper and run again
So yes, an update to the wrapper allows it to pass. Related to this, the patch provided by @ScottWales doesn't fully do what is needed. One of our developers found this issue happened for xtriggers too. And this patch did not resolve that. |
Summary: I think the best change is a modification to the suggested wrapper. Perhaps a simple change at https://github.com/cylc/cylc-flow/blob/master/cylc/flow/etc/cylc#L176 (with appropriate commenting)
|
Actually, this can't be right. Looking at how things were run where we saw these failures, the PATH had been modified to include the bin location. So whilst adding it to the PATH fixed it in my little user space test case, the one which actually is used, which I do not have access to to fiddle with and explore, already had it in the PATH. |
This has also tripped us up for seeing cylc logs in the hub. The cause are these lines:
By default these seem to call the miniconda environment's |
We have been very careful to avoid environment manipulation in the wrapper script. The exception for The reason we have avoided environment manipulation is that this alters the environment of Rose apps in a way that is very difficult to undo. It may also alter the environment of Cylc tasks, e.g. tasks submitted to localhost platforms. So this is not a change that we would want to use at our site as it would have functional impacts. I think the root cause of this is the desire to use multiple Cylc wrapper scripts, but not wanting to load the appropriate wrapper script in the |
As I mentioned, I do not believe modifying the wrapper was right. We are setting the PATH variable to include the cylc wrapper already. The problem only occurs when Cylc launches the workflow on a remote host. If it runs on the localhost ( So, looking at From a very quick thought process, something in https://github.com/cylc/cylc-flow/blob/master/cylc/flow/remote.py#L297-L330 would be modified so that, on SSH, it would pass in something like |
The way we work is like this:
If you are having issues, perhaps it is with If so, try making a copy of the It sounds like you have modified your wrapper script in other ways which is likely exacerbating the issue. I'm guessing that this is relation to per-workflow symlink-dir config? However, this should not be necessary as we added special logic to allow env vars in the |
Probably stating the obvious @ColemanTom but this config is only needed if you haven't arranged for the wrapper path to be in the default PATH on the remote. Is it not possible to do that at your site because you don't have a central wrapper? (I think you are deploying a separate wrapper with each workflow?) |
I think I have a better understanding of what's happening now. Still doing some testing but I think it's as Hilary says. We have two Cylc servers, Cylc then decides to load-balance the run to run on
using the Cylc path defined in the |
That doesn't quite make sense as the load-balancing works by re-invoking the same
Note, you don't want to add the "real" Cylc to |
The new process does go through the wrapper script because that’s how our global.cylc is set up, with the Cylc path set on localhost, but it does not set PATH. We could do so using the ssh forward environment variables, but are not at the moment.
It shouldn’t matter if subprocesses are called with the wrapper, the environment should be the same either way shouldn’t it?
On 23 Aug 2024, at 6:52 PM, Oliver Sanders ***@***.***> wrote:
That doesn't quite make sense as the load-balancing works by re-invoking the same cylc command on the selected host. This means the new process should be going through the same wrapper script as the original call. As long as the Cylc wrapper is in $PATH everything should work fine.
Adding the real Cylc to $PATH in the wrapper should fix the issue.
Note, you don't want to add the "real" Cylc to $PATH, you want to add the Cylc wrapper to $PATH, but you shouldn't need to add the Cylc wrapper to $PATH because it should already be in $PATH.
—
Reply to this email directly, view it on GitHub<#6302 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AABHK3JYEBXH7FR3AXSG4NDZS3Z3BAVCNFSM6AAAAABMRNREISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBWGYYTONBSGE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Not sure if I've understood the question exactly, but the main point of the wrapper is to cause the right version of Cylc (i.e., the right Cylc environment) to be invoked. |
@ColemanTom and @ScottWales - it might help (me at least) to take a step back and compare what you're doing with the way we intend and recommend Cylc to be used:
This is rather simple, and it works really well - even for users who need to run distributed workflows under different versions of Cylc at the same time. So I think the question should be: why can you not just do it this way? Then, if you do have a valid reason for doing something different to what we have recommended and tested, what exactly is different - because that might help us understand how best to achieve it. E.g.:
|
This is pretty much what we're doing, steps 2 and 3 work fine. Our issue is that we are trying to do
And then when Cylc spawns subprocesses it fails, as subprocesses do not look at this platform setting. It has been decreed by the higher ups that all of our workflows should have independent environments, including the Cylc wrapper and Cylc Conda environments alongside whatever the workflow itself needs, so yes no central wrapper. Details of operational use are outside my area, but I think they still want the wrapper to be able to change Cylc versions, it's just each of the workflows will have their own independent |
Right, hopefully re-reading the posts above will show which subprocesses exactly. It should work - all job submission, for instance, happens in subprocesses, and I'm pretty sure we test those platform settings. OK I've confirmed that the [scheduling]
[[graph]]
R1 = "foo"
[runtime]
[[foo]]
script = "cylc version --long"
platform = <my-remote> Tested with and without |
Not sure if this is a side-issue or not, but I've just checked that |
@ScottWales - does this imply that you cannot use the |
@jarich - a quick check shows that But as per my previous question to @ScottWales it's looking like:
Is that right? (Just to be 100% clear) [UPDATE] or does @ScottWales' comment contradict that:
Anyhow, I'm not grokking this bit. Subprocesses should not need to look up platform settings. What runs in the subprocess is a command constructed by the scheduler, which does look up platform settings. |
I guess something I should clarify is the name of the localhost platform. I
had been using [local] and I'm sure that had been working better for me,
but we made a change recently and changed it to [localhost] (to match the
platform our users were specifying).
Maybe I should have kept both. What does cylc look for for local
submissions?
…On Tue, 27 Aug 2024, 12:00 Hilary James Oliver, ***@***.***> wrote:
This has also tripped us up for seeing cylc logs in the hub. The cause are
these lines:
1.
https://github.com/cylc/cylc-uiserver/blob/7b492b7b7909fdf6a9ef93990276a3cdbf2d3c4b/cylc/uiserver/resolvers.py#L360
2.
https://github.com/cylc/cylc-uiserver/blob/7b492b7b7909fdf6a9ef93990276a3cdbf2d3c4b/cylc/uiserver/resolvers.py#L441
By default these seem to call the miniconda environment's bin/cylc, but
that cylc doesn't know enough to populate the jinja in our global.cylc
and therefore the required platform is not found.
@jarich <https://github.com/jarich> - a quick check shows that cylc
cat-log respects the cylc path global config setting, for platforms that
don't have cylc in $PATH.
But as per my previous question to @ScottWales
<https://github.com/ScottWales> it's looking like:
- you cannot put cylc in $PATH because there is no central wrapper
(every workflow has its own wrapper)
- you cannot use the cylc path platform setting either (for the same
reasons)
Is that right? (Just to be 100% clear)
—
Reply to this email directly, view it on GitHub
<#6302 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADI6JX3OOIAMRI3GZ26HTZTPMSJAVCNFSM6AAAAABMRNREISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJRGQZDGMZQHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Do you mean how does Cylc distinguish local job submissions from remote ones? Also, @jarich - could you answer my question just above, to help me get a handle on exactly what you're trying to do at your site?
Or if this is getting to complicated for a GitHub Issue discussion, maybe we should tee up another video chat? |
Additionally, we could also do with answers to other questions asked above:
As it stands, I don't think that this PR is going to solve your problems as it only targets job submission. Cylc launches many other subprocesses besides the job submission command which would not be fixed by this patch, you've also reported issues with subcommands launched by Cylc UI Server which this will not address. We need a better grapple on the problem before we can suggest a solution. |
I am currently on stress leave. I will try to answer your questions in
detail on Friday, but it might end up being early next week. My apologies
that I can't respond sooner.
…On Wed, 28 Aug 2024, 19:28 Oliver Sanders, ***@***.***> wrote:
help me get a handle on exactly what you're trying to do at your site?
Additionally, we could also do with answers to other questions asked above:
1.
Are these issues you are reporting in relation to the use of cylc hub?
If you are having issues, perhaps it is with cylc hub use? Are your
Cylc UI Servers being spawned with a path inherited from the hub? That
would make sense if you are using LocalProcessSpawner
<https://jupyterhub.readthedocs.io/en/stable/reference/api/spawner.html#localprocessspawner>.
If so, perhaps this is messing with the logic.
2.
What is the motivation behind this multiple wrapper script approach?
This seems to be the root cause of the issues.
It sounds like
<https://cylc.discourse.group/t/job-logs-not-showing-in-wui/1002/9>
you have modified your wrapper script in other ways which is likely
exacerbating the issue. I'm guessing that this is relation to #5418
<#5418>? However, this should
not be necessary as cylc/cylc-rose#237
<cylc/cylc-rose#237> to allow env vars in
the rose-suite.conf file to be used in the global.cylc to allow
per-workflow customisations to be configured in this way. It was a right
pain in the neck as the global.cylc has to be reloaded to support this, but
we have confirmed the approach works for configuring [symlink dirs].
As it stands, I don't think that this PR is going to solve your problems
as it only targets job submission. Cylc launches many other subprocesses
besides the job submission command which would not be fixed by this patch,
you've also reported issues with subcommands launched by Cylc UI Server
which this will not address.
We need a better grapple on the problem before we can suggest a solution.
—
Reply to this email directly, view it on GitHub
<#6302 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADI6KVWPZYJZIXCXGHG5LZTWJ33AVCNFSM6AAAAABMRNREISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJUHAYDMOBSGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Sorry to hear that. No pressure from us, happy to park this for now. |
Our original goal was to install workflows in a similar fashion to containers without actually using containers. Ideally this means that every workflow has everything it needs in its own space (except the actually global configuration file and some other things that we couldn't localise due to the not-actually containers aspect of it). We have had to compromise on this to get everything to work, but there are legacies left in how we're deploying things because of that original goal. Originally we were installing a To enable the UI Servers (running on a separate host than the schedulers) to find Cylc on the scheduler hosts we installed the cylc wrapper in Workflows that are started with In
As per the above, we do put the relevant
We make good use of the |
While I try to unpack that response, one clarification on my part:
I should have said in the default I'll try to reproduce a failure to use the |
Wow, ok. This is a new deployment pattern for us and not one we presently support or test so I would expect to hit a few stumbling blocks when trailblazing it. We (and others) have deployed containerised Cylc orchestrations onto cloud platforms. This pushes the environment management problem (solved by the wrapper script) up a level to an orchestration problem. However, this half-way solution is a bit different. I don't think that trying to adopt this pattern is going to help ease the transition to container orchestrations laster because this part of the solution (the Cylc environment management bit) will need to be re-implemented when you move to containers anyway. In a Cylc container orchestration:
As long as you're deploying a conventional-ish distributed system, I would personally suggest sticking with the supported centralised deployment approach and only changing this when you move to a container orchestration (as time / resources / systems allow). This is going to be a lot simpler to set up and, because you're following the supported pattern, your systems won't be exposed to the unnecessary risk of an untested deployment pattern. Assuming you want to push ahead with this approach, here are the immediate problems: Problem 1) The
|
# Set PATH when running cylc hub so that configurable-http-proxy can find node | |
if [[ ${0##*/} == "cylc" && ${1:-} == "hub" ]]; then | |
PATH=${CYLC_HOME}/bin:${PATH} | |
fi |
This adds the bin/
directory containing the real Cylc executable (not the wrapper) into the $PATH
. This is a lazy workaround to ensure configurable-http-proxy
is in the $PATH
, but it bypasses the wrapper.
I think the "correct" way to resolve this issue (under the current wrapper script approach) would be to create a "configurable-http-proxy" wrapper (suggested here). With this wrapper script, the hack will no longer be required removing the nasty $PATH
manipulation (which actually causes another issue).
I.E, you would need wrappers for:
cylc
isodatetime
rose
(if usingrose
, note I'm not sure howrose
will work with the per-workflow wrapper approach, it also runs subprocesses).configurable-http-proxy
(or whatever proxy you have configured Jupyter Hub to work with).
Problem 2) cylc path
We make good use of the cylc path platform setting for both job host and localhost submissions. Our problem seems to be that somehow or another we aren't seeing Cylc honour it in some situations.
It's not so much that Cylc isn't "respecting" this configuration, the "cylc path" is being used for something a bit different to the problem it was designed to solve. I think the behaviour you are seeing is actually by-design but would require modification to support your use case.
Cylc does not support configuring the cylc path
for the localhost
platform. It sounds like you are defining an alternate local
platform (with cylc path
configured) to get around this. This may work in some situations, however, it will not work (reliably) in others.
For example, some functionalities know that they need to perform an action once on each "install target". So they will go through the list of platforms and pick the first one that matches the "install target" they require. This means that Cylc may well use the "localhost" platform settings rather than the "local" platform you've defined.
The "clean", "remote-tidy" (and I think "cat-log") functionalities are examples of this.
In other words, Cylc requires that the "localhost" platform works whether you are submitting jobs to it or not.
In order to permit your intended usage, we would need to broaden the capabilities of cylc path
to cover all cylc
commands that are submitted to localhost
(not just job submission as targetted by this PR). I think this should be relatively easy as we have collected most commands into a single interface at Cylc 8, but we'll need to check all Popen
calls are covered by this as intended (event handlers, auto-restart, job-submission, psutil, cat-log, there are a bunch).
(marking as draft pending resolution of the above) |
Use the full path to the Cylc script when submitting jobs
Fixes #6301
Check List
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
(andconda-environment.yml
if present).?.?.x
branch.