LSF: Accept use_stdin in the constructor #360

stuarteberg · 2019-10-24T18:14:28Z

Right now, all LSF options can be specified either in the config OR in the constructor arguments, except the new use-stdin option (introduced in #347). Unlike all the others, that option may only be specified in the config (not the constructor).

I don't see why we'd want use-stdin to be different from all the other LSF options, so this PR allows the user to pass use_stdin to the LSFCluster constructor if she wants to. (As usual, the config is used if no value was passed in the constructor.)

cluster = LSFCluster(cores=15, memory='25GB',
                     use_stdin=True) # <-- now allowed

Side note: I suspect this new setting will be needed by many, if not most, LSF users, so I added some verbose documentation for it.

FWIW, I tested these changes on my LSF cluster, and they work as expected.

stuarteberg · 2019-10-25T13:17:19Z

OK, I got the tests passing, but I'm seeing intermittent failures from Travis (unrelated to this PR).

FWIW, here's the error:

PREFIX=/opt/anaconda

Unpacking payload ...

  0%|          | 0/35 [00:00<?, ?it/s]

No output has been received in the last 10m0s, this potentially
indicates a stalled build or something wrong with the build itself.

Check the details on how to adjust your build configuration on:
https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received

The build has been terminated

At first this error appeared in the JOBQUEUE=pbs build, so I triggered a new build. Now pbs passes, but I see the same error in the JOBQUEUE=slurm build.

I'll force-push one more time to see if I can get lucky with a successful build.

guillaumeeb

Thanks very much @stuarteberg, that looks very good!

Just one little thing, could you add a test thats shows that lsf_job.use_stdin can be modified according to the constructor argument you added?

lesteve · 2019-10-29T09:57:48Z

Side note: I suspect this new setting will be needed by many, if not most, LSF users, so I added some verbose documentation for it.

@stuarteberg do you think we should default to use_stdin=True? As mentioned in #328 (comment) it feels like we may have made the wrong choice mostly because it was based on @mrocklin's experience on Summit which may not be representative of other LSF clusters.

Also for further reference could you answer these two questions:

what is the output of lsid | head -n1
does bsub < job_script (rather than bsub job_script) work for you?

stuarteberg · 2019-10-30T14:32:11Z

@guillaumeeb

could you add a test thats shows that lsf_job.use_stdin can be modified according to the constructor argument you added?

OK, done.

stuarteberg · 2019-10-30T14:37:40Z

@lesteve

Disclaimer: I am not an LSF expert, and I only have experience with one LSF cluster, of which I am merely a user, not an administrator.

do you think we should default to use_stdin=True?

Yes, I think we should. If use_stdin=False, then LSFCluster will write the jobscript to /tmp (or whichever directory distributed.utils.tmpfile() chooses), and then launch the command as follows:

bsub /tmp/jobscript.sh

Unless tmpfile() chooses a location on the shared file system (which seems unlikely), that command is going to fail, because the LSF host that actually executes the job has a completely different /tmp directory.

When using bsub < /tmp/jobscript.sh, (i.e. use_stdin=True) then jobscript.sh is fed to the LSF scheduler and then passed to the LSF host when the job is executed. There is no need for the execution host to have access to the original jobscript file, so it doesn't matter where it was written to originally.

Again, I'm not an expert, but if bsub < /tmp/jobscript.sh didn't work when @mrocklin tried it at Summit, then my hunch is that there's something weird about Summit's configuration. It's even stranger that bsub /tmp/jobscript.sh apparently worked.

In any case, I don't think our current default heuristic is correct, because bsub < is still supported in LSF 10 (as documented in the LSF 10 manual).

Also for further reference could you answer these two questions:

what is the output of lsid | head -n1

$ lsid | head -n1
IBM Spectrum LSF Standard 10.1.0.8, May 10 2019

does bsub < job_script (rather than bsub job_script) work for you?

Yes, both of those work for me, as long as job_script is located in a shared location, such as my home directory. But (as explained above) if job_script is located in /tmp/, then only bsub < /tmp/job_script works.

lesteve · 2019-10-30T15:25:40Z

Disclaimer: I am not an LSF expert, and I only have experience with one LSF cluster, of which I am merely a user, not an administrator.

None of us are LSF experts even less LSF administrators ... as someone who has access to a LSF cluster and from your earlier comments I think you qualify as the dask-jobqueue LSF expert ;-).

Thanks a lot your feed-back, it is extremely useful! Also it aligns very much with my understanding of the problem so I think we should switch to use_stdin=True by default.

stuarteberg · 2019-10-30T15:39:54Z

I think we should switch to use_stdin=True by default.

OK, assuming @guillaumeeb agrees, would you like me to open a separate PR for that, or simply change it now, as part of this PR?

lesteve · 2019-10-30T15:57:17Z

conftest.py

@@ -27,3 +29,18 @@ def pytest_runtest_setup(item):
    if envnames:
        if item.config.getoption("-E") not in envnames:
            pytest.skip("test requires env in %r" % envnames)
+
+
+@pytest.fixture(autouse=True)


Would it be possible to not use autouse here, so that the fixture is explicitly used in test_use_stdin? My preference would be to avoid pytest magic if possible.

Since lsf_version() is called by default (unless use-stdin is specified in the config), then this monkey-patch is needed by all tests that instantiate LSFCluster(). That includes every test in test_lsf.py and also half of the tests in test_jobqueue_core.py.

I'm not a pytest expert, but IIUC, we need to use autouse=True or we need to add this fixture to every test that needs it in those two files. Is there some better mechanism I'm missing?

BTW, in the future, if we simply use use-stdin: true by default, then we can forbid use-stdin: null. At that point, there will be no need for lsf_version() anyway. We can delete it, along with this test fixture.

In other words, it's probably not worth debating the technical details of this fixture if we're going to delete it soon, anyway.

Right I missed that. I think we can keep it like this for this PR.

When we switch to use_stdin=True, we should remove the lsid logic (and so this autouse fixture). Basically, we thought there was a change in behaviour linked to LSF 10 and my current understanding is that this is not the case but is linked to some quirks on Summit ...

Looks like our messages crossed, oh well ... looks like we agree anyway.

lesteve · 2019-10-30T16:00:23Z

OK, assuming @guillaumeeb agrees, would you like me to open a separate PR for that, or simply change it now, as part of this PR?

If you don't mind, a separate PR would be preferrable. This PR is adding the use_stdin parameter, which was an oversight of #307.

Changing to use_stdin=True is a behaviour change that we may want to understand again in six months time. As such I feel it deserves its PR with its dedicated discussion (rather than plenty of comments in separate issues).

stuarteberg · 2019-10-30T20:14:22Z

If you don't mind, a separate PR would be preferrable.

I'll open a PR once this one is merged. (It will touch the same files as this one.)

lesteve · 2019-10-30T20:44:18Z

Thanks I'll merge this one, since this seems perfectly reasonable to me. @guillaumeeb don't hesitate to comment if you think we missed something.

lesteve · 2019-10-30T20:45:20Z

My current thinking is that there are a few fixes that would be nice to include in a release in the near future (say 1-2 weeks) and use_stdin=True by default is one of them.

stuarteberg force-pushed the lsf-use_stdin-arg branch 4 times, most recently from 7f948a4 to 03d3d67 Compare October 24, 2019 22:11

stuarteberg force-pushed the lsf-use_stdin-arg branch from 03d3d67 to b524b3e Compare October 25, 2019 13:17

guillaumeeb reviewed Oct 28, 2019

View reviewed changes

lsf: Allow use_stdin to be passed in via the LSFCluster constructor.

ee6fb77

stuarteberg force-pushed the lsf-use_stdin-arg branch from b524b3e to ee6fb77 Compare October 30, 2019 14:06

lesteve reviewed Oct 30, 2019

View reviewed changes

lesteve merged commit be60856 into dask:master Oct 30, 2019

lesteve mentioned this pull request Dec 1, 2019

Switch to use_stdin=True by default in LSFCluster #372

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSF: Accept use_stdin in the constructor #360

LSF: Accept use_stdin in the constructor #360

stuarteberg commented Oct 24, 2019 •

edited

Loading

stuarteberg commented Oct 25, 2019 •

edited

Loading

guillaumeeb left a comment

lesteve commented Oct 29, 2019 •

edited

Loading

stuarteberg commented Oct 30, 2019

stuarteberg commented Oct 30, 2019

lesteve commented Oct 30, 2019

stuarteberg commented Oct 30, 2019 •

edited

Loading

lesteve Oct 30, 2019

stuarteberg Oct 30, 2019

stuarteberg Oct 30, 2019

lesteve Oct 30, 2019

lesteve Oct 30, 2019

lesteve commented Oct 30, 2019

stuarteberg commented Oct 30, 2019

lesteve commented Oct 30, 2019

lesteve commented Oct 30, 2019

LSF: Accept use_stdin in the constructor #360

LSF: Accept use_stdin in the constructor #360

Conversation

stuarteberg commented Oct 24, 2019 • edited Loading

stuarteberg commented Oct 25, 2019 • edited Loading

guillaumeeb left a comment

Choose a reason for hiding this comment

lesteve commented Oct 29, 2019 • edited Loading

stuarteberg commented Oct 30, 2019

stuarteberg commented Oct 30, 2019

lesteve commented Oct 30, 2019

stuarteberg commented Oct 30, 2019 • edited Loading

lesteve Oct 30, 2019

Choose a reason for hiding this comment

stuarteberg Oct 30, 2019

Choose a reason for hiding this comment

stuarteberg Oct 30, 2019

Choose a reason for hiding this comment

lesteve Oct 30, 2019

Choose a reason for hiding this comment

lesteve Oct 30, 2019

Choose a reason for hiding this comment

lesteve commented Oct 30, 2019

stuarteberg commented Oct 30, 2019

lesteve commented Oct 30, 2019

lesteve commented Oct 30, 2019

stuarteberg commented Oct 24, 2019 •

edited

Loading

stuarteberg commented Oct 25, 2019 •

edited

Loading

lesteve commented Oct 29, 2019 •

edited

Loading

stuarteberg commented Oct 30, 2019 •

edited

Loading