-
-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSFCluster may be overly specific? #328
Comments
Not at UM anymore. Pinging a colleague who may use this or know someone who can give feedback @milancurcic |
Hi, - submit_command = "bsub <"
+ submit_command = "bsub" I know
I can confirm that this is necessary to allocate all processes on the same node. I have run into situations (with manually scheduled jobs on 32 cores) where not giving this argument resulted in the cores getting divided between multiple machines (like 16+16). For smaller numbers of cores, it might not be a problem though as LSF probably won't kill your job aggressively, but for the principle you should declare what you are actually using. |
This is weird and from what I remember from stdin was needed in @raybellwaves case (#78). Edit: I found this: #78 (comment) |
@mrocklin, what is you LSF version? Its pretty certain you've got to use But I can find it in newer IBM docs. For |
|
So this is the last one, maybe they changed the job script behaviour with this one, do you have access to IBM support through Summit staff? The non working |
I'm using |
@louisabraham do you use |
@guillaumeeb |
@louisabraham I am a bit confused:
In an ideal world we would figure out a way to make |
@lesteve Sorry for not being clear, I'm not currently using |
Can you first try to post the error you get if you do: If you can give import time
from dask.distributed import Client
from dask_jobqueue import LSFCluster
# you may need additional arguments like queue and possibly others in `LSFCluster`
# look at cluster.job_script(), this is the script that is actually used with bsub
# and LSFCluster docstring
cluster = LSFCluster(cores=1, memory='1GB')
cluster.scale(1)
client = Client(cluster)
while len(client.scheduler_info()['workers']) < 1:
print('waiting for workers')
time.sleep(5)
fut = client.submit(lambda x: x, 1)
result = fut.result()
print('Got result:', result) I am expecting is that you'll get an error at the If you have any feed-back about why you stopped using |
On the cluster we use |
I'm also having trouble with this change: - submit_command = "bsub <"
+ submit_command = "bsub" LSF outputs the following error:
So, If Thankfully, I notice that there is a new configuration option available, introduced in #347: jobqueue:
lsf:
use-stdin: true Using that setting fixes the issue for me. BTW, I notice that FWIW, here's my LSF version:
|
@stuarteberg this is very useful feed-back, thanks a lot!
Do you know if for LSF 10, there is an alternative way to spool the script that is different from
The main reason was @mrocklin's feed-back that Some additional general comments:
|
OK, I tried that. It turns out that option has a different meaning. For example, the following two lines are equivalent:
...except that in the latter case, The two options are orthogonal: I think all of the following are equivalent, assuming
Good point, I was looking at the LSF 9 docs. In the LSF 10 docs, it's harder to find. But the |
Thanks for your great feed-back. I guess I was too optimistic in thinking that "spool" was meaning the same thing in one of your earlier message and in the LSF docs. Full disclosure I had no idea what spool means (and I don't have the bandwidth to investigate). At this point, I am starting to think that Oak Ridge Summit may be the odd one out since Here are a few action points:
from dask_jobqueue import LSFCluster
cluster = LSFCluster(
cores=128,
memory="600 GB",
project="GEN119",
walltime="00:30",
use_stdin=True,
)
cluster.scale(jobs=3) # ask for three nodes Side-comment: for dask-jobqueue/dask_jobqueue/oar.py Lines 71 to 92 in c1e0a21
|
I was trying out dask-jobqueue on the Summit supercomputer at Oak Ridge National Labs. I ran into a number of problems with our current configuration that seem to be special cases. I propose un-special-casing these, but I would like to get feedback from others who use LSF.
bsub command
We currently do some odd things with bsub. This made things fail for me on a login node (although they did work on a compute node). Removing this special-cased behavior made things work well for me in both places.
cc'ing @raybellwaves and @guillaumeeb , who show up under
git blame
for this code-R "span[hosts=1]"
My particular deployment didn't like these lines. I don't know if these are very useful generally though, and I can work around them.
cc @louisabraham (also pointed to by
git blame
)The text was updated successfully, but these errors were encountered: