-
-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptive scaling and dask-jobqueue goes into endless loop when a job launches several worker processes (was: Different configs result in worker death) #498
Comments
This seems to have an answer on SO so I'm going to close this out here. |
Hi @jacobtomlinson - sorry for not being more transparent, I recreated this question here even with that answer in place. I don't think the current answer addresses the problem, which lead me to repost here. The suggestion on SO is that it is a Slurm issue. However, through my testing I've found that the jobs are not cancelled to due Slurm, and that the resource requests remain the same. Rather, it's the balance between threads and processes that changes (or using |
Hi @AlecThomson, Sorry for the delay. This seems a complicated issue, the problem is probably coming from a mix of how Slurm is setup on your system, adaptive scaling, dask-jobqueue, and your workflow. First, simple question: do you really need adaptive scaling? This is a great feature, but it introduces a more complex way of managing resources, your workers (and jobs) are probably stopped and restarted several times, which Slurm may not like, This might be the cause of the error you indicated on SO:
Changing the balance of processes / threads might very well change the lifetime of a worker, for several reasons:
To investigate futher, we would need more information, like stderr/stdout of one worker job, the |
Hi @guillaumeeb, Thanks very much for following up!
I would very much like to make use of the features, if possible. First, I'm on a highly-subscribed system, so its possible jobs could be sitting in the queue for some time. My understanding is that using the On memory management, this would not seem to be the case. I've tested (using the example below) purposely exceeding the memory limit, which results in a clear output:
followed by
As requested, I've put together an example that reproduces my issue: import time
from dask import delayed
from dask.distributed import Client, progress, LocalCluster
from dask_jobqueue import SLURMCluster
import numpy as np
def inner_job(i):
return i+1
@delayed
def job(x):
time.sleep(1)
y = inner_job(x)
# Enable for OOM error
#large_arr = np.zeros([100000,10000]) # 8GB
#y = y + large_arr
return y
def main(client):
njobs = int(1000)
outputs = []
for i in range(njobs):
output = job(i)
outputs.append(output)
results = client.persist(outputs)
print("Running test...")
progress(results)
def cli():
cluster = SLURMCluster(
# Set up for Galaxy
cores=20,
# processes=1,
processes=20,
name='spice-worker',
memory="60GB",
project='askap',
queue='workq',
walltime='12:00:00',
job_extra=['-M galaxy'],
# interface for the workers
interface="ipogif0",
log_directory='logs',
python='srun -n 1 -c 20 python',
extra=[
"--lifetime", "11h",
"--lifetime-stagger", "5m",
],
death_timeout=300,
local_directory='/dev/shm'
)
print('Submitted scripts will look like: \n', cluster.job_script())
# cluster.adapt(maximum_jobs=2)
cluster.scale(jobs=2)
client = Client(cluster)
main(client)
if __name__ == "__main__":
cli() I've tested this using several configurations:
Using
Using
Using anything other than
The accompanying output from the control script was:
At which point I stopped the job. The output looks exactly the same when using e.g. Let me know if you'd like any more info, or if there are any other diagnostic tests I can run. EDIT: Accidentally left some extra arguments in |
I'm not sure what you say by timeout in the queue.
This I think is the principal interest of adaptive. I also often give the advice of using it only in interactive mode (not for batch).
OK, so in the end there is clearly a bug when using Adaptive scaling and dask-jobqueue when a job launches several worker processes. I've also seen this just yesterday. Adaptive scale with number of process, not number of jobs, and I think it leads to calls of Unfortunately, I have no time to dig this currently... |
We're currently experiencing the exact same issue with our Grid Engine cluster. Losing a job during the use of adaptive scaling with multiple processes (workers) per job results in permanent loss of workers; it seems that no additional job is submitted to replace the lost workers. |
Hi @jeiche, not sure it is the same issue since here we're talking about race condition and endless loop when using adaptive mode. We do see new jobs being launched, but almost immediately deleted, is that what you see too? Anyway, it's a complicated issue to debug, we should look at SpecCluster in Adaptive code from distributed to fix this. Not sure if I'll have the time to try to understand the problem soon. |
@guillaumeeb I believe I found a solution to the problem (code). When adapt kills a worker, it calls scancel on the worker's job, inevitably killing other worker processes under the same job. To circumvent this, Hope that helps. |
That sounds really interesting! |
Okay, so I clearly reproduce the problem. Using process > 1 and adapt leads to an endless loop of starting a stopping workers and jobs. When activating debug mode, I see a lot of these messages: ...
DEBUG:Starting worker: spice-worker-1
...
DEBUG:Starting job: 31257350
DEBUG:Stopping worker: spice-worker-1 job: 31257350
...
DEBUG:Closed job 31257350
...
DEBUG:Starting worker: spice-worker-1
...
DEBUG:Starting job: 31257351
DEBUG:Stopping worker: spice-worker-1 job: 31257351
..
DEBUG:Closed job 31257351 @jasonkena suggestion modify this behavior, but both kwarg must be passed to adapt: |
On-going investigation here, it seems that it's at the initialization from adaptive mode that the problem is, e.g. when starting the first worker process. The problem occurs when we launch adaptive without any minimum number of workers. Using: cluster.adapt(minimum_jobs=1, maximum_jobs=6) Is also a workaround. But you'll always have at least one running job (which is not that bad). |
So if I'm not mistaken, I tracked down the problem to distributed adaptive code. It's a conjunction of two things:
So this needs to be fixed upstream. |
We just encountered this issue when utilizing adaptive scaling with Dask through Prefect (on a PBS Cluster rather than Slurm). We built on the solution from @jasonkena and the latest comment from @guillaumeeb to enforce the adaptive minimum to always be equal to the number of processes specified. This seems like a fairly simple work around to handle the core issue, and can be used regardless of cluster type, etc. until potential upstream fixes such as dask/distributed#7019 are implemented. |
I cannot thank you guys enough for finally finding an answer to the bug I am experiencing. The solution was really difficult to track down, especially because we don't get any information about why the job is being cancelled, not even in DEBUG mode. I suggest at least adding a log message when canceling/starting a job to comply with adapt while the problem is not solved. |
What happened:
(Reposting from SO)
I'm using Dask Jobequeue on a Slurm supercomputer (I'll note that this is also a Cray machine). My workload includes a mix of threaded (i.e. numpy) and python workloads, so I think a balance of threads and processes would be best for my deployment (which is the default behaviour). However, in order for my jobs to run I need to use this basic configuration:
which is entirely threaded. The tasks also seem to take longer than I would naively expect (a large part of this is a lot of file reading/writing). Switching to purely processes, i.e.
results in slurm jobs which are immediately killed by Slurm as they are launched, with the only output like:
Choosing a balanced configuration (i.e. default)
results in a strange intermediate behaviour. The task will run near to completion (i.e. 900/1000 work tasks) then a number of the workers will be killed, and the progress will drop back down to, say, 400/1000 tasks.
Further, I've found that using
cluster.scale
, rather thancluster.adapt
, results in a successful run of the work. Perhaps the issue here is how adapt is trying to scale the number of jobs?What you expected to happen:
I would expect that changing the balance of processes / threads shouldn't change the lifetime of a worker.
Anything else we need to know?:
Possibly related to #20 and #363
As an aside, the current configuration of processes / threads confusing, and seems to conflict with how e.g. a
LocalCluster
is specified. Is there any progress on #231?Environment:
The text was updated successfully, but these errors were encountered: