Clarity in the documentation about `tar_resources_clustermq` #963

multimeric · 2022-10-25T23:47:34Z

multimeric
Oct 25, 2022

Something that has bit me as I transitioned from the batchtools backend to the clustermq backend is the way that resources are specified. In batchtools, if you set some global options via tar_option_set(resources = tar_resources(...)), you get per-task resources. For example if you set a default walltime of 1 hour, and then submit 100 jobs, then each job will get a 1-hour walltime, which is as expected, and they will likely succeed.

The equivalent for clustermq is

tar_option_set(
  resources = tar_resources(
    clustermq = tar_resources_clustermq(template = list(walltime = 60))
  )
)

Here, we are actually setting the walltime of the persistent workers, which means that, even if we have 100 targets to build, all of them must finish within 60 minutes or else the pipeline will get stuck in limbo where it thinks it's running but has no workers. Now, this specific limbo issue is solved by mschubert/clustermq#150, however I think it would be helpful to explain this resources behaviour in the docs, if indeed I am understanding correctly. An interesting point also is the impact of the workers argument to tar_make_clustermq. As you increase this, you are more likely to have your pipeline succeed, since you have queued up workers that will take over even if a previous worker times out. This is slightly different to batchtools where workers will affect the concurrency of processing, but won't affect whether the pipeline succeeds or not.

Also, something that still isn't clear to me with clustermq is how the individual target resources affect the pipeline, if the worker resources must be determined upfront. For example, if I have the above configuration, but then define a target like this:

tar_target(
  name,
  command(),
  resources = tar_resources(
    clustermq = tar_resources_clustermq(template = list(walltime = 120))
  )
)

What happens in this case?

wlandau · 2022-10-26T14:40:07Z

wlandau
Oct 26, 2022
Maintainer

Here, we are actually setting the walltime of the persistent workers, which means that, even if we have 100 targets to build, all of them must finish within 60 minutes or else the pipeline will get stuck in limbo where it thinks it's running but has no workers.

Right, this is part of what it means for workers to be persistent. A persistent worker is an R process that launches early in the pipeline and stays running until the whole pipeline starts to wind down. A persistent worker usually runs multiple targets during its lifecycle, and it is not possible to precisely predict in advance which targets will be assigned to which workers. https://books.ropensci.org/targets/hpc.html discusses persistent vs transient workers, and I just made some edits in ropensci-books/targets@90401f0 and ec088ff to emphasize the concepts.

As you increase this, you are more likely to have your pipeline succeed, since you have queued up workers that will take over even if a previous worker times out.

All the persistent workers launch at the same time in a single array job. Some may be queued for longer than others, but if the job queue is accommodating enough, then all the workers will start at the same time and thus time out at the same time. But more workers = more parallelization, so the pipeline may be more likely to finish before any timeouts occur.

This is slightly different to batchtools where workers will affect the concurrency of processing, but won't affect whether the pipeline succeeds or not.

The difference is that tar_make_future() launches one short-lived worker for each target, whereas tar_make_clustermq() launches a bunch of long-running workers up front that can each run more than one target before shutting down.

What happens in this case?

clustermq resources assigned in tar_target() are ignored. clustermq workers are not target-specific, so the correct way to assign clustermq resources is through tar_option_set(). Mentioned in ae3dd34 and ropensci-books/targets@90401f0.

3 replies

multimeric Oct 26, 2022
Author

Thanks for all the improvements in the docs, they're all useful additions. However I still feel that my main point isn't obvious for new users: that the system resources you specify in tar_option_set(resources = ) are worker resources which therefore need to be high enough to run any of your jobs, and also that the walltime is the lifetime of the worker, not just the length of each job.

I'm happy to contribute if you'd prefer.

wlandau Oct 27, 2022
Maintainer

Always happy to review PRs.

multimeric Oct 27, 2022
Author

The changes you made make this very clear I think, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarity in the documentation about `tar_resources_clustermq` #963

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Clarity in the documentation about tar_resources_clustermq #963

multimeric Oct 25, 2022

Replies: 1 comment · 3 replies

wlandau Oct 26, 2022 Maintainer

multimeric Oct 26, 2022 Author

wlandau Oct 27, 2022 Maintainer

multimeric Oct 27, 2022 Author

Clarity in the documentation about `tar_resources_clustermq` #963

multimeric
Oct 25, 2022

Replies: 1 comment 3 replies

wlandau
Oct 26, 2022
Maintainer

multimeric Oct 26, 2022
Author

wlandau Oct 27, 2022
Maintainer

multimeric Oct 27, 2022
Author