Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout on scheduling wait #3736

Merged
merged 5 commits into from
Sep 10, 2023
Merged

Add timeout on scheduling wait #3736

merged 5 commits into from
Sep 10, 2023

Conversation

pditommaso
Copy link
Member

@pditommaso pditommaso commented Mar 9, 2023

This PR adds a new directive maxAwait that allows controlling how long a task can be in submission (pending execution) state, elapsed that time a failure is reported.

This, along with the use of dynamic directives evaluation and the task.submitAttempt attribute make it possible to attempt the execution of the task to a different queue or set of resources requirement.

Signed-off-by: Paolo Di Tommaso <[email protected]>
@netlify
Copy link

netlify bot commented Aug 25, 2023

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit 03836d7
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/64fd93e381419e0008c49c2d
😎 Deploy Preview https://deploy-preview-3736--nextflow-docs-staging.netlify.app/process
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@pditommaso pditommaso marked this pull request as ready for review August 25, 2023 08:07
@pditommaso
Copy link
Member Author

Think this is ready to go. @bentsherman please give a try to it

@bentsherman bentsherman self-requested a review August 29, 2023 20:21
@bentsherman
Copy link
Member

As I understand it, submit timeout errors are limited by maxRetries, but the submit retries are counted separately from other retries. So with maxRetries 3, a task could fail 3 times due to submit timeout and 3 times due to other retry-able errors, in any order, before the workflow would terminate. Is that right?

@pditommaso
Copy link
Member Author

Yes, exactly

@bentsherman
Copy link
Member

It seems like an odd choice. I would expect either two separate limits for submit retries and regular retries (e.g. maxSubmitRetries and maxRetries), or for them to be counted together and be subject to the same limit (e.g. submitRetries + retries <= maxRetries).

Also, maybe we should call it maxSubmitAwait instead of maxAwait to denote that it is for the submit time.

@pditommaso
Copy link
Member Author

I would expect either two separate limits for submit retries and regular retries

I see your point, at the same time the submit retry requires setting errorStrategy=retry this is why it uses the same setting.

It could make sense to use a separate setting, but in that case, it should be independent of the errorStrategy.

Let's move forward as "experimental", collect some user feedback and refine it later.

maybe we should call it maxSubmitAwait instead of maxAwait to denote that it is for the submit time

Agreed

@pditommaso pditommaso merged commit 5686bf1 into master Sep 10, 2023
18 of 19 checks passed
@pditommaso pditommaso deleted the submit-max-await-v2 branch September 10, 2023 12:51
@pditommaso pditommaso added this to the 23.10.0 milestone Sep 10, 2023
abhi18av pushed a commit to abhi18av/nextflow that referenced this pull request Oct 28, 2023
This commit adds a new directive named `maxSubmitAwait` that allows defining 
how long a task can be in a pending status waiting to be scheduled for execution 
by the underlying executor. 

When the timeout is reached, an execution error is returned. The usual `errorStrategy`  
directive can be used to control how handle the error condition and, if required, attempt 
a new job submission. The `task.submitAttempt` attribute can be used to determine 
which submitted attempt failed.  


Signed-off-by: Paolo Di Tommaso <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

option to resubmit long-queued tasks with smaller resource requests
2 participants