Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert changes to HyP3 job retry strategy #1944

Closed
jtherrmann opened this issue Nov 22, 2023 · 2 comments
Closed

Revert changes to HyP3 job retry strategy #1944

jtherrmann opened this issue Nov 22, 2023 · 2 comments
Labels
Jira Bug Create a Jira Bug for this issue

Comments

@jtherrmann
Copy link
Contributor

jtherrmann commented Nov 22, 2023

Jira: https://asfdaac.atlassian.net/browse/TOOL-2366

Note: The above link is accessible only to members of ASF.


The new retry strategy was implemented in #1871

Under the new retry strategy:

  • Each attempt is a separate Batch object so inspecting all three failures is more difficult.
  • When a job is retried, it goes to the back of the queue. This is particularly painful during giant processing campaigns in custom deployments.
  • If a hyp3 deployment occurs between retry attempts, then all subsequent retries fail because the Step Function doesn’t have perms to submit outdated Batch job definition.
    • If the AMI has changed with a deployment, then all jobs fail their current attempt, which means that under the new retry strategy, all jobs fail permanently when the AMI changes.
@jtherrmann jtherrmann added the Jira Bug Create a Jira Bug for this issue label Nov 22, 2023
@jtherrmann
Copy link
Contributor Author

jtherrmann commented Nov 22, 2023

@jtherrmann jtherrmann mentioned this issue Nov 22, 2023
2 tasks
@jtherrmann
Copy link
Contributor Author

jtherrmann commented Nov 27, 2023

Validating changes in prod:

These jobs should succeed: https://hyp3-api.asf.alaska.edu/jobs?user_id=jtherrmann&name=jth%20test%20retry%20revert

This job should fail after 3 attempts: https://hyp3-api.asf.alaska.edu/jobs/00147ea7-77f6-4e98-8d1c-4210bdd11b1a

Edit: Works as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Jira Bug Create a Jira Bug for this issue
Projects
None yet
Development

No branches or pull requests

1 participant