Revert changes to HyP3 job retry strategy #1944

jtherrmann · 2023-11-22T19:17:19Z

Jira: https://asfdaac.atlassian.net/browse/TOOL-2366

Note: The above link is accessible only to members of ASF.

The new retry strategy was implemented in #1871

Under the new retry strategy:

Each attempt is a separate Batch object so inspecting all three failures is more difficult.
When a job is retried, it goes to the back of the queue. This is particularly painful during giant processing campaigns in custom deployments.
If a hyp3 deployment occurs between retry attempts, then all subsequent retries fail because the Step Function doesn’t have perms to submit outdated Batch job definition.
- If the AMI has changed with a deployment, then all jobs fail their current attempt, which means that under the new retry strategy, all jobs fail permanently when the AMI changes.

jtherrmann · 2023-11-22T20:03:20Z

Edit: Works as expected.

jtherrmann · 2023-11-27T20:26:15Z

Validating changes in prod:

Edit: Works as expected.

jtherrmann added the Jira Bug Create a Jira Bug for this issue label Nov 22, 2023

jtherrmann mentioned this issue Nov 22, 2023

Revert changes to retry strategy #1945

Merged

jtherrmann mentioned this issue Nov 22, 2023

Release v4.4.1 #1946

Merged

2 tasks

jtherrmann closed this as completed Nov 27, 2023

Provide feedback