Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(controller): retry strategy support on daemon containers, fixes #13705 #13738

Open
wants to merge 57 commits into
base: main
Choose a base branch
from

Conversation

MenD32
Copy link

@MenD32 MenD32 commented Oct 10, 2024

Addresses #13705
And #2963

Motivation

Add retryStrategy support to daemon container templates.

example - our use case

We have a workflow with a model running as a daemon container and subsequent steps that query the model, In order to test its performance. When the workflow runs for a long time, it can be preempted by the cloud provider due to GPU node preemption, causing it to fail.

with this feature, I could sustain a cloud preemption by retrying the container with the correct cluster autoscaler configuration.

when considering how to solve this problem with existing argo workflows tooling, I've came to the following conclusions:

retryStrategy on daemon containers currently doesn't work as expected:
daemon containers with retry strategy cause the whole workflow to fail even if they have a retry strategy.

Using the k8s resource template is problematic for these reasons:

  • artifacts cannot pass to these deployments (which we currently use)
  • our team consists of mostly MLEs who are not very proficient in k8s, I do not want to split the model logs & status from the workflow page in order to reduce their confusion and their technical overhead

I see this feature as vital for anyone who is using daemon containers for testing / benchmarking, as problems similar to these can be quite common under these circumstances.

Modifications

execution functions now consider "succeeded" daemoned nodes as failed

When a daemoned container completes execution, it is considered as failed, if it has a retry strategy it will retry.

the IP change in the node will cascade down to future executions.

Verification

e2e tests

also for manual verification I added an examples to test and see the behavior locally

  • examples/dag-daemon-retry-strategy.yaml
  • examples/steps-daemon-retry-strategy.yaml

I simulated daemon failures by deleting the daemon pod manually

@MenD32 MenD32 force-pushed the feat/daemon-retry-strategy branch from bce39fd to 2e1c501 Compare October 11, 2024 22:37
@MenD32 MenD32 changed the title Feat/daemon retry strategy Feat: retry strategy support on daemon containers Oct 12, 2024
@MenD32 MenD32 changed the title Feat: retry strategy support on daemon containers feat: retry strategy support on daemon containers Oct 12, 2024
@MenD32 MenD32 force-pushed the feat/daemon-retry-strategy branch from fd95b21 to 6f4efc7 Compare October 13, 2024 14:38
@MenD32 MenD32 marked this pull request as ready for review October 14, 2024 15:00
@MenD32 MenD32 marked this pull request as draft October 14, 2024 15:25
@MenD32 MenD32 force-pushed the feat/daemon-retry-strategy branch from 4eb8b59 to fa12577 Compare October 14, 2024 19:44
@MenD32
Copy link
Author

MenD32 commented Oct 19, 2024

i added an E2E test and i can't get it to work on the CI (even though make test-functional works locally), any clue why?

the behavior of the controller is as if its built on main instead of on this branch

edit: NVM got it to work

@MenD32 MenD32 force-pushed the feat/daemon-retry-strategy branch from cb60281 to 80ffbb5 Compare November 2, 2024 14:31
@MenD32 MenD32 force-pushed the feat/daemon-retry-strategy branch from ca07755 to 2695fdf Compare December 8, 2024 13:37
Signed-off-by: MenD32 <[email protected]>
@MenD32 MenD32 force-pushed the feat/daemon-retry-strategy branch from fc780f7 to ec06bdb Compare December 8, 2024 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant