Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Do not reset the root node by default. Fixes #13196 #13198

Merged
merged 2 commits into from
Aug 27, 2024

Conversation

jswxstw
Copy link
Member

@jswxstw jswxstw commented Jun 17, 2024

Fixes #13196

Motivation

Single-node workflow with failed exit handler can not be retried correctly.

Modifications

Only reset root node when it is FailedOrError, remove the default reset logic.

Verification

local test and e2e test

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single-node workflow with failed exit handler can not be retried correctly.

Can you add a test case?

@jswxstw
Copy link
Member Author

jswxstw commented Jun 18, 2024

Sure.

@jswxstw
Copy link
Member Author

jswxstw commented Jun 18, 2024

/retest

@agilgur5 agilgur5 added area/exit-handler area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Jul 2, 2024
@juliev0
Copy link
Contributor

juliev0 commented Aug 23, 2024

Thanks for fixing. Would you mind taking me through what happened to help me understand? I don't know how the exit handler is related.

First, the root node got reset to "Running" as you mentioned. So, how did the node being in "Running" cause the retry to fail? (any links into the code if possible will help)

Thanks!

@jswxstw
Copy link
Member Author

jswxstw commented Aug 24, 2024

First, the root node got reset to "Running" as you mentioned. So, how did the node being in "Running" cause the retry to fail? (any links into the code if possible will help)

The root node of Pod type got reset to Running from Succeeded state, which is not as expected:

  • Firstly, it should not re-run, because it is already Succeeded.
  • Secondly, its pod is completed, so this node can not run correctly: new pod will not be created because it already exists and then this node turns to Error with message pod deleted.

@juliev0
Copy link
Contributor

juliev0 commented Aug 24, 2024

Got it, thanks. I see the difference is the Group node would be a DAG, Steps, etc, which doesn't run.

And regarding the Description, does it matter that there was an exit handler, or could it just have easily been a single node workflow that failed, and would've had the same issue?

@jswxstw
Copy link
Member Author

jswxstw commented Aug 24, 2024

And regarding the Description, does it matter that there was an exit handler, or could it just have easily been a single node workflow that failed, and would've had the same issue?

If there is only one single node without exit handler, the workflow will be Succeeded which is not retryable, so it will not cause any problems.

@juliev0
Copy link
Contributor

juliev0 commented Aug 24, 2024

I'm confused. Is this not a single node Workflow?:

version: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: entrypoint-fail
spec:
  entrypoint: failing-entrypoint
  templates:
  - name: failing-entrypoint
    container:
      image: alpine:3.18
      command: [sh, -c]
      args: ["exit 1"]

@jswxstw
Copy link
Member Author

jswxstw commented Aug 24, 2024

I'm confused. Is this not a single node Workflow?:

Yes, this is a single node worklfow. But the root node is of pod type which is no need to be reset, since it will be deleted after manually retry.

# argo retry entrypoint-fail
INFO[2024-08-25T00:25:22.288Z] Deleting pod                                  podDeleted=entrypoint-fail
INFO[2024-08-25T00:25:22.295Z] Workflow to be dehydrated                     Workflow Size=902
Name:                entrypoint-fail
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Running
Conditions:          
 PodRunning          False
 Completed           False
Created:             Sun Aug 25 00:24:35 +0800 (47 seconds ago)
Started:             Sun Aug 25 00:25:22 +0800 (now)
Duration:            0 seconds
Progress:            0/1
ResourcesDuration:   0s*(1 cpu),2s*(100Mi memory)

@juliev0
Copy link
Contributor

juliev0 commented Aug 25, 2024

So, just trying to understand - this statement isn't necessarily true in all cases, right?:

If there is only one single node without exit handler, the workflow will be Succeeded which is not retryable, so it will not cause any problems.

since the Workflow I have above is an example of an unsucceeded Workflow, right? But you're saying that in my example, the Root node is of type Pod?

And in your example what type is the Root node? I imagine there are many nodes in yours, right? Can you help me understand what the tree looks like for that?

Thank you for bearing with me. I'm guessing your change is a good one. I just want to make sure I fully understand.

@juliev0
Copy link
Contributor

juliev0 commented Aug 25, 2024

Okay, I just decided to try running yours. This is the result:

  nodes:
    workflow-exit-handler-fail:
      displayName: workflow-exit-handler-fail
      finishedAt: "2024-08-25T18:42:37Z"
      hostNodeName: lima-rancher-desktop
      id: workflow-exit-handler-fail
      name: workflow-exit-handler-fail
      outputs:
        artifacts:
        - name: main-logs
          s3:
            key: workflow-exit-handler-fail/workflow-exit-handler-fail/main.log
        exitCode: "0"
      phase: Succeeded
      progress: 1/1
      resourcesDuration:
        cpu: 0
        memory: 5
      startedAt: "2024-08-25T18:42:28Z"
      templateName: echo
      templateScope: local/workflow-exit-handler-fail
      type: Pod
    workflow-exit-handler-fail-925896592:
      children:
      - workflow-exit-handler-fail-1000917866
      displayName: workflow-exit-handler-fail.onExit
      finishedAt: "2024-08-25T18:42:47Z"
      id: workflow-exit-handler-fail-925896592
      message: child 'workflow-exit-handler-fail-2100426797' failed
      name: workflow-exit-handler-fail.onExit
      nodeFlag:
        hooked: true
      outboundNodes:
      - workflow-exit-handler-fail-2100426797
      phase: Failed
      progress: 0/1
      resourcesDuration:
        cpu: 0
        memory: 2
      startedAt: "2024-08-25T18:42:40Z"
      templateName: exit-handler
      templateScope: local/workflow-exit-handler-fail
      type: Steps
    workflow-exit-handler-fail-1000917866:
      boundaryID: workflow-exit-handler-fail-925896592
      children:
      - workflow-exit-handler-fail-2100426797
      displayName: '[0]'
      finishedAt: "2024-08-25T18:42:47Z"
      id: workflow-exit-handler-fail-1000917866
      message: child 'workflow-exit-handler-fail-2100426797' failed
      name: workflow-exit-handler-fail.onExit[0]
      nodeFlag: {}
      phase: Failed
      progress: 0/1
      resourcesDuration:
        cpu: 0
        memory: 2
      startedAt: "2024-08-25T18:42:40Z"
      templateScope: local/workflow-exit-handler-fail
      type: StepGroup
    workflow-exit-handler-fail-2100426797:
      boundaryID: workflow-exit-handler-fail-925896592
      displayName: exit-handler-task
      finishedAt: "2024-08-25T18:42:44Z"
      hostNodeName: lima-rancher-desktop
      id: workflow-exit-handler-fail-2100426797
      message: Error (exit code 1)
      name: workflow-exit-handler-fail.onExit[0].exit-handler-task
      outputs:
        artifacts:
        - name: main-logs
          s3:
            key: workflow-exit-handler-fail/workflow-exit-handler-fail-fail-2100426797/main.log
        exitCode: "1"
      phase: Failed
      progress: 0/1
      resourcesDuration:
        cpu: 0
        memory: 2
      startedAt: "2024-08-25T18:42:40Z"
      templateName: fail
      templateScope: local/workflow-exit-handler-fail
      type: Pod

So, root is a Pod type. And in addition there's a Steps with child StepGroup with child Pod representing the ExitHandler.

@juliev0
Copy link
Contributor

juliev0 commented Aug 25, 2024

The thing I'm not clear on is why this special logic for resetting root node even exists at all. Do we know why it's not sufficient to just catch all errored GroupNodes here? (including the root node) Is it important to also reset Succeeded Root node, not just Errored?

@jswxstw
Copy link
Member Author

jswxstw commented Aug 26, 2024

And in your example what type is the Root node? I imagine there are many nodes in yours, right? Can you help me understand what the tree looks like for that?

The phase of the root node and the workflow is usually the same if there is no exit handler.
The type of root node is usually Steps or DAG since a workflow usually have many steps or tasks.
My example is very special which breaks the above two rules:

  • root node is of Pod type because there is only one node.
  • root node is Succeeded but workflow is Failed because exit handler runs failed.

Workflow is Failed, so I can retry it, then the root node of Pod type is reset to Running from Succeeded which is unacceptable.

@jswxstw
Copy link
Member Author

jswxstw commented Aug 26, 2024

The thing I'm not clear on is why this special logic for resetting root node even exists at all. Do we know why it's not sufficient to just catch all errored GroupNodes here? (including the root node)

I didn't notice this before. I agree with you that this special logic for resetting root node seems useless now since Steps is belong to GroupNode.

Is it important to also reset Succeeded Root node, not just Errored?

Perhaps to keep it consistent with the workflow phase?

@juliev0
Copy link
Contributor

juliev0 commented Aug 26, 2024

Hey @terrytangyuan - I know this code was written a couple years back but do you have any input for this question?

@terrytangyuan
Copy link
Member

It should be fine to remove that special handling.

@juliev0
Copy link
Contributor

juliev0 commented Aug 26, 2024

It should be fine to remove that special handling.

Thanks for the quick response @terrytangyuan !

@jswxstw jswxstw changed the title fix: Only reset root node of group type. Fixes #13196 fix: Do not reset the root node by default. Fixes #13196 Aug 27, 2024
@juliev0 juliev0 merged commit f8f1893 into argoproj:main Aug 27, 2024
28 checks passed
Joibel pushed a commit to pipekit/argo-workflows that referenced this pull request Sep 19, 2024
Joibel pushed a commit that referenced this pull request Sep 20, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/exit-handler area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Single-node workflow with failed exit handler cannot be retried correctly
4 participants