Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment step lifecycle is not respecting defined behaviour around requiredForCompletion #4016

Open
2 tasks done
meeech opened this issue Dec 19, 2024 · 1 comment
Open
2 tasks done
Labels
bug Something isn't working

Comments

@meeech
Copy link
Contributor

meeech commented Dec 19, 2024

Looking at https://argoproj.github.io/argo-rollouts/features/experiment/#experiment-lifecycle

Want to check if this is a bug or if I'm misunderstanding the described behaviour

Given this setup:

...
  strategy:
    canary:
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1
      steps:
        - experiment:
            duration: 1m
            templates:
              - name: canary-exp
                specRef: canary
            analyses:
              - name: canary-test
                requiredForCompletion: true
                templateName: success-rate
        - setWeight: 10
        - pause: { duration: 120s }
        - setWeight: 100
 ...

And the analysis run metric in the experiment step being set to count 20, interval 10s - should be 200 seconds.
What i am observing is the experiment step doesn't wait for the analysis run to finish before tearing down the experiment pod. Yet I can observe the analysis run still running. So the experiment pod itself is torn down (with some delay - once I pass the duration set in the experiment, I can see delay

│ ├──⧉ basic-rollout-exp-steps-765fc64d58-17-0-canary-exp ReplicaSet ✔ Healthy 61s delay:29s

show up on the output from kubectl-argo-rollouts get rollout basic-rollout-exp-steps after which time it gets scaled down.

⟳ basic-rollout-exp-steps                                       Rollout      ◌ Progressing  27h
├──# revision:17
│  ├──⧉ basic-rollout-exp-steps-765fc64d58                      ReplicaSet   • ScaledDown   106s   canary
│  └──Σ basic-rollout-exp-steps-765fc64d58-17-0                 Experiment   ◌ Running      106s
│     ├──⧉ basic-rollout-exp-steps-765fc64d58-17-0-canary-exp   ReplicaSet   • ScaledDown   106s   delay:passed
│     └──α basic-rollout-exp-steps-765fc64d58-17-0-canary-test  AnalysisRun  ◌ Running      104s   ✔ 11

My expectation is the experiment pod is not torn down - respecting this behaviour:

If one or more of the referenced AnalysisTemplates is marked with requiredForCompletion: true, the Experiment will not complete until those AnalysisRuns have completed, even if it exceeds the Experiment duration.

Another behaviour I am observing is when requiredForCompletion: true is set on an analysis run for an experiment step, and the analysis run completes BEFORE the duration set for the experiment, then the experiment will end prematurely.
So if the experiment duration is 10 minutes, but I have a k8s job analysis run step for the experiment which takes 1 minute, and requiredForCompletion: true, then the exeriment ends in 1 min, not 10 as expected.

rollouts 1.7.2

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@meeech meeech added the bug Something isn't working label Dec 19, 2024
@meeech meeech changed the title Experiment step lifecycle is not respecting defined behaviour Experiment step lifecycle is not respecting defined behaviour around requiredForCompletion Dec 19, 2024
@meeech
Copy link
Contributor Author

meeech commented Dec 19, 2024

The expected behaviour is also in the FAQ - If the `requiredForCompletion` field is set, the Experiment only marks itself as Successful and scales down the created ReplicaSets when the AnalysisRun finishes Successfully. so pretty sure this is a bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant