Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: rollback to stable with dynamicStableScale could overwhelm stable pods #3077

Merged

Conversation

jessesuen
Copy link
Member

@jessesuen jessesuen commented Oct 4, 2023

Resolves #3020

Changes made:

1. Safer scaledown decisioning of "old" ReplicaSets

When scaling down "old" ReplicaSets, we now additionally check if they are still referenced by services before allowing them to be scaled down. This is an overall safety improvement to make sure we scale down older ReplicaSets only after it doesn't matter anymore (nothing is pointing to them).

2. Delay canary service selector switch, if we are rolling back to stable with dynamicStableScale and it is not fully available

Before this PR, when the user re-applied stable pod spec, the rollout controller would immediately change the service selector of the canary service back to the stable RS. This meant that all production traffic immediately hit the stable RS, even if it was undersized due to dynamicStableScale. This PR detects if dynamicStableScale is used, and if stableRS is not fully available, then we delay the service switch until stableRS is fully available. Now that we no longer touch the the canary service in this scenario, it allows the controller to continue to balance traffic between stable vs. previous desiredRS, so that traffic can be safely shifted away from it (improvement 3 below).

3. Gradually traffic shift back to stable RS to avoid overwhelming stable pods

A user can re-apply the stable manifest in the middle of an update. When this happens, stableRS == desiredRS. Before this PR, when this happened in the middle of an update, we would immediately shift 100% of the traffic back to stableRS (by setting weight to 0), even though it may have been undersized due to dyanamicStableScaling. With this PR, we will now follow similar logic as abort with dynamicStableScaling, and only increase traffic back to stable in accordance with stable's available replica count. A key difference between abort vs. stable rollback is that:

  • With Abort: traffic is balanced between desired vs. stable
  • With Stable Rollback: traffic is balanced between previous desired vs. stable

Because of this difference, the trafficrouting logic has special case logic for traffic splitting to happen between previous desired vs. stable in the event of a rollback to stable.

I will be adding an e2e test for this.

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@jessesuen jessesuen force-pushed the fix/rollback-stable-unavailability branch from 319d0a8 to 4c4bfd7 Compare October 4, 2023 21:44
@jessesuen
Copy link
Member Author

jessesuen commented Oct 4, 2023

Here is the new order of events after this PR:

  1. First rollout is updated and reaches a pause step where 90% is to canary (this is the same as before):
8s          Normal    Updated VirtualService   rollout/istio-canary                 VirtualService `istio-canary` set to desiredWeight '90'
8m33s       Normal    TrafficWeightUpdated     rollout/istio-canary                 Traffic weight updated from 0 to 90
8m33s       Normal    RolloutStepCompleted     rollout/istio-canary                 Rollout step 1/2 completed (setWeight: 90)
8m33s       Normal    ScalingReplicaSet        rollout/istio-canary                 Scaled down ReplicaSet istio-canary-f7d4dcd68 (revision 15) from 10 to 1
8m33s       Normal    RolloutPaused            rollout/istio-canary                 Rollout is paused (CanaryPauseStep)
  1. Next rollout is updated to go back to the stable pod spec. This is detected as a "rollback to stable". By looking at the Kubernetes events, we can see that we now do things in the correct order:
  • scale up stable
  • shift traffic away from the previous desired RS (80 -> 20 -> 0)
  • switch service selector of canary back to stable RS
  • scaled down the old rs (previous desired)
8s          Normal    RolloutUpdated           rollout/istio-canary                 Rollout updated to revision 17
8s          Normal    SkipSteps                rollout/istio-canary                 Rollback to stable ReplicaSets
8s          Normal    RolloutResumed           rollout/istio-canary                 Rollout is resumed
8s          Normal    ScalingReplicaSet        rollout/istio-canary                 Scaled up ReplicaSet istio-canary-f7d4dcd68 (revision 17) from 1 to 10
0s          Normal    Updated VirtualService   rollout/istio-canary                 VirtualService `istio-canary` set to desiredWeight '80'
0s          Normal    TrafficWeightUpdated     rollout/istio-canary                 Traffic weight updated from 90 to 80
0s          Normal    Updated VirtualService   rollout/istio-canary                 VirtualService `istio-canary` set to desiredWeight '20'
0s          Normal    TrafficWeightUpdated     rollout/istio-canary                 Traffic weight updated from 80 to 20
0s          Normal    SwitchService            rollout/istio-canary                 Switched selector for service 'istio-canary-canary' from '7b8bcb8869' to 'f7d4dcd68'
0s          Normal    Updated VirtualService   rollout/istio-canary                 VirtualService `istio-canary` set to desiredWeight '0'
0s          Normal    TrafficWeightUpdated     rollout/istio-canary                 Traffic weight updated from 20 to 0
0s          Normal    ScalingReplicaSet        rollout/istio-canary                 Scaled down ReplicaSet istio-canary-7b8bcb8869 (revision 16) from 9 to 0

@jessesuen jessesuen changed the title fix: rollback to stable with dynamicStableScale could go under maxUnavailable fix: rollback to stable with dynamicStableScale could overwhelm stable pods Oct 4, 2023
@codecov
Copy link

codecov bot commented Oct 4, 2023

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (047f2a2) 81.82% compared to head (cc1a941) 81.85%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3077      +/-   ##
==========================================
+ Coverage   81.82%   81.85%   +0.02%     
==========================================
  Files         134      134              
  Lines       20507    20556      +49     
==========================================
+ Hits        16780    16826      +46     
- Misses       2864     2866       +2     
- Partials      863      864       +1     
Files Coverage Δ
rollout/canary.go 76.11% <100.00%> (+1.17%) ⬆️
rollout/service.go 79.13% <100.00%> (+0.94%) ⬆️
utils/replicaset/replicaset.go 87.63% <ø> (-0.82%) ⬇️
rollout/replicaset.go 73.82% <97.36%> (+4.10%) ⬆️
rollout/bluegreen.go 68.56% <0.00%> (+0.25%) ⬆️
utils/replicaset/canary.go 94.02% <57.14%> (+0.05%) ⬆️
rollout/trafficrouting.go 72.85% <73.91%> (-1.08%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 4, 2023

Go Published Test Results

2 083 tests   2 083 ✔️  2m 48s ⏱️
   118 suites         0 💤
       1 files           0

Results for commit cc1a941.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 4, 2023

E2E Tests Published Test Results

0 tests   0 ✔️  0s ⏱️
0 suites  0 💤
0 files    0

Results for commit cc1a941.

♻️ This comment has been updated with latest results.

@jessesuen jessesuen force-pushed the fix/rollback-stable-unavailability branch 2 times, most recently from cadd42b to 722ebb4 Compare October 5, 2023 02:42
@jessesuen jessesuen marked this pull request as ready for review October 5, 2023 16:00
@jessesuen
Copy link
Member Author

jessesuen commented Oct 5, 2023

This is ready for review (remaining work is to add unit + e2e tests)

EDIT: unit tests and e2e tests have been added

rollout/canary.go Outdated Show resolved Hide resolved
@jessesuen jessesuen force-pushed the fix/rollback-stable-unavailability branch from 13099ef to 86d83ed Compare October 17, 2023 03:25
@zachaller zachaller enabled auto-merge (squash) October 23, 2023 11:43
@zachaller zachaller force-pushed the fix/rollback-stable-unavailability branch from 86d83ed to cc1a941 Compare October 23, 2023 11:44
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot E 4 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
12.1% 12.1% Duplication

@zachaller zachaller merged commit 7f25955 into argoproj:master Oct 23, 2023
22 checks passed
zachaller pushed a commit that referenced this pull request Oct 25, 2023
…e pods (#3077)

* fix: rollback to stable with dynamicStableScale could go under maxUnavailable

Signed-off-by: Jesse Suen <[email protected]>

* test: add unit tests

Signed-off-by: Jesse Suen <[email protected]>

* test: add e2e tests

Signed-off-by: Jesse Suen <[email protected]>

* refactor: move isReplicaSetReferenced to replicaset.go

Signed-off-by: Jesse Suen <[email protected]>

---------

Signed-off-by: Jesse Suen <[email protected]>
zachaller pushed a commit that referenced this pull request Oct 25, 2023
…e pods (#3077)

* fix: rollback to stable with dynamicStableScale could go under maxUnavailable

Signed-off-by: Jesse Suen <[email protected]>

* test: add unit tests

Signed-off-by: Jesse Suen <[email protected]>

* test: add e2e tests

Signed-off-by: Jesse Suen <[email protected]>

* refactor: move isReplicaSetReferenced to replicaset.go

Signed-off-by: Jesse Suen <[email protected]>

---------

Signed-off-by: Jesse Suen <[email protected]>
Signed-off-by: zachaller <[email protected]>
@jessesuen jessesuen deleted the fix/rollback-stable-unavailability branch October 25, 2023 17:02
balasoiuroxana pushed a commit to balasoiuroxana/argo-rollouts that referenced this pull request Oct 27, 2023
…e pods (argoproj#3077)

* fix: rollback to stable with dynamicStableScale could go under maxUnavailable

Signed-off-by: Jesse Suen <[email protected]>

* test: add unit tests

Signed-off-by: Jesse Suen <[email protected]>

* test: add e2e tests

Signed-off-by: Jesse Suen <[email protected]>

* refactor: move isReplicaSetReferenced to replicaset.go

Signed-off-by: Jesse Suen <[email protected]>

---------

Signed-off-by: Jesse Suen <[email protected]>
Signed-off-by: balasoiu <[email protected]>
@zachaller zachaller added cherry-pick-completed Used once we have cherry picked the PR to all requested releases and removed cherry-pick/release-1.4 labels Oct 30, 2023
@Gabryel8818
Copy link

Hello guys, I have same problem in 1.6.6
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick/release-1.5 cherry-pick/release-1.6 cherry-pick-completed Used once we have cherry picked the PR to all requested releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fast-tracked rollback to stable with dynamicStableScale went under maxUnavailable
3 participants