Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Refactor node drain #11074

Merged

Conversation

sbueringer
Copy link
Member

@sbueringer sbueringer commented Aug 20, 2024

What this PR does / why we need it:

The high-level idea of this PR is to inline the parts of the drain helper that we actually want to use, then improve them and make sure the state of Node drains can be easily observed through logs & conditions.

Goals are:

  • Stop actively waiting for 20s for Pods to go away after eviction was triggered
    • Bad for performance @ scale. controllers should just requeue in that situation
  • Stop spawning go routines to evict Pods in parallel
    • Without "wait for Pods delete" there is no reason to evict Pods in parallel anymore
  • Provide more precise information in logs & conditions about the current state of the drain
  • Ideally we should use a controller-runtime client from the CCT instead of creating a separate client-go client

Notes:

  • policy/v1.Eviction has been available since Kubernetes v1.22. So we don't have to fallback to policy/v1beta2 or Pod deletion anymore
  • I'll work on some additional e2e test coverage, but that won't be part of this PR

Big Kudos to @chrischdi for collaborating on this PR and figuring out a few technical challenges together!!

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #10056

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label labels Aug 20, 2024
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 20, 2024
@sbueringer sbueringer added the area/machine Issues or PRs related to machine lifecycle management label Aug 20, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-area PR is missing an area label label Aug 20, 2024
@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-main

@chrischdi
Copy link
Member

Related to:

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 22, 2024
@sbueringer sbueringer force-pushed the pr-improve-node-drain branch 2 times, most recently from 55a1a03 to e7f7978 Compare August 23, 2024 13:15
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 23, 2024
@sbueringer sbueringer changed the title [WIP] ✨ Refactor node drain ✨ Refactor node drain Aug 23, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 23, 2024
@sbueringer
Copy link
Member Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@sbueringer: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-build-main
  • /test pull-cluster-api-e2e-blocking-main
  • /test pull-cluster-api-e2e-conformance-ci-latest-main
  • /test pull-cluster-api-e2e-conformance-main
  • /test pull-cluster-api-e2e-main
  • /test pull-cluster-api-e2e-mink8s-main
  • /test pull-cluster-api-e2e-upgrade-1-31-1-32-main
  • /test pull-cluster-api-test-main
  • /test pull-cluster-api-test-mink8s-main
  • /test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-apidiff-main
  • pull-cluster-api-build-main
  • pull-cluster-api-e2e-blocking-main
  • pull-cluster-api-test-main
  • pull-cluster-api-verify-main

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer
Copy link
Member Author

/assign @fabriziopandini @chrischdi @vincepri

/assign @adilGhaffarDev @guettli
(as you previously showed interest in this topic, would be great if you can give this a try)

@k8s-ci-robot
Copy link
Contributor

@sbueringer: GitHub didn't allow me to assign the following users: guettli.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @fabriziopandini @chrischdi @vincepri

/assign @adilGhaffarDev @guettli
(as you previously showed interest in this topic, would be great if you can give this a try)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer sbueringer added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Aug 23, 2024
@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-conformance-ci-latest-main
/test pull-cluster-api-e2e-conformance-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main
/test pull-cluster-api-e2e-upgrade-1-31-1-32-main

@fabriziopandini
Copy link
Member

Latest commits looks fine for me,
again great work! this is a huge improvement on observability for machine deletion

happy to lgtm as soon we agree on the last few discussion threads

@sbueringer
Copy link
Member Author

@fabriziopandini Fixed & answered

/hold
(I want to do a proper squash with a good commit message after lgtm's)

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 30, 2024
@sbueringer
Copy link
Member Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@sbueringer: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-build-main
  • /test pull-cluster-api-e2e-blocking-main
  • /test pull-cluster-api-e2e-conformance-ci-latest-main
  • /test pull-cluster-api-e2e-conformance-main
  • /test pull-cluster-api-e2e-latestk8s-main
  • /test pull-cluster-api-e2e-main
  • /test pull-cluster-api-e2e-mink8s-main
  • /test pull-cluster-api-e2e-upgrade-1-31-1-32-main
  • /test pull-cluster-api-test-main
  • /test pull-cluster-api-test-mink8s-main
  • /test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-apidiff-main
  • pull-cluster-api-build-main
  • pull-cluster-api-e2e-blocking-main
  • pull-cluster-api-test-main
  • pull-cluster-api-verify-main

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chrischdi
Copy link
Member

/lgtm

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 30, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: d8865a6d9b19d8d6af7157a19abf8ea512d2a70d

@fabriziopandini
Copy link
Member

/lgtm

@sbueringer
Copy link
Member Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 30, 2024
@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-conformance-ci-latest-main
/test pull-cluster-api-e2e-conformance-main
/test pull-cluster-api-e2e-latestk8s-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main
/test pull-cluster-api-e2e-upgrade-1-31-1-32-main

@sbueringer
Copy link
Member Author

@enxebre Should be ready for a final review :) (@fabriziopandini @chrischdi only squashed since your last review, so lgtm was preserved)

Documentation PR and a PR to extend e2e test coverage will follow shortly

@k8s-ci-robot
Copy link
Contributor

@sbueringer: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-apidiff-main 466bf1a link false /test pull-cluster-api-apidiff-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@enxebre
Copy link
Member

enxebre commented Sep 2, 2024

This is great! we should be able to come up with repeatable e2e checks now. Thanks!
/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 2, 2024
@k8s-ci-robot k8s-ci-robot merged commit 3232abc into kubernetes-sigs:main Sep 2, 2024
33 of 35 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.9 milestone Sep 2, 2024
@sbueringer
Copy link
Member Author

Thx everyone for the quick reviews :)

@sbueringer sbueringer deleted the pr-improve-node-drain branch September 2, 2024 09:07
@adilGhaffarDev
Copy link
Contributor

@sbueringer can we backport it to v1.8?

@sbueringer
Copy link
Member Author

In my opinion the change is too big for that

@enxebre
Copy link
Member

enxebre commented Sep 2, 2024

Agree with the size concern. Also I think this needs to be released in latest and rollout to running clusters so we develop real use feedback and cover any regressions before considering backporting at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/machine Issues or PRs related to machine lifecycle management cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add more descriptive Message to DrainingFailedReason and DrainingReason
7 participants