Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Improve node drain e2e test #11127

Merged

Conversation

sbueringer
Copy link
Member

@sbueringer sbueringer commented Sep 2, 2024

What this PR does / why we need it:
Extends the test coverage for Node drain.

The test goes through the following steps:

  • Create cluster with 3 CP & 1 worker Machine
  • Ensure Node label is set & NodeDrainTimeout is set to 0 (wait forever)
  • Deploy Deployment with unevictable Pods on CP & MD Nodes
  • Deploy Deployment with evictable Pods with finalizer on CP & MD Nodes
  • Trigger Scale down to 1 CP and 0 MD Machines
  • Verify Node drains for control plane and MachineDeployment Machines are blocked (PDBs & Pods with finalizer)
    • DrainingSucceeded conditions should:
    • show 1 evicted Pod with deletionTimestamp (still exists because of finalizer)
    • show 1 Pod which could not be evicted because of PDB
    • Verify the evicted Pod has terminated (i.e. succeeded) and it was evicted
  • Unblock deletion of evicted Pods by removing the finalizer
  • Verify Node drains for control plane and MachineDeployment Machines are blocked (only PDBs)
    • DrainingSucceeded conditions should:
    • not contain any Pods with deletionTimestamp
    • show 1 Pod which could not be evicted because of PDB
  • Set NodeDrainTimeout to 1s to unblock drain
  • Verify scale down succeeded because Node drains were unblocked.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Part of #10056

@sbueringer sbueringer added the area/machine Issues or PRs related to machine lifecycle management label Sep 2, 2024
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 2, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 2, 2024
@sbueringer sbueringer changed the title Improve node drain e2e test 🌱 Improve node drain e2e test Sep 2, 2024
@sbueringer sbueringer changed the title 🌱 Improve node drain e2e test 🌱 KCP: propagate timeouts to deleting Machines / Improve node drain e2e test Sep 2, 2024
@sbueringer
Copy link
Member Author

/hold

Needs slightly more work

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 2, 2024
@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-main

@sbueringer sbueringer changed the title 🌱 KCP: propagate timeouts to deleting Machines / Improve node drain e2e test [WIP] 🌱 KCP: propagate timeouts to deleting Machines / Improve node drain e2e test Sep 2, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 2, 2024
@sbueringer sbueringer changed the title [WIP] 🌱 KCP: propagate timeouts to deleting Machines / Improve node drain e2e test [WIP] 🌱 Improve node drain e2e test Sep 2, 2024
@sbueringer sbueringer changed the title [WIP] 🌱 Improve node drain e2e test 🌱 Improve node drain e2e test Sep 2, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 2, 2024
@sbueringer
Copy link
Member Author

@chrischdi @fabriziopandini @enxebre
PR should be ready for review. Please ignore the "KCP: propagate timeouts to Machines with deletionTimestamp" commit

/hold
(for rebase when the other PR for KCP is merged)

@sbueringer
Copy link
Member Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@sbueringer: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-build-main
  • /test pull-cluster-api-e2e-blocking-main
  • /test pull-cluster-api-e2e-conformance-ci-latest-main
  • /test pull-cluster-api-e2e-conformance-main
  • /test pull-cluster-api-e2e-latestk8s-main
  • /test pull-cluster-api-e2e-main
  • /test pull-cluster-api-e2e-mink8s-main
  • /test pull-cluster-api-e2e-upgrade-1-31-1-32-main
  • /test pull-cluster-api-test-main
  • /test pull-cluster-api-test-mink8s-main
  • /test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-apidiff-main
  • pull-cluster-api-build-main
  • pull-cluster-api-e2e-blocking-main
  • pull-cluster-api-test-main
  • pull-cluster-api-verify-main

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-conformance-ci-latest-main
/test pull-cluster-api-e2e-conformance-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main
/test pull-cluster-api-e2e-upgrade-1-31-1-32-main

@fabriziopandini
Copy link
Member

Nice! We are now testing drain blocks and if it can be unblocked! (and we also dropped a few templates)
ping me when ready for approval

@sbueringer
Copy link
Member Author

Nice! We are now testing drain blocks and if it can be unblocked! (and we also dropped a few templates) ping me when ready for approval

Ups, forgot to remove the hold after I finalized the PR :)

/hold cancel

@fabriziopandini Should be ready

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 3, 2024
@sbueringer
Copy link
Member Author

sbueringer commented Sep 3, 2024

Nevermind, forgot the underlying commit.. :D (Just read my own first hold message and missed the second)

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 3, 2024
@enxebre
Copy link
Member

enxebre commented Sep 4, 2024

Note for myself: Add a test case for unreachable Node.
/lgtm

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 5, 2024
@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-conformance-ci-latest-main
/test pull-cluster-api-e2e-conformance-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main
/test pull-cluster-api-e2e-upgrade-1-31-1-32-main

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 5, 2024
@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-mink8s-main

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 6, 2024
@sbueringer
Copy link
Member Author

sbueringer commented Sep 6, 2024

@enxebre @chrischdi @fabriziopandini Should be ready for review again :)

Extended the coverage further, now we cover that Node drain actually works :) (i.e. Pods are evicted and terminate)

(see PR description or the godoc comment for the test spec for a summary of what the test does)

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-conformance-ci-latest-main
/test pull-cluster-api-e2e-conformance-main
/test pull-cluster-api-e2e-latestk8s-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main
/test pull-cluster-api-e2e-upgrade-1-31-1-32-main

@kubernetes-sigs kubernetes-sigs deleted a comment from k8s-ci-robot Sep 6, 2024
@sbueringer
Copy link
Member Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 6, 2024
@sbueringer sbueringer removed the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Sep 6, 2024
@chrischdi
Copy link
Member

/retest

Docker Flake:

msg: "could not find a log line that matches \"Reached target .*Multi-User System.*|detected cgroup v1\"",

@enxebre
Copy link
Member

enxebre commented Sep 9, 2024

This is huge improvement on this area, thanks!
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 9, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: f068faff6ee13450e2a3931541b137f06e73e293

@fabriziopandini
Copy link
Member

fabriziopandini commented Sep 9, 2024

Awesome work! 🥇
/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 9, 2024
@sbueringer
Copy link
Member Author

Looks pretty bad, but unrelated (@chrischdi, right?)

/retest

@sbueringer
Copy link
Member Author

Same flake again...

/retest

@sbueringer
Copy link
Member Author

Still no clue how this could be related

/retest

@k8s-ci-robot k8s-ci-robot merged commit eebff7b into kubernetes-sigs:main Sep 9, 2024
32 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.9 milestone Sep 9, 2024
@sbueringer sbueringer deleted the pr-improve-node-drain-e2e-test branch September 9, 2024 16:28
@sbueringer
Copy link
Member Author

Let's keep an eye on CI to see if this increases the number of flakes somehow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/machine Issues or PRs related to machine lifecycle management cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants