Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Node Repair implementation #1793

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

engedaam
Copy link
Contributor

@engedaam engedaam commented Oct 30, 2024

Fixes #N/A

Description

  • RFC: RFC: Node Auto Repair #1768
  • This PR is the implementation of the recommend solution defined in the node repair RFC
  • Defining a cloud provider interface RepairPolicy that will support node conditions that Karpenter will forcefully terminate nodes. The cloud provider policies will be unhealthy conditions a node can enter and the duration for Karpenter to react.

How was this change tested?

  • make resubmit

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 30, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: engedaam
Once this PR has been reviewed and has the lgtm label, please assign jonathan-innis for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 30, 2024
@coveralls
Copy link

coveralls commented Oct 30, 2024

Pull Request Test Coverage Report for Build 11874233136

Details

  • 53 of 75 (70.67%) changed or added relevant lines in 5 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.06%) to 80.887%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/controllers.go 0 8 0.0%
pkg/controllers/node/health/controller.go 39 53 73.58%
Files with Coverage Reduction New Missed Lines %
pkg/test/expectations/expectations.go 2 94.73%
Totals Coverage Status
Change from base Build 11874015687: -0.06%
Covered Lines: 8642
Relevant Lines: 10684

💛 - Coveralls

@engedaam engedaam changed the title feat: Node Auto Repair implementation feat: Node Repair implementation Nov 7, 2024
@engedaam engedaam marked this pull request as ready for review November 7, 2024 23:53
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 7, 2024
@engedaam engedaam force-pushed the node-repair-implementation branch 2 times, most recently from 2338123 to 8cefba7 Compare November 8, 2024 00:07
pkg/controllers/controllers.go Outdated Show resolved Hide resolved
pkg/controllers/controllers.go Outdated Show resolved Hide resolved
pkg/cloudprovider/types.go Outdated Show resolved Hide resolved
pkg/cloudprovider/types.go Outdated Show resolved Hide resolved
pkg/cloudprovider/types.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/controller.go Show resolved Hide resolved
pkg/controllers/node/health/controller.go Show resolved Hide resolved
pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
@engedaam engedaam force-pushed the node-repair-implementation branch 9 times, most recently from c8bed26 to 390c056 Compare November 8, 2024 16:14
@k8s-ci-robot k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 10, 2024
pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/suite_test.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/suite_test.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/suite_test.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/suite_test.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/suite_test.go Outdated Show resolved Hide resolved
@engedaam engedaam force-pushed the node-repair-implementation branch 8 times, most recently from b20e724 to 8397b07 Compare November 16, 2024 20:09
pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
@engedaam engedaam force-pushed the node-repair-implementation branch 9 times, most recently from 445cb11 to be92bc0 Compare November 17, 2024 01:58
@engedaam engedaam force-pushed the node-repair-implementation branch 2 times, most recently from ffb2103 to f0186f9 Compare November 17, 2024 03:33
ctx = injection.WithControllerName(ctx, "node.health")
ctx = log.IntoContext(ctx, log.FromContext(ctx).WithValues("Node", klog.KRef(node.Namespace, node.Name)))

// Validate that the node is owned by us and is not being deleted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Validate that the node is owned by us and is not being deleted
// Validate that the node is owned by us

nit: we aren't doing the deletion check anymore

cpPolicyFound := cloudprovider.RepairPolicy{}
// Find a node with a condition that matches one of the unhealthy conditions defined by the cloud provider
// If there are multiple unhealthy status condition we will requeue based on the condition closest to its terminationDuration
for _, policy := range c.cloudProvider.RepairPolicies() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This feels like it should be a separate function given it feels like it has "arguments" and it's pretty self-contained

return reconcile.Result{}, client.IgnoreNotFound(err)
}
if err := c.kubeClient.Delete(ctx, nodeClaim); err != nil {
return reconcile.Result{}, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This delete could have a NotFound error -- consider adding a client.IgnoreNotFound

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants