Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proactively bounce capi-controller-manager in case of netsplits #7445

Closed
jayunit100 opened this issue Oct 24, 2022 · 17 comments
Closed

Proactively bounce capi-controller-manager in case of netsplits #7445

jayunit100 opened this issue Oct 24, 2022 · 17 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@jayunit100
Copy link
Contributor

jayunit100 commented Oct 24, 2022

(edit: removed ccm, bc its an ambiguous abbreviation)

User Story

As a user on the edge, in case that one node in my MC gets somehow netsplit off from my Controlplane worker nodes, id like the capi controller manager to fail, that way it bounces to a "potentially" healthy, connected other node in my MC.

Detailed Description

As an example of how I was stepping through this earlier, the steps i followed here worked kubernetes-sigs/cluster-api-provider-vsphere#1660...

Anything else you would like to add:

I saw this in a very odd environment, admittedly, but I think its still a good idea to broaden the definition of health for the capi controller manager, if possible. We wouldnt want a single capi-controller-manager that was having issues to slow down the remediation of a fleet of clusters running on different networks .

/kind feature

Diagram below shows the issue i ran into in this netsplit situation.

image

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 24, 2022
@chrischdi
Copy link
Member

I think this is a general problem for all kind of controllers and is not specific to CAPI's controllers. It should be (or maybe is already) solved in Kubernetes itself.

Did you check if the pod gets evicted after five minutes which would map to the default of the kube-controller-managers pod-eviction-timeout duration parameter (which defaults to 5m0s, xref)

As addition, there are also two other maybe interesting configuration parameters for kube-apiserver which I got from this issue: FeatureGate DefaultTolerationSeconds , default-not-ready-toleration-seconds and default-unreachable-toleration-seconds (xref)

A workaround would be to run the CAPI Controller Manager (and also the other controllers) by multiple replicas and anti-affinity. Failed leader election should lead to failover to another replica.

@sbueringer
Copy link
Member

sbueringer commented Oct 24, 2022

id like the CCM to fail, that way it bounces to a "potentially" healthy, connected other node in my MC.

Not sure if I understood the issue correctly, but changes to the CCM should be out of scope of the Cluster API project?

I thought the issue is that CCM in the workload cluster doesn't set the providerID not the CAPI controller in the mgmt cluster?

@sbueringer
Copy link
Member

To be clear:

  • CCM = cloud controller manager
  • not capi controller manager

Wasn't your problem that the cloud controller manager was not setting the providerIDs on nodes?

@jayunit100
Copy link
Contributor Author

  • I think the cloud controller Manager issue may also have been a problem, but i saw that
  • during a net split situation, capi-controller-manager flips out if it cant connect to a control plane of a wl cluster , but it doesnt actually shutdown or bounce to a new node.

@killianmuldoon
Copy link
Contributor

killianmuldoon commented Oct 24, 2022

during a net split situation, capi-controller-manager flips out if it cant connect to a control plane of a wl cluster , but it doesnt actually shutdown or bounce to a new node

Is the CAPI Controller Manager running with multiple replicas? If the controller is out of the network (split from the leader API server on the mgmt cluster) it should lose the lease and another replica should take up the lease.

If the management Cluster is cut off from the workload cluster network entirely it will just keep retrying to contact the workload cluster.

@jayunit100
Copy link
Contributor Author

in this case, the capi-controller-manager's node

  • is able to connect to its own apiserver, (so it doesnt lose the lease)
  • but not that of other WL apiservers, (so it keeps failing at any MHC type steps)

@chrischdi
Copy link
Member

chrischdi commented Oct 24, 2022

So it looks something like this, the red or orange connection here is broken in someway?

image

@jayunit100
Copy link
Contributor Author

Yup !!!!!!!!!

@jayunit100
Copy link
Contributor Author

image

similar pic to yours

@killianmuldoon
Copy link
Contributor

There's a few questions that come to mind:

  1. Should CAPI be considered in a non-healthy state if it can't contact a single workload cluster, or a majority or all of them? How long should this be the case before taking action?
  2. Is the management cluster in HA? Is the assumption that a CAPI controller on another node would be able to communicate with the Workload Cluster? Would it be able to communicate with the APIServer on the initial CAPI controller node?
  3. A compromise / best-effort mitigation could be that the CAPI controller should lose the lease if it can't contact all workload clusters (or some subset of them?). Then it's likely that another controller would get the lease and possibly be able to contact those clusters.

Could this be approached by watching a metric and alerting that the CP is not able to contact workloads? This is discussed here.

That would give someone with insight into the network set up a change to remediate manually, or set up some deployment specific automation.

@sbueringer
Copy link
Member

sbueringer commented Nov 29, 2022

I think the answer is monitoring to be honest. Monitor Cluster API and then produce corresponding alerts with either manual playbooks or some automatic mitigation (although automatic is super hard with things like net splits)

Some thoughts:

the capi controller manager to fail, that way it bounces to a "potentially" healthy, connected other node in my MC.

If we include "can the CAPI controller contact workload clusters" into our readiness/liveness probes (or if the controller just shuts down) we get the following consequences:

  • we will get container restarts and CrashLoopBackoffs
  • this doesn't trigger a Pod eviction and no re-scheduling of the Pod
  • even if the CAPI controller Pod deletes itself it would get rescheduled on the same Node with a very high probability
    • the kube-scheduler afaik per default prefers Nodes which already have the container image locally

I don't think with a behavior like that we could maintain stable operations.

Let's assume we have a bunch of workload clusters and some of them are offline (either because they are simply broken / misconfigured or some edge "temporary not reachable" scenario). The Cluster API controller would just be permanently restarted. Imagine what happens if you have a few hundred clusters..

Same result if you have some workload clusters that are reachable on some mgmt cluster nodes and others only from other nodes.

@jayunit100
Copy link
Contributor Author

jayunit100 commented Nov 29, 2022

(deleting my comment as I meant to put it in the other issue) but I still think we should proactively time bomb this container :)

@fabriziopandini
Copy link
Member

/triage accepted

trying to make up my mind on two sides of the problem:

  • how can we detect netsplit (a problem on the machine controller run on) from other network problems that affect connectivity to all or a subset of workload clusters
  • how bounce should happen is interesting. What trigger the bumps? how do we ensure the newly scheduled controller doesn't get scheduled on the same machine.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 30, 2022
@jayunit100
Copy link
Contributor Author

jayunit100 commented Dec 3, 2022

Maybe it's just a naive time bomb expiration? Like once every week or something the pod self terminates gracefully .

@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024
@fabriziopandini
Copy link
Member

/close

unfortunately we did not reach an agreement on a way forward and the issue was not active in the last year

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

/close

unfortunately we did not reach an agreement on a way forward and the issue was not active in the last year

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

7 participants