Unable to use kube-vip-cloud-provider in CAPV clusters #3164

ybizeul · 2024-08-21T21:07:19Z

/kind bug

What steps did you take and what happened:

Provision a new cluster with a control node and a worker node
Deploy kube-vip-cloud-provider
Configure ip ranges for the provider
Create a new LoadBalancer service
the provider annotates the service correctly with IP from the pool
kube-vip fails with :

time="2024-08-21T18:55:27Z" level=error msg="[endpoint] unable to find shortname from my-cluster-lbd6d"

What did you expect to happen:

I was expecting kube-vip to assign external ip to the IP given by kube-vip-cloud-provider

Anything else you would like to add:

This seems to be related to kube-vip/kube-vip#723 which prevents provisioning of the IP on the service when node names aren't FQDNs (which is what CAPV does it seems, and I didn't find a way to change that yet. Adding a searchdomain leads to the same result).

After changing kube-vip Pod image to v0.8.2 in cluster.yaml as :

apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
[...]
spec:
  kubeadmConfigSpec:
[...]
    files:
    - content: |
[...]
            image: ghcr.io/kube-vip/kube-vip:v0.8.2
[...]
      path: /etc/kubernetes/manifests/kube-vip.yaml

Environment:

Cluster-api-provider-vsphere version:

v1.11

Kubernetes version: (use kubectl version):

Client Version: v1.31.0
Kustomize Version: v5.4.2
Server Version: v1.31.0

OS (e.g. from /etc/os-release):

NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=3975.2.0
VERSION_ID=3975.2.0
BUILD_ID=2024-08-05-2103
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3975.2.0 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar.org/"
BUG_REPORT_URL="https://issues.flatcar.org"
FLATCAR_BOARD="amd64-usr"
CPE_NAME="cpe:2.3:o:flatcar-linux:flatcar_linux:3975.2.0:*:*:*:*:*:*:*"

The text was updated successfully, but these errors were encountered:

lubronzhan · 2024-08-22T05:04:53Z

More of a limitation in kube-vip. And as we discussed in slack, this is fixed in kube-vip 0.8.

But even kube-vip is bumped to 0.8, since node name doesn't match the fqdn name, CPI will fail to find the corresponding VM for the node.

I think we should revert this PR fcd243d

It was aiming at addressing the node name length issue, but then if customer has search domain defined for hostname and the length is within 64 character, CPI will fail to initialize node

If you all think it's reasonable I can create an issue to revert it

ybizeul · 2024-08-22T06:47:13Z

Thank you @lubronzhan.

I will also add my remarks from this morning.

We have that issue with node names, related to v0.6.4, which isn't a problem for initial deployment and control plane HA
But as it is, the kube-vip static pods are only deployed on control nodes anyways, which, as far as I know, doesn't have LB Service resources.

Maybe CAPV intended way to implement kube-vip for workloads is through a new daemonset on the workload cluster, the problem with that is the control plane kube-vip has svc_enable to true so I'm afraid both will try to handle the new resources.

lubronzhan · 2024-08-22T17:21:39Z

More of a limitation in kube-vip. And as we discussed in slack, this is fixed in kube-vip 0.8.

But even kube-vip is bumped to 0.8, since node name doesn't match the fqdn name, CPI will fail to find the corresponding VM for the node.

I think we should revert this PR fcd243d

It was aiming at addressing the node name length issue, but then if customer has search domain defined for hostname and the length is within 64 character, CPI will fail to initialize node

If you all think it's reasonable I can create an issue to revert it

Ohk I would correct mine comment around the real issue. It's actually the /etc/hosts is not having the short-hostname ip entry so kube-vip can't resolve it. And it should be fixed in kube-vip 0.8. This is unrelated to reverting fcd243d

Maybe CAPV intended way to implement kube-vip for workloads is through a new daemonset on the workload cluster, the problem with that is the control plane kube-vip has svc_enable to true so I'm afraid both will try to handle the new resources.

Current way would support externalTrafficPolicy: cluster, which most people use, maybe that's why it's not removed. If you want to deploy different set of kube-vip, you can modify your kubeadmControlPlane to remove the svc_enable: true from the files section which contains kube-vip manifest.

ybizeul · 2024-08-22T21:36:38Z

But there is still the problem that 0.6.4 wouldn't work even with cluster policy, because of the short name bug right ?

lubronzhan · 2024-08-22T21:47:43Z

But there is still the problem that 0.6.4 wouldn't work even with cluster policy, because of the short name bug right ?

Yes. So need to upgrade to new 0.8.

This PR fcd243d#diff-e76b4b2137138f55f29ce20dd0ab8287648f3d8eb23a841a2e5c51ff88949750R120 also add the local_hostname to /etc/hosts. Maybe that's also one of the reason, so only the short hostname is in your /etc/hosts. Could you check that?

chrischdi · 2024-08-23T06:02:04Z

Maybe also relevant: the kube-vip static pod currently runs with its own /etc/hosts:

https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/templates/cluster-template.yaml#L179-L182

This is to workaround the issue at:

hostAliases do not work as static pod manifests in Kubernetes v1.29 kube-vip/kube-vip#692

An improved way may be a preKubeadmCommand which copies the original /etc/hosts and just adds kubernetes to it.

Note: this workaround is only required when kube-vip runs as static pod.

So instead of adding the static file /etc/kube-vip.hosts, we could maybe use a preKubeadmCommand like the following if this helps:

sed -E -e 's/^(127.0.0.1.*)/\1 kubernetes/g' /etc/hosts > /etc/kube-vip.hosts

k8s-triage-robot · 2024-11-21T06:54:04Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-12-21T07:38:52Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 21, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to use kube-vip-cloud-provider in CAPV clusters #3164

Unable to use kube-vip-cloud-provider in CAPV clusters #3164

ybizeul commented Aug 21, 2024

lubronzhan commented Aug 22, 2024

ybizeul commented Aug 22, 2024

lubronzhan commented Aug 22, 2024

ybizeul commented Aug 22, 2024

lubronzhan commented Aug 22, 2024

chrischdi commented Aug 23, 2024 •

edited

Loading

k8s-triage-robot commented Nov 21, 2024

k8s-triage-robot commented Dec 21, 2024

Unable to use kube-vip-cloud-provider in CAPV clusters #3164

Unable to use kube-vip-cloud-provider in CAPV clusters #3164

Comments

ybizeul commented Aug 21, 2024

lubronzhan commented Aug 22, 2024

ybizeul commented Aug 22, 2024

lubronzhan commented Aug 22, 2024

ybizeul commented Aug 22, 2024

lubronzhan commented Aug 22, 2024

chrischdi commented Aug 23, 2024 • edited Loading

k8s-triage-robot commented Nov 21, 2024

k8s-triage-robot commented Dec 21, 2024

chrischdi commented Aug 23, 2024 •

edited

Loading