Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add MachineHealthCheck #146

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

3deep5me
Copy link
Contributor

@3deep5me 3deep5me commented Nov 14, 2023

Hi @sp-yduck,

this adds a MachineHealthCheck for all Machines of a new cluster. This can help if the node does not go into a running state. E.g. #145 or network problems during startup or other reasons.

@sp-yduck
Copy link
Collaborator

Thank you for the PR !
As you may know this template file is used for Quick Start. I want to keep Quick Start with minimal setups. So if you want to include this, the options is

  1. create new template for it so that users can choose specific template to try specific features
    ref: https://cluster-api.sigs.k8s.io/clusterctl/commands/generate-cluster.html?highlight=flavor#flavors
    ref: https://cluster-api.sigs.k8s.io/clusterctl/commands/generate-cluster.html?highlight=flavor#alternative-source-for-cluster-templates

@3deep5me
Copy link
Contributor Author

I also changed the location of the quick start i hope this is fine for you.
I don't know how the clusterctl knows where to find the cluster-templates.
Do i have to change something else to make this work?

@sp-yduck
Copy link
Collaborator

clusterctl checks assets of the release so the file changes are ok. the thing is I am using make release to output these assets for each release.

  1. make release-templates
    https://github.com/sp-yduck/cluster-api-provider-proxmox/blob/bbcdd56993d21f9e3581eed558806fc909b71cec/Makefile#L219-L220

  2. make generate-e2e-templates
    https://github.com/sp-yduck/cluster-api-provider-proxmox/blob/bbcdd56993d21f9e3581eed558806fc909b71cec/Makefile#L111-L113

I think you can use kustomize build template/base or something for both of them

templates/flavors/advanced/machinehealthcheck.yaml Outdated Show resolved Hide resolved
templates/flavors/advanced/machinehealthcheck.yaml Outdated Show resolved Hide resolved
maxUnhealthy: 100%
selector:
matchLabels:
cluster.x-k8s.io/cluster-name: '${CLUSTER_NAME}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a good idea to have single mhc for all the nodes in a cluster ?
maybe better to have a different mhc for controlplane and others. what do you think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It think its more common to have two mhc but in the simple setup i see no reason for that.
But i could be wrong do you have an idea how we could utilize two mhc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be precise, I don't know how the two mhc should differ.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html#creating-a-machinehealthcheck
there is an example mhc for both workers(node-unhealthy-5m) and controlplane(kcp-unhealthy-5m)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhhhh... okay. But the Examples do not differ. But we can definitely do two mhc if you want to be more cluster-api conformant.

@sp-yduck
Copy link
Collaborator

sp-yduck commented Nov 20, 2023

btw I believe mhc does not help for

E.g. #145 or network problems during startup or other reasons.

since mhc checks Machine and Node object to confirm if Node(in workload cluster) is ready. so like issue #145 , if the vm goes into unhealthy before it joins k8s cluster, mhc cannot find the Node associated to that unhealthy vm and cannot remediate it.

@3deep5me
Copy link
Contributor Author

3deep5me commented Nov 23, 2023

I'm not sure about the detailed mechanics.
But I can confirm that if a VM does not boot, it is deleted and recreated.
(But at the moment no VM boots because i get the error every time 😢)

I think mhc also checks on default the status.condtion[].type.Ready field. Or that is the only way I can explain the behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants