-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add taint to user nodes #2605
base: main
Are you sure you want to change the base?
Add taint to user nodes #2605
Conversation
@@ -41,10 +41,33 @@ class ExistingInputVars(schema.Base): | |||
kube_context: str | |||
|
|||
|
|||
class DigitalOceanNodeGroup(schema.Base): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate class
This method works as intended when tested on GCP. However, One issue is that certain daemonsets won't run on the tainted nodes. I saw the issue with rook ceph csi-cephfslplugin from my rook PR, but I expect it would also be an issue for the monitoring daemonset pods. So we'd likely need to add the appropriate toleration to those daemonsets. |
@@ -45,6 +45,13 @@ resource "helm_release" "rook-ceph" { | |||
}, | |||
csi = { | |||
enableRbdDriver = false, # necessary to provision block storage, but saves some cpu and memory if not needed | |||
provisionerReplicas : 1, # default is 2 on different nodes | |||
pluginTolerations = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runs csi-driver on all nodes, even those with NoSchedule taints. Doesn't run on nodes with NoExecute taints. This is what the nebari-prometheus-node-exporter daemonset does so I copied it here.
effect = "NoSchedule" | ||
}, | ||
{ | ||
operator = "Exists" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runs promtail on all nodes, even those with NoSchedule taints. Doesn't run on nodes with NoExecute taints. This is what the nebari-prometheus-node-exporter daemonset does so I copied it here. Promtail is what exports logs from the node so we still want it to run on the user and worker nodes.
{ | ||
key = "node-role.kubernetes.io/master" | ||
operator = "Exists" | ||
effect = "NoSchedule" | ||
}, | ||
{ | ||
key = "node-role.kubernetes.io/control-plane" | ||
operator = "Exists" | ||
effect = "NoSchedule" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These top 2 are the default value for this helm chart.
Okay, so things are working for the user node group. I tried adding a taint to the worker node group, but the dask scheduler won't run on the tainted worker node group. See this commit to see what I tried in a quick test. I do see the new scheduler_pod_extra_config value in
so I think possibly the merge isn't going as expected, but I need to verify. The docs say that "This dict will be deep merged with the scheduler pod spec (a V1PodSpec object) before submission. Keys should match those in the kubernetes spec, and should be camelCase." |
I managed to get the taints applied to the scheduler pod in this commit. I would have expected the
I still need to apply the toleration to the dask workers. |
Reference Issues or PRs
Fixes #2507
WIP
What does this implement/fix?
Put a
x
in the boxes that applyTesting
Any other comments?