-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set dqlite failure-domain based on the k8s node AZ label #989
base: main
Are you sure you want to change the base?
Conversation
This commit listens for node label changes and configures the dqlite failure domain based on the availability zone node label. Dqlite adjusts the cluster node promotion order based on the failure domain, attempting to spread the quorum members over different failure domains.
d76868a
to
44374e1
Compare
I can add some tests after the current implementation is approved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @petrutlucian94 - did a first round, mainly focused on overall structure.
* move some of the logic to the node label controller * if just one of the db configs was updated, avoid restarting the other one * allow clearing the failure domain
* add node label controller tests * move some of the dqlite utils to separate functions that can be reused by the tests
@bschimke95 thanks for reviewing this PR, I've addressed the comments and submitted some unit tests. |
We added the getNewK8sClientWithRetries helper to avoid code duplication among k8sd controllers but missed the fact that some controllers need a node (kubelet) client while others need an admin client. We'll add a boolean to request the expected client.
We're adding a test that sets k8s node availability zone labels, ensuring that the failure domain is applied correctly and that the cluster remains functional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, LGTM. Left small nits
"until" is expected to receive a callable, we need to pass a lambda to ensure that the check is correctly retried in case of exceptions.
This commit listens for node label changes and configures the dqlite failure domain based on the availability zone node label.
Dqlite adjusts the cluster node promotion order based on the failure domain, attempting to spread the quorum members over different failure domains.