Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set dqlite failure-domain based on the k8s node AZ label #989

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

petrutlucian94
Copy link
Contributor

This commit listens for node label changes and configures the dqlite failure domain based on the availability zone node label.

Dqlite adjusts the cluster node promotion order based on the failure domain, attempting to spread the quorum members over different failure domains.

@petrutlucian94 petrutlucian94 requested a review from a team as a code owner January 20, 2025 16:52
@petrutlucian94 petrutlucian94 marked this pull request as draft January 20, 2025 16:52
@petrutlucian94 petrutlucian94 changed the title Set dqlite failure-domain based on the k8s node AZ label wip: Set dqlite failure-domain based on the k8s node AZ label Jan 20, 2025
This commit listens for node label changes and configures the
dqlite failure domain based on the availability zone node label.

Dqlite adjusts the cluster node promotion order based on the
failure domain, attempting to spread the quorum members over
different failure domains.
@petrutlucian94 petrutlucian94 changed the title wip: Set dqlite failure-domain based on the k8s node AZ label Set dqlite failure-domain based on the k8s node AZ label Jan 20, 2025
@petrutlucian94 petrutlucian94 marked this pull request as ready for review January 20, 2025 19:04
@petrutlucian94
Copy link
Contributor Author

I can add some tests after the current implementation is approved.

Copy link
Contributor

@bschimke95 bschimke95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @petrutlucian94 - did a first round, mainly focused on overall structure.

src/k8s/pkg/k8sd/controllers/node_label.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/controllers/node_label.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/controllers/node_label.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/controllers/utils.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/controllers/node_label.go Outdated Show resolved Hide resolved
src/k8s/pkg/snap/util/dqlite.go Show resolved Hide resolved
src/k8s/pkg/snap/util/dqlite.go Outdated Show resolved Hide resolved
* move some of the logic to the node label controller
  * if just one of the db configs was updated, avoid restarting the
    other one
* allow clearing the failure domain
* add node label controller tests
* move some of the dqlite utils to separate functions that can
  be reused by the tests
@petrutlucian94
Copy link
Contributor Author

@bschimke95 thanks for reviewing this PR, I've addressed the comments and submitted some unit tests.

We added the getNewK8sClientWithRetries helper to avoid code
duplication among k8sd controllers but missed the fact that
some controllers need a node (kubelet) client while others
need an admin client.

We'll add a boolean to request the expected client.
We're adding a test that sets k8s node availability zone labels,
ensuring that the failure domain is applied correctly and that
the cluster remains functional.
Copy link
Member

@berkayoz berkayoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, LGTM. Left small nits

src/k8s/pkg/k8sd/controllers/node_label.go Outdated Show resolved Hide resolved
src/k8s/pkg/snap/util/dqlite.go Outdated Show resolved Hide resolved
"until" is expected to receive a callable, we need to pass a
lambda to ensure that the check is correctly retried in case of
exceptions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants