Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clickhouse keeper reconfiguration settings: testing and tweaks #6910

Open
andrewjstone opened this issue Oct 21, 2024 · 1 comment
Open

Clickhouse keeper reconfiguration settings: testing and tweaks #6910

andrewjstone opened this issue Oct 21, 2024 · 1 comment
Assignees

Comments

@andrewjstone
Copy link
Contributor

andrewjstone commented Oct 21, 2024

We trigger clickhouse keeper reconfigurations via writing new xml configuration files to each keeper and then relying on the current leader to reload the configuration, diff it with the active configuration and issue any necessary raft membership changes.

This all happens in the keeper code.

There is a little-documented parameter there called CoordinationSetting::configuration_change_tries_count. This can lead to reconfigurations lasting a long time, even if they will eventually fail. My guess is that we should instead set this to 1 or 2. Even if the configuration fails, on the next push it should succeed. On failure though, we'd likely need to remove the cached copy of settings from #6909 to allow the rewrite of the configuration to go through. The problem is that it's next to impossible to figure out when the reconfiguration failed. In essence, this issue is somewhat in conflict with #6909.

Ideally, we'd use the the reconfig command with keeper to block and see if reconfiguration has succeeded or failed immediately. However, this command is not available in the version of clickhouse we have deployed. We need to upgrade, which we should do anyway.

For now though, we need to test this thoroughly and make any necessary config changes.

More details can be found in the following clickhouse issue: ClickHouse/ClickHouse#69355

Also, while we are in here we should almost certainly configure quorum_reads=true for correctness.

@andrewjstone andrewjstone self-assigned this Oct 21, 2024
@andrewjstone
Copy link
Contributor Author

I tested with clickward that I can trigger a failure to add a keeper node because it's not running. Then I start the new node, write the same configuration to the other nodes, watch it load at the current keeper leader and watch the reconfiguration succeed.

This is pretty manual testing, as I had to modify some code to make it do this. I'm currently testing expungement on a4x2, but am having some problems that I haven't diagnosed yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant