You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We trigger clickhouse keeper reconfigurations via writing new xml configuration files to each keeper and then relying on the current leader to reload the configuration, diff it with the active configuration and issue any necessary raft membership changes.
There is a little-documented parameter there called CoordinationSetting::configuration_change_tries_count. This can lead to reconfigurations lasting a long time, even if they will eventually fail. My guess is that we should instead set this to 1 or 2. Even if the configuration fails, on the next push it should succeed. On failure though, we'd likely need to remove the cached copy of settings from #6909 to allow the rewrite of the configuration to go through. The problem is that it's next to impossible to figure out when the reconfiguration failed. In essence, this issue is somewhat in conflict with #6909.
Ideally, we'd use the the reconfig command with keeper to block and see if reconfiguration has succeeded or failed immediately. However, this command is not available in the version of clickhouse we have deployed. We need to upgrade, which we should do anyway.
For now though, we need to test this thoroughly and make any necessary config changes.
I tested with clickward that I can trigger a failure to add a keeper node because it's not running. Then I start the new node, write the same configuration to the other nodes, watch it load at the current keeper leader and watch the reconfiguration succeed.
This is pretty manual testing, as I had to modify some code to make it do this. I'm currently testing expungement on a4x2, but am having some problems that I haven't diagnosed yet.
We trigger clickhouse keeper reconfigurations via writing new xml configuration files to each keeper and then relying on the current leader to reload the configuration, diff it with the active configuration and issue any necessary raft membership changes.
This all happens in the keeper code.
There is a little-documented parameter there called
CoordinationSetting::configuration_change_tries_count
. This can lead to reconfigurations lasting a long time, even if they will eventually fail. My guess is that we should instead set this to 1 or 2. Even if the configuration fails, on the next push it should succeed. On failure though, we'd likely need to remove the cached copy of settings from #6909 to allow the rewrite of the configuration to go through. The problem is that it's next to impossible to figure out when the reconfiguration failed. In essence, this issue is somewhat in conflict with #6909.Ideally, we'd use the the reconfig command with keeper to block and see if reconfiguration has succeeded or failed immediately. However, this command is not available in the version of clickhouse we have deployed. We need to upgrade, which we should do anyway.
For now though, we need to test this thoroughly and make any necessary config changes.
More details can be found in the following clickhouse issue: ClickHouse/ClickHouse#69355
Also, while we are in here we should almost certainly configure
quorum_reads=true
for correctness.The text was updated successfully, but these errors were encountered: