From e3e4d33ad99fa470ef385d4cd993892446a2caae Mon Sep 17 00:00:00 2001 From: Gantigmaa Selenge Date: Thu, 6 Jun 2024 10:02:44 +0100 Subject: [PATCH] Address some of the review comments from Jakub and Paolo Signed-off-by: Gantigmaa Selenge --- 06x-new-kafka-roller.md | 117 ++++++++++++++++++++-------------------- 1 file changed, 60 insertions(+), 57 deletions(-) diff --git a/06x-new-kafka-roller.md b/06x-new-kafka-roller.md index 115489d6..b5802296 100644 --- a/06x-new-kafka-roller.md +++ b/06x-new-kafka-roller.md @@ -4,10 +4,10 @@ The Kafka Roller is an internal Cluster Operator component that's responsible for coordinating the rolling restart or reconfiguration of Kafka pods when: - non-dynamic reconfigurations needs to be applied -- update in Kafka CRD is detected -- a certificate is renewed -- pods have been manually annotated by the user for controlled restarts -- pod is stuck and is out of date +- update in Kafka CR is detected +- a TLS certificate is renewed +- pods have been [manually annotated](https://strimzi.io/docs/operators/latest/full/deploying#rolling_pods_manually_alternative_to_drain_cleaner) by the user for controlled restarts +- pod is stuck and has a pending update (e.g. not running with the desired version or configuration) - Kafka broker is unresponsive to Kafka Admin connections A pod is considered stuck if it is in one of following states: @@ -20,15 +20,13 @@ A pod is considered stuck if it is in one of following states: The existing KafkaRoller suffers from the following shortcomings: - Although it is safe and straightforward to restart one broker at a time, this process is slow in large clusters ([related issue](https://github.com/strimzi/strimzi-kafka-operator/issues/8547)). -- It does not account for partition preferred leadership. As a result, there may be more leadership changes than necessary during a rolling restart, consequently impacting tail latency. +- It does not account for partition preferred leadership. As a result, there may be more leadership changes than necessary during a rolling restart, consequently impacting clients because they would need to reconnect everytime. - Hard to reason about when things go wrong. The code is complex to understand and it's not easy to determine why a pod was restarted from logs that tend to be noisy. - Potential race condition between Cruise Control rebalance and KafkaRoller that could cause partitions under minimum in sync replica. This issue is described in more detail in the `Future Improvements` section. - The current code for KafkaRoller does not easily allow growth and adding new functionality due to its complexity. The following non-trivial fixes and changes are missing from the current KafkaRoller's KRaft implementation: -- Currently KafkaRoller has to connect to brokers successfully in order to get KRaft quorum information and determine whether a controller node can be restarted. This is because it was not possible to directly talk to KRaft controllers at the time before [KIP 919](https://cwiki.apache.org/confluence/display/KAFKA/KIP-919%3A+Allow+AdminClient+to+Talk+Directly+with+the+KRaft+Controller+Quorum+and+add+Controller+Registration) was implemented. The issue is raised [here](https://github.com/strimzi/strimzi-kafka-operator/issues/9692). - - KafkaRoller takes a long time to reconcile mixed nodes if they are all in `Pending` state. This is because a mixed node does not become ready until the quorum is formed and KafkaRoller waits for a pod to become ready before it attempts to restart other nodes. In order for the quorum to form, at least the majority of controller nodes need to be running at the same time. This is not easy to solve in the current KafkaRoller without introducing some major changes because it processes each node individually and there is no mechanism to restart multiple nodes in parallel. More information can be found [here](https://github.com/strimzi/strimzi-kafka-operator/issues/9426). - The quorum health check relies on the `controller.quorum.fetch.timeout.ms` configuration, which is determined by the desired configuration values. However, during certificate reconciliation or manual rolling updates, KafkaRoller doesn't have access to these desired configuration values since they shouldn't prompt any configuration changes. As a result, the quorum health check defaults to using the hard-coded default value of `controller.quorum.fetch.timeout.ms` instead of the correct configuration value during manual rolling updates or when rolling nodes for certificate renewal. @@ -44,11 +42,14 @@ As you can see above, the current KafkaRoller still needs various changes and po ## Proposal -The objective of this proposal is to introduce a new KafkaRoller with simplified logic having a structured design resembling a finite state machine. KafkaRoller decisions are informed by observations coming from different sources (e.g. Kubernetes API, KafkaAgent, Kafka Admin API). These sources will be abstracted so that KafkaRoller is not dependent on their specifics as long as it's getting the information it needs. The abstractions also enable much better unit testing. +The objective of this proposal is to introduce a new KafkaRoller with more structured design resembling a finite state machine. Given the number of new features and changes related to KRaft, it is easiest to rewrite it from scratch rather than refactoring the existing component. With a more structured design, the process for evaluating pods in various states such as not running, unready, or lacking a connection; and deciding whether to restart them would become more defined and easier to follow. + +KafkaRoller decisions would be informed by observations coming from different sources (e.g. Kubernetes API, KafkaAgent, Kafka Admin API). These sources will be abstracted so that KafkaRoller is not dependent on their specifics as long as it's getting the information it needs. The abstractions also enable much better unit testing. -Depending on the observed states, the roller will perform specific actions. Those actions should cause a subsequent observation to cause a state transition. This iterative process continues until each node's state aligns with the desired state. +Nodes would categorised based on the observed states, the roller will perform specific actions on nodes in each category. Those actions should cause a subsequent observation to cause a state transition. This iterative process continues until each node's state aligns with the desired state. + +In addition, the new KafkaRoller will introduce an algorithm to restart brokers in parallel when safety conditions are met. These conditions ensure Kafka producer availability and minimize the impact on controllers and overall cluster stability. It will also wait for partitions to be reassigned to their preferred leaders to avoid triggering unnecessary partition leader elections. -It will also introduce an algorithm that can restart brokers in parallel when safety conditions are not violated. These conditions guarantee Kafka producer availability and cause minimal impact on controllers and overall stability of clusters. ### Node State When a new reconciliation starts up, a context object is created for each node to store the state and other useful information used by the roller. It will have the following fields: @@ -57,30 +58,31 @@ When a new reconciliation starts up, a context object is created for each node t - currentNodeRole: Currently assigned process roles for this node (e.g. controller, broker). - lastKnownState: It contains the last known state of the node based on information collected from the abstracted sources (Kubernetes API, KafkaAgent and Kafka Admin API). The table below describes the possible states. - restartReason: It is updated based on the current predicate logic from the `Reconciler`. For example, an update in the Kafka CR is detected. - - numRestartAttempts: The value is incremented each time the node has been attempted to restart. - - numReconfigAttempts: The value is incremented each time the node has been attempted to reconfigure. - - numRetries: The value is incremented each time the node cannot be restarted/reconfigured due to not meeting safety conditions (more on this later). + - numRestartAttempts: The value is incremented each time the node has been restarted or attempted to be restarted. + - numReconfigAttempts: The value is incremented each time the node has been reconfigured or attempted to be reconfigured. + - numRetries: The value is incremented each time the node is evaluated/processed but was not restarted/reconfigured due to not meeting safety conditions for example, availability check failed, log recovery or timed out waiting for pod to become ready. - lastTransitionTime: System.nanoTime of last observed state transition. - The following table illustrates the proposed node states and the possible transitions: + The following table illustrates possible states for `lastKnownState` field and the next states it can transition into: | State | Description | Possible transitions | | :--------------- | :--------------- | :----------- | - | UNKNOWN | The initial state when creating `Context` for a node. We expect to transition from this state fairly quickly after creating the context for nodes. | `NOT_RUNNING` `NOT_READY` `RECOVERING` `SERVING` | - | NOT_RUNNING | Node is not running (Kafka process is not running). This is determined via Kubernetes API, more details for it below. | `NOT_READY`, `NOT_RUNNING`. | `RESTARTED` `SERVING` | - | NOT_READY | Node is running but not ready to serve requests which is determined by Kubernetes readiness probe (broker state < 2 OR == 127 OR controller is not listening on port). | `RESTARTED` `SERVING` | - | RESTARTED | After successful `kubectl delete pod`. | `NOT_RUNNING` `NOT_READY` `RECOVERING` `SERVING` | - | RECONFIGURED | After successful Kafka node config update via Admin client. | `NOT_RUNNING` `NOT_READY` `RESTARTED` `SERVING` | - | RECOVERING | Node has started but is in log recovery (broker state == 2). | `SERVING` | - | SERVING | Node is in running state and ready to serve requests which is determined by Kubernetes readiness probe (broker state >= 3 AND != 127 OR controller is listening on port). | `RESTARTED` `RECONFIGURED` `LEADING_ALL_PREFERRED` | - | LEADING_ALL_PREFERRED | Node is in running state and leading all preferred replicas. | None + | UNKNOWN | The initial state when creating `Context` for a node. We expect to transition from this state fairly quickly after creating the context for nodes. | `NOT_RUNNING` `NOT_READY` `RECOVERING` `READY` | + | NOT_RUNNING | Node is not running (Kafka process is not running). This is determined via Kubernetes API, more details for it below. | `NOT_READY`, `NOT_RUNNING`. | `RESTARTED` `READY` | + | NOT_READY | Node is running but not ready to serve requests which is determined by Kubernetes readiness probe (broker state < 2 OR == 127 OR controller is not listening on port). | `RESTARTED` `READY` | + | RESTARTED | After successful `kubectl delete pod`. | `NOT_RUNNING` `NOT_READY` `RECOVERING` `READY` | + | RECONFIGURED | After successful Kafka node config update via Admin client. | `NOT_RUNNING` `NOT_READY` `RESTARTED` `READY` | + | RECOVERING | Node has started but is in log recovery (broker state == 2). This is determined via the KafkaAgent. | `READY` | + | READY | Node is in running state and ready to serve requests which is determined by Kubernetes readiness probe (broker state >= 3 AND != 127 OR controller is listening on port). | `LEADING_ALL_PREFERRED` `NOT_READY` `NOT_RUNNING` | + | LEADING_ALL_PREFERRED | Node is leading all the partitions that it is the preferred leader for. Node's state can transition into this only from `READY` state. | `NOT_READY` `NOT_RUNNING` -The definitions of broker states can be found via the following link: [Broker States](https://github.com/apache/kafka/blob/3.7/metadata/src/main/java/org/apache/kafka/metadata/BrokerState.java). +Context about broker states and restart reasons: +- To determine if the node is ready or performing a log recovery, we use the [Broker States](https://github.com/apache/kafka/blob/3.7/metadata/src/main/java/org/apache/kafka/metadata/BrokerState.java) metric emitted by Kafka. KafkaAgent collects and exposes this metric via REST Endpoint. This is what the current KafkaRoller does already, and the new roller will use it the same way. -The definitions of the possible restart reasons can be found via the following link: [Restart Reasons](https://github.com/strimzi/strimzi-kafka-operator/blob/0.40.0/cluster-operator/src/main/java/io/strimzi/operator/cluster/model/RestartReason.java) +- If Kafka pod is ready, the restart reasons is checked to determine whether it needs to be restarted The definitions of the possible restart reasons can be found via the following link: [Restart Reasons](https://github.com/strimzi/strimzi-kafka-operator/blob/0.40.0/cluster-operator/src/main/java/io/strimzi/operator/cluster/model/RestartReason.java). This is also what the current KafkaRoller roller does and the new roller will use it the same way. #### NOT_RUNNING state -If one of the following is true, then node's state is `NOT_RUNNING: +If one of the following is true, then node's state is `NOT_RUNNING`: - no pod exists for this node - unable to get the `Pod Status` for the pod - the pod has `Pending` status with `Unschedulable` reason @@ -92,15 +94,15 @@ If none of the above is true but the node is not ready, then its state would be ### Configurability -The following can be the configuration options for the new KafkaRoller: +The following are the configuration options for the new KafkaRoller. If exposed to user, the user can configure it via `STRIMZI_` environment variables. Otherwise, the operator will set them to the default values (which are similar to what the current roller has): | Configuration | Default value | Exposed to user | Description | |:-----------------------|:--------------|:----------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | maxRestartAttempts | 3 | No | The maximum number of times a node can be restarted before failing the reconciliation. This is checked against the node's `numRestartAttempts`. | -| maxReconfigAttempts | 3 | No | The maximum number of times a node can be reconfigured before restarting it. This is checked against the node's `numReconfigAttempts`. | -| maxRetries | 10 | No | The maximum number of times a node can be retried after not meeting the safety conditions. This is checked against the node's `numRetries`. | -| postOperationTimeoutMs | 60 seconds | Yes | The maximum amount of time we will wait for nodes to transition to `SERVING` state after an operation in each retry. This will be based on the operation timeout that is already exposed to the user via environment variable `STRIMZI_OPERATION_TIMEOUT_MS`. | -| maxRestartParallelism | 1 | Yes | The maximum number of broker nodes that can be restarted in parallel. | +| maxReconfigAttempts | 3 | No | The maximum number of times a node can be dynamically reconfigured before restarting it. This is checked against the node's `numReconfigAttempts`. | +| maxRetries | 10 | No | The maximum number of times a node can be retried after not meeting the safety conditions e.g. availability check failed. This is checked against the node's `numRetries`. | +| operationTimeoutMs | 60 seconds | Yes | The maximum amount of time we will wait for nodes to transition to `READY` state after an operation in each retry. This is already exposed to the user via environment variable `STRIMZI_OPERATION_TIMEOUT_MS`. | +| maxRestartParallelism | 1 | Yes | The maximum number of broker nodes that can be restarted in parallel. This will be exposed to the user via the new environment variable `STRIMZI_MAX_RESTART_BATCH_SIZE`. | ### Algorithm @@ -119,9 +121,9 @@ Context: { ``` Contexts are recreated in each reconciliation with the initial data above. -2. Observe and transition each node's state to the corresponding state based on the information collected from the abstracted sources. If it can't retrieve any information from the sources, the reconciliation fails and the next reconciliation would start from step 1. +2. Observe and transition each node's state to the corresponding state based on the information collected from the abstracted sources which is further explained [here](#node-state) . If it can't retrieve any information from the sources, the reconciliation fails and the next reconciliation would start from step 1. -3. If there are nodes in `NOT_READY` state, wait for them to have `SERVING` within the `postOperationTimeoutMs`. +3. If there are nodes in `NOT_READY` state, wait for them to have `READY` within the `operationTimeoutMs`. We want to give nodes chance to get ready before we try to connect to them or consider them for rolling. This is important especially for nodes which were just started. This is consistent with how the current roller handles unready nodes. - If the timeout is reached, proceed to the next step and check if any of the nodes need to be restarted. @@ -129,13 +131,13 @@ Contexts are recreated in each reconciliation with the initial data above. 4. Group the nodes into the following categories based on their state and connectivity: - `RESTART_FIRST` - Nodes that have `NOT_READY` or `NOT_RUNNING` state in their contexts. The nodes that we cannot connect to via Admin API will also be put into this group with its reason updated with `POD_UNRESPONSIVE`. - `WAIT_FOR_LOG_RECOVERY` - Nodes that have `RECOVERING` state. + - `MAYBE_RECONFIGURE_OR_RESTART` - Broker nodes (including mixed nodes) that have an empty list of reasons and not been restarted or reconfigured yet (Context.numReconfigAttempts == 0 || Context.numRestartAttempts == 0). This nodes in this group then will be further refined and put into `RECONFIGURE` or `RESTART` categories later. - `RESTART` - Nodes that have non-empty list of reasons from the predicate function and have not been restarted yet (Context.numRestartAttempts == 0). - - `MAYBE_RECONFIGURE` - Broker nodes (including mixed nodes) that have an empty list of reasons and not been reconfigured yet (Context.numReconfigAttempts == 0). - `NOP` - Nodes that have at least one restart or reconfiguration attempt (Context.numRestartAttempts > 0 || Context.numReconfigAttempts > 0 ) and have either - `LEADING_ALL_PREFERRED` or `SERVING` state. + `LEADING_ALL_PREFERRED` or `READY` state. This means no operation is needed for this node. 5. Wait for nodes in `WAIT_FOR_LOG_RECOVERY` group to finish performing log recovery. - - Wait for nodes to have `SERVING` within the `postOperationTimeoutMs`. + - Wait for nodes to have `READY` within the `operationTimeoutMs`. - If the timeout is reached for a node and its `numRetries` is greater than or equal to `maxRetries`, throw `UnrestartableNodesException` with the log recovery progress (number of remaining logs and segments). Otherwise increment node's `numRetries` and repeat from step 2. 6. Restart nodes in `RESTART_FIRST` category: @@ -146,7 +148,7 @@ Contexts are recreated in each reconciliation with the initial data above. - If a node is in `NOT_RUNNING` state, the restart it only if it has `POD_HAS_OLD_REVISION` or `POD_UNRESPONSIVE` reason. This is because, if the node is not running at all, then restarting it likely won't make any difference unless the node is out of date. > For example, if a pod is in pending state due to misconfigured affinity rule, there is no point restarting this pod again or restarting other pods, because that would leave them in pending state as well. If the user then fixed the misconfigured affinity rule, then we should detect that the pod has an old revision, therefore should restart it so that pod is scheduled correctly and runs. - - At this point either we started nodes or decided not to because nodes did not have `POD_HAS_OLD_REVISION` reason. Regardless, wait for nodes to have `SERVING` within `postOperationTimeoutMs`. If the timeout is reached and the node's `numRetries` is greater than or equal to `maxRetries`, throw `TimeoutException`. Otherwise increment node's `numRetries` and repeat from step 2. + - At this point either we started nodes or decided not to because nodes did not have `POD_HAS_OLD_REVISION` reason. Regardless, wait for nodes to have `READY` within `operationTimeoutMs`. If the timeout is reached and the node's `numRetries` is greater than or equal to `maxRetries`, throw `TimeoutException`. Otherwise increment node's `numRetries` and repeat from step 2. - Otherwise the nodes will be attempted to restart one by one in the following order: @@ -154,23 +156,24 @@ Contexts are recreated in each reconciliation with the initial data above. - Mixed nodes - Broker only nodes - - Wait for the restarted node to have `SERVING` within `postOperationTimeoutMs`. If the timeout is reached and the node's `numRetries` is greater than or equal to `maxRetries`, throw `TimeoutException`. Otherwise increment node's `numRetries` and repeat from step 2. + - Wait for the restarted node to have `READY` within `operationTimeoutMs`. If the timeout is reached and the node's `numRetries` is greater than or equal to `maxRetries`, throw `TimeoutException`. Otherwise increment node's `numRetries` and repeat from step 2. -7. Further refine the broker nodes in `MAYBE_RECONFIGURE` group: +7. Further refine the broker nodes in `MAYBE_RECONFIGURE_OR_RESTART` group: - Describe Kafka configurations for each node via Admin API and compare them against the desired configurations. This is essentially the same mechanism we use today for the current KafkaRoller. - If a node has configuration changes and they can be dynamically updated, add the node into another group called `RECONFIGURE`. - If a node has configuration changes but they cannot be dynamically updated, add the node into the `RESTART` group. - If a node has no configuration changes, put the node into the `NOP` group. 8. Reconfigure each node in `RECONFIGURE` group: - - If `numReconfigAttempts` of a node is greater than the configured `maxReconfigAttempts`, add a restart reason to its context and repeat from step 2. Otherwise continue. + - If `numReconfigAttempts` of a node is greater than the configured `maxReconfigAttempts`, add a restart reason to its context and repeat from step 2. This will cause the Otherwise continue. - Send `incrementalAlterConfig` request with its config updates. - Transitions the node's state to `RECONFIGURED` and increment its `numReconfigAttempts`. - - Wait for each node that got configurations updated until they have `LEADING_ALL_PREFERRED` within the `postOperationTimeoutMs`. - - If the `postOperationTimeoutMs` is reached, repeat from step 2. + - If the request to update the configuration failed, it will retry the node by repeating from step 2. + - Wait for each node that got configurations updated until they have `READY` within the `operationTimeoutMs`. + - If the `operationTimeoutMs` is reached, repeat from step 2. 9. If at this point, the `RESTART` group is empty and if there is no nodes that is in `NOT_READY` state, the reconciliation will be completed successfully. - - If there are nodes in `NOT_READY` state, wait for them to have `SERVING` within the `postOperationTimeoutMs`. + - If there are nodes in `NOT_READY` state, wait for them to have `READY` within the `operationTimeoutMs`. - If the timeout is reached for a node and its `numRetries` is greater than or equal to `maxRetries`, throw `TimeoutException`. - Otherwise increment node's `numRetries` and repeat from step 2. This is consistent with how the current roller handles unready nodes. @@ -201,9 +204,9 @@ Contexts are recreated in each reconciliation with the initial data above. 11. Restart the nodes from the returned batch in parallel: - If `numRestartAttempts` of a node is larger than `maxRestartAttempts`, throw `MaxRestartsExceededException`. - Otherwise, restart each node and transition its state to `RESTARTED` and increment its `numRestartAttempts`. - - After restarting all the nodes in the batch, wait for their states to become `SERVING` until the configured `postOperationTimeoutMs` is reached. + - After restarting all the nodes in the batch, wait for their states to become `READY` until the configured `operationTimeoutMs` is reached. - If the timeout is reached, throw `TimeoutException`. If a node's `numRetries` is greater than or equal to `maxRetries`. Otherwise increment their `numRetries` and repeat from step 2. - - After all the nodes are in `SERVING` state, trigger preferred leader elections via Admin client. Wait for their states to become `LEADING_ALL_PREFERRED` until the configured `postOperationTimeoutMs` is reached. If the timeout is reached, log a `WARN` message. + - After all the nodes are in `READY` state, trigger preferred leader elections via Admin client. Wait for their states to become `LEADING_ALL_PREFERRED` until the configured `operationTimeoutMs` is reached. If the timeout is reached, log a `WARN` message. 12. If there are no exceptions thrown at this point, the reconciliation completes successfully. If there were `UnrestartableNodesException`, `TimeoutException`, `MaxRestartsExceededException` or any other unexpected exceptions throws, the reconciliation fails. @@ -244,7 +247,7 @@ All the nodes except `mixed-3` have the following Context with `nodeRef` being t ``` nodeRef: controller-0/0 nodeRoles: controller - state: SERVING + state: READY lastTransition: 0123456 reason: MANUAL_ROLLING_UPDATE numRestartAttempts: 0 @@ -262,7 +265,7 @@ The `mixed-3` node has the following context because the operator could not esta numReconfigAttempts: 0 numRetries: 0 ``` -2. The roller checks if all of the controller nodes are mixed and in `NOT_RUNNING` state. Since they are not and it has `POD_UNRESPONSIVE` reason, it restarts `mixed-3` node and waits for it to have `SERVING` state. The `mixed-3`'s context becomes: +2. The roller checks if all of the controller nodes are mixed and in `NOT_RUNNING` state. Since they are not and it has `POD_UNRESPONSIVE` reason, it restarts `mixed-3` node and waits for it to have `READY` state. The `mixed-3`'s context becomes: ``` nodeRef: mixed-3/3 nodeRoles: controller,broker @@ -273,12 +276,12 @@ The `mixed-3` node has the following context because the operator could not esta numReconfigAttempts: 0 numRetries: 0 ``` -3. `mixed-3` state becomes `SERVING` and since its `numRestartAttempts` is greater than 1, the roller checks the rest of the nodes. +3. `mixed-3` state becomes `READY` and since its `numRestartAttempts` is greater than 1, the roller checks the rest of the nodes. 4. The roller checks which node is the active controller and finds that `controller-0` is. It then sends a request to the active controller via AdminClient to describe its `controller.quorum.fetch.timeout` config value. 5. It then considers restarting `controller-1` and checks if the quorum health would be impacted. The operator sends a request to the active controller to describe the quorum replication state. It finds that majority of the follower controllers have caught up with the quorum leader within the `controller.quorum.fetch.timeout.ms`. -6. The roller restarts `controller-1` as it has no impact on the quorum health. When it has `SERVING` state, the roller repeats the quorum check and restarts `controller-2` and then `controller-0`. +6. The roller restarts `controller-1` as it has no impact on the quorum health. When it has `READY` state, the roller repeats the quorum check and restarts `controller-2` and then `controller-0`. 7. It then considers restarting `mixed-4`, so it performs quorum healthcheck and then availability check. Both check passes therefore `mixed-4` is restarted. The same is repeated for `mixed-5`. -8. All the controller and mixed nodes have `SERVING` state and `numRestartAttempts` set to greater than 1. This means, they have been successfuly restarted, therefore the roller considers restarting the broker nodes. +8. All the controller and mixed nodes have `READY` state and `numRestartAttempts` set to greater than 1. This means, they have been successfuly restarted, therefore the roller considers restarting the broker nodes. 9. It sends requests to describe all the topic partitions and their `min.insync.replicas` configuration, and the following list of topics is returned: ``` topic("topic-A"), Replicas(9, 10, 11), ISR(9, 10), MinISR(2) @@ -293,10 +296,10 @@ topic("topic-E"), Replicas(6, 10, 11), ISR(6, 10, 11), MinISR(2) - (7) - `broker-7` and `broker-10` do not share any topic partitions, however topic-A is at min ISR, therefore 10 cannot be restarted and is removed from the batch. - (6) - `broker-6` and `broker-9` do not share any topic partitions, however topic-A is at min ISR, therefore 9 cannot be restarted and is removed from the batch. -11. The roller picks the largest batch containing `broker-11` and `broker-8` and restarts them together. It waits for the nodes to have `SERVING` and then `LEADING_ALL_PREFERRED` state. -12. It then restarts the batch containing only `broker-7`. It waits for it to have `SERVING` and then `LEADING_ALL_PREFERRED` state. -13. It then restarts the batch containing only `broker-6`. It times out waiting for it to have `SERVING` state because it's still performing log recovery. -14. The roller retries waiting for `broker-6` to have `SERVING` state for a number of times and results in the following context: +11. The roller picks the largest batch containing `broker-11` and `broker-8` and restarts them together. It waits for the nodes to have `READY` and then `LEADING_ALL_PREFERRED` state. +12. It then restarts the batch containing only `broker-7`. It waits for it to have `READY` and then `LEADING_ALL_PREFERRED` state. +13. It then restarts the batch containing only `broker-6`. It times out waiting for it to have `READY` state because it's still performing log recovery. +14. The roller retries waiting for `broker-6` to have `READY` state for a number of times and results in the following context: ``` nodeRef: broker-6/6 nodeRoles: broker @@ -308,14 +311,14 @@ topic("topic-E"), Replicas(6, 10, 11), ISR(6, 10, 11), MinISR(2) numRetries: 10 ``` 15. The `maxRetries` of 10 is reached for `broker-6`, therefore the roller throws `UnrestartableNodesException` and the reconciliation fails. The operator logs the number of remaining segments and logs to recover. -16. When the next reconciliation starts, all the nodes are observed and their contexts are updated. `broker-6` node has finished performing log recovery therefore have `SERVING` state. All nodes have `SERVING` state and no reason to restart except `broker-9` and `broker-10`. +16. When the next reconciliation starts, all the nodes are observed and their contexts are updated. `broker-6` node has finished performing log recovery therefore have `READY` state. All nodes have `READY` state and no reason to restart except `broker-9` and `broker-10`. 17. Broker nodes that have no reason to restart are checked if their configurations have been updated. The `min.insync.replicas` has been updated to 1 therefore the roller sends a request containing the configuration update to the brokers and then transitions nodes' state to `RECONFIGURED`. 18. Observe the broker nodes that have configuration updated, and wait until they have `LEADING_ALL_PREFERRED` state. 19. The roller considers restarting `broker-10` and `broker-9` as they still have `MANUAL_ROLLING_UPDATE` reason. 20. It sends requests to describe all the topic partitions and their `min.insync.replicas` configuration and finds that all topic partitions are fully replicated. 21. The roller create 2 batches with a single node in each because `broker-10` and `broker-9` share topic partition, "topic-A": -22. It then restarts the batch containing `broker-10`. It waits for it to have `SERVING` and then `LEADING_ALL_PREFERRED` state. The same is repeated for the batch containing `broker-9`. -23. All nodes have `SERVING` or `LEADING_ALL_PREFERRED` and no exception was thrown therefore the reconciliation completes successfully. +22. It then restarts the batch containing `broker-10`. It waits for it to have `READY` and then `LEADING_ALL_PREFERRED` state. The same is repeated for the batch containing `broker-9`. +23. All nodes have `READY` or `LEADING_ALL_PREFERRED` and no exception was thrown therefore the reconciliation completes successfully. ### Switching from the old KafkaRoller to the new KafkaRoller @@ -332,7 +335,7 @@ The new KafkaRoller will only work with KRaft clusters therefore when running in - KafkaRoller is performing a rolling update to the cluster. It checks the availability impact for foo-0 partition before rolling broker 1. Since partition foo-0 has ISR [1, 2, 4], KafkaRoller decides that it is safe to restart broker 1. It is unaware of the `removingReplicas` request that is about to be processed. - The reassignment request is processed and foo-0 partition now has ISR [1, 4]. - KafkaRoller restarts broker 1 and foo-0 partition now has ISR [4] which is below the configured minimum in sync replica of 2 resulting in producers with acks-all no longer being able to produce to this partition. - +This would likely need its own proposal. ## Affected/not affected projects