-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing new KafkaRoller #103
base: main
Are you sure you want to change the base?
Conversation
8c79a95
to
9c6154b
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
9c6154b
to
c74f0b4
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
ca68601
to
5abafe6
Compare
56d7a24
to
4baf73a
Compare
Made some improvements on the structure Signed-off-by: Gantigmaa Selenge <[email protected]>
4baf73a
to
33ec40e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a first pass, as I need more time to digest this. I think it would be useful to illustrate the new behavior with a couple of examples of the form: with this roller configuration and cluster state, these are the node groups and their restart order. Wdyt?
c060c24
to
433316f
Compare
Tidy up Signed-off-by: Gantigmaa Selenge <[email protected]>
433316f
to
4f91a5a
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
97bdef2
to
941fe43
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
@fvaleri Thank you for the feedback. I have added an example of rolling update. Please let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice proposal. Thanks for it 👍 .
STs POV:
I think we would need to also design multiple tests to cover all states, which KafkaRoller v2
. We have a few tests but for sure that's not 100% coverage. So, we should maybe have a meeting to talk about this...
Side note about performance:
What would be appropriate performance metrics for us to consider when designing performance tests? Are there any critical ones? For sure I can image that we would see significant improvement on RollingUpdates
of multiple nodes when we use batching mechanism...
Co-authored-by: Maros Orsak <[email protected]> Signed-off-by: Gantigmaa Selenge <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tinaselenge thanks for the example, it really helps.
I left some comments, let me know if something is not clear or you want to discuss further.
- Cruise Control sends `removingReplicas` request to un-assign the partition from broker 2. | ||
- KafkaRoller is performing a rolling update to the cluster. It checks the availability impact for foo-0 partition before rolling broker 1. Since partition foo-0 has ISR [1, 2, 4], KafkaRoller decides that it is safe to restart broker 1. It is unaware of the `removingReplicas` request that is about to be processed. | ||
- The reassignment request is processed and foo-0 partition now has ISR [1, 4]. | ||
- KafkaRoller restarts broker 1 and foo-0 partition now has ISR [4] which is below the configured minimum in sync replica of 2 resulting in producers with acks-all no longer being able to produce to this partition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to rebalance, we have the same race condition with replication factor change (the new integration between CC and TO), maybe you can mention this.
The roller should be able to call the CC's user_tasks endpoint, and check if there is any pending task. In that case, the roller has two options: wait for all tasks completion, or continue as today with the potential issue you describe here. You can't really stop the tasks because the current batch will still be completed, and the operators will try to submit a new task in the next reconciliation loop.
I think that we should let the user decide which policy to apply through a configuration. By default the roller would wait for all CC tasks to complete, logging a warning. If the user set or switch to "force" policy, then the roller would behave like today. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be perhaps included/discussed in a separate proposal or issue? The idea was to mention that there is a race condition we could fix with the new roller in the future, which is not easy to fix with the old roller. How we fix it and other similar problems should be a separate discussion I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a dedicated proposal IMO, but let's start by logging an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would calling the ListReassigningPartitions API be enough to know this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me, but I had a few questions and wording suggestions. I definitely think this will be useful since I've experienced first hand how tricky it is to debug the existing code.
931adbd
to
1060fee
Compare
Add possible transitions Signed-off-by: Gantigmaa Selenge <[email protected]>
1060fee
to
e56d1f8
Compare
If none of the above is true but the node is not ready, then its state would be `NOT_READY`. | ||
|
||
#### Flow diagram describing the overall flow of the states | ||
![The new roller flow](./images/06x-new-roller-flow.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments about the diagram of the FSM ...
- Each state has to be unique and not duplicated.
- If you are grouping states together, does it really mean they are different state? Or are they different state with different context? It looks weird to me.
- Where you use "Action and Transition State" I think that: Action should be cleared, because it represents an output for the FSM? Is there a real output or just transition? You don't need "Transition State" because it's defined by the arrow you have.
My gut general feeling is that ... you are describing a very complex state machine and transitions in the "Algorithm" section but then the visualization of it is really weak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the diagram to be more from high level and then broken down with more details in the Algorithm section. Given that, we may have to change various things around states and stuff, I will take this suggestion and recreate the diagram.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the diagram changed after my comments? It seems almost the same. Also, what's "Serving" or "Waiting" or "Restarted" or "Reconfigured" ... they are not listed as states in the table above.
I can't really match the table with the diagram, sorry :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ppatierno I have updated the diagram now. It's still more of high level flow showing the transitions. The possible states are listed but does not mean they are grouped together. Depending on the state, a different action taken and the possible actions are also listed. Do you think it's clearer now? Or do you think I should break down to each state mapping to the possible actions, instead of listing them together in a same bubble?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well an FSM state machine should have single states (but I can understand your grouping of more than one here, so let's leave it for now). What I can't understand is what's the hexagon and it's content. They are not states, right? I can't see them in the table. So their presence confuses me.
Also you are duplicating circles for Not_Ready/Not_Running/Recovery: you have one with them and Ready, and one with them alone. I think you should split and having Ready on its own, as well as Leading_All_preferred.
Finally as an FSM state diagram it should not contain other stuff like "Start" or "Desired State Reached", you are describing an FSM, and those ones are not states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's start getting a better shape :-) but I would still improve it by reducing duplications. I think you can have just one "Unknown" (instead of 2) and one "Ready" state (instead of 3). Even because AFAIU, the final desired state is "LeadingAllPreferred" right? and from the current graph not all "Ready" states end there.
Next we'll talk about the pink/red states which have some duplications as well .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to hear that it is in the right direction. I've taken the suggestion and made another update :). Thanks @ppatierno
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does exactly "Iterate" mean on the orange arrows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's repeating the process on node if the desired state is not reached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the diagram with a bit more explanation instead of the "iterate" part.
## Rejected | ||
|
||
- Why not use rack information when batching brokers that can be restarted at the same time? | ||
When all replicas of all partitions have been assigned in a rack-aware way then brokers in the same rack trivially share no partitions, and so racks provide a safe partitioning. However nothing in a broker, controller or cruise control is able to enforce the rack-aware property therefore assuming this property is unsafe. Even if CC is being used and rack aware replicas is a hard goal we can't be certain that other tooling hasn't reassigned some replicas since the last rebalance, or that no topics have been created in a rack-unaware way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure the above is considered a rejected alternative. I mean, this section is for solutions for the same goals which were rejected, while it seems to be used just to highlight that an "idea" within the current proposal was rejected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, we would need to agree if we are rejecting this idea. Perhaps I should rename this into other ideas considered?
d3e3f1d
to
557eb7a
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
557eb7a
to
e3e4d33
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
- Improve the names for categories and states - Remove restarted/reconfigured states - Add a configuration for delay between restarts - Add a configuration for delay between restart and trigger of preferred leader election - Restart NOT_RUNNING nodes in parallel for quicker recovery - Improve the overall algorithm section, to make it clearer and concise Signed-off-by: Gantigmaa Selenge <[email protected]>
033da99
to
ec38009
Compare
Thanks everyone who reviewed the PR! I believe I now have addressed all the review comments except an update of the diagram (will push that in a follow up commit). @scholzj @ppatierno @fvaleri @tombentley , could you please take another look when you have time? Thank you very much. |
If none of the above is true but the node is not ready, then its state would be `NOT_READY`. | ||
|
||
#### Flow diagram describing the overall flow of the states | ||
![The new roller flow](./images/06x-new-roller-flow.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the diagram changed after my comments? It seems almost the same. Also, what's "Serving" or "Waiting" or "Restarted" or "Reconfigured" ... they are not listed as states in the table above.
I can't really match the table with the diagram, sorry :-(
06x-new-kafka-roller.md
Outdated
Contexts are recreated in each reconciliation with the above initial data. | ||
|
||
2. **Transition Node States:** | ||
Update each node's state based on information from abstracted sources. If failed to retrieve information, the reconciliation fails and restarts from step 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that this is about "loading" or "building" the current state of the node. Usually in our FSMs (i.e. rebalancing), this state is coming from a custom resource, here it's coming from different sources. Maybe we can say it better instead of "update each node's ...." .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what would be better, does Load each node's state based on information...
sound better?
We build a context in step 1, which has state UNKNOWN. In this step, we are updating this state based on the information from the sources. So to me, update each node's state
sounds fine.
Signed-off-by: Gantigmaa Selenge <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tinaselenge, I had another look at the example and I think is great. Left few more comments, but I think this would work.
4c035fa
to
bf71ae6
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
bf71ae6
to
660f2ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for answering all my questions. Good job.
Thank you @fvaleri , I really appreciate you reviewing the proposal thoroughly. |
- Although it is safe and straightforward to restart one broker at a time, this process is slow in large clusters ([related issue](https://github.com/strimzi/strimzi-kafka-operator/issues/8547)). | ||
- It does not account for partition preferred leadership. As a result, there may be more leadership changes than necessary during a rolling restart, consequently impacting tail latency. | ||
- Hard to reason about when things go wrong. The code is complex to understand and it's not easy to determine why a pod was restarted from logs that tend to be noisy. | ||
- Potential race condition between Cruise Control rebalance and KafkaRoller that could cause partitions under minimum in sync replica. This issue is described in more detail in the `Future Improvements` section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general Slack is not really ideal for keeping details of problems in the long term. Better to create an issue, which can be discovered more easily by anyone who faces a similar problem.
Update each node's state based on information from abstracted sources. If failed to retrieve information, the current reconciliation immediately fails. When the next reconciliation is triggered, it will restart from step 1. | ||
|
||
3. **Handle `NOT_READY` Nodes:** | ||
Wait for `NOT_READY` nodes to become `READY` within `operationTimeoutMs`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Judging from the fact that the next step covers NOT_READY
, I'm guessing that we just fall through if the node is still NOT_READY
after operationTimeoutMs
. But you need to say that! And also explain, if we're prepared to fall though to the next step, why this timeout is even necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have explained why we do the wait and that it falls through the next step.
- `NOP`: Nodes needing no operation. | ||
|
||
5. **Wait for Log Recovery:** | ||
Wait for `WAIT_FOR_LOG_RECOVERY` nodes to become `READY` within `operationTimeoutMs`. If timeout is reached and `numRetries` exceeds `maxRetries`, throw `UnrestartableNodesException`. Otherwise, increment `numRetries` and repeat from step 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all these steps I think it would be really valuable to explain the why. Here we're willing to wait for brokers in log recovery because the following steps will result in actions, like restarting other brokers, which will be directly visible to clients. We prefer to start from a cluster that's as close to fully functional as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to explain why are we willing to wait for log recovery here, but not willing to wait for all the brokers replicas to rejoin the ISR?
IIRC the reason we had was the KafkaRoller's job was to restart things, but we didn't want to give it any responsibility for throttling interbroker replication, and we can't guarantee (for all possible workloads) that brokers will always be able to catch up to the LEO (within a reasonable time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tombentley. I have added the reasons.
Reconfigure nodes in the `RECONFIGURE` group: | ||
- Check if `numReconfigAttempts` exceeds `maxReconfigAttempts`. If exceeded, add a restart reason and repeat from step 2. Otherwise, continue. | ||
- Send `incrementalAlterConfig` request, transition state to `UNKNOWN`, and increment `numReconfigAttempts`. | ||
- Wait for each node's state to transition to `READY` within `operationTimeoutMs`. If timeout is reached, repeat from step 2, otherwise continue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By definition the node's state will already be READY (otherwise it would have been in the RESTART_NOT_RUNNING
group), therefore there is no transition to observe.
It's never been clear to me what would be a good safety check on a dynamic reconfig. Some of the reconfigurable configs could easily result in a borked cluster, so it feels like some kind of check is needed. I think we need to take into account any effects of the reconfiguring of this node on the other nodes in the cluster. I guess step 10 is intended to achieve this, but it's not clear to me how step 10 different from just always restarting from step 2 after each reconfiguration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The roller will transition the node to UNKNOWN state after taking an action so that the state can be observed again, but you are right, that would likely to return READY immediately. As you said, the step 10 will repeat from step 2 if at the point the reconfigured node has gone bad. When repeating from step 2, if the node is not ready but since there is no reason to restart and reconfigure (because it's already been reconfigured) , we would end up waiting for it to become ready until the reconciliation fails. Perhaps, we could fail the reconciliation with an error indicating that a node is not ready after reconfiguration, so we notify the human operator to do the investigation through the log.
Updated the text on the diagram Signed-off-by: Gantigmaa Selenge <[email protected]>
ad75fcf
to
e9a9859
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
Hi @tombentley @scholzj @ppatierno, do you have any further comments on this proposal? |
For more implementation details, the POC implementation code can be checked in RackRolling.java. All the related classes are in the same package , rolling.
The tests illustrating various cases with different set of configurations is in RackRollingTest.java.
The logic for switching to the new roller is in KafkaReconciler.java class.