-
-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broker: Specified group generation id is not valid #331
Comments
Hi @rao-mayur , Thanks for your ticket. You can increase the |
Hi @LGouellec Thanks for the response. I have a few follow up :
|
1- Yes exactly, there are lot of blog post on that. Example : https://medium.com/codex/dealing-with-long-running-jobs-using-apache-kafka-192f053e1691. Streamiz has internally some mecanishm with a back pressure pattern, but in some cases that doesn't always avoid being kicked out of the group 2- Yes exactly, all the async processing is made with a dedicated consumer-group ( 3- Not totally, if an internal exception is raised (Consumer group evicted for instance), the external consumer is closed properly, and another instance of the consumer group 4- This is the default behavior, FAIL means the thread will die and nothing else more. So no more external processing in your current application instance. |
Ok thank you @LGouellec |
Hi @LGouellec I set the InnerExcetionHandler to continue in the stream config setting. Got the warning Even after making the config update the issue I continue to see is : all the consumers in a group (I have 6 consumers running) do not join the group as part of the rebalance. This is for the -external group (ForEaschAsync() processor). This causes processing to go down and builds up lag. Manually re-starting the processors seem to help and more consumers join the group. Do you have any advise for this? How to be sure that all consumers re join the group after InnerExceptionHandler returns Continue? Thanks, |
Hi @rao-mayur , Can I share your last logs please ? Best regards, |
Hi @LGouellec Please see file attached. basically those logs capture that after the InnerExceptionHandler Continue hits on Jul 31, 2024 14:41:02.428 only one guid is reporting Processed X total records. I am running 6 containers so 6 threads. According to me after the Continue all 6 should restart and in logs I should be able to see 6 different guids report Processed X total records right? Thank you for your help on this. |
Hey @rao-mayur , I don't understand your logs and usecase. So you have 6 pods, and so 1 external stream thread per pods. Btw, I'm currently conducting a satisfaction survey to understand how I can better serve and I would love to get your feedback on the product. Your insights are invaluable and will help us shape the future of our product to better meet your needs. The survey will only take a few minutes, and your responses will be completely confidential. Thank you for your time and feedback! |
@LGouellec I see this error but not because it took too long to process messages; rather the state transitioned BACK from RUNNING to PARTITIONS_ASSIGNED. Here is the timeline:
As you can see, the thread processed 62 messages in a minute; which is less than the default 5 min MaxPollInterval.
This causes the following error. Note that its unable to close out the thread because it tries to Commit offsets which isn't going to work:
Is the expectation here to CONTINUE on inner exceptions like you proposed earlier? I don't think increasing MaxPollIntervalMs helps in this particular case. Also if the external stream thread is shutdown, why isn't a new one launched so that stream processing continues? Seems like the stream threads can fail and shutdown for several reasons - SSL handshake errors / network errors leading to member being kicked out, message processing taking too long, state transitions from RUNNING to PARITIONS_ASSIGNED and so on. Moreover, |
Another case: On startup, task restoration takes > 5 min, which is the default for MaxPollIntervalMs. As a result, Consume() fails:
We are able to workaround this by changing InnerExceptionHandler to CONTINUE and by increasing MaxPollIntervalMs. Is a better alternative to create Consumer instance in StreamThread after the restoration is complete? Is that feasible? |
Thanks for your input, for your information, I will complete refactor the external stream thread processor in the 1.8.0 release. |
This is a common error yes, when the task restoration is bigger than 5 minutes per default, the consumer fail due to an inactive consumer.
For your information, I will work to improve the restoration time soon (plan for 1.8.0) to leverage bulk upsert into RocksDb instead of unitary upsert. |
Description
I have a streams application running which uses 2 ForeachAsync() loops for processing. I am seeing an error/exception being logged by the Streamiz Library:
external-stream-thread[] Encountered the following unexpected Kafka exception during processing, this usually indicate Streams internal errors:
{
"error.class": "Confluent.Kafka.KafkaException",
"error.message": "Broker: Specified group generation id is not valid",
"error.stack": "\t at Confluent.Kafka.Impl.SafeKafkaHandle.Commit(IEnumerable
1 offsets) \n\t at Confluent.Kafka.Consumer
2.Commit(IEnumerable`1 offsets) \n\t at Streamiz.Kafka.Net.Processors.ExternalStreamThread.CommitOffsets(Boolean clearBuffer) \n\t at Streamiz.Kafka.Net.Processors.ExternalStreamThread.Commit() \n\t at Streamiz.Kafka.Net.Processors.ExternalStreamThread.<>c__DisplayClass44_1.b__2() \n\t at Streamiz.Kafka.Net.Crosscutting.ActionHelper.MeasureLatency(Action action) \n\t at Streamiz.Kafka.Net.Processors.ExternalStreamThread.Run()","level": "ERROR",
"message": "external-stream-thread[] Encountered the following unexpected Kafka exception during processing, this usually indicate Streams internal errors:",
}
Also there is warnings being logged like this :
stream-task[] Error with a non-fatal error during committing offset (ignore this, and try to commit during next time): Broker: Specified group generation id is not valid
I notice issues with rebalance when this happens and all consumers do not join the consumer group until a manual restart happens. Although not fatal, I see lag build up and overall slow processing.
How to reproduce
Not sure if this can be easily reproduced. Having a topology that uses 2 ForeachAsync() loops (which does some work processing the message) and adding consumers to an existing group to see if the re-balance happens correctly.
Config :
AutoOffsetReset = earliest
AutoRegisterSchemas = false,
UseLatestVersion = true,
MaxPollIntervalMs = 300000
Checklist
Please provide the following information:
The text was updated successfully, but these errors were encountered: