-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
erlkaf_consumer stop timeout #21
Comments
Hello, In erlkaf 2.0.0 we did a major rewrite to improve the behavior when you have lot of topics. (https://github.com/silviucpp/erlkaf#upgrading-from-v1x-to-v200) |
Hello, Looking to the logs I think problem is that one of your callbacks it's crashing during the Is this the case ? Silviu |
I might be wrong:) the logs are in a pretty strange format :). I'll dig more |
@thusnjak Can you please identify where the consumers spends so much time when it's stopped ? When a rebalance occurs:
2a. Once that stop message is received by the gen_server (if it's idle) will call: https://github.com/silviucpp/erlkaf/blob/master/src/erlkaf_consumer.erl#L128 2b. In case the consumer is during some processing messages, after going out from the app callback (your handle_message/2 callback) will check if stop was received: https://github.com/silviucpp/erlkaf/blob/master/src/erlkaf_consumer.erl#L193 In case your handle_message/2 callback takes more than 5 seconds can explain the issue.. or maybe a bug into our code.. Will be nice if you can put some logs and identify what function takes such big amount of time. Silviu |
So I think I replicated the problem:
I created a handle_message that adds a delay of 30 seconds and seems the problem replicate. I will try to find exactly way and to do a fix but will be nice to confirm that also your handle_message takes more than 5 seconds till return |
Please try the last master and let me know if you can still replicate |
Hi, thank you for your quick response. We did some crash testing and it seemed all fine. The fix is today deployed on our staging so most likely till Monday we'll know for sure. I'll let you know. Thanks, T |
Thanks a lot I'm waiting for your feedback Silviu |
So far we haven't noticed any problems. We'll keep monitoring and let you know if any strange behaviour happens. Thank you. |
We went into production with this and didn't experience any problems so far :) Tx |
Hello,
after upgrading to erlkaf 2.0.0 (commit 1d706c6) we started experiencing consumer stop timeouts on consumer group rebalance.
Our consumer group has only one topic with 36 partitions and originally we have 2 application instances consuming 18 each partitions. When there is some more load we scale horizontally up to 9 applications and consumer group rebalances.
Sometimes, when rebalance occurs consumer timeout on stop happens (erlkaf_consumer.erl:50), gen_server crashes. There wouldn't be any problem but consumer group rebalance continues in a loop and our app never recovers. The consumer just keeps starting and crashing in a loop and stable group rebalance never happens.
erlkaf_timeout.txt
erlkaf_timeout_extended.txt
This never happened with erlkaf 1.1.9 which we used for quite a while. What is different?
The text was updated successfully, but these errors were encountered: