-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: follower fetch halts rather than switch to leader if replica is stopped/restarted #2090
Comments
https://share.getcloudapp.com/v1uLoRK5 Here are the unique stacks found by
|
actually, not convinced this isn't apache/kafka#10326 rearing its head again. sigh. it looks like what caused things to wake back up again was the broker re-entering ISR for the partition, rather than the broker coming back at all. so yes, this is an odd interaction between the Sarama library and the Kafka broker. |
based on what I'm reading, we got punted off 1290 and had an active subscription to 1289 the whole time we weren't processing any messages, but it was not feeding us any messages (or we were ignoring the results), until 1290 re-entered ISR at which point it sent us a message that we understood to mean 1290 was the new preferred broker and we should re-dispatch. so why wasn't the leader giving us responses to Fetch, while our high watermark reader in separate thread was indicating we were further and further behind? |
Thanks for the detailed report, this sounds interesting and I look forward to diving into it, but unfortunately I’m on vacation now and so it’s unlikely I’ll get a chance to properly look into it until the New Year |
happy new year! let me know if/when you'd like to pair to look at this. |
After Sarama had been given a preferred replica to consume from, it was mistakenly latching onto that value and not unsetting it in the case that the preferred replica broker was shutdown and left the cluster metadata. Fetches continued to work as long as that broker remained shutdown, because they were now being sent to the Leader, which would service them itself as it had no better preferred replica to point the client at. However, consumption would then hang after the broker came back online, because the Leader would stop returning records in the FetchResponse and would instead just return the preferred replicaID, expecting the client to send its FetchRequests over there. However, because the partitionConsumer had latched the value of preferredReplica it never dispatched to (re-)connect to the preferred replica and instead just continued to send FetchRequests to the leader and received no records back. Contributes-to: #2090 Signed-off-by: Dominic Evans <[email protected]>
@lizthegrey sorry for not being in touch before now, it has been a busy start to the year, but I finally managed to carve out some time to look into this problem. I put together a small functional test to exercise the behaviour and immediately spotted one bug in the current implementation that I've pushed a fix for under PR #2108 I tried to describe the issue I found in the description to that PR. I'm not completely 100% certain this was the issue that you hit in your testing, based on how you described when consumption stalled and when it successfully resumed, but I wonder if you might be able to re-test your scenario with |
That indeed sounds like the bug we were seeing, thanks! We'll retry in the next few days and get back to you. |
Yup, the proposed fix indeed fixed it. |
After Sarama had been given a preferred replica to consume from, it was mistakenly latching onto that value and not unsetting it in the case that the preferred replica broker was shutdown and left the cluster metadata. Fetches continued to work as long as that broker remained shutdown, because they were now being sent to the Leader, which would service them itself as it had no better preferred replica to point the client at. However, consumption would then hang after the broker came back online, because the Leader would stop returning records in the FetchResponse and would instead just return the preferred replicaID, expecting the client to send its FetchRequests over there. However, because the partitionConsumer had latched the value of preferredReplica it never dispatched to (re-)connect to the preferred replica and instead just continued to send FetchRequests to the leader and received no records back. Contributes-to: IBM#2090 Signed-off-by: Dominic Evans <[email protected]>
After Sarama had been given a preferred replica to consume from, it was mistakenly latching onto that value and not unsetting it in the case that the preferred replica broker was shutdown and left the cluster metadata. Fetches continued to work as long as that broker remained shutdown, because they were now being sent to the Leader, which would service them itself as it had no better preferred replica to point the client at. However, consumption would then hang after the broker came back online, because the Leader would stop returning records in the FetchResponse and would instead just return the preferred replicaID, expecting the client to send its FetchRequests over there. However, because the partitionConsumer had latched the value of preferredReplica it never dispatched to (re-)connect to the preferred replica and instead just continued to send FetchRequests to the leader and received no records back. Contributes-to: IBM#2090 Signed-off-by: Dominic Evans <[email protected]>
Versions
Configuration
Logs
Broker 1289 is the leader; broker 1290 is our local replica in us-east-1b
logs: CLICK ME
Problem Description
When a broker is restarted that we are follower fetching from, Sarama stays with the bad broker waiting for it to come back instead of going to the leader. It remains hung and no messages are consumed until the restarted broker resumes accepting connections AND has re-entered ISR for the partition being read.
The text was updated successfully, but these errors were encountered: