-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restarted brokers are kept with a broken connection (producer) #665
Comments
When the producer sends a message to a broker and that network connection is gone, sarama should receive an error from the networking stack, refresh its metadata, and either find a new broker or return an error if it cannot. Are you saying that sarama is not doing that in some case? Is it completely hanging? Is it returning errors even after the connection has recovered? What does "cannot be used" mean? Or is the behaviour I described not the one you want? |
Yes, it seems that sarama das NOT reconnect to the broker that went down automatically but it should. We saw that after one of our kafka servers went down sarama was still not able to deliver all the data (errors returned). The ratio of messages not delivered was pointing to one partition not being available. So my first test was to check if the connection held by the broker object produced a networking error when used after the server went off- and online again - and it did. |
And presumably came back online? If one of the servers is actually down and the partition has no other replicas, there isn't much we can do.
What error(s) exactly? If you query the kafka metadata in zookeeper at this point, do all three partitions show up with valid leaders?
This should be proved (or refuted) from the logs. Do you have a copy of the logs that do not include your workaround or whatever consumers are running? |
Also worth checking: how long did this go on for? Could it just be waiting for one of the |
I already attached the logs of the reproduction case to my first posting.
I will try to reproduce the exact behavior on my local machine without the fix, this will hopefully draw a clearer picture. |
If there are two replicas this should not have happened in the first place; Kafka should have flipped leadership to the other replica and continued on its way.
I see almost nothing to do with producing in these logs; just some consumer lines and a lot of metadata lines which appear to be from the consumer (or from your workaround code?) rather than the producer itself. Based on what I do have, the heavy recurrence of the lines
suggests something wrong with the kafka cluster setup (or possibly zookeeper). Timestamps would help here; were these lines repeating for minutes, or did they all spew out at once? What version of kafka are you running? On certain older versions we have seen bugs where one of the brokers caches incorrect metadata from zookeeper. My working hypothesis is:
|
Reproduced the problem of a "broken" broker connection with a local system an better logs now. You can see that in the application log by the message "Broker 192.168.35.23:32002 found to have an invalid connection." You can find a dump of the sarama config used in the application log.
Please note that after the broker goes up metadata request return to look "normal" but when the first message is sent the broker connection is tested (metadata request for the message topic) an error (EOF = not connected) is returned, causing the message stated above. |
We had to restart our kafka servers today and managed to capture a log for the live servers. Behavior was still the same: 1 broker down = 1/num_brokers messages returned by the error channel. During the restart we had another server running with librdkafka instead and this one did not report any errors so I'm pretty sure that we either use Sarama in a wrong way or there is at least one bug related to a broker being restarted. |
Oh sorry, this is a dupe of #294. I should have recognized the symptoms sooner, but I didn't review your default configuration in depth because I wasn't sure which fields you overrode in production. That ticket is unfortunately complex; if you can run with retries enabled, that is by far the simplest solution. If this is a blocker, we can continue discussion of a better solution there. |
Ah! Good to know. Thank you. |
Apologies for a reviving a 2-year old thread, butI'm reasonably certain I bumped into this issue the other day. Even though it is not treating the root cause, would there be interest in a PR that has the client choosing a random seed broker (as opposed to the first one in the list)? Otherwise, the client is stuck on the same broker, that keeps feeding it with bad metadata. |
When connecting a producer to a set of brokers (lets say 3 for a given topic) and one of these brokers goes down (e.g. restart) Sarama keeps the broken connection open and does not attempt to reconnect.
There is a function in broker.go called "Connected" but this function does not actually check if the connection is still valid because a broken connection is not nil. See https://github.com/Shopify/sarama/blob/master/broker.go#L118
How to reproduce the problem:
-> The connection is invalid and unless manually tested cannot be used
What we use as a workaround is to do a metadata request to each leader of a topic and see if one fails.
If it does we call broker.Close and broker.Open and hope for the best.
This seems to work but it still feels like Sarama should take care of this as connection handling is normally done internally.
The code we used can be found here (maybe we just use it wrongly?): https://github.com/trivago/gollum/blob/cde9dfe8072f801e9f6a1ba20035865e9e3a1fea/producer/kafka.go
I also attached the log output of the test:
log.txt
This log follows the reproduction rule above using docker containers. The container was kept down for a few seconds before restart. After restart we waited a few seconds and sent a message which detected the broken connection as of this line https://github.com/trivago/gollum/blob/cde9dfe8072f801e9f6a1ba20035865e9e3a1fea/producer/kafka.go#L489
The text was updated successfully, but these errors were encountered: