Restarted brokers are kept with a broken connection (producer) #665

arnecls · 2016-06-01T16:02:34Z

When connecting a producer to a set of brokers (lets say 3 for a given topic) and one of these brokers goes down (e.g. restart) Sarama keeps the broken connection open and does not attempt to reconnect.

There is a function in broker.go called "Connected" but this function does not actually check if the connection is still valid because a broken connection is not nil. See https://github.com/Shopify/sarama/blob/master/broker.go#L118

How to reproduce the problem:

Create a topic with e.g. 3 partitions so that each partition is on a different broker
Connect to that topic with a sarama producer
Produce a message (make sure it is connected)
Restart one broker
-> The connection is invalid and unless manually tested cannot be used

What we use as a workaround is to do a metadata request to each leader of a topic and see if one fails.
If it does we call broker.Close and broker.Open and hope for the best.
This seems to work but it still feels like Sarama should take care of this as connection handling is normally done internally.

The code we used can be found here (maybe we just use it wrongly?): https://github.com/trivago/gollum/blob/cde9dfe8072f801e9f6a1ba20035865e9e3a1fea/producer/kafka.go

I also attached the log output of the test:
log.txt

This log follows the reproduction rule above using docker containers. The container was kept down for a few seconds before restart. After restart we waited a few seconds and sent a message which detected the broken connection as of this line https://github.com/trivago/gollum/blob/cde9dfe8072f801e9f6a1ba20035865e9e3a1fea/producer/kafka.go#L489

eapache · 2016-06-01T16:28:18Z

When the producer sends a message to a broker and that network connection is gone, sarama should receive an error from the networking stack, refresh its metadata, and either find a new broker or return an error if it cannot.

Are you saying that sarama is not doing that in some case? Is it completely hanging? Is it returning errors even after the connection has recovered? What does "cannot be used" mean?

Or is the behaviour I described not the one you want?

arnecls · 2016-06-01T17:07:49Z

Yes, it seems that sarama das NOT reconnect to the broker that went down automatically but it should.

We saw that after one of our kafka servers went down sarama was still not able to deliver all the data (errors returned). The ratio of messages not delivered was pointing to one partition not being available.
3 partitions, ~1/3 of the messages not delivered.
We assumed that the connection to the broker was till open but invalid (i.e. other side went down) and a reconnect was not attempted.

So my first test was to check if the connection held by the broker object produced a networking error when used after the server went off- and online again - and it did.
So to me it look like there is a reconnect missing somewhere although I cannot pin it down exactly.

eapache · 2016-06-01T17:19:23Z

We saw that after one of our kafka servers went down

And presumably came back online? If one of the servers is actually down and the partition has no other replicas, there isn't much we can do.

sarama was still not able to deliver all the data (errors returned)

What error(s) exactly?

If you query the kafka metadata in zookeeper at this point, do all three partitions show up with valid leaders?

We assumed that the connection to the broker was till open but invalid (i.e. other side went down) and a reconnect was not attempted.

This should be proved (or refuted) from the logs. Do you have a copy of the logs that do not include your workaround or whatever consumers are running?

eapache · 2016-06-01T17:20:55Z

Also worth checking: how long did this go on for? Could it just be waiting for one of the config.Net timeouts?

arnecls · 2016-06-01T20:19:46Z

I already attached the logs of the reproduction case to my first posting.
We sadly do not have Sarama log messages for the original case when one of the Kafka servers went down. To answer your questions, what we saw is this (topic with 3 partitions, 2 replicas):

Broker goes down
AsyncProducer.Errors() returns 1/3 of the messages
Broker goes up
Behavior does not change
We wait for the set Metadata.RefreshFrequency (10min set, 20min waited)
Behavior does not change
Restart producer, everything back to normal

I will try to reproduce the exact behavior on my local machine without the fix, this will hopefully draw a clearer picture.

eapache · 2016-06-01T21:19:29Z

AsyncProducer.Errors() returns 1/3 of the messages

If there are two replicas this should not have happened in the first place; Kafka should have flipped leadership to the other replica and continued on its way.

I already attached the logs of the reproduction case

I see almost nothing to do with producing in these logs; just some consumer lines and a lot of metadata lines which appear to be from the consumer (or from your workaround code?) rather than the producer itself.

Based on what I do have, the heavy recurrence of the lines

client/metadata fetching metadata for [test] from broker 192.168.35.23:32002
client/metadata found some partitions to be leaderless

suggests something wrong with the kafka cluster setup (or possibly zookeeper). Timestamps would help here; were these lines repeating for minutes, or did they all spew out at once?

What version of kafka are you running? On certain older versions we have seen bugs where one of the brokers caches incorrect metadata from zookeeper. My working hypothesis is:

Broker goes down.
Some other broker caches the fact that that partition is leaderless.
Sarama asks for the new leader, finds there is none, gives up and starts returning errors for that partition.
Broker goes up and reclaims leadership but the other broker's cache is stuck.
Restarting producer causes sarama to ask a different broker (without the bad cache) for the leadership, resumes working.

arnecls · 2016-06-02T08:20:26Z

Reproduced the problem of a "broken" broker connection with a local system an better logs now. You can see that in the application log by the message "Broker 192.168.35.23:32002 found to have an invalid connection."

You can find a dump of the sarama config used in the application log.
Sarama 1.9.1, Kafka 0.8.2.2
I also added the kafka logs of that run (kafka-dev2 container was killed + restarted)

Topic:test  PartitionCount:3    ReplicationFactor:1 Configs:
    Topic: test Partition: 0    Leader: 23  Replicas: 23    Isr: 23
    Topic: test Partition: 1    Leader: 21  Replicas: 21    Isr: 21
    Topic: test Partition: 2    Leader: 22  Replicas: 22    Isr: 22

Application log
Kafka log

Please note that after the broker goes up metadata request return to look "normal" but when the first message is sent the broker connection is tested (metadata request for the message topic) an error (EOF = not connected) is returned, causing the message stated above.
The Close + Open seen in the log after the error is done by my application.

arnecls · 2016-06-02T15:42:48Z

We had to restart our kafka servers today and managed to capture a log for the live servers.
Logfile

Behavior was still the same: 1 broker down = 1/num_brokers messages returned by the error channel.
I do agree that this is strange as during the downtime a re-election did take place and Sarama should have recovered automatically.

During the restart we had another server running with librdkafka instead and this one did not report any errors so I'm pretty sure that we either use Sarama in a wrong way or there is at least one bug related to a broker being restarted.
I want to mentioned that we had 43 other instances of that type of producer running and they all showed the same behavior and none of them did recover. The restarts were done over about 45 minutes so if this was a timeout related problem at least one of the producers should have recovered.

eapache · 2016-06-02T16:21:09Z

Oh sorry, this is a dupe of #294. I should have recognized the symptoms sooner, but I didn't review your default configuration in depth because I wasn't sure which fields you overrode in production. That ticket is unfortunately complex; if you can run with retries enabled, that is by far the simplest solution. If this is a blocker, we can continue discussion of a better solution there.

arnecls · 2016-06-03T08:21:44Z

Ah! Good to know. Thank you.
We normally have disabled retries as our framework already handles this and gives us better visibility if a retry (and thus a possible duplicate entry) happens. I guess we will enable retries for topics where we don't care about duplicates and use librdkafka for the critical ones in the meantime.

kchristidis · 2018-09-04T00:15:02Z

On certain older versions we have seen bugs where one of the brokers caches incorrect metadata from zookeeper. My working hypothesis is:

Apologies for a reviving a 2-year old thread, butI'm reasonably certain I bumped into this issue the other day. Even though it is not treating the root cause, would there be interest in a PR that has the client choosing a random seed broker (as opposed to the first one in the list)? Otherwise, the client is stuck on the same broker, that keeps feeding it with bad metadata.

eapache added the producer label Jun 1, 2016

eapache added bug labels Jun 2, 2016

eapache closed this as completed Jun 2, 2016

danhermann mentioned this issue Aug 30, 2018

WIP: Async retry with exponential backoff for kafka output elastic/beats#8156

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarted brokers are kept with a broken connection (producer) #665

Restarted brokers are kept with a broken connection (producer) #665

arnecls commented Jun 1, 2016

eapache commented Jun 1, 2016

arnecls commented Jun 1, 2016 •

edited

Loading

eapache commented Jun 1, 2016

eapache commented Jun 1, 2016

arnecls commented Jun 1, 2016

eapache commented Jun 1, 2016

arnecls commented Jun 2, 2016 •

edited

Loading

arnecls commented Jun 2, 2016 •

edited

Loading

eapache commented Jun 2, 2016

arnecls commented Jun 3, 2016

kchristidis commented Sep 4, 2018

Restarted brokers are kept with a broken connection (producer) #665

Restarted brokers are kept with a broken connection (producer) #665

Comments

arnecls commented Jun 1, 2016

eapache commented Jun 1, 2016

arnecls commented Jun 1, 2016 • edited Loading

eapache commented Jun 1, 2016

eapache commented Jun 1, 2016

arnecls commented Jun 1, 2016

eapache commented Jun 1, 2016

arnecls commented Jun 2, 2016 • edited Loading

arnecls commented Jun 2, 2016 • edited Loading

eapache commented Jun 2, 2016

arnecls commented Jun 3, 2016

kchristidis commented Sep 4, 2018

arnecls commented Jun 1, 2016 •

edited

Loading

arnecls commented Jun 2, 2016 •

edited

Loading

arnecls commented Jun 2, 2016 •

edited

Loading