Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SyncProducer SendMessage blocks for infinity when encounters a Network Timeout #2129

Closed
eafzali opened this issue Feb 7, 2022 · 1 comment

Comments

@eafzali
Copy link

eafzali commented Feb 7, 2022

Versions
Sarama Kafka Go
1.31.0 2.8.1 1.17.5
Configuration
func main() {
	sarama.Logger = log.New(os.Stderr, "[Sarama] ", log.LstdFlags)

	config := sarama.NewConfig()
	config.Producer.Retry.Max = 10
	config.Producer.Return.Successes = true
	config.Version = sarama.V1_0_0_0
	config.Metadata.Retry.Backoff = time.Second * 2

	sp, err := sarama.NewSyncProducer([]string{"0.0.0.0:55728", "0.0.0.0:55729", "0.0.0.0:55727"}, config)
	if err != nil {
		panic(err)
	}

	i := 0
	for {
		time.Sleep(2 * time.Second)
		go func(m string) {
			fmt.Println("sending", m)
			p, o, e := sp.SendMessage(&sarama.ProducerMessage{
				Topic: "mytopic",
				Value: sarama.ByteEncoder([]byte(m)),
			})
			fmt.Println("sent", m, p, o, e)
		}(strconv.Itoa(i))
		i++
	}
}

I'm running KAFKA_CREATE_TOPICS: "mytopic:3:3" on a 3 broker cluster locally using https://github.com/wurstmeister/kafka-docker/.

I run the above application and after a while I docker pause $(docker ps -f name=kafka -q), wait for all 3 network timeouts to happen and then docker unpause $(docker ps -f name=kafka -q).
The application does not recover and I can see the number of goroutines raising on the metrics. I have waited up to an hour.

The same scenario works fine with the previous version of Sarama (1.30.1), you can see it recovers a bit after unpausing the container. And I can confirm that 1.31.1 has not solved this issue.

Logs 1.31.0 and 1.31.1
logs: CLICK ME

[Sarama] 2022/02/07 11:45:00 Initializing new client
[Sarama] 2022/02/07 11:45:00 client/metadata fetching metadata for all topics from broker 0.0.0.0:55728
[Sarama] 2022/02/07 11:45:00 Connected to broker at 0.0.0.0:55728 (unregistered)
[Sarama] 2022/02/07 11:45:00 client/brokers registered new broker #1001 at host.docker.internal:55728
[Sarama] 2022/02/07 11:45:00 client/brokers registered new broker #1003 at host.docker.internal:55729
[Sarama] 2022/02/07 11:45:00 client/brokers registered new broker #1002 at host.docker.internal:55727
[Sarama] 2022/02/07 11:45:00 Successfully initialized new client
sending 0
[Sarama] 2022/02/07 11:45:02 producer/broker/1002 starting up
[Sarama] 2022/02/07 11:45:02 producer/broker/1002 state change to [open] on mytopic/1
[Sarama] 2022/02/07 11:45:02 Connected to broker at host.docker.internal:55727 (registered as #1002)
sent 0 1 387 <nil>
sending 1
[Sarama] 2022/02/07 11:45:04 producer/broker/1003 starting up
[Sarama] 2022/02/07 11:45:04 producer/broker/1003 state change to [open] on mytopic/2
[Sarama] 2022/02/07 11:45:04 Connected to broker at host.docker.internal:55729 (registered as #1003)
sent 1 2 456 <nil>
sending 2
sent 2 2 457 <nil>
sending 3
sent 3 1 388 <nil>
sending 4
sent 4 2 458 <nil>
sending 5
sent 5 2 459 <nil>
sending 6
[Sarama] 2022/02/07 11:45:14 producer/broker/1001 starting up
[Sarama] 2022/02/07 11:45:14 producer/broker/1001 state change to [open] on mytopic/0
[Sarama] 2022/02/07 11:45:14 Connected to broker at host.docker.internal:55728 (registered as #1001)
sent 6 0 467 <nil>
sending 7
sent 7 1 389 <nil>
sending 8
sent 8 1 390 <nil>
sending 9
sent 9 0 468 <nil>
sending 10
sending 11
sent 11 2 460 <nil>
sending 12
sent 12 1 391 <nil>
sending 13
sending 14
sent 14 1 392 <nil>
sending 15
sent 15 1 393 <nil>
sending 16
sending 17
sent 17 1 394 <nil>
sending 18
sending 19
sending 20
sending 21
sending 22
sending 23
sending 24
[Sarama] 2022/02/07 11:45:52 producer/broker/1001 state change to [closing] because read tcp 127.0.0.1:56291->127.0.0.1:55728: i/o timeout
sending 25
sending 26
sending 27
sending 28
sending 29
sending 30
sending 31
sending 32
sending 33
sending 34
[Sarama] 2022/02/07 11:46:12 producer/broker/1002 state change to [closing] because read tcp 127.0.0.1:56284->127.0.0.1:55727: i/o timeout
sending 35
[Sarama] 2022/02/07 11:46:14 producer/broker/1003 state change to [closing] because read tcp 127.0.0.1:56286->127.0.0.1:55729: i/o timeout
sending 36
sending 37
sending 38
sending 39
sending 40
sending 41
sending 42
sending 43
sending 44
sending 45
sending 46
sending 47
sending 48
sending 49
sending 50
sending 51
sending 52
sending 53
sending 54
sending 55
sending 56
sending 57
sending 58
...

Logs 1.30.1
logs: CLICK ME

[Sarama] 2022/02/07 12:05:22 Initializing new client
[Sarama] 2022/02/07 12:05:22 client/metadata fetching metadata for all topics from broker 0.0.0.0:55727
[Sarama] 2022/02/07 12:05:22 Connected to broker at 0.0.0.0:55727 (unregistered)
[Sarama] 2022/02/07 12:05:22 client/brokers registered new broker #1001 at host.docker.internal:55728
[Sarama] 2022/02/07 12:05:22 client/brokers registered new broker #1003 at host.docker.internal:55729
[Sarama] 2022/02/07 12:05:22 client/brokers registered new broker #1002 at host.docker.internal:55727
[Sarama] 2022/02/07 12:05:22 Successfully initialized new client
sending 0
[Sarama] 2022/02/07 12:05:24 producer/broker/1002 starting up
[Sarama] 2022/02/07 12:05:24 producer/broker/1002 state change to [open] on mytopic/1
[Sarama] 2022/02/07 12:05:24 Connected to broker at host.docker.internal:55727 (registered as #1002)
sent 0 1 403 <nil>
sending 1
sent 1 1 404 <nil>
sending 2
[Sarama] 2022/02/07 12:05:28 producer/broker/1001 starting up
[Sarama] 2022/02/07 12:05:28 producer/broker/1001 state change to [open] on mytopic/0
[Sarama] 2022/02/07 12:05:28 Connected to broker at host.docker.internal:55728 (registered as #1001)
sent 2 0 473 <nil>
sending 3
sending 4
[Sarama] 2022/02/07 12:05:32 producer/broker/1003 starting up
[Sarama] 2022/02/07 12:05:32 producer/broker/1003 state change to [open] on mytopic/2
[Sarama] 2022/02/07 12:05:32 Connected to broker at host.docker.internal:55729 (registered as #1003)
sending 5
sending 6
sending 7
sending 8
sending 9
sending 10
sending 11
sending 12
sending 13
sending 14
sending 15
sending 16
sending 17
[Sarama] 2022/02/07 12:06:00 producer/broker/1001 state change to [closing] because read tcp 127.0.0.1:57149->127.0.0.1:55728: i/o timeout
[Sarama] 2022/02/07 12:06:00 Closed connection to broker host.docker.internal:55728
[Sarama] 2022/02/07 12:06:00 producer/leader/mytopic/0 state change to [retrying-1]
[Sarama] 2022/02/07 12:06:00 producer/leader/mytopic/0 abandoning broker 1001
[Sarama] 2022/02/07 12:06:00 producer/broker/1001 input chan closed
[Sarama] 2022/02/07 12:06:00 producer/broker/1001 shut down
sending 18
[Sarama] 2022/02/07 12:06:00 client/metadata fetching metadata for [mytopic] from broker 0.0.0.0:55727
[Sarama] 2022/02/07 12:06:02 producer/broker/1003 state change to [closing] because read tcp 127.0.0.1:57166->127.0.0.1:55729: i/o timeout
[Sarama] 2022/02/07 12:06:02 Closed connection to broker host.docker.internal:55729
[Sarama] 2022/02/07 12:06:02 producer/leader/mytopic/2 state change to [retrying-1]
[Sarama] 2022/02/07 12:06:02 producer/leader/mytopic/2 abandoning broker 1003
[Sarama] 2022/02/07 12:06:02 producer/broker/1003 input chan closed
[Sarama] 2022/02/07 12:06:02 producer/broker/1003 shut down
sending 19
[Sarama] 2022/02/07 12:06:02 client/metadata fetching metadata for [mytopic] from broker 0.0.0.0:55727
sending 20
[Sarama] 2022/02/07 12:06:06 producer/broker/1002 state change to [closing] because read tcp 127.0.0.1:57147->127.0.0.1:55727: i/o timeout
[Sarama] 2022/02/07 12:06:06 Closed connection to broker host.docker.internal:55727
[Sarama] 2022/02/07 12:06:06 producer/leader/mytopic/1 state change to [retrying-1]
[Sarama] 2022/02/07 12:06:06 producer/leader/mytopic/1 abandoning broker 1002
[Sarama] 2022/02/07 12:06:06 producer/broker/1002 input chan closed
[Sarama] 2022/02/07 12:06:06 producer/broker/1002 shut down
sending 21
[Sarama] 2022/02/07 12:06:06 client/metadata fetching metadata for [mytopic] from broker 0.0.0.0:55727
sending 22
sending 23
sending 24
sending 25
sending 26
[Sarama] 2022/02/07 12:06:17 producer/broker/1001 starting up
[Sarama] 2022/02/07 12:06:17 producer/broker/1001 state change to [open] on mytopic/0
[Sarama] 2022/02/07 12:06:17 producer/leader/mytopic/0 selected broker 1001
[Sarama] 2022/02/07 12:06:17 producer/leader/mytopic/0 state change to [flushing-1]
[Sarama] 2022/02/07 12:06:17 producer/leader/mytopic/0 state change to [normal]
[Sarama] 2022/02/07 12:06:17 Connected to broker at host.docker.internal:55728 (registered as #1001)
[Sarama] 2022/02/07 12:06:17 producer/broker/1003 starting up
[Sarama] 2022/02/07 12:06:17 producer/broker/1003 state change to [open] on mytopic/2
[Sarama] 2022/02/07 12:06:17 producer/leader/mytopic/2 selected broker 1003
[Sarama] 2022/02/07 12:06:17 producer/leader/mytopic/2 state change to [flushing-1]
[Sarama] 2022/02/07 12:06:17 producer/leader/mytopic/2 state change to [normal]
[Sarama] 2022/02/07 12:06:17 Connected to broker at host.docker.internal:55729 (registered as #1003)
[Sarama] 2022/02/07 12:06:17 producer/broker/1002 starting up
[Sarama] 2022/02/07 12:06:17 producer/broker/1002 state change to [open] on mytopic/1
[Sarama] 2022/02/07 12:06:17 producer/leader/mytopic/1 selected broker 1002
[Sarama] 2022/02/07 12:06:17 producer/leader/mytopic/1 state change to [flushing-1]
[Sarama] 2022/02/07 12:06:17 producer/leader/mytopic/1 state change to [normal]
[Sarama] 2022/02/07 12:06:17 Connected to broker at host.docker.internal:55727 (registered as #1002)
sent 4 2 478 <nil>
sent 12 2 480 <nil>
sent 22 2 482 <nil>
sent 13 2 481 <nil>
sent 25 2 483 <nil>
sent 10 2 479 <nil>
sent 5 0 476 <nil>
sent 3 0 475 <nil>
sent 11 0 478 <nil>
sent 9 0 477 <nil>
sent 19 0 480 <nil>
sent 23 0 482 <nil>
sent 21 0 481 <nil>
sent 16 0 479 <nil>
sent 7 1 407 <nil>
sent 6 1 406 <nil>
sent 14 1 409 <nil>
sent 8 1 408 <nil>
sent 17 1 411 <nil>
sent 15 1 410 <nil>
sent 20 1 413 <nil>
sent 18 1 412 <nil>
sent 26 1 415 <nil>
sent 24 1 414 <nil>
sending 27
sent 27 0 483 <nil>
sending 28
[Sarama] 2022/02/07 12:06:20 producer/broker/1003 state change to [retrying] on mytopic/2 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
[Sarama] 2022/02/07 12:06:20 producer/leader/mytopic/2 state change to [retrying-1]
[Sarama] 2022/02/07 12:06:20 producer/leader/mytopic/2 abandoning broker 1003
[Sarama] 2022/02/07 12:06:20 producer/broker/1003 state change to [closed] on mytopic/2
[Sarama] 2022/02/07 12:06:20 producer/broker/1003 input chan closed
[Sarama] 2022/02/07 12:06:20 producer/broker/1003 shut down
[Sarama] 2022/02/07 12:06:20 client/metadata fetching metadata for [mytopic] from broker 0.0.0.0:55727
[Sarama] 2022/02/07 12:06:20 producer/leader/mytopic/2 selected broker 1002
[Sarama] 2022/02/07 12:06:20 producer/broker/1002 state change to [open] on mytopic/2
[Sarama] 2022/02/07 12:06:20 producer/leader/mytopic/2 state change to [flushing-1]
[Sarama] 2022/02/07 12:06:20 producer/leader/mytopic/2 state change to [normal]
sent 28 2 477 <nil>
sending 29
sent 29 1 416 <nil>
sending 30
[Sarama] 2022/02/07 12:06:24 producer/broker/1001 state change to [retrying] on mytopic/0 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
[Sarama] 2022/02/07 12:06:24 producer/leader/mytopic/0 state change to [retrying-1]
[Sarama] 2022/02/07 12:06:24 producer/leader/mytopic/0 abandoning broker 1001
[Sarama] 2022/02/07 12:06:24 producer/broker/1001 state change to [closed] on mytopic/0
[Sarama] 2022/02/07 12:06:24 producer/broker/1001 input chan closed
[Sarama] 2022/02/07 12:06:24 producer/broker/1001 shut down
[Sarama] 2022/02/07 12:06:24 client/metadata fetching metadata for [mytopic] from broker 0.0.0.0:55727
[Sarama] 2022/02/07 12:06:24 producer/leader/mytopic/0 selected broker 1002
[Sarama] 2022/02/07 12:06:24 producer/broker/1002 state change to [open] on mytopic/0
[Sarama] 2022/02/07 12:06:24 producer/leader/mytopic/0 state change to [flushing-1]
[Sarama] 2022/02/07 12:06:24 producer/leader/mytopic/0 state change to [normal]
sent 30 0 474 <nil>
sending 31
sent 31 1 417 <nil>
sending 32
sent 32 2 478 <nil>

Problem Description

I think this is the same issue as what was described in #2121 (comment) and #2121 (comment) in #2121 but since they seem not to be exactly the same problem as reported in the original issue, I decided to open a new issue.

@dnwe
Copy link
Collaborator

dnwe commented Feb 7, 2022

@slaunay I haven't had a chance to look into these reports about the Sync producer deadlocking yet, but I'm guessing they are due the changes that were made to the async Producer in #2094 (the sync producer just wrappers the async one in a sync get of the future)

slaunay added a commit to slaunay/sarama that referenced this issue Feb 8, 2022
- add unit test to reproduce the deadlock by simulating a network error
- document possible deadlock when closing the Broker from an AsyncProduce
  callback when handling a response error
- add closeBroker goroutine and channel to asynchronously close a Broker
- reuse the stopchan channel to signal that the closeBroker goroutine is
  done
- update TestBrokerProducerShutdown to check goroutine leak by closing
  the input vs the stopchan channel
- fixes IBM#2129
slaunay added a commit to slaunay/sarama that referenced this issue Feb 8, 2022
- add unit test to reproduce the deadlock by simulating a network error
- document possible deadlock when closing the Broker from an AsyncProduce
  callback when handling a response error
- add closeBroker goroutine and channel to asynchronously close a Broker
  once
- reuse the stopchan channel to signal that the closeBroker goroutine is
  done
- update TestBrokerProducerShutdown to check goroutine leak by closing
  the input vs the stopchan channel
- fixes IBM#2129
@dnwe dnwe closed this as completed in 06513c1 Feb 13, 2022
docmerlin added a commit to influxdata/kapacitor that referenced this issue Mar 10, 2022
docmerlin added a commit to influxdata/kapacitor that referenced this issue Mar 10, 2022
docmerlin added a commit to influxdata/kapacitor that referenced this issue Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants