-
Notifications
You must be signed in to change notification settings - Fork 676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Producer.Close() hangs in syscall to rd_kafka_destroy #463
Comments
Looking at
Control flow makes it through most of the method, stalling on the last line. This means that the channels are closed before attempting to destroy, which means no delivery notifications will be processed from this point. I'm wondering what would happen if we're trying to close a producer with in-flight messages, where there are outstanding delivery notifications. What if a notification arrives while we are halfway through This is pure speculation, of course, as I'm unfamiliar with the workings of |
Some more data: it appears the issue is not only bound to producers, but can also affect consumers. Here is a trace showing a hung consumer in
|
i am seeing this as well with v1.4.0 |
Are all other references to the consumer or producer deleted prior to calling Close()? |
For the consumer, yes. For the producer, there is just a goroutine consuming from the events channel. |
Is there any possibility to spin up gdb or pstack to get a backtrace of the client when it is in a stalled state? |
I'm seeing this frequently as well with my consumer running 2.2.0. I ran gdb and looked at the backtraces for all threads – these seem to be the relevant ones. @edenhill is this helpful at all?
Full backtrace log: output.log |
Description
The problem occurred as part of volume testing goneli with aggressive injection of random faults, with a Kafka broker that has been (intentionally) subjected to very high load.
The test code frequently creates and destroys consumer and producer instances.
On one such occasion,
Producer.Close()
was left in a blocked state.The stack trace shows this:
How to reproduce
Rapidly cycle through producer instances: open producer, produce some messages, close producer with some in-flight messages without flushing or waiting for message confirmations. (I suspect this last point is pivotal to being able to reproduce this.)
The specific test that caused the issue is here; however, it took several days of uninterrupted running for the issue to surface.
Checklist
Please provide the following information:
LibraryVersion()
): 1.4ConfigMap{...}
"debug": ".."
as necessary)The text was updated successfully, but these errors were encountered: