Fix coordinator process exiting due to heartbeat race #578

laurglia · 2024-05-08T10:57:24Z

Brod group coordinator periodically sends heartbeats to the Kafka broker. If it does not receive a response to a request within configured timeout, it exits with hb_timeout reason.

There was a race condition where the connection to the Kafka broker was closed after a heartbeat was sent out, but before a heartbeat response was received. When this happened, brod still expected to receive a response to the heartbeat. But since the connection had closed, this response never came and the process exited with hb_timeout.

This error consistently happens once in an hour in all our Elixir deployments that use brod. It looks like that for some reason Amazon MSK closes the Kafka connection from the broker side every 1 hour, and for some reason always after the client sends a heartbeat request. I do not know why this happens, but regardless, the server has a right to close the connection and the application should be able to handle that without causing error noise.

This commit fixes the race condition. Now, when the connection goes down, we remove the reference to the heartbeat request that was last sent out. By removing this reference, the coordinator will no longer expect a response to the heartbeat request. Should connection be re-established, the coordinator will start sending out new heartbeat requests as usual.

I tested out the solution in my own computer by adding a custom TCP proxy in front of Kafka where I had ability to terminate the connections and introduce additional latency. With this setup, I was able to verify that with the previous version, the same errors that we saw in production happened, but with the changes they no longer showed up.

These are the errors that showed up in our logs:

Process #PID<0.19777.11> terminating
** (exit) :hb_timeout
    (stdlib 4.2) gen_server.erl:1241: :gen_server.handle_common_reply/8
    (stdlib 4.2) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Initial Call: :brod_group_coordinator.init/1
Ancestors: [#PID<0.19775.11>, CallRouter.Supervisor, #PID<0.4065.0>]
Neighbours:
    #PID<0.6845.12>
        Initial Call: :kpro_connection.init/4
        Current Call: :kpro_connection.loop/2
        Ancestors: [#PID<0.19777.11>, #PID<0.19775.11>, CallRouter.Supervisor, #PID<0.4065.0>]

GenServer #PID<0.1262.11> terminating
** (stop) :hb_timeout
Last message: :lo_cmd_send_heartbeat

Brod group coordinator periodically sends heartbeats to the Kafka broker. If it does not receive a response to a request within configured timeout, it exits with `hb_timeout` reason. There was a race condition where the connection to the Kafka broker was closed after a heartbeat was sent out, but before a heartbeat response was received. When this happened, brod still expected to receive a response to the heartbeat. But since the connection had closed, this response never came and the process exited with `hb_timeout`. This error consistently happens once in an hour in all our Elixir deployments that use brod. It looks like that for some reason Amazon MSK closes the Kafka connection from the broker side every 1 hour, and for some reason always after the client sends a heartbeat request. I do not know why this happens, but regardless, the server has a right to close the connection and the application should be able to handle that without causing error noise. This commit fixes the race condition. Now, when the connection goes down, we remove the reference to the heartbeat request that was last sent out. By removing this reference, the coordinator will no longer expect a response to the heartbeat request. Should connection be re-established, the coordinator will start sending out new heartbeat requests as usual. I tested out the solution in my own computer by adding a custom TCP proxy in front of Kafka where I had ability to terminate the connections and introduce additional latency. With this setup, I was able to verify that with the previous version, the same errors that we saw in production happened, but with the changes they no longer showed up. These are the errors that showed up in our logs: ``` Process #PID<0.19777.11> terminating ** (exit) :hb_timeout (stdlib 4.2) gen_server.erl:1241: :gen_server.handle_common_reply/8 (stdlib 4.2) proc_lib.erl:240: :proc_lib.init_p_do_apply/3 Initial Call: :brod_group_coordinator.init/1 Ancestors: [#PID<0.19775.11>, CallRouter.Supervisor, #PID<0.4065.0>] Neighbours: #PID<0.6845.12> Initial Call: :kpro_connection.init/4 Current Call: :kpro_connection.loop/2 Ancestors: [#PID<0.19777.11>, #PID<0.19775.11>, CallRouter.Supervisor, #PID<0.4065.0>] ``` ``` GenServer #PID<0.1262.11> terminating ** (stop) :hb_timeout Last message: :lo_cmd_send_heartbeat ``` XT-19

laurglia · 2024-05-08T11:09:25Z

@zmstone could you take a look at this please?

laurglia · 2024-05-15T05:29:17Z

@zmstone could you also merge this? And what is the release process here, could we create a new release with this bug fix immediately?

zmstone · 2024-05-16T10:11:37Z

Hi @laurglia Thank you for the PR. This fix is included in the 3.18.0 release.

laurglia force-pushed the fix_heartbeat_timeout branch from 4e0cca3 to bf92860 Compare May 8, 2024 10:59

laurglia mentioned this pull request May 8, 2024

Fix coordinator process exiting due to heartbeat race salemove/brod#1

Merged

zmstone approved these changes May 12, 2024

View reviewed changes

zmstone merged commit 255c8a9 into kafka4beam:master May 16, 2024
11 checks passed

laurglia deleted the fix_heartbeat_timeout branch May 16, 2024 12:16

eaguilera23 mentioned this pull request Oct 18, 2024

Rebase fork with upstream salemove/brod#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix coordinator process exiting due to heartbeat race #578

Fix coordinator process exiting due to heartbeat race #578

laurglia commented May 8, 2024 •

edited

Loading

laurglia commented May 8, 2024

laurglia commented May 15, 2024

zmstone commented May 16, 2024

Fix coordinator process exiting due to heartbeat race #578

Fix coordinator process exiting due to heartbeat race #578

Conversation

laurglia commented May 8, 2024 • edited Loading

laurglia commented May 8, 2024

laurglia commented May 15, 2024

zmstone commented May 16, 2024

laurglia commented May 8, 2024 •

edited

Loading