Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix coordinator process exiting due to heartbeat race #578

Merged
merged 1 commit into from
May 16, 2024

Conversation

laurglia
Copy link
Contributor

@laurglia laurglia commented May 8, 2024

Brod group coordinator periodically sends heartbeats to the Kafka broker. If it does not receive a response to a request within configured timeout, it exits with hb_timeout reason.

There was a race condition where the connection to the Kafka broker was closed after a heartbeat was sent out, but before a heartbeat response was received. When this happened, brod still expected to receive a response to the heartbeat. But since the connection had closed, this response never came and the process exited with hb_timeout.

This error consistently happens once in an hour in all our Elixir deployments that use brod. It looks like that for some reason Amazon MSK closes the Kafka connection from the broker side every 1 hour, and for some reason always after the client sends a heartbeat request. I do not know why this happens, but regardless, the server has a right to close the connection and the application should be able to handle that without causing error noise.

This commit fixes the race condition. Now, when the connection goes down, we remove the reference to the heartbeat request that was last sent out. By removing this reference, the coordinator will no longer expect a response to the heartbeat request. Should connection be re-established, the coordinator will start sending out new heartbeat requests as usual.

I tested out the solution in my own computer by adding a custom TCP proxy in front of Kafka where I had ability to terminate the connections and introduce additional latency. With this setup, I was able to verify that with the previous version, the same errors that we saw in production happened, but with the changes they no longer showed up.

These are the errors that showed up in our logs:

Process #PID<0.19777.11> terminating
** (exit) :hb_timeout
    (stdlib 4.2) gen_server.erl:1241: :gen_server.handle_common_reply/8
    (stdlib 4.2) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Initial Call: :brod_group_coordinator.init/1
Ancestors: [#PID<0.19775.11>, CallRouter.Supervisor, #PID<0.4065.0>]
Neighbours:
    #PID<0.6845.12>
        Initial Call: :kpro_connection.init/4
        Current Call: :kpro_connection.loop/2
        Ancestors: [#PID<0.19777.11>, #PID<0.19775.11>, CallRouter.Supervisor, #PID<0.4065.0>]
GenServer #PID<0.1262.11> terminating
** (stop) :hb_timeout
Last message: :lo_cmd_send_heartbeat

Brod group coordinator periodically sends heartbeats to the Kafka
broker. If it does not receive a response to a request within configured
timeout, it exits with `hb_timeout` reason.

There was a race condition where the connection to the Kafka broker was
closed after a heartbeat was sent out, but before a heartbeat response
was received. When this happened, brod still expected to receive a
response to the heartbeat. But since the connection had closed, this
response never came and the process exited with `hb_timeout`.

This error consistently happens once in an hour in all our Elixir
deployments that use brod. It looks like that for some reason Amazon
MSK closes the Kafka connection from the broker side every 1 hour, and for
some reason always after the client sends a heartbeat request. I do not
know why this happens, but regardless, the server has a right to close
the connection and the application should be able to handle that without
causing error noise.

This commit fixes the race condition. Now, when the connection goes
down, we remove the reference to the heartbeat request that was last
sent out. By removing this reference, the coordinator will no longer
expect a response to the heartbeat request. Should connection be
re-established, the coordinator will start sending out new heartbeat
requests as usual.

I tested out the solution in my own computer by adding a custom TCP
proxy in front of Kafka where I had ability to terminate the connections
and introduce additional latency. With this setup, I was able to verify
that with the previous version, the same errors that we saw in
production happened, but with the changes they no longer showed up.

These are the errors that showed up in our logs:

```
Process #PID<0.19777.11> terminating
** (exit) :hb_timeout
    (stdlib 4.2) gen_server.erl:1241: :gen_server.handle_common_reply/8
    (stdlib 4.2) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Initial Call: :brod_group_coordinator.init/1
Ancestors: [#PID<0.19775.11>, CallRouter.Supervisor, #PID<0.4065.0>]
Neighbours:
    #PID<0.6845.12>
        Initial Call: :kpro_connection.init/4
        Current Call: :kpro_connection.loop/2
        Ancestors: [#PID<0.19777.11>, #PID<0.19775.11>, CallRouter.Supervisor, #PID<0.4065.0>]
```

```
GenServer #PID<0.1262.11> terminating
** (stop) :hb_timeout
Last message: :lo_cmd_send_heartbeat
```

XT-19
@laurglia laurglia force-pushed the fix_heartbeat_timeout branch from 4e0cca3 to bf92860 Compare May 8, 2024 10:59
@laurglia
Copy link
Contributor Author

laurglia commented May 8, 2024

@zmstone could you take a look at this please?

@laurglia
Copy link
Contributor Author

@zmstone could you also merge this? And what is the release process here, could we create a new release with this bug fix immediately?

@zmstone zmstone merged commit 255c8a9 into kafka4beam:master May 16, 2024
11 checks passed
@zmstone
Copy link
Contributor

zmstone commented May 16, 2024

Hi @laurglia Thank you for the PR. This fix is included in the 3.18.0 release.

@laurglia laurglia deleted the fix_heartbeat_timeout branch May 16, 2024 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants