rabbit_feature_flags: Retry after `erpc:call()` fails with `noconnection` #8411

dumbbell · 2023-05-30T10:08:27Z

Why

There could be a transient network issue. Let's give a few more chances to perform the requested RPC call.

How

We retry until the given timeout is reached, if any.

To honor that timeout, we measure the time taken by the RPC call itself. We also sleep between retries. Before each retry, the timeout is reduced by the total of the time taken by the RPC call and the sleep.

References #8346.

michaelklishin · 2023-05-30T10:11:00Z

In other places (reconnection to peers/schema DB replicas after a restart, for instance), we retry N times every M milliseconds up until the total timeout T is reached. This can work, too, but the above implementation is very easy to reason about.

dumbbell · 2023-05-30T10:25:15Z

By "the above implementation", do you mean, "retry N times every M milliseconds"?

I agree, it's easier to read such code. I remember we use it in several places. My problem with that is it may not honor the timeout given as argument if the main job of the function may take a long time. I mean, in the context of this RPC call, if the call itself takes 10 seconds to return noconnection for whatever reason, if this time isn't taken into account, the retries could extend the given timeout significantly.

Measuring the RPC call time allows to honor the given timeout at the cost of a more complicated code for sure.

deps/rabbit/src/rabbit_ff_controller.erl

lukebakken

I still can't reproduce the issue using various tricks, but upon reviewing this code it'll retry as expected. Thanks @dumbbell

[Why] There could be a transient network issue. Let's give a few more chances to perform the requested RPC call. [How] We retry until the given timeout is reached, if any. To honor that timeout, we measure the time taken by the RPC call itself. We also sleep between retries. Before each retry, the timeout is reduced by the total of the time taken by the RPC call and the sleep. References #8346. V2: Treat `infinity` timeout differently. In this case, we never retry following a `noconnection` error. The reason is that this timeout is used specifically for callbacks executed remotely. We don't know how long they take (for instance if there is a lot of data to migrate). We don't want an infinite retry loop either, so in this case, we don't retry.

[Why] This is unused in this source file.

rabbit_feature_flags: Retry after `erpc:call()` fails with `noconnection` (backport #8411)

rabbit_feature_flags: Retry after `erpc:call()` fails with `noconnection` (backport #8411) (backport #8491)

…l_discovery_with_a_subset_of_nodes_coming_online` [Why] Now that feature flags compatibility is tested first, before Mnesia-specific checks, when a peer is not started yet, the feature flags check lasts the entire timeout, so one minute. This retry mechanism was added to feature flags in #8411. Thus, instead of 20 seconds, the testcase takes 10 minutes now (10 retries of one minute each).

…onnection`" This reverts commit 8749c60. [Why] The patch was supposed to solve an issue that we didn't understand and that is likely to be a network/DNS problem outside of RabbitMQ. We know it didn't solve that issue because it was reported again 6 months after the initial pull request (#8411). What we are sure however is that it increased the testing of RabbitMQ significantly because the code loops for 10+ minutes if the remote node is not running. Let's revert it until the root cause is really understood.

…onnection`" This reverts commit 8749c60. [Why] The patch was supposed to solve an issue that we didn't understand and that was likely a network/DNS problem outside of RabbitMQ. We know it didn't solve that issue because it was reported again 6 months after the initial pull request (#8411). What we are sure however is that it increased the testing of RabbitMQ significantly because the code loops for 10+ minutes if the remote node is not running. The retry in the Feature flags subsystem was not the right place either. The `noconnection` error is visible there because it runs earlier during RabbitMQ startup. But retrying there won't solve a network issue magically. There are two ways to create a cluster: 1. peer discovery and this subsystem takes care of retries if necessary and appropriate 2. manually using the CLI, in which case the user is responsible for starting RabbitMQ nodes and clustering them Let's revert it until the root cause is really understood.

…onnection`" This reverts commit 8749c60. [Why] The patch was supposed to solve an issue that we didn't understand and that was likely a network/DNS problem outside of RabbitMQ. We know it didn't solve that issue because it was reported again 6 months after the initial pull request (#8411). What we are sure however is that it increased the testing of RabbitMQ significantly because the code loops for 10+ minutes if the remote node is not running. The retry in the Feature flags subsystem was not the right place either. The `noconnection` error is visible there because it runs earlier during RabbitMQ startup. But retrying there won't solve a network issue magically. There are two ways to create a cluster: 1. peer discovery and this subsystem takes care of retries if necessary and appropriate 2. manually using the CLI, in which case the user is responsible for starting RabbitMQ nodes and clustering them Let's revert it until the root cause is really understood. (cherry picked from commit d0c13b4)

dumbbell added this to the 3.13.0 milestone May 30, 2023

dumbbell requested review from michaelklishin and lukebakken May 30, 2023 10:08

dumbbell self-assigned this May 30, 2023

dumbbell force-pushed the retry-feature-flags-rpcs branch from 71dd368 to 38fa28d Compare May 30, 2023 13:41

lukebakken reviewed May 30, 2023

View reviewed changes

deps/rabbit/src/rabbit_ff_controller.erl Show resolved Hide resolved

dumbbell force-pushed the retry-feature-flags-rpcs branch from 38fa28d to 9f8b176 Compare May 30, 2023 16:56

mergify bot added the bazel label Jun 1, 2023

dumbbell force-pushed the retry-feature-flags-rpcs branch 2 times, most recently from 56ad90c to deb4c74 Compare June 5, 2023 14:36

dumbbell changed the title ~~rabbit_feature_flags: Retry after erpc:call() fails with noconnection~~ rabbit_feature_flags: Retry after erpc:call() fails with noconnection Jun 5, 2023

lukebakken approved these changes Jun 5, 2023

View reviewed changes

dumbbell added 2 commits June 6, 2023 09:40

rabbit_feature_flags: Remove the definition of ?TIMEOUT

f09544b

[Why] This is unused in this source file.

dumbbell force-pushed the retry-feature-flags-rpcs branch from deb4c74 to f09544b Compare June 6, 2023 07:41

dumbbell marked this pull request as ready for review June 6, 2023 10:58

dumbbell merged commit 7f06a08 into main Jun 6, 2023

dumbbell deleted the retry-feature-flags-rpcs branch June 6, 2023 10:59

dumbbell added backport-v3.11.x backport-v3.12.x labels Jun 6, 2023

mergify bot mentioned this pull request Jun 6, 2023

rabbit_feature_flags: Retry after erpc:call() fails with noconnection (backport #8411) #8491

Merged

dumbbell added a commit that referenced this pull request Jun 6, 2023

Merge pull request #8491 from rabbitmq/mergify/bp/v3.12.x/pr-8411

0c9d55b

rabbit_feature_flags: Retry after `erpc:call()` fails with `noconnection` (backport #8411)

dumbbell added a commit that referenced this pull request Jun 6, 2023

Merge pull request #8493 from rabbitmq/mergify/bp/v3.11.x/pr-8491

b3e4df7

rabbit_feature_flags: Retry after `erpc:call()` fails with `noconnection` (backport #8411) (backport #8491)

dumbbell mentioned this pull request Jun 20, 2024

Revert "rabbit_feature_flags: Retry after erpc:call() fails with noconnection" #11507

Merged

mergify bot mentioned this pull request Jul 9, 2024

Revert "rabbit_feature_flags: Retry after erpc:call() fails with noconnection" (backport #11507) #11646

Merged

michaelklishin mentioned this pull request Aug 27, 2024

Feature flags detection sometimes triggers erpc,noconnection #8346

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rabbit_feature_flags: Retry after `erpc:call()` fails with `noconnection` #8411

rabbit_feature_flags: Retry after `erpc:call()` fails with `noconnection` #8411

dumbbell commented May 30, 2023

michaelklishin commented May 30, 2023 •

edited

Loading

dumbbell commented May 30, 2023

lukebakken left a comment

rabbit_feature_flags: Retry after erpc:call() fails with noconnection #8411

rabbit_feature_flags: Retry after erpc:call() fails with noconnection #8411

Conversation

dumbbell commented May 30, 2023

Why

How

michaelklishin commented May 30, 2023 • edited Loading

dumbbell commented May 30, 2023

lukebakken left a comment

Choose a reason for hiding this comment

rabbit_feature_flags: Retry after `erpc:call()` fails with `noconnection` #8411

rabbit_feature_flags: Retry after `erpc:call()` fails with `noconnection` #8411

michaelklishin commented May 30, 2023 •

edited

Loading