Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rabbit_feature_flags: Retry after erpc:call() fails with noconnection #8411

Merged
merged 2 commits into from
Jun 6, 2023

Conversation

dumbbell
Copy link
Member

Why

There could be a transient network issue. Let's give a few more chances to perform the requested RPC call.

How

We retry until the given timeout is reached, if any.

To honor that timeout, we measure the time taken by the RPC call itself. We also sleep between retries. Before each retry, the timeout is reduced by the total of the time taken by the RPC call and the sleep.

References #8346.

@dumbbell dumbbell added this to the 3.13.0 milestone May 30, 2023
@dumbbell dumbbell self-assigned this May 30, 2023
@michaelklishin
Copy link
Member

michaelklishin commented May 30, 2023

In other places (reconnection to peers/schema DB replicas after a restart, for instance), we retry N times every M milliseconds up until the total timeout T is reached. This can work, too, but the above implementation is very easy to reason about.

@dumbbell
Copy link
Member Author

By "the above implementation", do you mean, "retry N times every M milliseconds"?

I agree, it's easier to read such code. I remember we use it in several places. My problem with that is it may not honor the timeout given as argument if the main job of the function may take a long time. I mean, in the context of this RPC call, if the call itself takes 10 seconds to return noconnection for whatever reason, if this time isn't taken into account, the retries could extend the given timeout significantly.

Measuring the RPC call time allows to honor the given timeout at the cost of a more complicated code for sure.

@dumbbell dumbbell force-pushed the retry-feature-flags-rpcs branch from 71dd368 to 38fa28d Compare May 30, 2023 13:41
@dumbbell dumbbell force-pushed the retry-feature-flags-rpcs branch from 38fa28d to 9f8b176 Compare May 30, 2023 16:56
@mergify mergify bot added the bazel label Jun 1, 2023
@dumbbell dumbbell force-pushed the retry-feature-flags-rpcs branch 2 times, most recently from 56ad90c to deb4c74 Compare June 5, 2023 14:36
@dumbbell dumbbell changed the title rabbit_feature_flags: Retry after erpc:call() fails with noconnection rabbit_feature_flags: Retry after erpc:call() fails with noconnection Jun 5, 2023
Copy link
Collaborator

@lukebakken lukebakken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still can't reproduce the issue using various tricks, but upon reviewing this code it'll retry as expected. Thanks @dumbbell

dumbbell added 2 commits June 6, 2023 09:40
[Why]
There could be a transient network issue. Let's give a few more chances
to perform the requested RPC call.

[How]
We retry until the given timeout is reached, if any.

To honor that timeout, we measure the time taken by the RPC call itself.
We also sleep between retries. Before each retry, the timeout is reduced
by the total of the time taken by the RPC call and the sleep.

References #8346.

V2: Treat `infinity` timeout differently. In this case, we never retry
    following a `noconnection` error. The reason is that this timeout is
    used specifically for callbacks executed remotely. We don't know how
    long they take (for instance if there is a lot of data to migrate).
    We don't want an infinite retry loop either, so in this case, we
    don't retry.
[Why]
This is unused in this source file.
@dumbbell dumbbell force-pushed the retry-feature-flags-rpcs branch from deb4c74 to f09544b Compare June 6, 2023 07:41
@dumbbell dumbbell marked this pull request as ready for review June 6, 2023 10:58
@dumbbell dumbbell merged commit 7f06a08 into main Jun 6, 2023
@dumbbell dumbbell deleted the retry-feature-flags-rpcs branch June 6, 2023 10:59
dumbbell added a commit that referenced this pull request Jun 6, 2023
rabbit_feature_flags: Retry after `erpc:call()` fails with `noconnection` (backport #8411)
dumbbell added a commit that referenced this pull request Jun 6, 2023
rabbit_feature_flags: Retry after `erpc:call()` fails with `noconnection` (backport #8411) (backport #8491)
dumbbell added a commit that referenced this pull request Jun 13, 2023
…l_discovery_with_a_subset_of_nodes_coming_online`

[Why]
Now that feature flags compatibility is tested first, before
Mnesia-specific checks, when a peer is not started yet, the feature
flags check lasts the entire timeout, so one minute. This retry
mechanism was added to feature flags in #8411.

Thus, instead of 20 seconds, the testcase takes 10 minutes now (10
retries of one minute each).
dumbbell added a commit that referenced this pull request Jun 13, 2023
…l_discovery_with_a_subset_of_nodes_coming_online`

[Why]
Now that feature flags compatibility is tested first, before
Mnesia-specific checks, when a peer is not started yet, the feature
flags check lasts the entire timeout, so one minute. This retry
mechanism was added to feature flags in #8411.

Thus, instead of 20 seconds, the testcase takes 10 minutes now (10
retries of one minute each).
dumbbell added a commit that referenced this pull request Jun 19, 2023
…l_discovery_with_a_subset_of_nodes_coming_online`

[Why]
Now that feature flags compatibility is tested first, before
Mnesia-specific checks, when a peer is not started yet, the feature
flags check lasts the entire timeout, so one minute. This retry
mechanism was added to feature flags in #8411.

Thus, instead of 20 seconds, the testcase takes 10 minutes now (10
retries of one minute each).
dumbbell added a commit that referenced this pull request Jun 21, 2023
…l_discovery_with_a_subset_of_nodes_coming_online`

[Why]
Now that feature flags compatibility is tested first, before
Mnesia-specific checks, when a peer is not started yet, the feature
flags check lasts the entire timeout, so one minute. This retry
mechanism was added to feature flags in #8411.

Thus, instead of 20 seconds, the testcase takes 10 minutes now (10
retries of one minute each).
dumbbell added a commit that referenced this pull request Jun 23, 2023
…l_discovery_with_a_subset_of_nodes_coming_online`

[Why]
Now that feature flags compatibility is tested first, before
Mnesia-specific checks, when a peer is not started yet, the feature
flags check lasts the entire timeout, so one minute. This retry
mechanism was added to feature flags in #8411.

Thus, instead of 20 seconds, the testcase takes 10 minutes now (10
retries of one minute each).
dumbbell added a commit that referenced this pull request Jun 26, 2023
…l_discovery_with_a_subset_of_nodes_coming_online`

[Why]
Now that feature flags compatibility is tested first, before
Mnesia-specific checks, when a peer is not started yet, the feature
flags check lasts the entire timeout, so one minute. This retry
mechanism was added to feature flags in #8411.

Thus, instead of 20 seconds, the testcase takes 10 minutes now (10
retries of one minute each).
dumbbell added a commit that referenced this pull request Jun 29, 2023
…l_discovery_with_a_subset_of_nodes_coming_online`

[Why]
Now that feature flags compatibility is tested first, before
Mnesia-specific checks, when a peer is not started yet, the feature
flags check lasts the entire timeout, so one minute. This retry
mechanism was added to feature flags in #8411.

Thus, instead of 20 seconds, the testcase takes 10 minutes now (10
retries of one minute each).
dumbbell added a commit that referenced this pull request Jun 30, 2023
…l_discovery_with_a_subset_of_nodes_coming_online`

[Why]
Now that feature flags compatibility is tested first, before
Mnesia-specific checks, when a peer is not started yet, the feature
flags check lasts the entire timeout, so one minute. This retry
mechanism was added to feature flags in #8411.

Thus, instead of 20 seconds, the testcase takes 10 minutes now (10
retries of one minute each).
dumbbell added a commit that referenced this pull request Jul 3, 2023
…l_discovery_with_a_subset_of_nodes_coming_online`

[Why]
Now that feature flags compatibility is tested first, before
Mnesia-specific checks, when a peer is not started yet, the feature
flags check lasts the entire timeout, so one minute. This retry
mechanism was added to feature flags in #8411.

Thus, instead of 20 seconds, the testcase takes 10 minutes now (10
retries of one minute each).
dumbbell added a commit that referenced this pull request Jul 6, 2023
…l_discovery_with_a_subset_of_nodes_coming_online`

[Why]
Now that feature flags compatibility is tested first, before
Mnesia-specific checks, when a peer is not started yet, the feature
flags check lasts the entire timeout, so one minute. This retry
mechanism was added to feature flags in #8411.

Thus, instead of 20 seconds, the testcase takes 10 minutes now (10
retries of one minute each).
dumbbell added a commit that referenced this pull request Jul 7, 2023
…l_discovery_with_a_subset_of_nodes_coming_online`

[Why]
Now that feature flags compatibility is tested first, before
Mnesia-specific checks, when a peer is not started yet, the feature
flags check lasts the entire timeout, so one minute. This retry
mechanism was added to feature flags in #8411.

Thus, instead of 20 seconds, the testcase takes 10 minutes now (10
retries of one minute each).
dumbbell added a commit that referenced this pull request Jun 20, 2024
…onnection`"

This reverts commit 8749c60.

[Why]
The patch was supposed to solve an issue that we didn't understand and
that is likely to be a network/DNS problem outside of RabbitMQ. We know
it didn't solve that issue because it was reported again 6 months after
the initial pull request (#8411).

What we are sure however is that it increased the testing of RabbitMQ
significantly because the code loops for 10+ minutes if the remote node
is not running.

Let's revert it until the root cause is really understood.
dumbbell added a commit that referenced this pull request Jun 20, 2024
…onnection`"

This reverts commit 8749c60.

[Why]
The patch was supposed to solve an issue that we didn't understand and
that is likely to be a network/DNS problem outside of RabbitMQ. We know
it didn't solve that issue because it was reported again 6 months after
the initial pull request (#8411).

What we are sure however is that it increased the testing of RabbitMQ
significantly because the code loops for 10+ minutes if the remote node
is not running.

Let's revert it until the root cause is really understood.
dumbbell added a commit that referenced this pull request Jun 20, 2024
…onnection`"

This reverts commit 8749c60.

[Why]
The patch was supposed to solve an issue that we didn't understand and
that was likely a network/DNS problem outside of RabbitMQ. We know it
didn't solve that issue because it was reported again 6 months after the
initial pull request (#8411).

What we are sure however is that it increased the testing of RabbitMQ
significantly because the code loops for 10+ minutes if the remote node
is not running.

The retry in the Feature flags subsystem was not the right place either.
The `noconnection` error is visible there because it runs earlier during
RabbitMQ startup. But retrying there won't solve a network issue
magically.

There are two ways to create a cluster:
1. peer discovery and this subsystem takes care of retries if necessary
   and appropriate
2. manually using the CLI, in which case the user is responsible for
   starting RabbitMQ nodes and clustering them

Let's revert it until the root cause is really understood.
michaelklishin pushed a commit that referenced this pull request Jun 22, 2024
…onnection`"

This reverts commit 8749c60.

[Why]
The patch was supposed to solve an issue that we didn't understand and
that was likely a network/DNS problem outside of RabbitMQ. We know it
didn't solve that issue because it was reported again 6 months after the
initial pull request (#8411).

What we are sure however is that it increased the testing of RabbitMQ
significantly because the code loops for 10+ minutes if the remote node
is not running.

The retry in the Feature flags subsystem was not the right place either.
The `noconnection` error is visible there because it runs earlier during
RabbitMQ startup. But retrying there won't solve a network issue
magically.

There are two ways to create a cluster:
1. peer discovery and this subsystem takes care of retries if necessary
   and appropriate
2. manually using the CLI, in which case the user is responsible for
   starting RabbitMQ nodes and clustering them

Let's revert it until the root cause is really understood.
mergify bot pushed a commit that referenced this pull request Jul 9, 2024
…onnection`"

This reverts commit 8749c60.

[Why]
The patch was supposed to solve an issue that we didn't understand and
that was likely a network/DNS problem outside of RabbitMQ. We know it
didn't solve that issue because it was reported again 6 months after the
initial pull request (#8411).

What we are sure however is that it increased the testing of RabbitMQ
significantly because the code loops for 10+ minutes if the remote node
is not running.

The retry in the Feature flags subsystem was not the right place either.
The `noconnection` error is visible there because it runs earlier during
RabbitMQ startup. But retrying there won't solve a network issue
magically.

There are two ways to create a cluster:
1. peer discovery and this subsystem takes care of retries if necessary
   and appropriate
2. manually using the CLI, in which case the user is responsible for
   starting RabbitMQ nodes and clustering them

Let's revert it until the root cause is really understood.

(cherry picked from commit d0c13b4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants