-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rabbit_feature_flags: Retry after erpc:call()
fails with noconnection
#8411
Conversation
In other places (reconnection to peers/schema DB replicas after a restart, for instance), we retry N times every M milliseconds up until the total timeout T is reached. This can work, too, but the above implementation is very easy to reason about. |
By "the above implementation", do you mean, "retry N times every M milliseconds"? I agree, it's easier to read such code. I remember we use it in several places. My problem with that is it may not honor the timeout given as argument if the main job of the function may take a long time. I mean, in the context of this RPC call, if the call itself takes 10 seconds to return Measuring the RPC call time allows to honor the given timeout at the cost of a more complicated code for sure. |
71dd368
to
38fa28d
Compare
38fa28d
to
9f8b176
Compare
56ad90c
to
deb4c74
Compare
noconnection
erpc:call()
fails with noconnection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still can't reproduce the issue using various tricks, but upon reviewing this code it'll retry as expected. Thanks @dumbbell
[Why] There could be a transient network issue. Let's give a few more chances to perform the requested RPC call. [How] We retry until the given timeout is reached, if any. To honor that timeout, we measure the time taken by the RPC call itself. We also sleep between retries. Before each retry, the timeout is reduced by the total of the time taken by the RPC call and the sleep. References #8346. V2: Treat `infinity` timeout differently. In this case, we never retry following a `noconnection` error. The reason is that this timeout is used specifically for callbacks executed remotely. We don't know how long they take (for instance if there is a lot of data to migrate). We don't want an infinite retry loop either, so in this case, we don't retry.
[Why] This is unused in this source file.
deb4c74
to
f09544b
Compare
rabbit_feature_flags: Retry after `erpc:call()` fails with `noconnection` (backport #8411)
…l_discovery_with_a_subset_of_nodes_coming_online` [Why] Now that feature flags compatibility is tested first, before Mnesia-specific checks, when a peer is not started yet, the feature flags check lasts the entire timeout, so one minute. This retry mechanism was added to feature flags in #8411. Thus, instead of 20 seconds, the testcase takes 10 minutes now (10 retries of one minute each).
…l_discovery_with_a_subset_of_nodes_coming_online` [Why] Now that feature flags compatibility is tested first, before Mnesia-specific checks, when a peer is not started yet, the feature flags check lasts the entire timeout, so one minute. This retry mechanism was added to feature flags in #8411. Thus, instead of 20 seconds, the testcase takes 10 minutes now (10 retries of one minute each).
…l_discovery_with_a_subset_of_nodes_coming_online` [Why] Now that feature flags compatibility is tested first, before Mnesia-specific checks, when a peer is not started yet, the feature flags check lasts the entire timeout, so one minute. This retry mechanism was added to feature flags in #8411. Thus, instead of 20 seconds, the testcase takes 10 minutes now (10 retries of one minute each).
…l_discovery_with_a_subset_of_nodes_coming_online` [Why] Now that feature flags compatibility is tested first, before Mnesia-specific checks, when a peer is not started yet, the feature flags check lasts the entire timeout, so one minute. This retry mechanism was added to feature flags in #8411. Thus, instead of 20 seconds, the testcase takes 10 minutes now (10 retries of one minute each).
…l_discovery_with_a_subset_of_nodes_coming_online` [Why] Now that feature flags compatibility is tested first, before Mnesia-specific checks, when a peer is not started yet, the feature flags check lasts the entire timeout, so one minute. This retry mechanism was added to feature flags in #8411. Thus, instead of 20 seconds, the testcase takes 10 minutes now (10 retries of one minute each).
…l_discovery_with_a_subset_of_nodes_coming_online` [Why] Now that feature flags compatibility is tested first, before Mnesia-specific checks, when a peer is not started yet, the feature flags check lasts the entire timeout, so one minute. This retry mechanism was added to feature flags in #8411. Thus, instead of 20 seconds, the testcase takes 10 minutes now (10 retries of one minute each).
…l_discovery_with_a_subset_of_nodes_coming_online` [Why] Now that feature flags compatibility is tested first, before Mnesia-specific checks, when a peer is not started yet, the feature flags check lasts the entire timeout, so one minute. This retry mechanism was added to feature flags in #8411. Thus, instead of 20 seconds, the testcase takes 10 minutes now (10 retries of one minute each).
…l_discovery_with_a_subset_of_nodes_coming_online` [Why] Now that feature flags compatibility is tested first, before Mnesia-specific checks, when a peer is not started yet, the feature flags check lasts the entire timeout, so one minute. This retry mechanism was added to feature flags in #8411. Thus, instead of 20 seconds, the testcase takes 10 minutes now (10 retries of one minute each).
…l_discovery_with_a_subset_of_nodes_coming_online` [Why] Now that feature flags compatibility is tested first, before Mnesia-specific checks, when a peer is not started yet, the feature flags check lasts the entire timeout, so one minute. This retry mechanism was added to feature flags in #8411. Thus, instead of 20 seconds, the testcase takes 10 minutes now (10 retries of one minute each).
…l_discovery_with_a_subset_of_nodes_coming_online` [Why] Now that feature flags compatibility is tested first, before Mnesia-specific checks, when a peer is not started yet, the feature flags check lasts the entire timeout, so one minute. This retry mechanism was added to feature flags in #8411. Thus, instead of 20 seconds, the testcase takes 10 minutes now (10 retries of one minute each).
…l_discovery_with_a_subset_of_nodes_coming_online` [Why] Now that feature flags compatibility is tested first, before Mnesia-specific checks, when a peer is not started yet, the feature flags check lasts the entire timeout, so one minute. This retry mechanism was added to feature flags in #8411. Thus, instead of 20 seconds, the testcase takes 10 minutes now (10 retries of one minute each).
…onnection`" This reverts commit 8749c60. [Why] The patch was supposed to solve an issue that we didn't understand and that is likely to be a network/DNS problem outside of RabbitMQ. We know it didn't solve that issue because it was reported again 6 months after the initial pull request (#8411). What we are sure however is that it increased the testing of RabbitMQ significantly because the code loops for 10+ minutes if the remote node is not running. Let's revert it until the root cause is really understood.
…onnection`" This reverts commit 8749c60. [Why] The patch was supposed to solve an issue that we didn't understand and that is likely to be a network/DNS problem outside of RabbitMQ. We know it didn't solve that issue because it was reported again 6 months after the initial pull request (#8411). What we are sure however is that it increased the testing of RabbitMQ significantly because the code loops for 10+ minutes if the remote node is not running. Let's revert it until the root cause is really understood.
…onnection`" This reverts commit 8749c60. [Why] The patch was supposed to solve an issue that we didn't understand and that was likely a network/DNS problem outside of RabbitMQ. We know it didn't solve that issue because it was reported again 6 months after the initial pull request (#8411). What we are sure however is that it increased the testing of RabbitMQ significantly because the code loops for 10+ minutes if the remote node is not running. The retry in the Feature flags subsystem was not the right place either. The `noconnection` error is visible there because it runs earlier during RabbitMQ startup. But retrying there won't solve a network issue magically. There are two ways to create a cluster: 1. peer discovery and this subsystem takes care of retries if necessary and appropriate 2. manually using the CLI, in which case the user is responsible for starting RabbitMQ nodes and clustering them Let's revert it until the root cause is really understood.
…onnection`" This reverts commit 8749c60. [Why] The patch was supposed to solve an issue that we didn't understand and that was likely a network/DNS problem outside of RabbitMQ. We know it didn't solve that issue because it was reported again 6 months after the initial pull request (#8411). What we are sure however is that it increased the testing of RabbitMQ significantly because the code loops for 10+ minutes if the remote node is not running. The retry in the Feature flags subsystem was not the right place either. The `noconnection` error is visible there because it runs earlier during RabbitMQ startup. But retrying there won't solve a network issue magically. There are two ways to create a cluster: 1. peer discovery and this subsystem takes care of retries if necessary and appropriate 2. manually using the CLI, in which case the user is responsible for starting RabbitMQ nodes and clustering them Let's revert it until the root cause is really understood.
…onnection`" This reverts commit 8749c60. [Why] The patch was supposed to solve an issue that we didn't understand and that was likely a network/DNS problem outside of RabbitMQ. We know it didn't solve that issue because it was reported again 6 months after the initial pull request (#8411). What we are sure however is that it increased the testing of RabbitMQ significantly because the code loops for 10+ minutes if the remote node is not running. The retry in the Feature flags subsystem was not the right place either. The `noconnection` error is visible there because it runs earlier during RabbitMQ startup. But retrying there won't solve a network issue magically. There are two ways to create a cluster: 1. peer discovery and this subsystem takes care of retries if necessary and appropriate 2. manually using the CLI, in which case the user is responsible for starting RabbitMQ nodes and clustering them Let's revert it until the root cause is really understood. (cherry picked from commit d0c13b4)
Why
There could be a transient network issue. Let's give a few more chances to perform the requested RPC call.
How
We retry until the given timeout is reached, if any.
To honor that timeout, we measure the time taken by the RPC call itself. We also sleep between retries. Before each retry, the timeout is reduced by the total of the time taken by the RPC call and the sleep.
References #8346.