Feature flags detection sometimes triggers `erpc,noconnection` #8346

lukebakken · 2023-05-25T15:40:25Z

Describe the bug

Start a RabbitMQ cluster
Restart a node

Logs


2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: on node `rabbit@rabbit2`:
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:   exception error: {erpc,noconnection}
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:     in function  erpc:call/5 (erpc.erl, line 710)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:     in call from rabbit_ff_controller:rpc_call/5 (rabbit_ff_controller.erl, line 1123)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:     in call from lists:foreach_1/2 (lists.erl, line 1442)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:     in call from rabbit_feature_flags:check_node_compatibility_v1/2 (rabbit_feature_flags.erl, line 1599)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:     in call from rabbit_mnesia:check_rabbit_consistency/2 (rabbit_mnesia.erl, line 1017)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:     in call from rabbit_mnesia:check_consistency/5 (rabbit_mnesia.erl, line 948)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:     in call from rabbit_mnesia:check_cluster_consistency/2 (rabbit_mnesia.erl, line 746)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:     in call from lists:foldl/3 (lists.erl, line 1350)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> 
2023-05-24 01:39:55.243345-07:00 [error] <0.277.0> Mnesia(rabbit@rabbit3): ** ERROR ** Mnesia on rabbit@rabbit3 could not connect to node(s) [rabbit@rabbit2]

Reproduction steps

See above.

Expected behavior

No erpc error - either it is re-tried, or it is not tried until disterl is definitely up and running.

Additional context

Observed in the following situations:

The text was updated successfully, but these errors were encountered:

michaelklishin · 2023-05-25T15:49:26Z

I think the expected behavior should be "the operation is retried N times" :)

[Why] There could be a transient network issue. Let's give a few more chances to perform the requested RPC call. [How] We retry until the given timeout is reached, if any. To honor that timeout, we measure the time taken by the RPC call itself. We also sleep between retries. Before each retry, the timeout is reduced by the total of the time taken by the RPC call and the sleep. References #8346.

[Why] There could be a transient network issue. Let's give a few more chances to perform the requested RPC call. [How] We retry until the given timeout is reached, if any. To honor that timeout, we measure the time taken by the RPC call itself. We also sleep between retries. Before each retry, the timeout is reduced by the total of the time taken by the RPC call and the sleep. References #8346. V2: Treat `infinity` timeout differently. In this case, we never retry following a `noconnection` error. The reason is that this timeout is used specifically for callbacks executed remotely. We don't know how long they take (for instance if there is a lot of data to migrate). We don't want an infinite retry loop either, so in this case, we don't retry.

[Why] There could be a transient network issue. Let's give a few more chances to perform the requested RPC call. [How] We retry until the given timeout is reached, if any. To honor that timeout, we measure the time taken by the RPC call itself. We also sleep between retries. Before each retry, the timeout is reduced by the total of the time taken by the RPC call and the sleep. References #8346. V2: Treat `infinity` timeout differently. In this case, we never retry following a `noconnection` error. The reason is that this timeout is used specifically for callbacks executed remotely. We don't know how long they take (for instance if there is a lot of data to migrate). We don't want an infinite retry loop either, so in this case, we don't retry. (cherry picked from commit 8749c60)

[Why] There could be a transient network issue. Let's give a few more chances to perform the requested RPC call. [How] We retry until the given timeout is reached, if any. To honor that timeout, we measure the time taken by the RPC call itself. We also sleep between retries. Before each retry, the timeout is reduced by the total of the time taken by the RPC call and the sleep. References #8346. V2: Treat `infinity` timeout differently. In this case, we never retry following a `noconnection` error. The reason is that this timeout is used specifically for callbacks executed remotely. We don't know how long they take (for instance if there is a lot of data to migrate). We don't want an infinite retry loop either, so in this case, we don't retry. (cherry picked from commit 8749c60) (cherry picked from commit 47b1596) # Conflicts: # deps/rabbit/src/rabbit_ff_controller.erl # deps/rabbit/test/feature_flags_v2_SUITE.erl

[Why] There could be a transient network issue. Let's give a few more chances to perform the requested RPC call. [How] We retry until the given timeout is reached, if any. To honor that timeout, we measure the time taken by the RPC call itself. We also sleep between retries. Before each retry, the timeout is reduced by the total of the time taken by the RPC call and the sleep. References #8346. V2: Treat `infinity` timeout differently. In this case, we never retry following a `noconnection` error. The reason is that this timeout is used specifically for callbacks executed remotely. We don't know how long they take (for instance if there is a lot of data to migrate). We don't want an infinite retry loop either, so in this case, we don't retry. (cherry picked from commit 8749c60) (cherry picked from commit 47b1596)

kepakiano · 2023-12-14T13:06:25Z

We stumbled over this by user error in #10100 and as requested, here is the step by step to get the same error message. Although, bear in mind that this happened to me only because I forgot the "rabbit@" when trying to call join_cluster:

$ docker network create test_network
1947438e01b9cced503ba3044be1afb1f5a6225fb64d265257b3547b947cad64
$ docker run -d --network test_network --name rabbit1 --privileged -v $(pwd)/cookie:/var/lib/rabbitmq/.erlang.cookie pivotalrabbitmq/rabbitmq:main-otp-max-bazel
b29a66ec3350cb7ee60975d3a1b8c0bd7918313f30833be76a113d0ea0c78590
$ docker container ls
CONTAINER ID        IMAGE                                         COMMAND                  CREATED             STATUS              PORTS                                                                                                                      NAMES
b29a66ec3350        pivotalrabbitmq/rabbitmq:main-otp-max-bazel   "docker-entrypoint.s…"   38 seconds ago      Up 36 seconds       1883/tcp, 4369/tcp, 5551-5552/tcp, 5671-5672/tcp, 8883/tcp, 15670-15676/tcp, 15691-15692/tcp, 25672/tcp, 61613-61614/tcp   rabbit1
$ docker exec -it b2 /bin/bash
root@b29a66ec3350:/# rabbitmqctl join_cluster this_node_does_not_exist
Clustering node rabbit@b29a66ec3350 with this_node_does_not_exist

13:03:53.487 [error] Feature flags: error while running:
Feature flags:   rabbit_ff_controller:running_nodes[]
Feature flags: on node `this_node_does_not_exist@b29a66ec3350`:
Feature flags:   exception error: {erpc,noconnection}
Feature flags:     in function  erpc:call/5 (erpc.erl, line 710)
Feature flags:     in call from rabbit_ff_controller:rpc_call/5 (rabbit_ff_controller.erl, line 1377)
Feature flags:     in call from rabbit_ff_controller:list_nodes_clustered_with/1 (rabbit_ff_controller.erl, line 477)
Feature flags:     in call from rabbit_ff_controller:check_node_compatibility_task/2 (rabbit_ff_controller.erl, line 389)
Feature flags:     in call from rabbit_db_cluster:can_join/1 (rabbit_db_cluster.erl, line 65)
Feature flags:     in call from rabbit_db_cluster:join/2 (rabbit_db_cluster.erl, line 97)
Feature flags:     in call from erpc:execute_call/4 (erpc.erl, line 589)

Error:
{:aborted_feature_flags_compat_check, {:error, {:erpc, :noconnection}}}
root@b29a66ec3350:/#

michaelklishin · 2023-12-14T23:42:40Z

It's not clear to me from this log what exactly logs this message: the node or the shell where rabbitmqctl join_cluster this_node_does_not_exist is executed?

In any case, join_cluster should bail early if it cannot contact its not-to-be-joint.

CarvalhoRod · 2024-08-27T19:52:55Z

I don't know if you checked the log on the node that is running when you try to connect, but it's worth checking.

What may be wrong is your /var/lib/rabbitmq/.erlang.cookie, it has to be the same (with the same value) on all nodes in the cluster.

michaelklishin · 2024-08-27T20:03:33Z

@CarvalhoRod thank you for chiming in but this is RabbitMQ 101 and @lukebakken is a core team engineer. You can be sure such basics were accounted for.

That said, with #8411 this can probably be closed. If we get more details/observe more specific failure scenarios that are specific to the code and not the setup, we can always file a new issue.

michaelklishin · 2024-08-27T20:04:02Z

Setting the milestone to 3.13.7 because that's the most recent 3.13.x release at the time of writing.

michaelklishin · 2024-09-12T13:04:44Z

Note that the relevant PR was reverted in #11507, I will unset the milestone to reduce confusion.

Hussain-f · 2024-12-02T00:09:19Z

@lukebakken @michaelklishin just to advise I am experiencing the exact same when trying to cluster on aws. Erland cookie is the same. Both instances/nodes on same VPC in different AZ. Security groups all set up correctly. Running version 4.0.4

Node 1:

Node 2:

I have redacted the erland cookie except the last 2 letters to show this has been accounted for.

Both nodes are up and running before node 2 runs join_cluster to join node 1. That is, systemctl start rabbitmq-server followed by rabbitmqctl start_app both run fine.

I'm not sure what i'm missing.

michaelklishin · 2024-12-02T04:06:37Z

@Hussain-f our team does not appreciate when issues or PRs are used for questions. Start a discussion, they have been around for a few years now.

We cannot suggest much based on the logs of the connecting node. Logs from all nodes must be collected and inspected, when a node with a mismatching shared secret connects, the connection target will log a message after refusing the conneciton.

lukebakken added the bug label May 25, 2023

lukebakken assigned michaelklishin, dumbbell and lukebakken May 25, 2023

dumbbell mentioned this issue May 30, 2023

rabbit_feature_flags: Retry after erpc:call() fails with noconnection #8411

Merged

michaelklishin closed this as completed Aug 27, 2024

michaelklishin added this to the 3.13.7 milestone Aug 27, 2024

michaelklishin removed this from the 3.13.7 milestone Sep 12, 2024

rabbitmq locked as resolved and limited conversation to collaborators Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature flags detection sometimes triggers `erpc,noconnection` #8346

Feature flags detection sometimes triggers `erpc,noconnection` #8346

lukebakken commented May 25, 2023 •

edited

Loading

michaelklishin commented May 25, 2023

kepakiano commented Dec 14, 2023

michaelklishin commented Dec 14, 2023

CarvalhoRod commented Aug 27, 2024

michaelklishin commented Aug 27, 2024

michaelklishin commented Aug 27, 2024

michaelklishin commented Sep 12, 2024

Hussain-f commented Dec 2, 2024 •

edited

Loading

michaelklishin commented Dec 2, 2024

Feature flags detection sometimes triggers erpc,noconnection #8346

Feature flags detection sometimes triggers erpc,noconnection #8346

Comments

lukebakken commented May 25, 2023 • edited Loading

Describe the bug

Reproduction steps

Expected behavior

Additional context

michaelklishin commented May 25, 2023

kepakiano commented Dec 14, 2023

michaelklishin commented Dec 14, 2023

CarvalhoRod commented Aug 27, 2024

michaelklishin commented Aug 27, 2024

michaelklishin commented Aug 27, 2024

michaelklishin commented Sep 12, 2024

Hussain-f commented Dec 2, 2024 • edited Loading

michaelklishin commented Dec 2, 2024

Feature flags detection sometimes triggers `erpc,noconnection` #8346

Feature flags detection sometimes triggers `erpc,noconnection` #8346

lukebakken commented May 25, 2023 •

edited

Loading

Hussain-f commented Dec 2, 2024 •

edited

Loading