Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature flags detection sometimes triggers erpc,noconnection #8346

Closed
lukebakken opened this issue May 25, 2023 · 9 comments
Closed

Feature flags detection sometimes triggers erpc,noconnection #8346

lukebakken opened this issue May 25, 2023 · 9 comments
Assignees
Labels

Comments

@lukebakken
Copy link
Collaborator

lukebakken commented May 25, 2023

Describe the bug

  • Start a RabbitMQ cluster
  • Restart a node
Logs

2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: on node `rabbit@rabbit2`:
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:   exception error: {erpc,noconnection}
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:     in function  erpc:call/5 (erpc.erl, line 710)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:     in call from rabbit_ff_controller:rpc_call/5 (rabbit_ff_controller.erl, line 1123)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:     in call from lists:foreach_1/2 (lists.erl, line 1442)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:     in call from rabbit_feature_flags:check_node_compatibility_v1/2 (rabbit_feature_flags.erl, line 1599)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:     in call from rabbit_mnesia:check_rabbit_consistency/2 (rabbit_mnesia.erl, line 1017)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:     in call from rabbit_mnesia:check_consistency/5 (rabbit_mnesia.erl, line 948)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:     in call from rabbit_mnesia:check_cluster_consistency/2 (rabbit_mnesia.erl, line 746)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags:     in call from lists:foldl/3 (lists.erl, line 1350)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> 
2023-05-24 01:39:55.243345-07:00 [error] <0.277.0> Mnesia(rabbit@rabbit3): ** ERROR ** Mnesia on rabbit@rabbit3 could not connect to node(s) [rabbit@rabbit2]

Reproduction steps

See above.

Expected behavior

No erpc error - either it is re-tried, or it is not tried until disterl is definitely up and running.

Additional context

Observed in the following situations:

@michaelklishin
Copy link
Member

I think the expected behavior should be "the operation is retried N times" :)

dumbbell added a commit that referenced this issue May 30, 2023
[Why]
There could be a transient network issue. Let's give a few more chances
to perform the requested RPC call.

[How]
We retry until the given timeout is reached, if any.

To honor that timeout, we measure the time taken by the RPC call itself.
We also sleep between retries. Before each retry, the timeout is reduced
by the total of the time taken by the RPC call and the sleep.

References #8346.
dumbbell added a commit that referenced this issue May 30, 2023
[Why]
There could be a transient network issue. Let's give a few more chances
to perform the requested RPC call.

[How]
We retry until the given timeout is reached, if any.

To honor that timeout, we measure the time taken by the RPC call itself.
We also sleep between retries. Before each retry, the timeout is reduced
by the total of the time taken by the RPC call and the sleep.

References #8346.

V2: Treat `infinity` timeout differently. In this case, we never retry
    following a `noconnection` error. The reason is that this timeout is
    used specifically for callbacks executed remotely. We don't know how
    long they take (for instance if there is a lot of data to migrate).
    We don't want an infinite retry loop either, so in this case, we
    don't retry.
dumbbell added a commit that referenced this issue May 30, 2023
[Why]
There could be a transient network issue. Let's give a few more chances
to perform the requested RPC call.

[How]
We retry until the given timeout is reached, if any.

To honor that timeout, we measure the time taken by the RPC call itself.
We also sleep between retries. Before each retry, the timeout is reduced
by the total of the time taken by the RPC call and the sleep.

References #8346.

V2: Treat `infinity` timeout differently. In this case, we never retry
    following a `noconnection` error. The reason is that this timeout is
    used specifically for callbacks executed remotely. We don't know how
    long they take (for instance if there is a lot of data to migrate).
    We don't want an infinite retry loop either, so in this case, we
    don't retry.
dumbbell added a commit that referenced this issue Jun 1, 2023
[Why]
There could be a transient network issue. Let's give a few more chances
to perform the requested RPC call.

[How]
We retry until the given timeout is reached, if any.

To honor that timeout, we measure the time taken by the RPC call itself.
We also sleep between retries. Before each retry, the timeout is reduced
by the total of the time taken by the RPC call and the sleep.

References #8346.

V2: Treat `infinity` timeout differently. In this case, we never retry
    following a `noconnection` error. The reason is that this timeout is
    used specifically for callbacks executed remotely. We don't know how
    long they take (for instance if there is a lot of data to migrate).
    We don't want an infinite retry loop either, so in this case, we
    don't retry.
dumbbell added a commit that referenced this issue Jun 5, 2023
[Why]
There could be a transient network issue. Let's give a few more chances
to perform the requested RPC call.

[How]
We retry until the given timeout is reached, if any.

To honor that timeout, we measure the time taken by the RPC call itself.
We also sleep between retries. Before each retry, the timeout is reduced
by the total of the time taken by the RPC call and the sleep.

References #8346.

V2: Treat `infinity` timeout differently. In this case, we never retry
    following a `noconnection` error. The reason is that this timeout is
    used specifically for callbacks executed remotely. We don't know how
    long they take (for instance if there is a lot of data to migrate).
    We don't want an infinite retry loop either, so in this case, we
    don't retry.
dumbbell added a commit that referenced this issue Jun 6, 2023
[Why]
There could be a transient network issue. Let's give a few more chances
to perform the requested RPC call.

[How]
We retry until the given timeout is reached, if any.

To honor that timeout, we measure the time taken by the RPC call itself.
We also sleep between retries. Before each retry, the timeout is reduced
by the total of the time taken by the RPC call and the sleep.

References #8346.

V2: Treat `infinity` timeout differently. In this case, we never retry
    following a `noconnection` error. The reason is that this timeout is
    used specifically for callbacks executed remotely. We don't know how
    long they take (for instance if there is a lot of data to migrate).
    We don't want an infinite retry loop either, so in this case, we
    don't retry.
mergify bot pushed a commit that referenced this issue Jun 6, 2023
[Why]
There could be a transient network issue. Let's give a few more chances
to perform the requested RPC call.

[How]
We retry until the given timeout is reached, if any.

To honor that timeout, we measure the time taken by the RPC call itself.
We also sleep between retries. Before each retry, the timeout is reduced
by the total of the time taken by the RPC call and the sleep.

References #8346.

V2: Treat `infinity` timeout differently. In this case, we never retry
    following a `noconnection` error. The reason is that this timeout is
    used specifically for callbacks executed remotely. We don't know how
    long they take (for instance if there is a lot of data to migrate).
    We don't want an infinite retry loop either, so in this case, we
    don't retry.
(cherry picked from commit 8749c60)
mergify bot pushed a commit that referenced this issue Jun 6, 2023
[Why]
There could be a transient network issue. Let's give a few more chances
to perform the requested RPC call.

[How]
We retry until the given timeout is reached, if any.

To honor that timeout, we measure the time taken by the RPC call itself.
We also sleep between retries. Before each retry, the timeout is reduced
by the total of the time taken by the RPC call and the sleep.

References #8346.

V2: Treat `infinity` timeout differently. In this case, we never retry
    following a `noconnection` error. The reason is that this timeout is
    used specifically for callbacks executed remotely. We don't know how
    long they take (for instance if there is a lot of data to migrate).
    We don't want an infinite retry loop either, so in this case, we
    don't retry.
(cherry picked from commit 8749c60)
(cherry picked from commit 47b1596)

# Conflicts:
#	deps/rabbit/src/rabbit_ff_controller.erl
#	deps/rabbit/test/feature_flags_v2_SUITE.erl
dumbbell added a commit that referenced this issue Jun 6, 2023
[Why]
There could be a transient network issue. Let's give a few more chances
to perform the requested RPC call.

[How]
We retry until the given timeout is reached, if any.

To honor that timeout, we measure the time taken by the RPC call itself.
We also sleep between retries. Before each retry, the timeout is reduced
by the total of the time taken by the RPC call and the sleep.

References #8346.

V2: Treat `infinity` timeout differently. In this case, we never retry
    following a `noconnection` error. The reason is that this timeout is
    used specifically for callbacks executed remotely. We don't know how
    long they take (for instance if there is a lot of data to migrate).
    We don't want an infinite retry loop either, so in this case, we
    don't retry.
(cherry picked from commit 8749c60)
(cherry picked from commit 47b1596)

# Conflicts:
#	deps/rabbit/src/rabbit_ff_controller.erl
#	deps/rabbit/test/feature_flags_v2_SUITE.erl
dumbbell added a commit that referenced this issue Jun 6, 2023
[Why]
There could be a transient network issue. Let's give a few more chances
to perform the requested RPC call.

[How]
We retry until the given timeout is reached, if any.

To honor that timeout, we measure the time taken by the RPC call itself.
We also sleep between retries. Before each retry, the timeout is reduced
by the total of the time taken by the RPC call and the sleep.

References #8346.

V2: Treat `infinity` timeout differently. In this case, we never retry
    following a `noconnection` error. The reason is that this timeout is
    used specifically for callbacks executed remotely. We don't know how
    long they take (for instance if there is a lot of data to migrate).
    We don't want an infinite retry loop either, so in this case, we
    don't retry.

(cherry picked from commit 8749c60)
(cherry picked from commit 47b1596)
@kepakiano
Copy link

We stumbled over this by user error in #10100 and as requested, here is the step by step to get the same error message. Although, bear in mind that this happened to me only because I forgot the "rabbit@" when trying to call join_cluster:

$ docker network create test_network
1947438e01b9cced503ba3044be1afb1f5a6225fb64d265257b3547b947cad64
$ docker run -d --network test_network --name rabbit1 --privileged -v $(pwd)/cookie:/var/lib/rabbitmq/.erlang.cookie pivotalrabbitmq/rabbitmq:main-otp-max-bazel
b29a66ec3350cb7ee60975d3a1b8c0bd7918313f30833be76a113d0ea0c78590
$ docker container ls
CONTAINER ID        IMAGE                                         COMMAND                  CREATED             STATUS              PORTS                                                                                                                      NAMES
b29a66ec3350        pivotalrabbitmq/rabbitmq:main-otp-max-bazel   "docker-entrypoint.s…"   38 seconds ago      Up 36 seconds       1883/tcp, 4369/tcp, 5551-5552/tcp, 5671-5672/tcp, 8883/tcp, 15670-15676/tcp, 15691-15692/tcp, 25672/tcp, 61613-61614/tcp   rabbit1
$ docker exec -it b2 /bin/bash
root@b29a66ec3350:/# rabbitmqctl join_cluster this_node_does_not_exist
Clustering node rabbit@b29a66ec3350 with this_node_does_not_exist

13:03:53.487 [error] Feature flags: error while running:
Feature flags:   rabbit_ff_controller:running_nodes[]
Feature flags: on node `this_node_does_not_exist@b29a66ec3350`:
Feature flags:   exception error: {erpc,noconnection}
Feature flags:     in function  erpc:call/5 (erpc.erl, line 710)
Feature flags:     in call from rabbit_ff_controller:rpc_call/5 (rabbit_ff_controller.erl, line 1377)
Feature flags:     in call from rabbit_ff_controller:list_nodes_clustered_with/1 (rabbit_ff_controller.erl, line 477)
Feature flags:     in call from rabbit_ff_controller:check_node_compatibility_task/2 (rabbit_ff_controller.erl, line 389)
Feature flags:     in call from rabbit_db_cluster:can_join/1 (rabbit_db_cluster.erl, line 65)
Feature flags:     in call from rabbit_db_cluster:join/2 (rabbit_db_cluster.erl, line 97)
Feature flags:     in call from erpc:execute_call/4 (erpc.erl, line 589)

Error:
{:aborted_feature_flags_compat_check, {:error, {:erpc, :noconnection}}}
root@b29a66ec3350:/# 

@michaelklishin
Copy link
Member

It's not clear to me from this log what exactly logs this message: the node or the shell where rabbitmqctl join_cluster this_node_does_not_exist is executed?

In any case, join_cluster should bail early if it cannot contact its not-to-be-joint.

@CarvalhoRod
Copy link

I don't know if you checked the log on the node that is running when you try to connect, but it's worth checking.

What may be wrong is your /var/lib/rabbitmq/.erlang.cookie, it has to be the same (with the same value) on all nodes in the cluster.

@michaelklishin
Copy link
Member

@CarvalhoRod thank you for chiming in but this is RabbitMQ 101 and @lukebakken is a core team engineer. You can be sure such basics were accounted for.

That said, with #8411 this can probably be closed. If we get more details/observe more specific failure scenarios that are specific to the code and not the setup, we can always file a new issue.

@michaelklishin michaelklishin added this to the 3.13.7 milestone Aug 27, 2024
@michaelklishin
Copy link
Member

Setting the milestone to 3.13.7 because that's the most recent 3.13.x release at the time of writing.

@michaelklishin
Copy link
Member

Note that the relevant PR was reverted in #11507, I will unset the milestone to reduce confusion.

@michaelklishin michaelklishin removed this from the 3.13.7 milestone Sep 12, 2024
@Hussain-f
Copy link

Hussain-f commented Dec 2, 2024

@lukebakken @michaelklishin just to advise I am experiencing the exact same when trying to cluster on aws. Erland cookie is the same. Both instances/nodes on same VPC in different AZ. Security groups all set up correctly. Running version 4.0.4

Node 1:
image

Node 2:
image

I have redacted the erland cookie except the last 2 letters to show this has been accounted for.

Both nodes are up and running before node 2 runs join_cluster to join node 1. That is, systemctl start rabbitmq-server followed by rabbitmqctl start_app both run fine.

I'm not sure what i'm missing.

@michaelklishin
Copy link
Member

@Hussain-f our team does not appreciate when issues or PRs are used for questions. Start a discussion, they have been around for a few years now.

We cannot suggest much based on the logs of the connecting node. Logs from all nodes must be collected and inspected, when a node with a mismatching shared secret connects, the connection target will log a message after refusing the conneciton.

@rabbitmq rabbitmq locked as resolved and limited conversation to collaborators Dec 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants