Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RabbitMQ does not auto-recover from pause_minority due to a feature flags issue #8114

Closed
lukebakken opened this issue May 5, 2023 · 9 comments
Assignees
Labels

Comments

@lukebakken
Copy link
Collaborator

Describe the bug

See the README here:

https://github.com/lukebakken/rabbitmq-server-8113

Upon resolution of network connectivity, the paused node errors with {erpc,noconnection}

Reproduction steps

https://github.com/lukebakken/rabbitmq-server-8113#reproduction-steps

Expected behavior

The paused node should automatically come back up when network connectivity is restored.

Additional context

RabbitMQ 3.11.15, Erlang 25.3.1 (current docker image).

@lukebakken lukebakken added the bug label May 5, 2023
@lukebakken lukebakken self-assigned this May 5, 2023
@michaelklishin
Copy link
Member

Interesting that it runs into

Error during startup: {error,incompatible_feature_flags}

upon boot.

@dcorbacho
Copy link
Contributor

Interesting that it runs into

Error during startup: {error,incompatible_feature_flags}

upon boot.

Syncing feature flags captures the error and returns it tagged like that. Not the most descriptive, probably error_syncing_feature_flags is more accurate nowadays - ff are becoming more complex and they can fail for many reasons. Here: https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/rabbit_mnesia.erl#L678

@dumbbell
Copy link
Member

@lukebakken shared the entire log files in our internal Slack, I'm attaching them here.
[email protected]
[email protected]
[email protected]

I tried to reproduce as explained but couldn't so far. The logs above were from RabbitMQ 3.11.15, but RabbitMQ 3.11.16 was picked when I executed the script.

There is no evidence in the logs above that this is something related to feature flags: the rmq0 Erlang node failed to open a connection to both rmq1 and rmq2 when it wanted to restart:

2023-05-05 22:57:05.477001+00:00 [warning] <0.2275.0> 'global' at '[email protected]' failed to connect to '[email protected]'
2023-05-05 22:57:05.477001+00:00 [warning] <0.2275.0> 
2023-05-05 22:57:05.495009+00:00 [notice] <0.44.0> Application mnesia exited with reason: stopped
2023-05-05 22:57:05.496061+00:00 [notice] <0.2315.0> Feature flags: checking nodes `[email protected]` and `[email protected]` compatibility...
2023-05-05 22:57:19.522172+00:00 [error] <0.2363.0> Feature flags: error while running:
2023-05-05 22:57:19.522172+00:00 [error] <0.2363.0> Feature flags:   rabbit_ff_registry:inventory[]
2023-05-05 22:57:19.522172+00:00 [error] <0.2363.0> Feature flags: on node `[email protected]`:
2023-05-05 22:57:19.522172+00:00 [error] <0.2363.0> Feature flags:   exception error: {erpc,noconnection}

We can see that global couldn't contact rmq1 and rabbit_ff_controller couldn't contact rmq2.

I will do some "research" to see if I can use RabbitMQ 3.11.15 instead.

@dumbbell
Copy link
Member

So far, I was never able to reproduce the error :-/

@michaelklishin
Copy link
Member

@mkuratczyk this is the issue I've mentioned during the sync-up today

@lukebakken
Copy link
Collaborator Author

@michaelklishin @dumbbell @mkuratczyk this appears to be fixed in 3.11.16. I'm re-testing with 3.11.15

@lukebakken
Copy link
Collaborator Author

It took two tries to reproduce with 3.11.15, but it did. The other day I could repro several times in a row, which is why I opened this issue.

I'm going to keep re-trying with 3.11.16 for a bit and will re-open this issue if I can repro.

@lukebakken
Copy link
Collaborator Author

Yep, I just reproduced with 3.11.16. Same set of reproduction steps.

@lukebakken
Copy link
Collaborator Author

User error! When re-connecting, docker network connect knows nothing about aliases defined via docker compose. You must re-connect using this command:

docker network connect --alias=rmq0.local rabbitnet rabbitmq-server-8113-rmq0-1

@lukebakken lukebakken closed this as not planned Won't fix, can't repro, duplicate, stale May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants