Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent node monitor crashing from doing multiple node disconnects in unstable networks #9663

Merged
merged 3 commits into from
Oct 9, 2023
Merged

Prevent node monitor crashing from doing multiple node disconnects in unstable networks #9663

merged 3 commits into from
Oct 9, 2023

Conversation

Ayanda-D
Copy link
Contributor

@Ayanda-D Ayanda-D commented Oct 9, 2023

Proposed Changes

When nodes are already disconnected on erlang dist layer, we want to avoid repeated explicit disconnects
by rabbit_node_monitor when detecting a partial partition. Doing so once until node reconnects is enough.
Repeated disconnections (e.g. up | down messages queued in node monitor's mailbox in unstable networks)
has been seen to cause application controller timeouts and node monitor terminations, which in most cases
fails to recover indefinitely. See crash below. Partial partition disconnect from rabbit@host-309 followed by
{partial_partition_disconnect, 'rabbit@host-309'} request.

This PR also adds announcement of partition handling strategy on the node monitor start-up log (no need to
check environment for this).

2023-09-28 17:01:58.233862+00:00 [error] <0.8214.2> Partial partition disconnect from rabbit@host-309
2023-09-28 17:01:59.460923+00:00 [error] <0.8003.2> Mnesia('rabbit@host-478'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@host-309'}
2023-09-28 17:01:59.460923+00:00 [error] <0.8003.2> 
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2> ** Generic server rabbit_node_monitor terminating
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2> ** Last message in was {'$gen_cast',
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                            {partial_partition_disconnect,
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                                'rabbit@host-309'}}
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2> ** When Server state == {state,
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                             {state,
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                                 #{{rabbit,
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                                       'rabbit@host-309'} =>
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                                       #Ref<0.3730966478.3434872835.32113>,
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                                   {rabbit,
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                                       'rabbit@host-704'} =>
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                                       #Ref<0.3730966478.3434872835.32374>,
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                                   {rabbit,
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                                       'rabbit@host-547'} =>
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                                       #Ref<0.3730966478.3435134977.25732>},
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                                 erlang},
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                             ['rabbit@host-309'],
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                             {state,#{},erlang},
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                             undefined,
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                             {erlang,#Ref<0.3730966478.3434610692.108563>},
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                             {restarting,<0.9514.2>},
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                             1695912560,
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                             #{'rabbit@host-309' =>
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                                   1695910801,
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                               'rabbit@host-704' =>
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                                   1695827984,
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                               'rabbit@host-547' =>
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                                   1695916511}}
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2> ** Reason for termination ==
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2> ** {{timeout,{gen_server,call,
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                          [application_controller,
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                           {set_env,kernel,dist_auto_connect,never,[]}]}},
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>     [{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]},
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>      {rabbit_node_monitor,disconnect,1,
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                           [{file,"rabbit_node_monitor.erl"},{line,871}]},
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>      {rabbit_node_monitor,handle_cast,2,
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>                           [{file,"rabbit_node_monitor.erl"},{line,524}]},
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>      {gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,695}]},
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>      {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,771}]},
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}
2023-09-28 17:02:03.235095+00:00 [error] <0.8214.2> 
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>   crasher:
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>     initial call: rabbit_node_monitor:init/1
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>     pid: <0.8214.2>
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>     registered_name: rabbit_node_monitor
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>     exception exit: {timeout,
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                         {gen_server,call,
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                             [application_controller,
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                              {set_env,kernel,dist_auto_connect,never,[]}]}}
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>       in function  gen_server:call/2 (gen_server.erl, line 239)
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>       in call from rabbit_node_monitor:disconnect/1 (rabbit_node_monitor.erl, line 871)
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>       in call from rabbit_node_monitor:handle_cast/2 (rabbit_node_monitor.erl, line 524)
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>       in call from gen_server:try_dispatch/4 (gen_server.erl, line 695)
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>       in call from gen_server:handle_msg/6 (gen_server.erl, line 771)
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>     ancestors: [rabbit_node_monitor_sup,rabbit_sup,<0.7969.2>]
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>     message_queue_len: 6
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>     messages: [{'DOWN',#Ref<0.3730966478.3434872835.32113>,process,
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                       {rabbit,'rabbit@host-309'},
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                       noconnection},
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                   {nodedown,'rabbit@host-309',
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                       [{nodedown_reason,connection_closed}]},
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                   {mnesia_system_event,
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                       {mnesia_down,'rabbit@host-309'}},
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                   {nodeup,'rabbit@host-309',[]},
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                   {mnesia_system_event,
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                       {inconsistent_database,running_partitioned_network,
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                           'rabbit@host-309'}},
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                   {'$gen_cast',
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                       {check_partial_partition,
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                           'rabbit@host-309',
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                           'rabbit@host-704',1695910801,
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>                           1695912560,1695827984}}]
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>     links: [<0.8213.2>,<0.9514.2>,<0.8007.2>]
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>     dictionary: [{{xtype_to_module,topic},rabbit_exchange_type_topic}]
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>     trap_exit: true
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>     status: running
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>     heap_size: 17731
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>     stack_size: 29
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>     reductions: 125675
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2>   neighbours:
2023-09-28 17:02:03.235839+00:00 [error] <0.8214.2> 
2023-09-28 17:02:03.236854+00:00 [error] <0.8213.2>     supervisor: {local,rabbit_node_monitor_sup}
2023-09-28 17:02:03.236854+00:00 [error] <0.8213.2>     errorContext: child_terminated
2023-09-28 17:02:03.236854+00:00 [error] <0.8213.2>     reason: {timeout,
2023-09-28 17:02:03.236854+00:00 [error] <0.8213.2>                 {gen_server,call,
2023-09-28 17:02:03.236854+00:00 [error] <0.8213.2>                     [application_controller,
2023-09-28 17:02:03.236854+00:00 [error] <0.8213.2>                      {set_env,kernel,dist_auto_connect,never,[]}]}}
2023-09-28 17:02:03.236854+00:00 [error] <0.8213.2>     offender: [{pid,<0.8214.2>},
2023-09-28 17:02:03.236854+00:00 [error] <0.8213.2>                {id,rabbit_node_monitor},
2023-09-28 17:02:03.236854+00:00 [error] <0.8213.2>                {mfargs,{rabbit_node_monitor,start_link,[]}},
2023-09-28 17:02:03.236854+00:00 [error] <0.8213.2>                {restart_type,transient},
2023-09-28 17:02:03.236854+00:00 [error] <0.8213.2>                {shutdown,300000},
2023-09-28 17:02:03.236854+00:00 [error] <0.8213.2>                {child_type,worker}]
2023-09-28 17:02:03.242198+00:00 [info] <0.9679.2> Starting rabbit_node_monitor, might be partitioned from ['rabbit@host-309']
2023-09-28 17:02:08.243026+00:00 [error] <0.8213.2>     supervisor: {local,rabbit_node_monitor_sup}
2023-09-28 17:02:08.243026+00:00 [error] <0.8213.2>     errorContext: start_error
2023-09-28 17:02:08.243026+00:00 [error] <0.8213.2>     reason: {timeout,
2023-09-28 17:02:08.243026+00:00 [error] <0.8213.2>                 {gen_server,call,
2023-09-28 17:02:08.243026+00:00 [error] <0.8213.2>                     [application_controller,
2023-09-28 17:02:08.243026+00:00 [error] <0.8213.2>                      {unset_env,rabbit,rabbit_autoheal_state_after_restart,
2023-09-28 17:02:08.243026+00:00 [error] <0.8213.2>                          []}]}}
2023-09-28 17:02:08.243026+00:00 [error] <0.8213.2>     offender: [{pid,<0.8214.2>},
2023-09-28 17:02:08.243026+00:00 [error] <0.8213.2>                {id,rabbit_node_monitor},
2023-09-28 17:02:08.243026+00:00 [error] <0.8213.2>                {mfargs,{rabbit_node_monitor,start_link,[]}},
2023-09-28 17:02:08.243026+00:00 [error] <0.8213.2>                {restart_type,transient},
2023-09-28 17:02:08.243026+00:00 [error] <0.8213.2>                {shutdown,300000},
2023-09-28 17:02:08.243026+00:00 [error] <0.8213.2>                {child_type,worker}]
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>   crasher:
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>     initial call: rabbit_node_monitor:init/1
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>     pid: <0.9679.2>
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>     registered_name: []
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>     exception exit: {timeout,
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>                         {gen_server,call,
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>                             [application_controller,
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>                              {unset_env,rabbit,
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>                                  rabbit_autoheal_state_after_restart,[]}]}}
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>       in function  gen_server:call/2 (gen_server.erl, line 239)
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>       in call from rabbit_autoheal:init/0 (rabbit_autoheal.erl, line 117)
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>       in call from rabbit_node_monitor:init/1 (rabbit_node_monitor.erl, line 371)
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>       in call from gen_server:init_it/2 (gen_server.erl, line 423)
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>       in call from gen_server:init_it/6 (gen_server.erl, line 390)
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>     ancestors: [rabbit_node_monitor_sup,rabbit_sup,<0.7969.2>]
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>     message_queue_len: 0
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>     messages: []
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>     links: [<0.8213.2>,<0.8007.2>]
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>     dictionary: []
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>     trap_exit: true
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>     status: running
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>     heap_size: 2586
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>     stack_size: 29
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>     reductions: 2721
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2>   neighbours:
2023-09-28 17:02:08.243259+00:00 [error] <0.9679.2> 

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc.

when nodes are already disconnected, thus preventing
application controller timeouts and node monitor terminations,
which in some cases fails to receover completely.
@michaelklishin michaelklishin merged commit e6339bb into rabbitmq:main Oct 9, 2023
michaelklishin added a commit that referenced this pull request Oct 9, 2023
Prevent node monitor crashing from doing multiple node disconnects in unstable networks (backport #9663)
michaelklishin added a commit that referenced this pull request Oct 9, 2023
Prevent node monitor crashing from doing multiple node disconnects in unstable networks (backport #9663) (backport #9667)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants