Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore: classic peer discovery without randomised startup delay #689

Closed
wants to merge 2 commits into from

Conversation

ansd
Copy link
Member

@ansd ansd commented May 12, 2021

Relates #662.

Enforce pod 0 to form the cluster.

Pros:

  • no randomised startup delay => always a single node (pod 0) forms the cluster
  • no sophisticated locking mechanism

Cons:

  • If pod 0 never starts, the cluster wont't be created. I think that's okay. For new clusters, all nodes should be up & running.
  • Sporadically, there are container restarts since nodes crash with below schema_integrity_check_failed error. Eventually the cluster gets created successfully and all pods become ready. Solution: have one node initialize itself at a time.
2021-05-12 15:34:02.786 [debug] <0.67.0> Supervisor kernel_safe_sup started pg_local:start_link() at pid <0.1065.0>
2021-05-12 15:34:02.876 [info] <0.44.0> Application mnesia exited with reason: stopped

2021-05-12 15:34:02.877 [error] <0.273.0>
2021-05-12 15:34:02.877 [info] <0.44.0> Application mnesia exited with reason: stopped
2021-05-12 15:34:02.877 [error] <0.273.0> BOOT FAILED
2021-05-12 15:34:02.877 [error] <0.273.0> ===========
2021-05-12 15:34:02.877 [error] <0.273.0> Error during startup: {error,
2021-05-12 15:34:02.877 [error] <0.273.0>                           {schema_integrity_check_failed,
BOOT FAILED
===========
Error during startup: {error,
                          {schema_integrity_check_failed,
2021-05-12 15:34:02.877 [error] <0.273.0>                               [{table_attributes_mismatch,rabbit_user,
2021-05-12 15:34:02.878 [error] <0.273.0>                                    [username,password_hash,tags,
                              [{table_attributes_mismatch,rabbit_user,
                                   [username,password_hash,tags,
                                    hashing_algorithm],
                                   [username,password_hash,tags,
                                    hashing_algorithm,limits]},
2021-05-12 15:34:02.878 [error] <0.273.0>                                     hashing_algorithm],
2021-05-12 15:34:02.878 [error] <0.273.0>                                    [username,password_hash,tags,
2021-05-12 15:34:02.878 [error] <0.273.0>                                     hashing_algorithm,limits]},
                               {table_attributes_mismatch,rabbit_vhost,
2021-05-12 15:34:02.878 [error] <0.273.0>                                {table_attributes_mismatch,rabbit_vhost,
2021-05-12 15:34:02.878 [error] <0.273.0>                                    [virtual_host,limits],
2021-05-12 15:34:02.879 [error] <0.273.0>                                    [virtual_host,limits,metadata]}]}}
                                   [virtual_host,limits],
                                   [virtual_host,limits,metadata]}]}}

2021-05-12 15:34:02.879 [error] <0.273.0>
2021-05-12 15:34:03.880 [debug] <0.273.0> Set stop reason to: {error,{schema_integrity_check_failed,[{table_attributes_mismatch,rabbit_user,[username,password_hash,tags,hashing_algorithm],[username,password_hash,tags,hashing_algorithm,limits]},{table_attributes_mismatch,rabbit_vhost,[virtual_host,limits],[virtual_host,limits,metadata]}]}}
2021-05-12 15:34:03.880 [debug] <0.273.0> Change boot state to `stopped`
2021-05-12 15:34:03.881 [debug] <0.44.0> Running rabbit_prelaunch:shutdown_func() as part of `kernel` shutdown
2021-05-12 15:34:03.881 [info] <0.272.0> [{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0.272.0>},{registered_name,[]},{error_info,{exit,{{schema_integrity_check_failed,[{table_attributes_mismatch,rabbit_user,[username,password_hash,tags,hashing_algorithm],[username,password_hash,tags,hashing_algorithm,limits]},{table_attributes_mismatch,rabbit_vhost,[virtual_host,limits],[virtual_host,limits,metadata]}]},{rabbit,start,[normal,[]]}},[{application_master,init,4,[{file,"application_master.erl"},{line,138}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}},{ancestors,[<0.271.0>]},{message_queue_len,1},{messages,[{'EXIT',<0.273.0>,normal}]},{links,[<0.271.0>,<0.44.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,376},{stack_size,28},{reductions,620}], []
2021-05-12 15:34:03.881 [debug] <0.44.0> Deleting PID file: /var/lib/rabbitmq/mnesia/[email protected]
2021-05-12 15:34:03.881 [error] <0.272.0> CRASH REPORT Process <0.272.0> with 0 neighbours exited with reason: {{schema_integrity_check_failed,[{table_attributes_mismatch,rabbit_user,[username,password_hash,tags,hashing_algorithm],[username,password_hash,tags,hashing_algorithm,limits]},{table_attributes_mismatch,rabbit_vhost,[virtual_host,limits],[virtual_host,limits,metadata]}]},{rabbit,start,[normal,[]]}} in application_master:init/4 line 138

This commit fixes #662.

Use classic peer discovery instead of rabbit_peer_discovery_k8s plugin.
For RabbitMQ clusters deployed by the RabbitMQ cluster operator, there
is no need for dynamic peer discovery since cluster members are known
at deploy time.

Randomised startup delays suffer from sporadic cluster formation issues as
observed in #662 because two nodes might choose to form a new cluster at
roughly the same time. The more nodes in the cluster and the smaller the
randomised startup delay range, the higher the chances of multiple nodes
creating a new cluster.

This commit uses neither randomised startup delays nor sophisticated locking mechanisms.
It enforces always pod 0 to create the cluster.
As soon as pod 0 created the cluster, the other nodes will join.

Removing randomised startup delays also decreases overall time until
all nodes are ready.
ansd added a commit to rabbitmq/rabbitmq-server that referenced this pull request Jun 2, 2021
On initial cluster formation, only one node in a multi node cluster
should initialize the Mnesia database schema (i.e. form the cluster).
To ensure that for nodes starting up in parallel,
RabbitMQ peer discovery backends have used
either locks or randomized startup delays.

Locks work great: When a node holds the lock, it either starts a new
blank node (if there is no other node in the cluster), or it joins
an existing node. This makes it impossible to have two nodes forming
the cluster at the same time.
Consul and etcd peer discovery backends use locks. The lock is acquired
in the consul and etcd infrastructure, respectively.

For other peer discovery backends (classic, DNS, AWS), randomized
startup delays were used. They work good enough in most cases.
However, in rabbitmq/cluster-operator#662 we
observed that in 1% - 10% of the cases (the more nodes or the
smaller the randomized startup delay range, the higher the chances), two
nodes decide to form the cluster. That's bad since it will end up in a
single Erlang cluster, but in two RabbitMQ clusters. Even worse, no
obvious alert got triggered or error message logged.

To solve this issue, one could increase the randomized startup delay
range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster
formation very slow since it will take up to 3 minutes until
every node is ready. In rare cases, we still end up with two nodes
forming the cluster.

Another way to solve the problem is to name a dedicated node to be the
seed node (forming the cluster). This was explored in
rabbitmq/cluster-operator#689 and works well.
Two minor downsides to this approach are: 1. If the seed node never
becomes available, the whole cluster won't be formed (which is okay),
and 2. it doesn't integrate with existing dynamic peer discovery backends
(e.g. K8s, AWS) since nodes are not yet known at deploy time.

In this commit, we take a better approach: We remove randomized startup
delays altogether. We replace them with locks. However, instead of
implementing our own lock implementation in an external system (e.g. in K8s),
we re-use Erlang's locking mechanism global:set_lock/3.

global:set_lock/3 has some convenient properties:
1. It accepts a list of nodes to set the lock on.
2. The nodes in that list connect to each other (i.e. create an Erlang
cluster).
3. The method is synchronous with a timeout (number of retries). It
blocks until the lock becomes available.
4. If a process that holds a lock dies, or the node goes down, the lock
held by the process is deleted.

The list of nodes passed to global:set_lock/3 corresponds to the nodes
the peer discovery backend discovers (lists).

Two special cases worth mentioning:

1. That list can be all desired nodes in the cluster
(e.g. in classic peer discovery where nodes are known at
deploy time) while only a subset of nodes is available.
In that case, global:set_lock/3 still sets the lock not
blocking until all nodes can be connected to. This is good since
nodes might start sequentially (non-parallel).

2. In dynamic peer discovery backends (e.g. K8s, AWS), this
list can be just a subset of desired nodes since nodes might not startup
in parallel. That's also not a problem as long as the following
requirement is met: "The peer disovery backend does not list two disjoint
sets of nodes (on different nodes) at the same time."
For example, in a 2-node cluster, the peer discovery backend must not
list only node 1 on node 1 and only node 2 on node 2.

Existing peer discovery backends fullfil that requirement because the
resource the nodes are discovered from is global.
For example, in K8s, once node 1 is part of the Endpoints object, it
will be returned on both node 1 and node 2.
Likewise, in AWS, once node 1 started, the described list of instances
with a specific tag will include node 1 when the AWS peer discovery backend
runs on node 1 or node 2.

Removing randomized startup delays also makes cluster formation
considerably faster (up to 1 minute faster if that was the
upper bound in the range).
ansd added a commit to rabbitmq/rabbitmq-server that referenced this pull request Jun 3, 2021
On initial cluster formation, only one node in a multi node cluster
should initialize the Mnesia database schema (i.e. form the cluster).
To ensure that for nodes starting up in parallel,
RabbitMQ peer discovery backends have used
either locks or randomized startup delays.

Locks work great: When a node holds the lock, it either starts a new
blank node (if there is no other node in the cluster), or it joins
an existing node. This makes it impossible to have two nodes forming
the cluster at the same time.
Consul and etcd peer discovery backends use locks. The lock is acquired
in the consul and etcd infrastructure, respectively.

For other peer discovery backends (classic, DNS, AWS), randomized
startup delays were used. They work good enough in most cases.
However, in rabbitmq/cluster-operator#662 we
observed that in 1% - 10% of the cases (the more nodes or the
smaller the randomized startup delay range, the higher the chances), two
nodes decide to form the cluster. That's bad since it will end up in a
single Erlang cluster, but in two RabbitMQ clusters. Even worse, no
obvious alert got triggered or error message logged.

To solve this issue, one could increase the randomized startup delay
range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster
formation very slow since it will take up to 3 minutes until
every node is ready. In rare cases, we still end up with two nodes
forming the cluster.

Another way to solve the problem is to name a dedicated node to be the
seed node (forming the cluster). This was explored in
rabbitmq/cluster-operator#689 and works well.
Two minor downsides to this approach are: 1. If the seed node never
becomes available, the whole cluster won't be formed (which is okay),
and 2. it doesn't integrate with existing dynamic peer discovery backends
(e.g. K8s, AWS) since nodes are not yet known at deploy time.

In this commit, we take a better approach: We remove randomized startup
delays altogether. We replace them with locks. However, instead of
implementing our own lock implementation in an external system (e.g. in K8s),
we re-use Erlang's locking mechanism global:set_lock/3.

global:set_lock/3 has some convenient properties:
1. It accepts a list of nodes to set the lock on.
2. The nodes in that list connect to each other (i.e. create an Erlang
cluster).
3. The method is synchronous with a timeout (number of retries). It
blocks until the lock becomes available.
4. If a process that holds a lock dies, or the node goes down, the lock
held by the process is deleted.

The list of nodes passed to global:set_lock/3 corresponds to the nodes
the peer discovery backend discovers (lists).

Two special cases worth mentioning:

1. That list can be all desired nodes in the cluster
(e.g. in classic peer discovery where nodes are known at
deploy time) while only a subset of nodes is available.
In that case, global:set_lock/3 still sets the lock not
blocking until all nodes can be connected to. This is good since
nodes might start sequentially (non-parallel).

2. In dynamic peer discovery backends (e.g. K8s, AWS), this
list can be just a subset of desired nodes since nodes might not startup
in parallel. That's also not a problem as long as the following
requirement is met: "The peer disovery backend does not list two disjoint
sets of nodes (on different nodes) at the same time."
For example, in a 2-node cluster, the peer discovery backend must not
list only node 1 on node 1 and only node 2 on node 2.

Existing peer discovery backends fullfil that requirement because the
resource the nodes are discovered from is global.
For example, in K8s, once node 1 is part of the Endpoints object, it
will be returned on both node 1 and node 2.
Likewise, in AWS, once node 1 started, the described list of instances
with a specific tag will include node 1 when the AWS peer discovery backend
runs on node 1 or node 2.

Removing randomized startup delays also makes cluster formation
considerably faster (up to 1 minute faster if that was the
upper bound in the range).
@ansd
Copy link
Member Author

ansd commented Jun 7, 2021

Closing this PR in favor of rabbitmq/rabbitmq-server#3075.

@ansd ansd closed this Jun 7, 2021
@ansd ansd deleted the node-0-forms-cluster branch June 8, 2021 07:59
Zerpet added a commit that referenced this pull request Jan 28, 2022
This will be used in the pipeline to publish to additional registries.

[#689]

Signed-off-by: Aitor Perez Cedres <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant