Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Zen2] Calculate optimal cluster configuration #33924

Merged

Conversation

DaveCTurner
Copy link
Contributor

We wish to commit a cluster state update after having received a response from
more than half of the master-eligible nodes in the cluster. This is optimal:
requiring either more or fewer votes than half harms resilience. For instance
if we have three master nodes then, we want to be able to commit a cluster
state after receiving responses from any two nodes; requiring responses from
all three is clearly not resilient to the failure of any node, and if we could
commit an update after a response from just one node then that node would be
required for every commit, which is also not resilient.

However, this means we must adjust the configuration (the set of voting nodes
in the cluster) whenever a master-eligible node joins or leaves. The
calculation of the best configuration for the cluster is the job of the
Reconfigurator, introduced here.

We wish to commit a cluster state update after having received a response from
more than half of the master-eligible nodes in the cluster.  This is optimal:
requiring either more or fewer votes than half harms resilience. For instance
if we have three master nodes then, we want to be able to commit a cluster
state after receiving responses from any two nodes; requiring responses from
all three is clearly not resilient to the failure of any node, and if we could
commit an update after a response from just one node then that node would be
required for every commit, which is also not resilient.

However, this means we must adjust the configuration (the set of voting nodes
in the cluster) whenever a master-eligible node joins or leaves. The
calculation of the best configuration for the cluster is the job of the
Reconfigurator, introduced here.
@DaveCTurner DaveCTurner added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. team-discuss labels Sep 20, 2018
@DaveCTurner DaveCTurner requested a review from ywelsch September 20, 2018 21:20
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

* nodes required to process a cluster state update.
*/
public static final Setting<Integer> MINIMUM_VOTING_MASTER_NODES_SETTING =
Setting.intSetting("cluster.minimum_voting_master_nodes", 1, 1, Property.NodeScope, Property.Dynamic);
Copy link
Contributor Author

@DaveCTurner DaveCTurner Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've marked this as team-discuss to contemplate this setting. In the PoC the equivalent setting had a different meaning (it was 2*[cluster.minimum_voting_master_nodes]-1) and only really made sense as an odd number. In writing docs like the above Javadoc I've found it easier to describe this:

the size of the smallest set of master nodes required to process a cluster state update.

I also feel that the most sensible name for this setting would be something like cluster.minimum_master_nodes if only we could ignore what that name means today. 🙈

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to define it like that (so that it represents the level of fault-tolerance instead of something that projects majorities). Regarding the name, I want to take the weekend to think more about it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should call it cluster.global_safety_factor and use non-numeric values, e.g.,

  • CARELESS (maps to 1)
  • HEALTHY (maps to 2)
  • PARANOID (maps to 3)
    😸

Copy link
Contributor Author

@DaveCTurner DaveCTurner Sep 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming the constants is a good idea; I'd be surprised if anyone ever wants a value >3 here. I'd also like to explore the idea of the default being 2 and not 1: at first I thought that this'd break clusters with fewer than 3 nodes but on reflection I can't find any obvious problems.

In-place upgrades (rolling or otherwise) seem unaffected by this setting. The trickiest case I can think of is a rolling migration for a 1- or 2-node cluster, which is not something that I've heard much about. A migration of a 1-node cluster already requires special handling (i.e. explicit retirement of the old node) so I think changing this setting is no big deal here. Migrating a 2-node cluster one-node-at-a-time will work if cluster.global_safety_factor: 2, but the resulting configuration will be three nodes (the two new nodes as well as one of the old nodes) whereas with cluster.global_safety_factor: 1 it would just be a single new node, which is slightly more resilient. Not that I think we should particularly care about resilience in 2-node clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One situation where the difference might matter is with a one-node cluster to which you (accidentally or deliberately) join a second node and then remove it again. With cluster.global_safety_factor: 2 the resulting configuration is both nodes, so shutting the second node down will lose a quorum. With cluster.global_safety_factor: 1 no reconfiguration will take place and the cluster will carry on working.

@ywelsch ywelsch mentioned this pull request Sep 20, 2018
61 tasks
@DaveCTurner DaveCTurner changed the title Calculate optimal cluster configuration [Zen2] Calculate optimal cluster configuration Sep 20, 2018
ywelsch
ywelsch previously approved these changes Sep 21, 2018
* nodes required to process a cluster state update.
*/
public static final Setting<Integer> MINIMUM_VOTING_MASTER_NODES_SETTING =
Setting.intSetting("cluster.minimum_voting_master_nodes", 1, 1, Property.NodeScope, Property.Dynamic);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to define it like that (so that it represents the level of fault-tolerance instead of something that projects majorities). Regarding the name, I want to take the weekend to think more about it

* nodes required to process a cluster state update.
*/
public static final Setting<Integer> MINIMUM_VOTING_MASTER_NODES_SETTING =
Setting.intSetting("cluster.minimum_voting_master_nodes", 1, 1, Property.NodeScope, Property.Dynamic);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should call it cluster.global_safety_factor and use non-numeric values, e.g.,

  • CARELESS (maps to 1)
  • HEALTHY (maps to 2)
  • PARANOID (maps to 3)
    😸

nonRetiredLiveNotInConfigIds.removeAll(retiredNodeIds);

final int targetSize = Math.max(roundDownToOdd(nonRetiredLiveInConfigIds.size() + nonRetiredLiveNotInConfigIds.size()),
2 * minVotingMasterNodes - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a lot going on in this formula. Can you maybe factor some of the things out into dedicated local variables to make it clearer what's going on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I added more variables and explanation.

@DaveCTurner DaveCTurner dismissed ywelsch’s stale review October 3, 2018 10:13

Significant changes made since this review

@DaveCTurner
Copy link
Contributor Author

@ywelsch I think this encapsulates all the changes we agreed on, please take another look.

@DaveCTurner DaveCTurner requested a review from ywelsch October 17, 2018 07:57
@DaveCTurner
Copy link
Contributor Author

@ywelsch this is good for another look

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a comment about the default and some smaller comments about the tests

}
}
public static final Setting<Integer> CLUSTER_MASTER_NODES_FAILURE_TOLERANCE =
Setting.intSetting("cluster.master_nodes_failure_tolerance", 0, 0, Property.NodeScope, Property.Dynamic);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the default be 1 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's important because we decided to set this at bootstrapping time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's not important, maybe a safer default is nicer :) I'm fine leaving as is for now. We can revisit after the bootstrapping.

public static final Setting<Integer> CLUSTER_MASTER_NODES_FAILURE_TOLERANCE =
Setting.intSetting("cluster.master_nodes_failure_tolerance", 0, 0, Property.NodeScope, Property.Dynamic);

private int masterNodesFailureTolerance;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be made volatile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes.

check(nodes("a", "b", "c", "d"), conf("a", "b", "c"), masterNodesFailureTolerance, conf("a", "b", "c"));
check(nodes("a", "b", "c", "d", "e"), conf("a", "b", "c"), masterNodesFailureTolerance, conf("a", "b", "c", "d", "e"));
check(nodes("a", "b"), conf("a", "b", "e"), masterNodesFailureTolerance,
masterNodesFailureTolerance == 1 ? conf("a", "b", "e") : conf("a"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit weird to read these tests as we handle the config.getNodeIds().size() < 2 * masterNodesFailureTolerance + 1) case within check. It's something that you have to constantly keep in mind while reading these tests.
I would prefer to test that separately and only have checks that actually lead to the desired target configuration.
This will result in a bit more duplication, but I think that's fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I reworked the tests to avoid the invalid cases better.


// If the safety level was never reached then retirement can take place
check(nodes("a", "b"), retired("a"), conf("a"), 1, conf("b"));
check(nodes("a", "b"), retired("a"), conf("b"), 1, conf("b"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 2 tests don't make sense anymore? We never reconfigure that way. One more reason not to hide the config.getNodeIds().size() < 2 * masterNodesFailureTolerance + 1) case within check

@DaveCTurner DaveCTurner requested a review from ywelsch October 17, 2018 11:14
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit e13ce66 into elastic:zen2 Oct 18, 2018
@DaveCTurner DaveCTurner deleted the 2018-09-20-calculate-optimal-configuration branch October 18, 2018 12:19
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 18, 2018
As master-eligible nodes join or leave the cluster we should give them votes or
take them away, in order to maintain the optimal level of fault-tolerance in
the system. elastic#33924 introduced the `Reconfigurator` to calculate the optimal
configuration of the cluster, and in this change we add the plumbing needed to
actually perform the reconfigurations needed as the cluster grows or shrinks.
DaveCTurner added a commit that referenced this pull request Oct 19, 2018
As master-eligible nodes join or leave the cluster we should give them votes or
take them away, in order to maintain the optimal level of fault-tolerance in
the system. #33924 introduced the `Reconfigurator` to calculate the optimal
configuration of the cluster, and in this change we add the plumbing needed to
actually perform the reconfigurations needed as the cluster grows or shrinks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants