-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Zen2] Gather votes from all nodes #34335
[Zen2] Gather votes from all nodes #34335
Conversation
Pinging @elastic/es-distributed |
@@ -753,6 +766,11 @@ public String toString() { | |||
private final AckListener ackListener; | |||
private final ActionListener<Void> publishListener; | |||
|
|||
// We may not have accepted our own state before receiving a join from another node, causing its join to be rejected (we cannot | |||
// safely accept a join whose last-accepted term/version is ahead of ours), so store them up and process them at the end. | |||
// TODO this is unpleasant, is there a better way? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's not so bad. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as reconfiguration (which cares about the joins) can only happen in the next cluster state update (and is only triggered at the end of this publication), I think this is ok.
Another failure at ~800 iterations; this specific case is fixed in 71a642d but I think there's a related failure in which the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if (sourceNode.equals(getLocalNode())) { | ||
preVoteCollector.update(getPreVoteResponse(), getLocalNode()); | ||
} else { | ||
becomeFollower("handlePublishRequest", sourceNode); // updates preVoteCollector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe change comment to "also updates preVoteCollector" to make it clearer that that is not the only purpose (or maybe I just misinterpreted this)
@@ -753,6 +766,11 @@ public String toString() { | |||
private final AckListener ackListener; | |||
private final ActionListener<Void> publishListener; | |||
|
|||
// We may not have accepted our own state before receiving a join from another node, causing its join to be rejected (we cannot | |||
// safely accept a join whose last-accepted term/version is ahead of ours), so store them up and process them at the end. | |||
// TODO this is unpleasant, is there a better way? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as reconfiguration (which cares about the joins) can only happen in the next cluster state update (and is only triggered at the end of this publication), I think this is ok.
Today we accept that some nodes may vote for the wrong master in an election.
This is mostly fine because they do end up joining the correct master in the
end, but the lack of a vote from every follower may prevent a future desirable
reconfiguration from taking place.
The solution is to hold another election in a yet-higher term in order to
collect a complete set of votes. Elections are somewhat disruptive so we should
think carefully about when this election should take place. One option is to
wait as late as possible (on the grounds that it might not ever be necessary).
This unfortunately makes it harder to predict how an
apparently-smoothly-running cluster will react to nodes leaving and joining.
Instead we prefer to perform the election as soon as possible in the leader's
term, adding "votes from all followers" to the invariants that we expect to
hold in a stable cluster. The start of a leader's term is already a somewhat
disrupted time for the cluster, so performing another election at this point
does not materially change the cluster's behaviour.
This change implements the logic needed to trigger a new election in order to
satisfy this extra stabilisation condition.