Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zen2: Fail fast on disconnects #34503

Merged
merged 11 commits into from
Oct 22, 2018
Merged

Zen2: Fail fast on disconnects #34503

merged 11 commits into from
Oct 22, 2018

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Oct 16, 2018

Integrates the failure detectors with the Connection lifecycle, to fail nodes as soon as:

  • a leader detects one of his followers disconnecting.
  • a follower detects its leader disconnecting.

@ywelsch ywelsch added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Oct 16, 2018
@ywelsch ywelsch requested a review from DaveCTurner October 16, 2018 07:25
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the tests and suggested a few changes.

checkScheduler.handleWakeUp();
return checkScheduler;
public void updateLeader(@Nullable final DiscoveryNode leader) {
assert leader == null || transportService.getLocalNode().equals(leader) == false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leader == null is redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

+ DEFAULT_DELAY_VARIABILITY
// then wait for a new election
boolean followersGetDisconnectEvent = randomBoolean();
if (followersGetDisconnectEvent) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer two tests to test these two paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 4518722

final FollowersChecker followersChecker = new FollowersChecker(settings, transportService, fcr -> {
assert false : fcr;
}, (node, reason) -> {
assertTrue(nodeFailed.compareAndSet(false, true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert that reason is what we expect too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in f196a2a, which also now uses a proper reason.

@@ -266,7 +319,7 @@ public String toString() {

final FollowersChecker followersChecker = new FollowersChecker(settings, transportService, fcr -> {
assert false : fcr;
}, node -> {
}, (node, reason) -> {
assertTrue(nodeFailed.compareAndSet(false, true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert that reason is what we expect too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in f196a2a, which also now uses a proper reason.

return randomFrom(getAllNodesExcept(clusterNodes));
}

List<ClusterNode> getAllNodesExcept(ClusterNode... clusterNodes) {
Set<String> forbiddenIds = Arrays.stream(clusterNodes).map(ClusterNode::getId).collect(Collectors.toSet());
List<ClusterNode> acceptableNodes
= this.clusterNodes.stream().filter(n -> forbiddenIds.contains(n.getId()) == false).collect(Collectors.toList());
assert acceptableNodes.isEmpty() == false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assertion should be in getAnyNodeExcept() - it's ok to return an empty list here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok 6589027

follower0.onDisconnectEventFrom(leader);
follower1.onDisconnectEventFrom(leader);
cluster.runFor(DEFAULT_DELAY_VARIABILITY // disconnect is scheduled
+ DEFAULT_ELECTION_DELAY, "elect new leader");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a stabilise() instead? If not, could there be a comment saying why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want the 2 followers to complete an election and elect a leader among themselves before "healing" the leader (and prevent the leader from preventing that election), so that the publishing of the subsequent value will fail. I've added a comment.

+ DEFAULT_ELECTION_DELAY, "elect new leader");
leader.heal();
AckCollector ackCollector = leader.submitValue(randomLong());
cluster.runFor(DEFAULT_DELAY_VARIABILITY, "start publishing");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runFor() is immediately followed by a stabilise(). Can we just stabilise? If not, could there be a comment saying why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we can fold this into stabilize (see cb123e5)

+ DEFAULT_DELAY_VARIABILITY
));
boolean leaderGetsDisconnectEvent = randomBoolean();
if (leaderGetsDisconnectEvent) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer two tests to test these two paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 4518722

@ywelsch ywelsch mentioned this pull request Oct 16, 2018
61 tasks
@ywelsch ywelsch requested a review from DaveCTurner October 16, 2018 15:37
@ywelsch
Copy link
Contributor Author

ywelsch commented Oct 18, 2018

This is ready for another look @DaveCTurner

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running this on my CI now, but I think we can shorten some timeouts - see comment.

@@ -398,7 +420,6 @@ public void testLeaderDisconnectionDetectedQuickly() {

cluster.stabilise(Math.max(
// Each follower may have just sent a leader check, which receives no response
// TODO not necessary if notified of disconnection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO indicates that the timeout we calculate below can now be made shorter since we don't have to wait for an in-flight check to timeout. Same applies to the other TODOs like it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, there are now tests of both cases, don't mind me.

@ywelsch ywelsch merged commit 6d6ac74 into elastic:zen2 Oct 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants