Zen2: Fail fast on disconnects #34503

ywelsch · 2018-10-16T07:25:47Z

Integrates the failure detectors with the Connection lifecycle, to fail nodes as soon as:

a leader detects one of his followers disconnecting.
a follower detects its leader disconnecting.

elasticmachine · 2018-10-16T07:25:48Z

Pinging @elastic/es-distributed

DaveCTurner

I looked through the tests and suggested a few changes.

DaveCTurner · 2018-10-16T12:10:12Z

server/src/main/java/org/elasticsearch/cluster/coordination/LeaderChecker.java

-        checkScheduler.handleWakeUp();
-        return checkScheduler;
+    public void updateLeader(@Nullable final DiscoveryNode leader) {
+        assert leader == null || transportService.getLocalNode().equals(leader) == false;


leader == null is redundant?

DaveCTurner · 2018-10-16T12:11:02Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

-                + DEFAULT_DELAY_VARIABILITY
-                // then wait for a new election
+        boolean followersGetDisconnectEvent = randomBoolean();
+        if (followersGetDisconnectEvent) {


I think I'd prefer two tests to test these two paths.

fixed in 4518722

DaveCTurner · 2018-10-16T12:15:52Z

server/src/test/java/org/elasticsearch/cluster/coordination/FollowersCheckerTests.java

+        final FollowersChecker followersChecker = new FollowersChecker(settings, transportService, fcr -> {
+            assert false : fcr;
+        }, (node, reason) -> {
+            assertTrue(nodeFailed.compareAndSet(false, true));


Assert that reason is what we expect too?

fixed in f196a2a, which also now uses a proper reason.

DaveCTurner · 2018-10-16T12:17:23Z

server/src/test/java/org/elasticsearch/cluster/coordination/FollowersCheckerTests.java

@@ -266,7 +319,7 @@ public String toString() {

        final FollowersChecker followersChecker = new FollowersChecker(settings, transportService, fcr -> {
            assert false : fcr;
-        }, node -> {
+        }, (node, reason) -> {
            assertTrue(nodeFailed.compareAndSet(false, true));


Assert that reason is what we expect too?

fixed in f196a2a, which also now uses a proper reason.

DaveCTurner · 2018-10-16T12:19:12Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

+            return randomFrom(getAllNodesExcept(clusterNodes));
+        }
+
+        List<ClusterNode> getAllNodesExcept(ClusterNode... clusterNodes) {
            Set<String> forbiddenIds = Arrays.stream(clusterNodes).map(ClusterNode::getId).collect(Collectors.toSet());
            List<ClusterNode> acceptableNodes
                = this.clusterNodes.stream().filter(n -> forbiddenIds.contains(n.getId()) == false).collect(Collectors.toList());
            assert acceptableNodes.isEmpty() == false;


I think this assertion should be in getAnyNodeExcept() - it's ok to return an empty list here.

ok 6589027

DaveCTurner · 2018-10-16T12:20:53Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

+        follower0.onDisconnectEventFrom(leader);
+        follower1.onDisconnectEventFrom(leader);
+        cluster.runFor(DEFAULT_DELAY_VARIABILITY // disconnect is scheduled
+            + DEFAULT_ELECTION_DELAY, "elect new leader");


Can this be a stabilise() instead? If not, could there be a comment saying why?

we want the 2 followers to complete an election and elect a leader among themselves before "healing" the leader (and prevent the leader from preventing that election), so that the publishing of the subsequent value will fail. I've added a comment.

DaveCTurner · 2018-10-16T12:21:57Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

+            + DEFAULT_ELECTION_DELAY, "elect new leader");
+        leader.heal();
+        AckCollector ackCollector = leader.submitValue(randomLong());
+        cluster.runFor(DEFAULT_DELAY_VARIABILITY, "start publishing");


This runFor() is immediately followed by a stabilise(). Can we just stabilise? If not, could there be a comment saying why?

yeah, we can fold this into stabilize (see cb123e5)

DaveCTurner · 2018-10-16T12:22:21Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

-                + DEFAULT_DELAY_VARIABILITY
-        ));
+        boolean leaderGetsDisconnectEvent = randomBoolean();
+        if (leaderGetsDisconnectEvent) {


I would prefer two tests to test these two paths.

fixed in 4518722

ywelsch · 2018-10-18T11:32:05Z

This is ready for another look @DaveCTurner

DaveCTurner

Running this on my CI now, but I think we can shorten some timeouts - see comment.

DaveCTurner · 2018-10-20T12:28:35Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

@@ -398,7 +420,6 @@ public void testLeaderDisconnectionDetectedQuickly() {

        cluster.stabilise(Math.max(
            // Each follower may have just sent a leader check, which receives no response
-            // TODO not necessary if notified of disconnection


This TODO indicates that the timeout we calculate below can now be made shorter since we don't have to wait for an in-flight check to timeout. Same applies to the other TODOs like it.

Oh wait, there are now tests of both cases, don't mind me.

fast disconnect

2601aba

ywelsch added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Oct 16, 2018

ywelsch requested a review from DaveCTurner October 16, 2018 07:25

DaveCTurner reviewed Oct 16, 2018

View reviewed changes

ywelsch added 5 commits October 16, 2018 16:54

redundant

b0df401

they pay me by line of code

4518722

add proper reason when removing nodes

f196a2a

move assertion

6589027

fold into stabilize

cb123e5

ywelsch mentioned this pull request Oct 16, 2018

A new cluster coordination layer #32006

Closed

61 tasks

add comment

8022cb9

ywelsch requested a review from DaveCTurner October 16, 2018 15:37

Merge remote-tracking branch 'elastic/zen2' into fast-disconnect

e2396ac

Merge remote-tracking branch 'elastic/zen2' into fast-disconnect

fd7b29e

DaveCTurner reviewed Oct 20, 2018

View reviewed changes

ywelsch added 2 commits October 22, 2018 10:03

wait for reconfiguration

7781a44

add comments

8d2c394

DaveCTurner approved these changes Oct 22, 2018

View reviewed changes

ywelsch merged commit 6d6ac74 into elastic:zen2 Oct 22, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zen2: Fail fast on disconnects #34503

Zen2: Fail fast on disconnects #34503

ywelsch commented Oct 16, 2018

elasticmachine commented Oct 16, 2018

DaveCTurner left a comment

DaveCTurner Oct 16, 2018

ywelsch Oct 16, 2018

DaveCTurner Oct 16, 2018

ywelsch Oct 16, 2018

DaveCTurner Oct 16, 2018

ywelsch Oct 16, 2018

DaveCTurner Oct 16, 2018

ywelsch Oct 16, 2018

DaveCTurner Oct 16, 2018

ywelsch Oct 16, 2018

DaveCTurner Oct 16, 2018

ywelsch Oct 16, 2018

DaveCTurner Oct 16, 2018

ywelsch Oct 16, 2018

DaveCTurner Oct 16, 2018

ywelsch Oct 16, 2018

ywelsch commented Oct 18, 2018

DaveCTurner left a comment

DaveCTurner Oct 20, 2018

DaveCTurner Oct 20, 2018

Zen2: Fail fast on disconnects #34503

Zen2: Fail fast on disconnects #34503

Conversation

ywelsch commented Oct 16, 2018

elasticmachine commented Oct 16, 2018

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch commented Oct 18, 2018

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment