-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zen2: Fail fast on disconnects #34503
Conversation
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through the tests and suggested a few changes.
checkScheduler.handleWakeUp(); | ||
return checkScheduler; | ||
public void updateLeader(@Nullable final DiscoveryNode leader) { | ||
assert leader == null || transportService.getLocalNode().equals(leader) == false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leader == null
is redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right
+ DEFAULT_DELAY_VARIABILITY | ||
// then wait for a new election | ||
boolean followersGetDisconnectEvent = randomBoolean(); | ||
if (followersGetDisconnectEvent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer two tests to test these two paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 4518722
final FollowersChecker followersChecker = new FollowersChecker(settings, transportService, fcr -> { | ||
assert false : fcr; | ||
}, (node, reason) -> { | ||
assertTrue(nodeFailed.compareAndSet(false, true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert that reason
is what we expect too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in f196a2a, which also now uses a proper reason.
@@ -266,7 +319,7 @@ public String toString() { | |||
|
|||
final FollowersChecker followersChecker = new FollowersChecker(settings, transportService, fcr -> { | |||
assert false : fcr; | |||
}, node -> { | |||
}, (node, reason) -> { | |||
assertTrue(nodeFailed.compareAndSet(false, true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert that reason
is what we expect too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in f196a2a, which also now uses a proper reason.
return randomFrom(getAllNodesExcept(clusterNodes)); | ||
} | ||
|
||
List<ClusterNode> getAllNodesExcept(ClusterNode... clusterNodes) { | ||
Set<String> forbiddenIds = Arrays.stream(clusterNodes).map(ClusterNode::getId).collect(Collectors.toSet()); | ||
List<ClusterNode> acceptableNodes | ||
= this.clusterNodes.stream().filter(n -> forbiddenIds.contains(n.getId()) == false).collect(Collectors.toList()); | ||
assert acceptableNodes.isEmpty() == false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this assertion should be in getAnyNodeExcept()
- it's ok to return an empty list here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok 6589027
follower0.onDisconnectEventFrom(leader); | ||
follower1.onDisconnectEventFrom(leader); | ||
cluster.runFor(DEFAULT_DELAY_VARIABILITY // disconnect is scheduled | ||
+ DEFAULT_ELECTION_DELAY, "elect new leader"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a stabilise()
instead? If not, could there be a comment saying why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want the 2 followers to complete an election and elect a leader among themselves before "healing" the leader (and prevent the leader from preventing that election), so that the publishing of the subsequent value will fail. I've added a comment.
+ DEFAULT_ELECTION_DELAY, "elect new leader"); | ||
leader.heal(); | ||
AckCollector ackCollector = leader.submitValue(randomLong()); | ||
cluster.runFor(DEFAULT_DELAY_VARIABILITY, "start publishing"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This runFor()
is immediately followed by a stabilise()
. Can we just stabilise? If not, could there be a comment saying why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we can fold this into stabilize (see cb123e5)
+ DEFAULT_DELAY_VARIABILITY | ||
)); | ||
boolean leaderGetsDisconnectEvent = randomBoolean(); | ||
if (leaderGetsDisconnectEvent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer two tests to test these two paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 4518722
This is ready for another look @DaveCTurner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running this on my CI now, but I think we can shorten some timeouts - see comment.
@@ -398,7 +420,6 @@ public void testLeaderDisconnectionDetectedQuickly() { | |||
|
|||
cluster.stabilise(Math.max( | |||
// Each follower may have just sent a leader check, which receives no response | |||
// TODO not necessary if notified of disconnection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO indicates that the timeout we calculate below can now be made shorter since we don't have to wait for an in-flight check to timeout. Same applies to the other TODOs like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, there are now tests of both cases, don't mind me.
Integrates the failure detectors with the Connection lifecycle, to fail nodes as soon as: