Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mock connections more accurately in DisruptableMockTransport #37296

Merged
merged 9 commits into from
Jan 11, 2019

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Jan 10, 2019

This PR moves DisruptableMockTransport to use a more accurate representation of connection management, which allows to use the full connection manager and does not require mocking out any behavior. With this, we can implement restarting nodes in CoordinatorTests.

@ywelsch ywelsch added >non-issue >test Issues or PRs that are addressing/adding tests v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jan 10, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch ywelsch mentioned this pull request Jan 10, 2019
61 tasks
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change, it makes a lot more sense. I pointed out a handful of places where we might want to be a bit more badly behaved and suggested a couple of naming changes.

@@ -78,6 +78,10 @@ public DiscoveryNode getTargetNode() {
return targetNode;
}

public boolean matchesTarget(DiscoveryNode matchingNode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe targetMatches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, mine didn't feel quite right.

});
return () -> {};
} else {
throw new ConnectTransportException(node, "node " + node + " does not exist");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK for now but in future (hoho) we will want this to be async and/or to timeout on an unknown node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this depends on the connection manager becoming async. Right now there's a Future.get() waiting for us behind this call.

connectionStatus = ConnectionStatus.CONNECTED;
} else {
connectionStatus = ConnectionStatus.DISCONNECTED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to test both DISCONNECTED and BLACK_HOLE here, perhaps using mostly the same value for the duration of a test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


protected abstract void handle(DiscoveryNode sender, DiscoveryNode destination, String action, Runnable doDelivery);
protected abstract void schedule(Runnable runnable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name execute would be more consistent with things like ExecutorService. schedule is largely used for delayed execution (ignoring that the DeterministicTaskQueue uses scheduleNow for this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

logger.debug("----> [runRandomly {}] rebooting [{}]", thisStep, clusterNode.getId());
clusterNode.close();
clusterNodes.forEach(cn -> cn.onNode(
() -> cn.transportService.disconnectFromNode(clusterNode.getLocalNode())).run());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should delay these disconnections. Maybe we should rarely delay them by a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing this line makes the tests fail, will need to look at why that's the case. Do you think this should be possibly delayed beyond the safety phase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think completely removing it is unrealistic, but we may not get a disconnection event for quite some time (up to ~15 minutes by default on Linux). I do not think it should be delayed beyond the safety phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I've made the according changes. AFAICS the reason why we can't extend it beyond the safety phase is that PeerFinder will not start connecting to the new node as long as the transport claims for the old node with same address to be still connected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, the 15 minutes example wasn't supposed to be a suggestion. I think just scheduling it as a normal delayed action is sufficient, given that EXTREME_DELAY_VARIABILITY is mostly in force. I think this also means we don't need to clean it up specially, because we reduce the delay variability down to something reasonable for the end of the safety phase, and it shouldn't matter if it occurs within the first DEFAULT_DELAY_VARIABILITY of the stabilisation phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed efa0728

@DaveCTurner
Copy link
Contributor

Also note I haven't run a soak test, my CI machine is otherwise engaged.

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just one (somewhat irrelevant) suggestion and one question :)

}

private void setUp() {
mockTransport = new DisruptableMockTransport(logger) {
@Override
protected DiscoveryNode getLocalNode() {
public DiscoveryNode getLocalNode() {
Copy link
Member

@original-brownbear original-brownbear Jan 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that all the implementations of DisruptableMockTransport simply have a getter for some constant value for the local node as their implementation. Maybe just move that getter up into DisruptableMockTransport and pass it as constructor parameter while we're changing this anyway? (just to save a bit of noise in the concrete tests :))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@ywelsch ywelsch requested a review from DaveCTurner January 11, 2019 08:47
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but needs a soak test to be sure.

@ywelsch ywelsch merged commit f4abf96 into elastic:master Jan 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >non-issue >test Issues or PRs that are addressing/adding tests v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants