-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mock connections more accurately in DisruptableMockTransport #37296
Mock connections more accurately in DisruptableMockTransport #37296
Conversation
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change, it makes a lot more sense. I pointed out a handful of places where we might want to be a bit more badly behaved and suggested a couple of naming changes.
@@ -78,6 +78,10 @@ public DiscoveryNode getTargetNode() { | |||
return targetNode; | |||
} | |||
|
|||
public boolean matchesTarget(DiscoveryNode matchingNode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe targetMatches
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, mine didn't feel quite right.
}); | ||
return () -> {}; | ||
} else { | ||
throw new ConnectTransportException(node, "node " + node + " does not exist"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK for now but in future (hoho) we will want this to be async and/or to timeout on an unknown node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this depends on the connection manager becoming async. Right now there's a Future.get()
waiting for us behind this call.
connectionStatus = ConnectionStatus.CONNECTED; | ||
} else { | ||
connectionStatus = ConnectionStatus.DISCONNECTED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to test both DISCONNECTED
and BLACK_HOLE
here, perhaps using mostly the same value for the duration of a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
||
protected abstract void handle(DiscoveryNode sender, DiscoveryNode destination, String action, Runnable doDelivery); | ||
protected abstract void schedule(Runnable runnable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name execute
would be more consistent with things like ExecutorService
. schedule
is largely used for delayed execution (ignoring that the DeterministicTaskQueue
uses scheduleNow
for this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
logger.debug("----> [runRandomly {}] rebooting [{}]", thisStep, clusterNode.getId()); | ||
clusterNode.close(); | ||
clusterNodes.forEach(cn -> cn.onNode( | ||
() -> cn.transportService.disconnectFromNode(clusterNode.getLocalNode())).run()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should delay these disconnections. Maybe we should rarely delay them by a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing this line makes the tests fail, will need to look at why that's the case. Do you think this should be possibly delayed beyond the safety phase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think completely removing it is unrealistic, but we may not get a disconnection event for quite some time (up to ~15 minutes by default on Linux). I do not think it should be delayed beyond the safety phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I've made the according changes. AFAICS the reason why we can't extend it beyond the safety phase is that PeerFinder will not start connecting to the new node as long as the transport claims for the old node with same address to be still connected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, the 15 minutes example wasn't supposed to be a suggestion. I think just scheduling it as a normal delayed action is sufficient, given that EXTREME_DELAY_VARIABILITY
is mostly in force. I think this also means we don't need to clean it up specially, because we reduce the delay variability down to something reasonable for the end of the safety phase, and it shouldn't matter if it occurs within the first DEFAULT_DELAY_VARIABILITY
of the stabilisation phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed efa0728
Also note I haven't run a soak test, my CI machine is otherwise engaged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just one (somewhat irrelevant) suggestion and one question :)
} | ||
|
||
private void setUp() { | ||
mockTransport = new DisruptableMockTransport(logger) { | ||
@Override | ||
protected DiscoveryNode getLocalNode() { | ||
public DiscoveryNode getLocalNode() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that all the implementations of DisruptableMockTransport
simply have a getter for some constant value for the local node as their implementation. Maybe just move that getter up into DisruptableMockTransport
and pass it as constructor parameter while we're changing this anyway? (just to save a bit of noise in the concrete tests :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but needs a soak test to be sure.
This PR moves DisruptableMockTransport to use a more accurate representation of connection management, which allows to use the full connection manager and does not require mocking out any behavior. With this, we can implement restarting nodes in CoordinatorTests.