-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test #16917
Conversation
0c60e64
to
a2ab8f3
Compare
@bleskes Your feedback would be appreciated. I tried taking the route of registering a cluster state listener (the code is commented out), but there was no way to use it to make assertions for the unit test while executing in the listener's thread. The current approach check's the cluster state periodically before proceeding with the node restart. I'm happy to hear your suggestions if there is a better way. |
break; | ||
} | ||
try { | ||
Thread.sleep(sleepTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is usually a bad sign. We should use sleep anywhere. Sometimes it's needed but we try give all the utilities to make sure no one used it explicitly. In this case we have assert busy:
assertBusy(() -> {
final ClusterState currState = internalCluster().clusterService(masterNode1).state();
assertTrue("index not deleted", currState.metaData().hasIndex("test") == false && currState.status() == ClusterState.ClusterStateStatus.APPLIED);
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we reduce the publish timeout to 0 (which will make the test faster), we need to use the same assertBusy technique on masterNode2 to make sure it has process the change as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this tip @bleskes, I didn't realize we already had an assertBusy
Thx @abeyad . I left some suggestion.
Note that this is not an issue - the change times out due to the disruption that the test add. We report correctly by indicating that the change is not acked in the response, but the test knows that and therefore doesn't check for it. I do have some long term plans to make sure that when the call returns it is at least guaranteed to be processed on the current master, which will make things more intuitive. I don't think we want to always wait on all the nodes (with no timeout) before returning. |
a2ab8f3
to
9726f4f
Compare
That would be awesome. And I changed the commit comment to reflect what you mentioned above. |
@bleskes I made the changes, except for the |
Also, do you think the feature itself (i.e. #11665) belongs in 2.3 as well? I've held off for now until the test issue has been resolved. |
9726f4f
to
51fe06d
Compare
I missed the fact that the commit timeout defaults to publish time out - I wanted to create the situation that the commit timeout stays 30s and publish timeout is at 0. This means that the master will continue as soon as the CS as been comitted (and not wait on the isolated data node). |
It's a bug so I think it should go into the 2.x branch. But there is absolutely no rush. Let's get this in and stable on master first. |
51fe06d
to
c5af2a6
Compare
@bleskes I wasn't aware of the two different settings, thanks for that! I've included a commit timeout of 30s and publish timeout of 0s. All tests pass. |
In particular, this test ensures we don't restart the master node until we know the index deletion has taken effect on master and the master eligible nodes. Closes elastic#16890
c5af2a6
to
d09eefc
Compare
NetworkPartition networkPartition = new NetworkUnresponsivePartition(masterNode1, dataNode.get(), getRandom()); | ||
internalCluster().setDisruptionScheme(networkPartition); | ||
networkPartition.startDisrupting(); | ||
internalCluster().client(masterNode1).admin().indices().prepareDelete("test").setTimeout("1s").get(); | ||
internalCluster().client(masterNode1).admin().indices().prepareDelete(idxName).setTimeout("0s").get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we comment on why we set the timeout to 0? something a long the lines that we know this will time out due to the partition and we are going to check manually when it is applied to master nodes only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
LGTM. Let's give it a go! |
In particular, this test ensures we don't restart the master node until
we know the index deletion has taken effect on master. This overcomes a
current known issue where a delete can return before cluster state
changes take effect.
Closes #16890