Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test #16917

Closed

Conversation

abeyad
Copy link

@abeyad abeyad commented Mar 2, 2016

In particular, this test ensures we don't restart the master node until
we know the index deletion has taken effect on master. This overcomes a
current known issue where a delete can return before cluster state
changes take effect.

Closes #16890

@abeyad abeyad changed the title Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test WIP: Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test Mar 2, 2016
@abeyad abeyad force-pushed the fix-indices-deleted-discovery-test branch 5 times, most recently from 0c60e64 to a2ab8f3 Compare March 2, 2016 22:42
@abeyad
Copy link
Author

abeyad commented Mar 2, 2016

@bleskes Your feedback would be appreciated. I tried taking the route of registering a cluster state listener (the code is commented out), but there was no way to use it to make assertions for the unit test while executing in the listener's thread. The current approach check's the cluster state periodically before proceeding with the node restart. I'm happy to hear your suggestions if there is a better way.

break;
}
try {
Thread.sleep(sleepTime);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is usually a bad sign. We should use sleep anywhere. Sometimes it's needed but we try give all the utilities to make sure no one used it explicitly. In this case we have assert busy:

        assertBusy(() -> {
            final ClusterState currState = internalCluster().clusterService(masterNode1).state();
            assertTrue("index not deleted", currState.metaData().hasIndex("test") == false && currState.status() == ClusterState.ClusterStateStatus.APPLIED);
        });

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we reduce the publish timeout to 0 (which will make the test faster), we need to use the same assertBusy technique on masterNode2 to make sure it has process the change as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this tip @bleskes, I didn't realize we already had an assertBusy

@bleskes
Copy link
Contributor

bleskes commented Mar 3, 2016

Thx @abeyad . I left some suggestion.

This overcomes a current known issue where a delete can return before cluster state
changes take effect.

Note that this is not an issue - the change times out due to the disruption that the test add. We report correctly by indicating that the change is not acked in the response, but the test knows that and therefore doesn't check for it. I do have some long term plans to make sure that when the call returns it is at least guaranteed to be processed on the current master, which will make things more intuitive. I don't think we want to always wait on all the nodes (with no timeout) before returning.

@clintongormley clintongormley changed the title WIP: Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test Mar 3, 2016
@abeyad abeyad force-pushed the fix-indices-deleted-discovery-test branch from a2ab8f3 to 9726f4f Compare March 3, 2016 16:19
@abeyad
Copy link
Author

abeyad commented Mar 3, 2016

I do have some long term plans to make sure that when the call returns it is at least guaranteed to be processed on the current master, which will make things more intuitive.

That would be awesome. And I changed the commit comment to reflect what you mentioned above.

@abeyad
Copy link
Author

abeyad commented Mar 3, 2016

@bleskes I made the changes, except for the PUBLISH_TIMEOUT_SETTING which caused a FailedToCommitClusterStateException[timed out while waiting for enough masters to ack sent cluster state. [1] left] exception.

@abeyad
Copy link
Author

abeyad commented Mar 3, 2016

Also, do you think the feature itself (i.e. #11665) belongs in 2.3 as well? I've held off for now until the test issue has been resolved.

@abeyad abeyad force-pushed the fix-indices-deleted-discovery-test branch from 9726f4f to 51fe06d Compare March 3, 2016 21:45
@bleskes
Copy link
Contributor

bleskes commented Mar 7, 2016

I made the changes, except for the PUBLISH_TIMEOUT_SETTING which caused a FailedToCommitClusterStateException[timed out while waiting for enough masters to ack sent cluster state. [1] left] exception.

I missed the fact that the commit timeout defaults to publish time out - I wanted to create the situation that the commit timeout stays 30s and publish timeout is at 0. This means that the master will continue as soon as the CS as been comitted (and not wait on the isolated data node).

@bleskes
Copy link
Contributor

bleskes commented Mar 7, 2016

Also, do you think the feature itself (i.e. #11665) belongs in 2.3 as well? I've held off for now until the test issue has been resolved.

It's a bug so I think it should go into the 2.x branch. But there is absolutely no rush. Let's get this in and stable on master first.

@abeyad abeyad force-pushed the fix-indices-deleted-discovery-test branch from 51fe06d to c5af2a6 Compare March 7, 2016 15:03
@abeyad
Copy link
Author

abeyad commented Mar 7, 2016

@bleskes I wasn't aware of the two different settings, thanks for that! I've included a commit timeout of 30s and publish timeout of 0s. All tests pass.

In particular, this test ensures we don't restart the master node until
we know the index deletion has taken effect on master and the master
eligible nodes.

Closes elastic#16890
@abeyad abeyad force-pushed the fix-indices-deleted-discovery-test branch from c5af2a6 to d09eefc Compare March 7, 2016 15:08
NetworkPartition networkPartition = new NetworkUnresponsivePartition(masterNode1, dataNode.get(), getRandom());
internalCluster().setDisruptionScheme(networkPartition);
networkPartition.startDisrupting();
internalCluster().client(masterNode1).admin().indices().prepareDelete("test").setTimeout("1s").get();
internalCluster().client(masterNode1).admin().indices().prepareDelete(idxName).setTimeout("0s").get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we comment on why we set the timeout to 0? something a long the lines that we know this will time out due to the partition and we are going to check manually when it is applied to master nodes only.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@bleskes
Copy link
Contributor

bleskes commented Mar 9, 2016

LGTM. Let's give it a go!

@abeyad abeyad closed this in e411cbb Mar 9, 2016
@clintongormley clintongormley added the >test Issues or PRs that are addressing/adding tests label Mar 9, 2016
abeyad pushed a commit that referenced this pull request Mar 10, 2016
This commit backports commit e411cbb
from master to 2.x.

Relates #16917
@clintongormley clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Cluster labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test Issues or PRs that are addressing/adding tests v5.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants