Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test #16917

abeyad · 2016-03-02T18:15:01Z

In particular, this test ensures we don't restart the master node until
we know the index deletion has taken effect on master. This overcomes a
current known issue where a delete can return before cluster state
changes take effect.

Closes #16890

abeyad · 2016-03-02T22:42:56Z

@bleskes Your feedback would be appreciated. I tried taking the route of registering a cluster state listener (the code is commented out), but there was no way to use it to make assertions for the unit test while executing in the listener's thread. The current approach check's the cluster state periodically before proceeding with the node restart. I'm happy to hear your suggestions if there is a better way.

bleskes · 2016-03-03T10:17:30Z

core/src/test/java/org/elasticsearch/discovery/DiscoveryWithServiceDisruptionsIT.java

+                break;
+            }
+            try {
+                Thread.sleep(sleepTime);


this is usually a bad sign. We should use sleep anywhere. Sometimes it's needed but we try give all the utilities to make sure no one used it explicitly. In this case we have assert busy:

assertBusy(() -> { final ClusterState currState = internalCluster().clusterService(masterNode1).state(); assertTrue("index not deleted", currState.metaData().hasIndex("test") == false && currState.status() == ClusterState.ClusterStateStatus.APPLIED); });

if we reduce the publish timeout to 0 (which will make the test faster), we need to use the same assertBusy technique on masterNode2 to make sure it has process the change as well.

Thanks for this tip @bleskes, I didn't realize we already had an assertBusy

bleskes · 2016-03-03T10:28:02Z

Thx @abeyad . I left some suggestion.

This overcomes a current known issue where a delete can return before cluster state
changes take effect.

Note that this is not an issue - the change times out due to the disruption that the test add. We report correctly by indicating that the change is not acked in the response, but the test knows that and therefore doesn't check for it. I do have some long term plans to make sure that when the call returns it is at least guaranteed to be processed on the current master, which will make things more intuitive. I don't think we want to always wait on all the nodes (with no timeout) before returning.

abeyad · 2016-03-03T16:28:38Z

I do have some long term plans to make sure that when the call returns it is at least guaranteed to be processed on the current master, which will make things more intuitive.

That would be awesome. And I changed the commit comment to reflect what you mentioned above.

abeyad · 2016-03-03T16:42:04Z

@bleskes I made the changes, except for the PUBLISH_TIMEOUT_SETTING which caused a FailedToCommitClusterStateException[timed out while waiting for enough masters to ack sent cluster state. [1] left] exception.

abeyad · 2016-03-03T16:43:03Z

Also, do you think the feature itself (i.e. #11665) belongs in 2.3 as well? I've held off for now until the test issue has been resolved.

bleskes · 2016-03-07T09:27:01Z

I made the changes, except for the PUBLISH_TIMEOUT_SETTING which caused a FailedToCommitClusterStateException[timed out while waiting for enough masters to ack sent cluster state. [1] left] exception.

I missed the fact that the commit timeout defaults to publish time out - I wanted to create the situation that the commit timeout stays 30s and publish timeout is at 0. This means that the master will continue as soon as the CS as been comitted (and not wait on the isolated data node).

bleskes · 2016-03-07T09:27:51Z

Also, do you think the feature itself (i.e. #11665) belongs in 2.3 as well? I've held off for now until the test issue has been resolved.

It's a bug so I think it should go into the 2.x branch. But there is absolutely no rush. Let's get this in and stable on master first.

abeyad · 2016-03-07T15:06:33Z

@bleskes I wasn't aware of the two different settings, thanks for that! I've included a commit timeout of 30s and publish timeout of 0s. All tests pass.

In particular, this test ensures we don't restart the master node until we know the index deletion has taken effect on master and the master eligible nodes. Closes elastic#16890

bleskes · 2016-03-09T08:11:23Z

core/src/test/java/org/elasticsearch/discovery/DiscoveryWithServiceDisruptionsIT.java

        NetworkPartition networkPartition = new NetworkUnresponsivePartition(masterNode1, dataNode.get(), getRandom());
        internalCluster().setDisruptionScheme(networkPartition);
        networkPartition.startDisrupting();
-        internalCluster().client(masterNode1).admin().indices().prepareDelete("test").setTimeout("1s").get();
+        internalCluster().client(masterNode1).admin().indices().prepareDelete(idxName).setTimeout("0s").get();


can we comment on why we set the timeout to 0? something a long the lines that we know this will time out due to the partition and we are going to check manually when it is applied to master nodes only.

bleskes · 2016-03-09T08:11:48Z

LGTM. Let's give it a go!

This commit backports commit e411cbb from master to 2.x. Relates #16917

abeyad added >bug review v5.0.0-alpha1 labels Mar 2, 2016

abeyad changed the title ~~Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test~~ WIP: Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test Mar 2, 2016

abeyad force-pushed the fix-indices-deleted-discovery-test branch 5 times, most recently from 0c60e64 to a2ab8f3 Compare March 2, 2016 22:42

bleskes reviewed Mar 3, 2016
View reviewed changes

clintongormley changed the title ~~WIP: Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test~~ Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test Mar 3, 2016

abeyad force-pushed the fix-indices-deleted-discovery-test branch from a2ab8f3 to 9726f4f Compare March 3, 2016 16:19

abeyad force-pushed the fix-indices-deleted-discovery-test branch from 9726f4f to 51fe06d Compare March 3, 2016 21:45

abeyad force-pushed the fix-indices-deleted-discovery-test branch from 51fe06d to c5af2a6 Compare March 7, 2016 15:03

Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test

d09eefc

In particular, this test ensures we don't restart the master node until we know the index deletion has taken effect on master and the master eligible nodes. Closes elastic#16890

abeyad force-pushed the fix-indices-deleted-discovery-test branch from c5af2a6 to d09eefc Compare March 7, 2016 15:08

bleskes reviewed Mar 9, 2016
View reviewed changes

abeyad closed this in e411cbb Mar 9, 2016

clintongormley added the >test Issues or PRs that are addressing/adding tests label Mar 9, 2016

clintongormley removed the >bug label Mar 9, 2016

abeyad pushed a commit that referenced this pull request Mar 10, 2016

Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test

b7e1bed

This commit backports commit e411cbb from master to 2.x. Relates #16917

clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Cluster labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test #16917

Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test #16917

abeyad commented Mar 2, 2016

abeyad commented Mar 2, 2016

bleskes Mar 3, 2016

bleskes Mar 3, 2016

abeyad Mar 3, 2016

bleskes commented Mar 3, 2016

abeyad commented Mar 3, 2016

abeyad commented Mar 3, 2016

abeyad commented Mar 3, 2016

bleskes commented Mar 7, 2016

bleskes commented Mar 7, 2016

abeyad commented Mar 7, 2016

bleskes Mar 9, 2016

abeyad Mar 9, 2016

bleskes commented Mar 9, 2016

Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test #16917

Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test #16917

Conversation

abeyad commented Mar 2, 2016

abeyad commented Mar 2, 2016

bleskes Mar 3, 2016

Choose a reason for hiding this comment

bleskes Mar 3, 2016

Choose a reason for hiding this comment

abeyad Mar 3, 2016

Choose a reason for hiding this comment

bleskes commented Mar 3, 2016

abeyad commented Mar 3, 2016

abeyad commented Mar 3, 2016

abeyad commented Mar 3, 2016

bleskes commented Mar 7, 2016

bleskes commented Mar 7, 2016

abeyad commented Mar 7, 2016

bleskes Mar 9, 2016

Choose a reason for hiding this comment

abeyad Mar 9, 2016

Choose a reason for hiding this comment

bleskes commented Mar 9, 2016