[CI] ClusterClientIT failure #35450

romseygeek · 2018-11-12T13:08:31Z

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.5+periodic/105

1> [2018-11-12T12:23:40,510][INFO ][o.e.c.ClusterClientIT    ] [testClusterHealthYellowIndicesLevel] after test
FAILURE 0.43s | ClusterClientIT.testClusterHealthYellowIndicesLevel <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: <10>
   >      but: was <11>
   > 	at __randomizedtesting.SeedInfo.seed([FBF52C08AF7A3624:6D7C7AD6072CE6B1]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.client.ClusterClientIT.assertYellowShards(ClusterClientIT.java:211)
   > 	at org.elasticsearch.client.ClusterClientIT.testClusterHealthYellowIndicesLevel(ClusterClientIT.java:199)
   > 	at java.lang.Thread.run(Thread.java:748)

 2> REPRODUCE WITH: ./gradlew :client:rest-high-level:integTestRunner -Dtests.seed=FBF52C08AF7A3624 -Dtests.class=org.elasticsearch.client.ClusterClientIT -Dtests.method="testClusterHealthYellowIndicesLevel" -Dtests.security.manager=true -Dtests.locale=el-CY -Dtests.timezone=Asia/Qyzylorda -Dcompiler.java=11 -Druntime.java=8

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-11-12T13:08:32Z

Pinging @elastic/es-core-infra

romseygeek · 2018-11-12T13:10:27Z

The cause seems to be a .tasks index hanging around from somewhere:

1> [2018-11-12T12:23:40,393][INFO ][o.e.c.ClusterClientIT    ] [testClusterHealthYellowIndicesLevel] Shard stats
  1> index2 4 p STARTED    0 230b 127.0.0.1 node-0
  1> index2 4 r UNASSIGNED                  
  1> index2 2 p STARTED    0   0b 127.0.0.1 node-0
  1> index2 2 r UNASSIGNED                  
  1> index2 3 p STARTED    0   0b 127.0.0.1 node-0
  1> index2 3 r UNASSIGNED                  
  1> index2 1 p STARTED    0   0b 127.0.0.1 node-0
  1> index2 1 r UNASSIGNED                  
  1> index2 0 p STARTED    0 401b 127.0.0.1 node-0
  1> index2 0 r UNASSIGNED                  
  1> index  4 p STARTED    0 230b 127.0.0.1 node-0
  1> index  4 r UNASSIGNED                  
  1> index  2 p STARTED    0   0b 127.0.0.1 node-0
  1> index  2 r UNASSIGNED                  
  1> index  3 p STARTED    0   0b 127.0.0.1 node-0
  1> index  3 r UNASSIGNED                  
  1> index  1 p STARTED    0   0b 127.0.0.1 node-0
  1> index  1 r UNASSIGNED                  
  1> index  0 p STARTED    0 230b 127.0.0.1 node-0
  1> index  0 r UNASSIGNED                  
  1> .tasks 0 p STARTED    0 230b 127.0.0.1 node-0

The test is only expecting the 10 shards it has asked for to be present in the cluster state.

romseygeek · 2018-11-12T13:21:22Z

Possibly related:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=fedora/31/console

 > Throwable #1: java.lang.AssertionError: 
   > Expected: <0>
   >      but: was <1>
   > 	at __randomizedtesting.SeedInfo.seed([BC0FDBA62D2461C3:139D580BC0F6AABE]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.client.ClusterClientIT.assertNoIndices(ClusterClientIT.java:286)
   > 	at org.elasticsearch.client.ClusterClientIT.testClusterHealthGreen(ClusterClientIT.java:169)
   > 	at java.lang.Thread.run(Thread.java:748)

expected no indices, but cluster state contains one

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/129/console

FAILURE 0.14s | ClusterClientIT.testClusterHealthYellowSpecificIndex <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: <50.0>
   >      but: was <60.0>
   > 	at __randomizedtesting.SeedInfo.seed([D90179B8FBC5571B:6072232686183FE4]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.client.ClusterClientIT.testClusterHealthYellowSpecificIndex(ClusterClientIT.java:236)
   > 	at java.lang.Thread.run(Thread.java:748)

expected 50% active shards, but was 60% (consistent with one extra active shard in the cluster state)

dimitris-athanasiou · 2018-11-16T16:51:26Z

Another hit in: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/62/console

albertzaharovits · 2018-11-22T14:29:53Z

Another one:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-darwin-compatibility/74/console

./gradlew :client:rest-high-level:integTestRunner \
  -Dtests.seed=22CF8C14098629BE \
  -Dtests.class=org.elasticsearch.client.ClusterClientIT \
  -Dtests.method="testClusterHealthYellowSpecificIndex" \
  -Dtests.security.manager=true \
  -Dtests.locale=es-PE \
  -Dtests.timezone=Pacific/Pitcairn \
  -Dcompiler.java=11 \
  -Druntime.java=8

java.lang.AssertionError: 
Expected: <50.0>
     but: was <60.0>
	at __randomizedtesting.SeedInfo.seed([22CF8C14098629BE:9BBCD68A745B4141]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.junit.Assert.assertThat(Assert.java:923)
	at org.elasticsearch.client.ClusterClientIT.testClusterHealthYellowSpecificIndex(ClusterClientIT.java:236)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.lang.Thread.run(Thread.java:748)

iverase · 2018-11-26T10:24:18Z

Another case like last one in master:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=fedora/81/console

./gradlew :client:rest-high-level:integTestRunner \
  -Dtests.seed=9EE7AAEEECBF7CE3 \
  -Dtests.class=org.elasticsearch.client.ClusterClientIT \
  -Dtests.method="testClusterHealthYellowSpecificIndex" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-YE \
  -Dtests.timezone=Europe/Prague \
  -Dcompiler.java=11 \
  -Druntime.java=8

cbuescher · 2018-11-29T10:39:35Z

Another one: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+g1gc/131/console

Doesn't reproduce locally.

./gradlew :client:rest-high-level:integTestRunner \
  -Dtests.seed=C8335A1B48161974 \
  -Dtests.class=org.elasticsearch.client.ClusterClientIT \
  -Dtests.method="testClusterHealthYellowClusterLevel" \
  -Dtests.security.manager=true \
  -Dtests.jvm.argline="-XX:-UseConcMarkSweepGC -XX:+UseG1GC" \
  -Dtests.locale=fr-CH \
  -Dtests.timezone=Asia/Irkutsk \
  -Dcompiler.java=11 \
  -Druntime.java=8

11:19:20 FAILURE 0.25s | ClusterClientIT.testClusterHealthYellowClusterLevel <<< FAILURES!
11:19:20    > Throwable #1: java.lang.AssertionError: 
11:19:20    > Expected: <2>
11:19:20    >      but: was <3>
11:19:20    > 	at __randomizedtesting.SeedInfo.seed([C8335A1B48161974:55393482F88B5C5A]:0)
11:19:20    > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
11:19:20    > 	at org.elasticsearch.client.ClusterClientIT.assertYellowShards(ClusterClientIT.java:207)
11:19:20    > 	at org.elasticsearch.client.ClusterClientIT.testClusterHealthYellowClusterLevel(ClusterClientIT.java:181)
11:19:20    > 	at java.lang.Thread.run(Thread.java:748)

cbuescher · 2018-11-29T10:47:32Z

I muted this test on master and 6.x with c412ecd and bf61173

cbuescher · 2018-12-17T13:22:40Z

I think we got confirmation in #35644 now that @romseygeek's suspicion in #35450 (comment) was right and there are documents from a .tasks index lingering around.

droberts195 · 2019-01-09T13:00:09Z

There seem to be a number tests in ClusterClientIT failing with a similar problem, certainly at least testClusterHealthYellowIndicesLevel and testClusterHealthYellowClusterLevel, and only testClusterHealthYellowClusterLevel is muted.

I just ran into testClusterHealthYellowIndicesLevel failing in an unrelated PR build (https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/3963/consoleText) with the exact same problem as in #35450 (comment).

Tim-Brooks · 2019-01-10T19:01:08Z

Another failure of:

org.elasticsearch.client.ClusterClientIT testClusterHealthYellowIndicesLevel

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=java11,nodes=virtual&&linux/172/console

Since we've had two in the last two days it seems like we might be at the point of muting these?

cbuescher · 2019-02-04T15:51:13Z

There are still failures today in testClusterHealthYellowSpecificIndex on master, probably missed while muting:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=centos-7/223/console
Will add muting to these as well on 6.x and master

cbuescher · 2019-02-04T16:14:32Z

Muted also testClusterHealthYellowSpecificIndex on master with 1899658

pcsanwald · 2019-02-07T21:09:17Z

I've been trying to reproduce this issue by running the test suite in a couple different ways:

run each test in a loop with tests.iter=1000
run ClusterClientIT in a bash loop to try and replicate.
same as last step, but with -Dtests.class=org.elasticsearch.client.*, because looking at the output of this failure, we get: All tests run in this JVM: [IndicesClientIT, IndicesClientDocumentationIT, MigrationClientDocumentationIT, ClusterClientIT]

thus far I've been unsuccessful in reproducing locally.

pcsanwald · 2019-02-11T23:23:21Z

Still can't replicate the failures, which do stem from a lingering .tasks index, possibly from MachingLearningIT which does create a number of tasks, but also has @After hooks in place to clean up the indices.

There are two basic approaches I can think of to take this forward:

Find out where/how the .tasks index is getting created, and make sure it gets cleaned up correctly. This involves reliably reproducing the issue, something I have failed to do so far.
Update the assertions in ClusterClientIT to assume less about the state of the cluster it's using for this test. An example here is this snippet:

createIndex("index", Settings.EMPTY);
createIndex("index2", Settings.EMPTY);
... // snip for brevity
assertThat(response.getIndices().size(), equalTo(2));

If the goal of this test is to ensure that the indices we asked to be created are indeed created, then it's better to assert on the existence of each named index, instead of just on size. an example of how this assertion can go awry is if .tasks and index2 are present, and index is missing: asserting on two indices will incorrectly pass.

I'd love some feedback on this, particularly if anyone feels strategy 2 is a bad idea. It is clearly not easy to be deterministic about the entire cluster state (or I wouldn't be investigating this issue), and therefore I think we are best served by more targeted assertions that are testing ClusterClient functionality specifically.

droberts195 · 2019-02-12T10:27:27Z

I am +1 on strategy 2. It's best if each test just asserts on exactly what it changed rather than the whole cluster state.

It is quite possible that the ML tests in this suite are responsible for the .tasks index being created in between tests after all indices have been removed, meaning the next test starts with an index existing. The ML production code cannot do anything about this, as it has no idea some other part of the system is creating a .tasks index. We were plagued with problems due to interactions between tests caused by async side effects of actions in Summer 2017. At that time it was generally caused by the .ml-notifications index, which also gets created asynchronously. The solution to that was to wait for pending tasks to complete before deleting all indices in between tests. If you look in XPackRestIT.cleanup you'll see it calls ESRestTestCase.waitForPendingTasks. This cleanup is happening after deleting higher level entities like ML jobs and datafeeds but before the base class cleanup in ESRestTestCase.cleanUpCluster that deletes all the indices. If you look in ESRestHighLevelClientTestCase you'll see it doesn't have an @After method that calls ESRestTestCase.waitForPendingTasks. To avoid interactions between tests it probably should. (Or if people disagree with this then it needs to be clearly documented that high level REST client integration tests must not assume they are starting on a completely empty cluster.)

TL;DR:

It's probably possible to better isolate high level REST client integration tests so that there is less chance of one test interfering with another by adding an @After method to ESRestHighLevelClientTestCase that calls ESRestTestCase.waitForPendingTasks
Even if this is done I am +1 on the idea of most tests only asserting on what they changed and not the entire cluster state

romseygeek added >test-failure Triaged test failures from CI :Core/Java High Level REST Client labels Nov 12, 2018

cbuescher mentioned this issue Dec 11, 2018

[CI] SearchIT.testCountAllIndicesNoQuery failed on master #35644

Closed

pcsanwald self-assigned this Jan 31, 2019

pcsanwald mentioned this issue Jan 31, 2019

mute test, as this one is failing also per #35450 #38132

Merged

pcsanwald pushed a commit that referenced this issue Feb 1, 2019

mute test, as this one is failing also per #35450 (#38132)

0d56955

cbuescher mentioned this issue Feb 4, 2019

Mute ClusterClientIT#testClusterHealthYellowSpecificIndex #38343

Merged

pcsanwald mentioned this issue Feb 13, 2019

ClusterClientIT refactor #38872

Merged

pcsanwald closed this as completed in #38872 Feb 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] ClusterClientIT failure #35450

[CI] ClusterClientIT failure #35450

romseygeek commented Nov 12, 2018

elasticmachine commented Nov 12, 2018

romseygeek commented Nov 12, 2018

romseygeek commented Nov 12, 2018

dimitris-athanasiou commented Nov 16, 2018

albertzaharovits commented Nov 22, 2018

iverase commented Nov 26, 2018

cbuescher commented Nov 29, 2018

cbuescher commented Nov 29, 2018

cbuescher commented Dec 17, 2018

droberts195 commented Jan 9, 2019

Tim-Brooks commented Jan 10, 2019

cbuescher commented Feb 4, 2019

cbuescher commented Feb 4, 2019

pcsanwald commented Feb 7, 2019

pcsanwald commented Feb 11, 2019 •

edited

Loading

droberts195 commented Feb 12, 2019

[CI] ClusterClientIT failure #35450

[CI] ClusterClientIT failure #35450

Comments

romseygeek commented Nov 12, 2018

elasticmachine commented Nov 12, 2018

romseygeek commented Nov 12, 2018

romseygeek commented Nov 12, 2018

dimitris-athanasiou commented Nov 16, 2018

albertzaharovits commented Nov 22, 2018

iverase commented Nov 26, 2018

cbuescher commented Nov 29, 2018

cbuescher commented Nov 29, 2018

cbuescher commented Dec 17, 2018

droberts195 commented Jan 9, 2019

Tim-Brooks commented Jan 10, 2019

cbuescher commented Feb 4, 2019

cbuescher commented Feb 4, 2019

pcsanwald commented Feb 7, 2019

pcsanwald commented Feb 11, 2019 • edited Loading

droberts195 commented Feb 12, 2019

pcsanwald commented Feb 11, 2019 •

edited

Loading