Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] ClusterClientIT failure #35450

Closed
romseygeek opened this issue Nov 12, 2018 · 16 comments · Fixed by #38872
Closed

[CI] ClusterClientIT failure #35450

romseygeek opened this issue Nov 12, 2018 · 16 comments · Fixed by #38872
Assignees
Labels
>test-failure Triaged test failures from CI

Comments

@romseygeek
Copy link
Contributor

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.5+periodic/105

1> [2018-11-12T12:23:40,510][INFO ][o.e.c.ClusterClientIT    ] [testClusterHealthYellowIndicesLevel] after test
FAILURE 0.43s | ClusterClientIT.testClusterHealthYellowIndicesLevel <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: <10>
   >      but: was <11>
   > 	at __randomizedtesting.SeedInfo.seed([FBF52C08AF7A3624:6D7C7AD6072CE6B1]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.client.ClusterClientIT.assertYellowShards(ClusterClientIT.java:211)
   > 	at org.elasticsearch.client.ClusterClientIT.testClusterHealthYellowIndicesLevel(ClusterClientIT.java:199)
   > 	at java.lang.Thread.run(Thread.java:748)
 2> REPRODUCE WITH: ./gradlew :client:rest-high-level:integTestRunner -Dtests.seed=FBF52C08AF7A3624 -Dtests.class=org.elasticsearch.client.ClusterClientIT -Dtests.method="testClusterHealthYellowIndicesLevel" -Dtests.security.manager=true -Dtests.locale=el-CY -Dtests.timezone=Asia/Qyzylorda -Dcompiler.java=11 -Druntime.java=8
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@romseygeek
Copy link
Contributor Author

The cause seems to be a .tasks index hanging around from somewhere:

1> [2018-11-12T12:23:40,393][INFO ][o.e.c.ClusterClientIT    ] [testClusterHealthYellowIndicesLevel] Shard stats
  1> index2 4 p STARTED    0 230b 127.0.0.1 node-0
  1> index2 4 r UNASSIGNED                  
  1> index2 2 p STARTED    0   0b 127.0.0.1 node-0
  1> index2 2 r UNASSIGNED                  
  1> index2 3 p STARTED    0   0b 127.0.0.1 node-0
  1> index2 3 r UNASSIGNED                  
  1> index2 1 p STARTED    0   0b 127.0.0.1 node-0
  1> index2 1 r UNASSIGNED                  
  1> index2 0 p STARTED    0 401b 127.0.0.1 node-0
  1> index2 0 r UNASSIGNED                  
  1> index  4 p STARTED    0 230b 127.0.0.1 node-0
  1> index  4 r UNASSIGNED                  
  1> index  2 p STARTED    0   0b 127.0.0.1 node-0
  1> index  2 r UNASSIGNED                  
  1> index  3 p STARTED    0   0b 127.0.0.1 node-0
  1> index  3 r UNASSIGNED                  
  1> index  1 p STARTED    0   0b 127.0.0.1 node-0
  1> index  1 r UNASSIGNED                  
  1> index  0 p STARTED    0 230b 127.0.0.1 node-0
  1> index  0 r UNASSIGNED                  
  1> .tasks 0 p STARTED    0 230b 127.0.0.1 node-0

The test is only expecting the 10 shards it has asked for to be present in the cluster state.

@romseygeek
Copy link
Contributor Author

Possibly related:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=fedora/31/console

 > Throwable #1: java.lang.AssertionError: 
   > Expected: <0>
   >      but: was <1>
   > 	at __randomizedtesting.SeedInfo.seed([BC0FDBA62D2461C3:139D580BC0F6AABE]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.client.ClusterClientIT.assertNoIndices(ClusterClientIT.java:286)
   > 	at org.elasticsearch.client.ClusterClientIT.testClusterHealthGreen(ClusterClientIT.java:169)
   > 	at java.lang.Thread.run(Thread.java:748)
  • expected no indices, but cluster state contains one

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/129/console

FAILURE 0.14s | ClusterClientIT.testClusterHealthYellowSpecificIndex <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: <50.0>
   >      but: was <60.0>
   > 	at __randomizedtesting.SeedInfo.seed([D90179B8FBC5571B:6072232686183FE4]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.client.ClusterClientIT.testClusterHealthYellowSpecificIndex(ClusterClientIT.java:236)
   > 	at java.lang.Thread.run(Thread.java:748)
  • expected 50% active shards, but was 60% (consistent with one extra active shard in the cluster state)

@albertzaharovits
Copy link
Contributor

Another one:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-darwin-compatibility/74/console

./gradlew :client:rest-high-level:integTestRunner \
  -Dtests.seed=22CF8C14098629BE \
  -Dtests.class=org.elasticsearch.client.ClusterClientIT \
  -Dtests.method="testClusterHealthYellowSpecificIndex" \
  -Dtests.security.manager=true \
  -Dtests.locale=es-PE \
  -Dtests.timezone=Pacific/Pitcairn \
  -Dcompiler.java=11 \
  -Druntime.java=8
java.lang.AssertionError: 
Expected: <50.0>
     but: was <60.0>
	at __randomizedtesting.SeedInfo.seed([22CF8C14098629BE:9BBCD68A745B4141]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.junit.Assert.assertThat(Assert.java:923)
	at org.elasticsearch.client.ClusterClientIT.testClusterHealthYellowSpecificIndex(ClusterClientIT.java:236)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.lang.Thread.run(Thread.java:748)

@iverase
Copy link
Contributor

iverase commented Nov 26, 2018

Another case like last one in master:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=fedora/81/console

./gradlew :client:rest-high-level:integTestRunner \
  -Dtests.seed=9EE7AAEEECBF7CE3 \
  -Dtests.class=org.elasticsearch.client.ClusterClientIT \
  -Dtests.method="testClusterHealthYellowSpecificIndex" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-YE \
  -Dtests.timezone=Europe/Prague \
  -Dcompiler.java=11 \
  -Druntime.java=8

@cbuescher
Copy link
Member

Another one: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+g1gc/131/console

Doesn't reproduce locally.

./gradlew :client:rest-high-level:integTestRunner \
  -Dtests.seed=C8335A1B48161974 \
  -Dtests.class=org.elasticsearch.client.ClusterClientIT \
  -Dtests.method="testClusterHealthYellowClusterLevel" \
  -Dtests.security.manager=true \
  -Dtests.jvm.argline="-XX:-UseConcMarkSweepGC -XX:+UseG1GC" \
  -Dtests.locale=fr-CH \
  -Dtests.timezone=Asia/Irkutsk \
  -Dcompiler.java=11 \
  -Druntime.java=8
11:19:20 FAILURE 0.25s | ClusterClientIT.testClusterHealthYellowClusterLevel <<< FAILURES!
11:19:20    > Throwable #1: java.lang.AssertionError: 
11:19:20    > Expected: <2>
11:19:20    >      but: was <3>
11:19:20    > 	at __randomizedtesting.SeedInfo.seed([C8335A1B48161974:55393482F88B5C5A]:0)
11:19:20    > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
11:19:20    > 	at org.elasticsearch.client.ClusterClientIT.assertYellowShards(ClusterClientIT.java:207)
11:19:20    > 	at org.elasticsearch.client.ClusterClientIT.testClusterHealthYellowClusterLevel(ClusterClientIT.java:181)
11:19:20    > 	at java.lang.Thread.run(Thread.java:748)

@cbuescher
Copy link
Member

I muted this test on master and 6.x with c412ecd and bf61173

@cbuescher
Copy link
Member

I think we got confirmation in #35644 now that @romseygeek's suspicion in #35450 (comment) was right and there are documents from a .tasks index lingering around.

@droberts195
Copy link
Contributor

There seem to be a number tests in ClusterClientIT failing with a similar problem, certainly at least testClusterHealthYellowIndicesLevel and testClusterHealthYellowClusterLevel, and only testClusterHealthYellowClusterLevel is muted.

I just ran into testClusterHealthYellowIndicesLevel failing in an unrelated PR build (https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/3963/consoleText) with the exact same problem as in #35450 (comment).

@Tim-Brooks
Copy link
Contributor

Another failure of:

org.elasticsearch.client.ClusterClientIT testClusterHealthYellowIndicesLevel

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=java11,nodes=virtual&&linux/172/console

Since we've had two in the last two days it seems like we might be at the point of muting these?

@cbuescher
Copy link
Member

There are still failures today in testClusterHealthYellowSpecificIndex on master, probably missed while muting:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=centos-7/223/console
Will add muting to these as well on 6.x and master

@cbuescher
Copy link
Member

Muted also testClusterHealthYellowSpecificIndex on master with 1899658

@pcsanwald
Copy link
Contributor

I've been trying to reproduce this issue by running the test suite in a couple different ways:

  1. run each test in a loop with tests.iter=1000
  2. run ClusterClientIT in a bash loop to try and replicate.
  3. same as last step, but with -Dtests.class=org.elasticsearch.client.*, because looking at the output of this failure, we get: All tests run in this JVM: [IndicesClientIT, IndicesClientDocumentationIT, MigrationClientDocumentationIT, ClusterClientIT]

thus far I've been unsuccessful in reproducing locally.

@pcsanwald
Copy link
Contributor

pcsanwald commented Feb 11, 2019

Still can't replicate the failures, which do stem from a lingering .tasks index, possibly from MachingLearningIT which does create a number of tasks, but also has @After hooks in place to clean up the indices.

There are two basic approaches I can think of to take this forward:

  1. Find out where/how the .tasks index is getting created, and make sure it gets cleaned up correctly. This involves reliably reproducing the issue, something I have failed to do so far.
  2. Update the assertions in ClusterClientIT to assume less about the state of the cluster it's using for this test. An example here is this snippet:
createIndex("index", Settings.EMPTY);
createIndex("index2", Settings.EMPTY);
... // snip for brevity
assertThat(response.getIndices().size(), equalTo(2));

If the goal of this test is to ensure that the indices we asked to be created are indeed created, then it's better to assert on the existence of each named index, instead of just on size. an example of how this assertion can go awry is if .tasks and index2 are present, and index is missing: asserting on two indices will incorrectly pass.

I'd love some feedback on this, particularly if anyone feels strategy 2 is a bad idea. It is clearly not easy to be deterministic about the entire cluster state (or I wouldn't be investigating this issue), and therefore I think we are best served by more targeted assertions that are testing ClusterClient functionality specifically.

@droberts195
Copy link
Contributor

I am +1 on strategy 2. It's best if each test just asserts on exactly what it changed rather than the whole cluster state.

It is quite possible that the ML tests in this suite are responsible for the .tasks index being created in between tests after all indices have been removed, meaning the next test starts with an index existing. The ML production code cannot do anything about this, as it has no idea some other part of the system is creating a .tasks index. We were plagued with problems due to interactions between tests caused by async side effects of actions in Summer 2017. At that time it was generally caused by the .ml-notifications index, which also gets created asynchronously. The solution to that was to wait for pending tasks to complete before deleting all indices in between tests. If you look in XPackRestIT.cleanup you'll see it calls ESRestTestCase.waitForPendingTasks. This cleanup is happening after deleting higher level entities like ML jobs and datafeeds but before the base class cleanup in ESRestTestCase.cleanUpCluster that deletes all the indices. If you look in ESRestHighLevelClientTestCase you'll see it doesn't have an @After method that calls ESRestTestCase.waitForPendingTasks. To avoid interactions between tests it probably should. (Or if people disagree with this then it needs to be clearly documented that high level REST client integration tests must not assume they are starting on a completely empty cluster.)

TL;DR:

  • It's probably possible to better isolate high level REST client integration tests so that there is less chance of one test interfering with another by adding an @After method to ESRestHighLevelClientTestCase that calls ESRestTestCase.waitForPendingTasks
  • Even if this is done I am +1 on the idea of most tests only asserting on what they changed and not the entire cluster state

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants