[CI] GatewayMetaStateTests.testAtomicityWithFailures #39077

albertzaharovits · 2019-02-18T22:02:55Z

The following has failed in a PR build:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/8533/console

REPRODUCE WITH: ./gradlew :server:unitTest -Dtests.seed=B2CA639652A398C6 -Dtests.class=org.elasticsearch.gateway.GatewayMetaStateTests -Dtests.method="testAtomicityWithFailures" -Dtests.security.manager=true -Dtests.locale=nn-NO -Dtests.timezone=Asia/Jakarta -Dcompiler.java=11 -Druntime.java=8
ERROR   0.30s | GatewayMetaStateTests.testAtomicityWithFailures <<< FAILURES!
   > Throwable #1: java.io.IOException: failed to find global metadata [generation: 0]
   >    at __randomizedtesting.SeedInfo.seed([B2CA639652A398C6:8F4B0E643D33025F]:0)
   >    at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:85)
   >    at org.elasticsearch.gateway.GatewayMetaStateTests.testAtomicityWithFailures(GatewayMetaStateTests.java:427)

It reproduces!

I will follow up with the mute, and I will take a swing at it!

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-02-18T22:02:56Z

Pinging @elastic/es-distributed

albertzaharovits · 2019-02-21T14:05:50Z

@andrershov I am going to relabel this and unassign myself.
Judging by your comment in #39117 (comment) the test is better to be left disabled rather than changed. It might change when the code in main also changes.
I don't think the >test-failure label is suitable for these types of "known issues". For a lack of a better one, I have added the >test label.

…40519) Currently, if Manifest write is unsuccessful (i.e. WriteStateException is thrown) we perform cleanup of newly created metadata files. However, this is wrong. Consider the following sequence (caught by CI here #39077): - cluster global data is written **successful** - the associated manifest write **fails** (during the fsync, ie files have been written) - deleting (revert) the manifest files, **fails**, metadata is therefore persisted - deleting (revert) the cluster global data is **successful** In this case, when trying to load metadata (after node restart because of dirty WriteStateException), the following exception will happen ``` java.io.IOException: failed to find global metadata [generation: 0] ``` because the manifest file is referencing missing global metadata file. This commit checks if thrown WriteStateException is dirty and if its we don't perform any cleanup, because new Manifest file might be created, but its deletion has failed. In the future, we might add more fine-grained check - perform the clean up if WriteStateException is dirty, but Manifest deletion is successful. Closes #39077

…40519) Currently, if Manifest write is unsuccessful (i.e. WriteStateException is thrown) we perform cleanup of newly created metadata files. However, this is wrong. Consider the following sequence (caught by CI here #39077): - cluster global data is written **successful** - the associated manifest write **fails** (during the fsync, ie files have been written) - deleting (revert) the manifest files, **fails**, metadata is therefore persisted - deleting (revert) the cluster global data is **successful** In this case, when trying to load metadata (after node restart because of dirty WriteStateException), the following exception will happen ``` java.io.IOException: failed to find global metadata [generation: 0] ``` because the manifest file is referencing missing global metadata file. This commit checks if thrown WriteStateException is dirty and if its we don't perform any cleanup, because new Manifest file might be created, but its deletion has failed. In the future, we might add more fine-grained check - perform the clean up if WriteStateException is dirty, but Manifest deletion is successful. Closes #39077 (cherry picked from commit 1fac569)

…lastic#40519) Currently, if Manifest write is unsuccessful (i.e. WriteStateException is thrown) we perform cleanup of newly created metadata files. However, this is wrong. Consider the following sequence (caught by CI here elastic#39077): - cluster global data is written **successful** - the associated manifest write **fails** (during the fsync, ie files have been written) - deleting (revert) the manifest files, **fails**, metadata is therefore persisted - deleting (revert) the cluster global data is **successful** In this case, when trying to load metadata (after node restart because of dirty WriteStateException), the following exception will happen ``` java.io.IOException: failed to find global metadata [generation: 0] ``` because the manifest file is referencing missing global metadata file. This commit checks if thrown WriteStateException is dirty and if its we don't perform any cleanup, because new Manifest file might be created, but its deletion has failed. In the future, we might add more fine-grained check - perform the clean up if WriteStateException is dirty, but Manifest deletion is successful. Closes elastic#39077

albertzaharovits added >test-failure Triaged test failures from CI :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Feb 18, 2019

albertzaharovits self-assigned this Feb 18, 2019

This was referenced Feb 18, 2019

Mute GatewayMetaStateTests.testAtomicityWithFailures #39079

Merged

Fix libs:ssl-config project setup #39074

Merged

albertzaharovits mentioned this issue Feb 19, 2019

Fix test GatewayMetaStateTests.testAtomicityWithFailures #39117

Closed

albertzaharovits added >test Issues or PRs that are addressing/adding tests and removed >test-failure Triaged test failures from CI labels Feb 21, 2019

albertzaharovits removed their assignment Feb 21, 2019

andrershov mentioned this issue Mar 27, 2019

Do not perform cleanup if Manifest write fails with dirty exception #40519

Merged

andrershov closed this as completed in #40519 Apr 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] GatewayMetaStateTests.testAtomicityWithFailures #39077

[CI] GatewayMetaStateTests.testAtomicityWithFailures #39077

albertzaharovits commented Feb 18, 2019

elasticmachine commented Feb 18, 2019

albertzaharovits commented Feb 21, 2019

[CI] GatewayMetaStateTests.testAtomicityWithFailures #39077

[CI] GatewayMetaStateTests.testAtomicityWithFailures #39077

Comments

albertzaharovits commented Feb 18, 2019

elasticmachine commented Feb 18, 2019

albertzaharovits commented Feb 21, 2019