Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] GatewayMetaStateTests.testAtomicityWithFailures #39077

Closed
albertzaharovits opened this issue Feb 18, 2019 · 2 comments · Fixed by #40519
Closed

[CI] GatewayMetaStateTests.testAtomicityWithFailures #39077

albertzaharovits opened this issue Feb 18, 2019 · 2 comments · Fixed by #40519
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test Issues or PRs that are addressing/adding tests

Comments

@albertzaharovits
Copy link
Contributor

The following has failed in a PR build:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/8533/console

REPRODUCE WITH: ./gradlew :server:unitTest -Dtests.seed=B2CA639652A398C6 -Dtests.class=org.elasticsearch.gateway.GatewayMetaStateTests -Dtests.method="testAtomicityWithFailures" -Dtests.security.manager=true -Dtests.locale=nn-NO -Dtests.timezone=Asia/Jakarta -Dcompiler.java=11 -Druntime.java=8
ERROR   0.30s | GatewayMetaStateTests.testAtomicityWithFailures <<< FAILURES!
   > Throwable #1: java.io.IOException: failed to find global metadata [generation: 0]
   >    at __randomizedtesting.SeedInfo.seed([B2CA639652A398C6:8F4B0E643D33025F]:0)
   >    at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:85)
   >    at org.elasticsearch.gateway.GatewayMetaStateTests.testAtomicityWithFailures(GatewayMetaStateTests.java:427)

It reproduces!

I will follow up with the mute, and I will take a swing at it!

@albertzaharovits albertzaharovits added >test-failure Triaged test failures from CI :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Feb 18, 2019
@albertzaharovits albertzaharovits self-assigned this Feb 18, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@albertzaharovits albertzaharovits added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Feb 19, 2019
@albertzaharovits
Copy link
Contributor Author

@andrershov I am going to relabel this and unassign myself.
Judging by your comment in #39117 (comment) the test is better to be left disabled rather than changed. It might change when the code in main also changes.
I don't think the >test-failure label is suitable for these types of "known issues". For a lack of a better one, I have added the >test label.

@albertzaharovits albertzaharovits added >test Issues or PRs that are addressing/adding tests and removed >test-failure Triaged test failures from CI labels Feb 21, 2019
@albertzaharovits albertzaharovits removed their assignment Feb 21, 2019
andrershov added a commit that referenced this issue Apr 1, 2019
…40519)

Currently, if Manifest write is unsuccessful (i.e. WriteStateException
is thrown) we perform cleanup of newly created metadata files.
However, this is wrong. 
Consider the following sequence (caught by CI here
#39077):

- cluster global data is written **successful**
- the associated manifest write **fails** (during the fsync, ie files
have been written)
- deleting (revert) the manifest files, **fails**, metadata is
therefore persisted
- deleting (revert) the cluster global data is **successful**

In this case, when trying to load metadata (after node restart
because of dirty WriteStateException),  the following exception will
happen
```
java.io.IOException: failed to find global metadata [generation: 0]
```
because the manifest file is referencing missing global metadata file.

This commit checks if thrown WriteStateException is dirty and if its
we don't perform any cleanup, because new Manifest file might be
created, but its deletion has failed.
In the future, we might add more fine-grained check - perform the
clean up if WriteStateException is dirty, but Manifest deletion is
successful.

Closes #39077
andrershov pushed a commit that referenced this issue Apr 1, 2019
…40519)

Currently, if Manifest write is unsuccessful (i.e. WriteStateException
is thrown) we perform cleanup of newly created metadata files.
However, this is wrong.
Consider the following sequence (caught by CI here
#39077):

- cluster global data is written **successful**
- the associated manifest write **fails** (during the fsync, ie files
have been written)
- deleting (revert) the manifest files, **fails**, metadata is
therefore persisted
- deleting (revert) the cluster global data is **successful**

In this case, when trying to load metadata (after node restart
because of dirty WriteStateException),  the following exception will
happen
```
java.io.IOException: failed to find global metadata [generation: 0]
```
because the manifest file is referencing missing global metadata file.

This commit checks if thrown WriteStateException is dirty and if its
we don't perform any cleanup, because new Manifest file might be
created, but its deletion has failed.
In the future, we might add more fine-grained check - perform the
clean up if WriteStateException is dirty, but Manifest deletion is
successful.

Closes #39077

(cherry picked from commit 1fac569)
andrershov pushed a commit that referenced this issue Apr 1, 2019
…40519)

Currently, if Manifest write is unsuccessful (i.e. WriteStateException
is thrown) we perform cleanup of newly created metadata files.
However, this is wrong.
Consider the following sequence (caught by CI here
#39077):

- cluster global data is written **successful**
- the associated manifest write **fails** (during the fsync, ie files
have been written)
- deleting (revert) the manifest files, **fails**, metadata is
therefore persisted
- deleting (revert) the cluster global data is **successful**

In this case, when trying to load metadata (after node restart
because of dirty WriteStateException),  the following exception will
happen
```
java.io.IOException: failed to find global metadata [generation: 0]
```
because the manifest file is referencing missing global metadata file.

This commit checks if thrown WriteStateException is dirty and if its
we don't perform any cleanup, because new Manifest file might be
created, but its deletion has failed.
In the future, we might add more fine-grained check - perform the
clean up if WriteStateException is dirty, but Manifest deletion is
successful.

Closes #39077

(cherry picked from commit 1fac569)
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this issue May 27, 2019
…lastic#40519)

Currently, if Manifest write is unsuccessful (i.e. WriteStateException
is thrown) we perform cleanup of newly created metadata files.
However, this is wrong. 
Consider the following sequence (caught by CI here
elastic#39077):

- cluster global data is written **successful**
- the associated manifest write **fails** (during the fsync, ie files
have been written)
- deleting (revert) the manifest files, **fails**, metadata is
therefore persisted
- deleting (revert) the cluster global data is **successful**

In this case, when trying to load metadata (after node restart
because of dirty WriteStateException),  the following exception will
happen
```
java.io.IOException: failed to find global metadata [generation: 0]
```
because the manifest file is referencing missing global metadata file.

This commit checks if thrown WriteStateException is dirty and if its
we don't perform any cleanup, because new Manifest file might be
created, but its deletion has failed.
In the future, we might add more fine-grained check - perform the
clean up if WriteStateException is dirty, but Manifest deletion is
successful.

Closes elastic#39077
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test Issues or PRs that are addressing/adding tests
Projects
None yet
2 participants