Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not perform cleanup if Manifest write fails with dirty exception #40519

Merged
merged 2 commits into from
Apr 1, 2019

Conversation

andrershov
Copy link
Contributor

Currently if Manifest write is unsuccesful (i.e. WriteStateException is thrown) we perform cleanup of newly created metadata files. However, this is wrong.
Consider the following sequence (catched by CI here #39077):

  • cluster global data is written successful
  • the associated manifest write fails (during the fsync, ie files have been written)
  • deleting (revert) the manifest files, fails, metadata is therefore persisted
  • deleting (revert) the cluster global data is successful

In this case, when trying to load metadata (after node restart because of dirty WriteStateException), the following exception will happen

java.io.IOException: failed to find global metadata [generation: 0]

because manifest file is referencing missing global metadata file.

This commit checks if thrown WriteStateException is dirty and if its we don't perform any cleanup, because new Manifest file might be created, but its deletion has failed.
In the future, we might add more fine grained check - perform the cleanup if WriteStateException is dirty, but Manifest deletion is succesful.

Closes #39077

@andrershov andrershov added >bug v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.2.0 labels Mar 27, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also unmute testAtomicityWithFailures?

Copy link
Contributor

@albertzaharovits albertzaharovits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andrershov andrershov merged commit 1fac569 into elastic:master Apr 1, 2019
andrershov pushed a commit that referenced this pull request Apr 1, 2019
…40519)

Currently, if Manifest write is unsuccessful (i.e. WriteStateException
is thrown) we perform cleanup of newly created metadata files.
However, this is wrong.
Consider the following sequence (caught by CI here
#39077):

- cluster global data is written **successful**
- the associated manifest write **fails** (during the fsync, ie files
have been written)
- deleting (revert) the manifest files, **fails**, metadata is
therefore persisted
- deleting (revert) the cluster global data is **successful**

In this case, when trying to load metadata (after node restart
because of dirty WriteStateException),  the following exception will
happen
```
java.io.IOException: failed to find global metadata [generation: 0]
```
because the manifest file is referencing missing global metadata file.

This commit checks if thrown WriteStateException is dirty and if its
we don't perform any cleanup, because new Manifest file might be
created, but its deletion has failed.
In the future, we might add more fine-grained check - perform the
clean up if WriteStateException is dirty, but Manifest deletion is
successful.

Closes #39077

(cherry picked from commit 1fac569)
andrershov pushed a commit that referenced this pull request Apr 1, 2019
…40519)

Currently, if Manifest write is unsuccessful (i.e. WriteStateException
is thrown) we perform cleanup of newly created metadata files.
However, this is wrong.
Consider the following sequence (caught by CI here
#39077):

- cluster global data is written **successful**
- the associated manifest write **fails** (during the fsync, ie files
have been written)
- deleting (revert) the manifest files, **fails**, metadata is
therefore persisted
- deleting (revert) the cluster global data is **successful**

In this case, when trying to load metadata (after node restart
because of dirty WriteStateException),  the following exception will
happen
```
java.io.IOException: failed to find global metadata [generation: 0]
```
because the manifest file is referencing missing global metadata file.

This commit checks if thrown WriteStateException is dirty and if its
we don't perform any cleanup, because new Manifest file might be
created, but its deletion has failed.
In the future, we might add more fine-grained check - perform the
clean up if WriteStateException is dirty, but Manifest deletion is
successful.

Closes #39077

(cherry picked from commit 1fac569)
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Apr 1, 2019
* elastic/7.0:
  [TEST] Mute WebhookHttpsIntegrationTests.testHttps
  [DOCS] Add 'time value' links to several monitor settings (elastic#40633) (elastic#40687)
  Do not perform cleanup if Manifest write fails with dirty exception (elastic#40519)
  Remove mention of soft deletes from getting started (elastic#40668)
  Fix bug in detecting use of bundled JDK on macOS
  Reindex conflicts clarification (docs) (elastic#40442)
  SQL: [Tests] Enable integration tests for fixed issues (elastic#40664)
  Add information about the default sort mode (elastic#40657)
  SQL: [Docs] Fix example for CURDATE
  SQL: [Docs] Fix doc errors regarding CURRENT_DATE. (elastic#40649)
  Clarify using time_zone and date math in range query (elastic#40655)
  Add notice for bundled jdk (elastic#40576)
  disable kerberos test until kerberos fixture is working again
  [DOCS] Use "source" instead of "inline" in ML docs (elastic#40635)
  Unmute and fix testSubParserArray (elastic#40626)
  Geo Point parse error fix (elastic#40447)
  Increase suite timeout to 30 minutes for docs tests (elastic#40521)
  Fix repository-hdfs when no docker and unnecesary fixture
  Avoid building hdfs-fixure use an image that works instead
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
…lastic#40519)

Currently, if Manifest write is unsuccessful (i.e. WriteStateException
is thrown) we perform cleanup of newly created metadata files.
However, this is wrong. 
Consider the following sequence (caught by CI here
elastic#39077):

- cluster global data is written **successful**
- the associated manifest write **fails** (during the fsync, ie files
have been written)
- deleting (revert) the manifest files, **fails**, metadata is
therefore persisted
- deleting (revert) the cluster global data is **successful**

In this case, when trying to load metadata (after node restart
because of dirty WriteStateException),  the following exception will
happen
```
java.io.IOException: failed to find global metadata [generation: 0]
```
because the manifest file is referencing missing global metadata file.

This commit checks if thrown WriteStateException is dirty and if its
we don't perform any cleanup, because new Manifest file might be
created, but its deletion has failed.
In the future, we might add more fine-grained check - perform the
clean up if WriteStateException is dirty, but Manifest deletion is
successful.

Closes elastic#39077
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v7.0.0-rc2 v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] GatewayMetaStateTests.testAtomicityWithFailures
5 participants