Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test GatewayMetaStateTests.testAtomicityWithFailures #39117

Conversation

albertzaharovits
Copy link
Contributor

The test tries to ascertain atomic writes of the manifest, cluster global data,
and index metadata, in the presence of write failures.
We don't however guarantee atomicity no matter what. Specifically rollbacks, which delete extraneous files generated during the transaction, can also fail and swallow exceptions. A failed write corroborated with a failed delete could thrash the metadata consistency.

For example: the test in #39077 , fails because:

  • cluster global data is written successful
  • the associated manifest write fails (during the fsync, ie files have been written)
  • deleting (revert) the manifest files, fails, metadata is therefore persisted
  • deleting (revert) the cluster global data is successful

The test then detects an inconsistency when it tries to read the global data pointed to in the
manifest, which has been reverted, and fails.

This fixes the test such that it does not disrupt delete operations (all deletes are successful). I have run the test multiple times so I am fairly confident atomicity is guaranteed in this case, hence the test is correctly defined.

Closes #39077

@albertzaharovits albertzaharovits added >test-failure Triaged test failures from CI v7.0.0 :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. v8.0.0 v7.2.0 labels Feb 19, 2019
@albertzaharovits albertzaharovits self-assigned this Feb 19, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@andrershov
Copy link
Contributor

@albertzaharovits thank you for your PR!

We indeed want to guarantee atomicity no matter what. And the reason why the test fails is the bug in the algorithm, delete failure can happen in real life.
In the real-life, we will get dirty WriteStateException, which will lead to node restart. And we expect to read either new state or old state and rewrite it during the node startup. See GatewayMetaState.upgradeMetaData.
In this method, we're calling loadFullState and we expect it to return either the old state or the new state, but we don't expect it to throw IOException.
The idea of removing files in case of unsuccessful write was added as an after-thought and complicates reasoning about the algorithm and this is already the second bug that we catch (the first one is #37542).
I think we need to close the PR, leave the test muted and I will work on changes to the algorithm and TLA+ spec for it to prove its correctness.

@andrershov andrershov closed this Feb 20, 2019
@albertzaharovits
Copy link
Contributor Author

Understood, thanks for the detailed explanation, @andrershov !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test-failure Triaged test failures from CI v7.0.0-rc2 v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] GatewayMetaStateTests.testAtomicityWithFailures
4 participants