Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Zen2] Add storage-layer disruptions to CoordinatorTests #34347

Merged
merged 28 commits into from
Oct 13, 2018

Conversation

DaveCTurner
Copy link
Contributor

Today we assume the storage layer operates perfectly in CoordinatorTests, which
means we are not testing that the system's invariants are preserved if the
storage layer fails for some reason. This change injects (rare) storage-layer
failures during the safety phase to cover these cases.

The hack to work around lag detection had some issues:
- it always called runFor(), even if no lag was detected
- it looked at the last-accepted state not the last-applied state, so missed
  some lag situations.

This fixes these issues.
Today we inject the initial configuration of the cluster (i.e. the set of
voting nodes) at startup. In reality we must support injecting the initial
configuration after startup too. This commit adds low-level support for doing
so as safely as possible.
Today we assume the storage layer operates perfectly in CoordinatorTests, which
means we are not testing that the system's invariants are preserved if the
storage layer fails for some reason. This change injects (rare) storage-layer
failures during the safety phase to cover these cases.
@DaveCTurner DaveCTurner added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Oct 8, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor Author

Opening this so it doesn't get lost. The pertinent commit is d6d1ee4, but I left it based on top of other recent PRs to avoid conflicts after they are merged. No review expected yet, I'll follow up when dependent PRs are merged.

@DaveCTurner DaveCTurner requested a review from ywelsch October 8, 2018 15:52
@DaveCTurner
Copy link
Contributor Author

This is ready for review now.

@DaveCTurner
Copy link
Contributor Author

@elasticmachine test this please

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left two smaller comments. Adding this disruption kind to runRandomly makes sense though.


private void possiblyFail(String description) {
if (disruptStorage && rarely()) {
throw new CoordinationStateRejectedException("simulated IO exception [" + description + ']');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we throw other exception types here? Can you add TODO that we should check whether the same needs to be done for IOException?

@Override
public void setCurrentTerm(long currentTerm) {
possiblyFail("writing term of " + currentTerm);
super.setCurrentTerm(currentTerm);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe inject failure both before or after executing the actual action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I tried this and it's found a lot of places where we assume an exception means the write was unsuccessful. We can fix these, but I am not sure this is the right thing to do. We plan on finishing each write with a rename-and-fsync-the-directory. A failure before the fsync (including during the rename) means we didn't change state. I'm unsure how we should interpret a failed fsync.

/cc @andrershov re #33958

@DaveCTurner DaveCTurner merged commit 8b9fa55 into elastic:zen2 Oct 13, 2018
@DaveCTurner DaveCTurner deleted the 2018-10-08-disrupt-storage branch October 13, 2018 13:24
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants