[Zen2] Add storage-layer disruptions to CoordinatorTests #34347

DaveCTurner · 2018-10-08T08:03:34Z

Today we assume the storage layer operates perfectly in CoordinatorTests, which
means we are not testing that the system's invariants are preserved if the
storage layer fails for some reason. This change injects (rare) storage-layer
failures during the safety phase to cover these cases.

The hack to work around lag detection had some issues: - it always called runFor(), even if no lag was detected - it looked at the last-accepted state not the last-applied state, so missed some lag situations. This fixes these issues.

Today we inject the initial configuration of the cluster (i.e. the set of voting nodes) at startup. In reality we must support injecting the initial configuration after startup too. This commit adds low-level support for doing so as safely as possible.

…ge-testing

Today we assume the storage layer operates perfectly in CoordinatorTests, which means we are not testing that the system's invariants are preserved if the storage layer fails for some reason. This change injects (rare) storage-layer failures during the safety phase to cover these cases.

elasticmachine · 2018-10-08T08:03:35Z

Pinging @elastic/es-distributed

DaveCTurner · 2018-10-08T08:05:35Z

Opening this so it doesn't get lost. The pertinent commit is d6d1ee4, but I left it based on top of other recent PRs to avoid conflicts after they are merged. No review expected yet, I'll follow up when dependent PRs are merged.

…rupt-storage

DaveCTurner · 2018-10-08T15:55:12Z

This is ready for review now.

DaveCTurner · 2018-10-08T16:36:00Z

@elasticmachine test this please

ywelsch

I've left two smaller comments. Adding this disruption kind to runRandomly makes sense though.

ywelsch · 2018-10-09T22:39:48Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

+
+            private void possiblyFail(String description) {
+                if (disruptStorage && rarely()) {
+                    throw new CoordinationStateRejectedException("simulated IO exception [" + description + ']');


Can we throw other exception types here? Can you add TODO that we should check whether the same needs to be done for IOException?

ywelsch · 2018-10-09T22:41:43Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

+            @Override
+            public void setCurrentTerm(long currentTerm) {
+                possiblyFail("writing term of " + currentTerm);
+                super.setCurrentTerm(currentTerm);


maybe inject failure both before or after executing the actual action.

Hmm, I tried this and it's found a lot of places where we assume an exception means the write was unsuccessful. We can fix these, but I am not sure this is the right thing to do. We plan on finishing each write with a rename-and-fsync-the-directory. A failure before the fsync (including during the rename) means we didn't change state. I'm unsure how we should interpret a failed fsync.

/cc @andrershov re #33958

This reverts commit ac1d564.

This reverts commit 3dac8fa.

This reverts commit e1e33fa.

This reverts commit 68bdc08.

This reverts commit 6ce5f67.

This reverts commit 9bd4741.

DaveCTurner added 4 commits October 7, 2018 16:00

Fix bugs in fixLag()

ce1071e

The hack to work around lag detection had some issues: - it always called runFor(), even if no lag was detected - it looked at the last-accepted state not the last-applied state, so missed some lag situations. This fixes these issues.

Merge branch '2018-10-07-low-level-bootstrapping' into 2018-10-08-mer…

cff9df7

…ge-testing

DaveCTurner added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Oct 8, 2018

DaveCTurner added 7 commits October 8, 2018 09:11

Assert that we clean up disruptStorage correctly

5e7a4f9

Better message in the case where a quorum has not been discovered

5c47f55

Review feedback

dbb2b1a

Set initial configuration at the start of stabilisation

ef712a5

Describe why the first election should succeed

dcbacd8

Merge branch '2018-10-07-low-level-bootstrapping' into 2018-10-08-dis…

6644618

…rupt-storage

Merge branch 'zen2' into 2018-10-08-disrupt-storage

040657b

DaveCTurner requested a review from ywelsch October 8, 2018 15:52

ywelsch approved these changes Oct 9, 2018

View reviewed changes

DaveCTurner added 10 commits October 11, 2018 15:05

Fail before or after

b81ef53

Handle exception when bumping term

9bd4741

Throw more kinds of exception

1db8419

Log too

c72e863

Merge branch 'zen2' into 2018-10-08-disrupt-storage

b66b991

More finallys

6ce5f67

More finally

68bdc08

Fix node to follow

e1e33fa

become candidate on failure

3dac8fa

Do not always start election scheduler

ac1d564

DaveCTurner added 7 commits October 13, 2018 09:55

Revert "Do not always start election scheduler"

27de8d7

This reverts commit ac1d564.

Revert "become candidate on failure"

7b6985b

This reverts commit 3dac8fa.

Revert "Fix node to follow"

02c1cf1

This reverts commit e1e33fa.

Revert "More finally"

d43916e

This reverts commit 68bdc08.

Revert "More finallys"

22f5af7

This reverts commit 6ce5f67.

Revert "Handle exception when bumping term"

38fac65

This reverts commit 9bd4741.

Remove possiblyFail() calls after updating state

fce76c2

DaveCTurner merged commit 8b9fa55 into elastic:zen2 Oct 13, 2018

DaveCTurner deleted the 2018-10-08-disrupt-storage branch October 13, 2018 13:24

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Zen2] Add storage-layer disruptions to CoordinatorTests #34347

[Zen2] Add storage-layer disruptions to CoordinatorTests #34347

DaveCTurner commented Oct 8, 2018

elasticmachine commented Oct 8, 2018

DaveCTurner commented Oct 8, 2018

DaveCTurner commented Oct 8, 2018

DaveCTurner commented Oct 8, 2018

ywelsch left a comment

ywelsch Oct 9, 2018

ywelsch Oct 9, 2018

DaveCTurner Oct 12, 2018

[Zen2] Add storage-layer disruptions to CoordinatorTests #34347

[Zen2] Add storage-layer disruptions to CoordinatorTests #34347

Conversation

DaveCTurner commented Oct 8, 2018

elasticmachine commented Oct 8, 2018

DaveCTurner commented Oct 8, 2018

DaveCTurner commented Oct 8, 2018

DaveCTurner commented Oct 8, 2018

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Oct 9, 2018

Choose a reason for hiding this comment

ywelsch Oct 9, 2018

Choose a reason for hiding this comment

DaveCTurner Oct 12, 2018

Choose a reason for hiding this comment