-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Zen2] Add storage-layer disruptions to CoordinatorTests #34347
[Zen2] Add storage-layer disruptions to CoordinatorTests #34347
Conversation
The hack to work around lag detection had some issues: - it always called runFor(), even if no lag was detected - it looked at the last-accepted state not the last-applied state, so missed some lag situations. This fixes these issues.
Today we inject the initial configuration of the cluster (i.e. the set of voting nodes) at startup. In reality we must support injecting the initial configuration after startup too. This commit adds low-level support for doing so as safely as possible.
Today we assume the storage layer operates perfectly in CoordinatorTests, which means we are not testing that the system's invariants are preserved if the storage layer fails for some reason. This change injects (rare) storage-layer failures during the safety phase to cover these cases.
Pinging @elastic/es-distributed |
Opening this so it doesn't get lost. The pertinent commit is d6d1ee4, but I left it based on top of other recent PRs to avoid conflicts after they are merged. No review expected yet, I'll follow up when dependent PRs are merged. |
This is ready for review now. |
@elasticmachine test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left two smaller comments. Adding this disruption kind to runRandomly makes sense though.
|
||
private void possiblyFail(String description) { | ||
if (disruptStorage && rarely()) { | ||
throw new CoordinationStateRejectedException("simulated IO exception [" + description + ']'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we throw other exception types here? Can you add TODO that we should check whether the same needs to be done for IOException?
@Override | ||
public void setCurrentTerm(long currentTerm) { | ||
possiblyFail("writing term of " + currentTerm); | ||
super.setCurrentTerm(currentTerm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe inject failure both before or after executing the actual action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I tried this and it's found a lot of places where we assume an exception means the write was unsuccessful. We can fix these, but I am not sure this is the right thing to do. We plan on finishing each write with a rename-and-fsync-the-directory. A failure before the fsync (including during the rename) means we didn't change state. I'm unsure how we should interpret a failed fsync.
/cc @andrershov re #33958
Today we assume the storage layer operates perfectly in CoordinatorTests, which
means we are not testing that the system's invariants are preserved if the
storage layer fails for some reason. This change injects (rare) storage-layer
failures during the safety phase to cover these cases.