-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add failpoint for nospace on puts #16018
base: main
Are you sure you want to change the base?
Conversation
tests/robustness/failpoints.go
Outdated
member := clus.Procs[rand.Int()%len(clus.Procs)] | ||
for member.IsRunning() { | ||
lg.Info("Setting up gofailpoint", zap.String("failpoint", f.Name())) | ||
err := member.Failpoints().Setup(ctx, f.Name(), "return") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
over time, this might render the cluster entirely unusable - what's the best way to turn this off for a given member? After x-minutes? Only if there's still quorum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Robustness tests iteration (create cluster, run traffic, inject failpoint and delete cluster) takes usually 5s, max a minute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems I was out of the loop for too long, are the longer nightly linearizability tests not a thing anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are a thing, but we run 100x 5s iteration
tests/robustness/failpoints.go
Outdated
if err != nil { | ||
panic(err) | ||
} | ||
if v.LessThan(version.V3_6) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this sensible? I believe this should be easy to backport to 3.5 and 3.4 however
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would not be needed if you used code from goPanicFailpoint
that checks list of failpoints exposed by gofail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added and reused, thanks
tests/robustness/failpoints.go
Outdated
lg.Info("Setting up gofailpoint", zap.String("failpoint", f.Name())) | ||
err := member.Failpoints().Setup(ctx, f.Name(), "return") | ||
if err != nil { | ||
lg.Info("goFailpoint setup failed", zap.String("failpoint", f.Name()), zap.Error(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to assert that failpoint was executed at least once. Please follow #14729 on how to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, it's much easier than sleep testing
Please give me more context on why you want to introduce this failpoint. I want to make sure that we not only code it, but also address the overall issues with no space alarms. |
I was briefly seeing a panic after playing around the authbackend implementation paths, which led me to run the linearization test for longer while dropping out a node at random with this alarm. I'll add a more specific test once I've figured out how to properly repro this... (setting to draft in the meantime) |
Adding a new flag to retain e2e etcd process logs after stop and saving next to the visualized model. Spun out of etcd-io#16018 where I used it for easier local debugging on model violations. Signed-off-by: Thomas Jungblut <[email protected]>
Adding a new flag to retain e2e etcd process logs after stop and saving next to the visualized model. Spun out of etcd-io#16018 where I used it for easier local debugging on model violations. Signed-off-by: Thomas Jungblut <[email protected]>
Adding a new flag to retain e2e etcd process logs after stop and saving next to the visualized model. Spun out of etcd-io#16018 where I used it for easier local debugging on model violations. Fixes etcd-io#15079 partially. Signed-off-by: Thomas Jungblut <[email protected]>
Adding a new flag to retain e2e etcd process logs after stop and saving next to the visualized model. Spun out of etcd-io#16018 where I used it for easier local debugging on model violations. Fixes etcd-io#15079 partially. Signed-off-by: Thomas Jungblut <[email protected]>
Adding a set of functions which retain e2e etcd process logs after stop and saving next to the visualized model during robustness tests. Spun out of etcd-io#16018 where I used it for easier local debugging on model violations. Fixes etcd-io#15079 partially. Signed-off-by: Thomas Jungblut <[email protected]>
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Discussed during sig-etcd triage meeting. @tjungblu do you have capacity to resolve conflicts and finish this off? |
Sure, do you guys want to keep the failpoint? I can remove the remainder. |
a030901
to
d23cc9a
Compare
rebased, updated and removed the remainder of unrelated changes |
This CR introduces a new failput that will trigger a member to report no space. Signed-off-by: Thomas Jungblut <[email protected]>
d23cc9a
to
2213f06
Compare
@tjungblu: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This CR introduces a new failput that will trigger a member to report no space.
This might need some more legwork, it's the first time I'm adding a failpoint here.