-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvprober: rate limit the planner #61275
Conversation
73d136e
to
c6430b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random note: I missed earlier that we are using time.Ticker
here, I think you want to use timeutil.NewTimer
and after reading from t.C
you want to do t.Read = true
. Shouldn't change anything, but it's cleaner.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @jreut, and @knz)
pkg/kv/kvprober/planner.go, line 91 at r1 (raw file):
.probe
pkg/kv/kvprober/settings.go, line 78 at r1 (raw file):
}) var plannerRateLimit = settings.RegisterDurationSetting(
Hmm. I know I was on board with this approach, but now that I see it it seems less simple than I anticipated. For one, with the cluster setting, we now have to wonder what the behavior is when you set it to a value smaller than the probing interval (I think it will just bounce a few probes every time it needs to do planning and pollute the metrics, so not too bad?)
I think the right thing here could be to avoid the cluster setting and to "hard"-code it as 0.5 times the probing interval times the prefetch size.
So if you're fetching up to 100 descs during planning, and you want to run a step every 2s, if you ever hit an error while fetching descs you would back off 0.5 * 100 * 2s = 100s. Or, in other words, the load on meta2 that results from a failure is at most 2x the load it would get if probing were to occur normally. As a result the meta ranges are protected and we don't have to introduce another knob that could be tuned incorrectly. We could even just drop the random factor 0.5
that I threw in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @jreut, @knz, and @tbg)
pkg/kv/kvprober/settings.go, line 78 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Hmm. I know I was on board with this approach, but now that I see it it seems less simple than I anticipated. For one, with the cluster setting, we now have to wonder what the behavior is when you set it to a value smaller than the probing interval (I think it will just bounce a few probes every time it needs to do planning and pollute the metrics, so not too bad?)
I think the right thing here could be to avoid the cluster setting and to "hard"-code it as 0.5 times the probing interval times the prefetch size.
So if you're fetching up to 100 descs during planning, and you want to run a step every 2s, if you ever hit an error while fetching descs you would back off 0.5 * 100 * 2s = 100s. Or, in other words, the load on meta2 that results from a failure is at most 2x the load it would get if probing were to occur normally. As a result the meta ranges are protected and we don't have to introduce another knob that could be tuned incorrectly. We could even just drop the random factor0.5
that I threw in there.
I think I like that better too. Thanks for the idea. Gonna try it out.
c6430b4
to
c293123
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL!
I'm gonna add moving to timeutil.NewTicket
to my list of maintenance work to get done post 21.1 branch being cut. It's not totally clear to me what you mean re: t.Read
and I want to get one other change into 21.1 ideally (add structured logging) and also do some e2e manual testing.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jreut, @knz, and @tbg)
pkg/kv/kvprober/planner.go, line 91 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
.probe
Done.
pkg/kv/kvprober/settings.go, line 78 at r1 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
I think I like that better too. Thanks for the idea. Gonna try it out.
Done. I like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I added two commits to your branch. You should be able to merge at will.
bors d+
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jreut and @knz)
✌️ joshimhoff can now approve this pull request. To approve and merge a pull request, simply reply with |
TTTF and two commits, Tobias! |
2140436
to
8c35466
Compare
This commit introduces a planning rate limit. This protects CRDB from planning executing too often, due to either issues with CRDB (e.g. meta2 unavailability) or bugs in kvprober. When planning does execute, kvprober scans kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 and unmarshalls the resulting range descriptors. Release justification: Auxiliary system that is off by default. Release note: None.
Release justification: bug fixes and low-risk updates to new functionality Release note: None
A nonzero limit has the potential to cause issues, albeit theoretical ones, if the clock doesn't advance even 1ns between two operations. This is mostly hypothetical and with zero we still have issues if the clock jumps backwards, but still. Release justification: non-production code changes Release note: None
8c35466
to
56c54a4
Compare
bors r+ |
Build succeeded: |
#61255
kvprober: rate limit the planner
This commit introduces a planning rate limit. This protects CRDB from
planning executing too often, due to either issues with CRDB (e.g.
meta2 unavailability) or bugs in kvprober. When planning does execute,
kvprober scans kv.prober.planner.num_steps_to_plan_at_once rows worth
of meta2 and unmarshalls the resulting range descriptors.
Release justification: Auxiliary system that is off by default.
Release note: None.