Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvprober: rate limit the planner #61275

Merged
merged 3 commits into from
Mar 4, 2021

Conversation

joshimhoff
Copy link
Collaborator

@joshimhoff joshimhoff commented Mar 1, 2021

#61255

kvprober: rate limit the planner

This commit introduces a planning rate limit. This protects CRDB from
planning executing too often, due to either issues with CRDB (e.g.
meta2 unavailability) or bugs in kvprober. When planning does execute,
kvprober scans kv.prober.planner.num_steps_to_plan_at_once rows worth
of meta2 and unmarshalls the resulting range descriptors.

Release justification: Auxiliary system that is off by default.
Release note: None.

@joshimhoff joshimhoff requested review from knz, jreut and tbg March 1, 2021 18:34
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@joshimhoff joshimhoff force-pushed the kv_prober_meta2_protect branch 5 times, most recently from 73d136e to c6430b4 Compare March 2, 2021 00:51
@tbg tbg changed the title [small] kvprober: rate limit the planner via a cluster setting kvprober: rate limit the planner via a cluster setting Mar 2, 2021
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random note: I missed earlier that we are using time.Ticker here, I think you want to use timeutil.NewTimer and after reading from t.C you want to do t.Read = true. Shouldn't change anything, but it's cleaner.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @jreut, and @knz)


pkg/kv/kvprober/planner.go, line 91 at r1 (raw file):

.probe

pkg/kv/kvprober/settings.go, line 78 at r1 (raw file):

	})

var plannerRateLimit = settings.RegisterDurationSetting(

Hmm. I know I was on board with this approach, but now that I see it it seems less simple than I anticipated. For one, with the cluster setting, we now have to wonder what the behavior is when you set it to a value smaller than the probing interval (I think it will just bounce a few probes every time it needs to do planning and pollute the metrics, so not too bad?)

I think the right thing here could be to avoid the cluster setting and to "hard"-code it as 0.5 times the probing interval times the prefetch size.
So if you're fetching up to 100 descs during planning, and you want to run a step every 2s, if you ever hit an error while fetching descs you would back off 0.5 * 100 * 2s = 100s. Or, in other words, the load on meta2 that results from a failure is at most 2x the load it would get if probing were to occur normally. As a result the meta ranges are protected and we don't have to introduce another knob that could be tuned incorrectly. We could even just drop the random factor 0.5 that I threw in there.

@joshimhoff joshimhoff requested a review from tbg March 2, 2021 14:15
Copy link
Collaborator Author

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @jreut, @knz, and @tbg)


pkg/kv/kvprober/settings.go, line 78 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Hmm. I know I was on board with this approach, but now that I see it it seems less simple than I anticipated. For one, with the cluster setting, we now have to wonder what the behavior is when you set it to a value smaller than the probing interval (I think it will just bounce a few probes every time it needs to do planning and pollute the metrics, so not too bad?)

I think the right thing here could be to avoid the cluster setting and to "hard"-code it as 0.5 times the probing interval times the prefetch size.
So if you're fetching up to 100 descs during planning, and you want to run a step every 2s, if you ever hit an error while fetching descs you would back off 0.5 * 100 * 2s = 100s. Or, in other words, the load on meta2 that results from a failure is at most 2x the load it would get if probing were to occur normally. As a result the meta ranges are protected and we don't have to introduce another knob that could be tuned incorrectly. We could even just drop the random factor 0.5 that I threw in there.

I think I like that better too. Thanks for the idea. Gonna try it out.

@joshimhoff joshimhoff changed the title kvprober: rate limit the planner via a cluster setting kvprober: rate limit the planner Mar 2, 2021
@joshimhoff joshimhoff force-pushed the kv_prober_meta2_protect branch from c6430b4 to c293123 Compare March 2, 2021 15:12
Copy link
Collaborator Author

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL!

I'm gonna add moving to timeutil.NewTicket to my list of maintenance work to get done post 21.1 branch being cut. It's not totally clear to me what you mean re: t.Read and I want to get one other change into 21.1 ideally (add structured logging) and also do some e2e manual testing.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jreut, @knz, and @tbg)


pkg/kv/kvprober/planner.go, line 91 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…
.probe

Done.


pkg/kv/kvprober/settings.go, line 78 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

I think I like that better too. Thanks for the idea. Gonna try it out.

Done. I like it.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I added two commits to your branch. You should be able to merge at will.

bors d+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jreut and @knz)

@craig
Copy link
Contributor

craig bot commented Mar 3, 2021

✌️ joshimhoff can now approve this pull request. To approve and merge a pull request, simply reply with bors r+. More detailed instructions are available here.

@joshimhoff
Copy link
Collaborator Author

TTTF and two commits, Tobias!

@joshimhoff joshimhoff force-pushed the kv_prober_meta2_protect branch 3 times, most recently from 2140436 to 8c35466 Compare March 4, 2021 15:54
joshimhoff and others added 3 commits March 4, 2021 12:01
This commit introduces a planning rate limit. This protects CRDB from
planning executing too often, due to either issues with CRDB (e.g.
meta2 unavailability) or bugs in kvprober. When planning does execute,
kvprober scans kv.prober.planner.num_steps_to_plan_at_once rows worth
of meta2 and unmarshalls the resulting range descriptors.

Release justification: Auxiliary system that is off by default.
Release note: None.
Release justification: bug fixes and low-risk updates to new functionality
Release note: None
A nonzero limit has the potential to cause issues, albeit theoretical
ones, if the clock doesn't advance even 1ns between two operations.
This is mostly hypothetical and with zero we still have issues if the
clock jumps backwards, but still.

Release justification: non-production code changes
Release note: None
@joshimhoff joshimhoff force-pushed the kv_prober_meta2_protect branch from 8c35466 to 56c54a4 Compare March 4, 2021 17:01
@joshimhoff
Copy link
Collaborator Author

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 4, 2021

Build succeeded:

@craig craig bot merged commit c0d8340 into cockroachdb:master Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants