-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics for cache size / prune pauses #4401
Comments
Regarding prometheus, I think this should be first added on its own (if we think it is needed). #1544 This would be the default Go metrics and figuring out how to configure the listening addresses/auth etc (and binary size increase). Then the custom counters can follow. We already have debug handler address for access to Go profilers/tracers, as well as expvar https://pkg.go.dev/expvar . If there are more properties that should be added to Regarding prune, if there is a test case that shows that prune has an unreasonable impact on running build then prune can be updated to not do that. This shouldn't be very hard. Prune time itself does not show slowness as it only matters if something is waiting on it to continue. There is no reason we need to keep a long lock for whole prune (and we don't for |
Sure! Here's the test case I wrote:
prune speed is linear with the number of items -- so 100 items leads to a 1.5s pause, 1000 items leads to a 15s pause, etc. Ya, I think a good place to start would be to add some metrics around how many cache records there are and how many records are being pruned in a batch, so that our monitoring systems can tell us how often this happens and whether it's worth optimizing further. i could certainly make expvar work for metric export, though promhttp is definitely more widespread |
So what do you suggest for the max batch size?
|
Added limit for this case in #4413 But I don't really have a reproducible case for build stalls as basic batching already happens. What I did (this is without the #4413 patch):
Prune is bit slower : Build runs slower but there are no big pauses. The biggest pause is 1.2 sec.
|
Just a note:
I think this is covered in the histogram case. Histograms will usually include the raw count and sum which should match this requirement. I'll make sure of that, but is using those parameters fine for this? |
Why
We've seen issues where buildkit will sometimes pause for a long time while the cache manager is doing a prune().
It's fairly straightforward to reproduce this in isolation. A test that creates 1,000 cache entries then garbage collects them will hold the cache lock for about ~15s on my machine. But we don't have good visibility into how frequent these pauses are in a live system.
Feature request
Can we add a prometheus handler that exposes some metrics about cache / prune behavior?
A good place to start would be to take some inspiration from the pause metrics that Go itself uses for its GC (PauseTotalNs and PauseNs in https://pkg.go.dev/runtime#MemStats)
The text was updated successfully, but these errors were encountered: