-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.12.0 exposes some high cardinality metrics #967
Comments
Thanks for reporting. It seems like we need to think more about default bucket configuration. |
cc @bwplotka |
Yeah, that's rather extreme for default metrics. Perhaps those could be opt-in at that granularity. I would also like to change that before releasing the final Prometheus v2.33. |
Yeah, the histograms in FWIW I think it would be nice keep the |
Out of curiosity, what kind of problems do you expect? |
If these metrics came with a newly introduced package, it might be fine. Users could do the resource planning for it or decide to drop them on ingestion. However, the problem is that these default Go metrics are in a lot of existing binaries, and all of them will suddenly expose a significantly increased number of metrics after merely bumping the dependencies to the current minor release. This might easily push many users beyond their cardinality budget. |
Not to derail this too hard, but just for my understanding: what does this cardinality budget look like? Memory? Throughput? |
Each unique time series eats ~4KB on average in our production environments. The diff I posted has more than 1k lines. Adding 1k extra metrics adds ~4MB of Prometheus memory per instance for every Go service scraped by Prometheus (plus some memory for exposing those metrics in those services). 4MB might not be much but it adds up quickly. |
Thanks, that's good to know. I meant to say this earlier, but I'll send a PR to reduce the bucket granularity. I'm thinking something like a function that iterates over the buckets and merges buckets within either a factor of 2 or a factor of 10 (depending on the metric unit), making the buckets exponential. Does this sound reasonable? Opt-in to higher granularity can be future work; an exponential distribution is already pretty useful. |
Is 136 too much? |
@prymitive Sure. I can also reduce the impact of this further if there's a way to alias metrics (i.e. just say "here's metric name, but it's really just this other metric, so use that data instead") since some existing metrics are re-exported under new names. I can also drop the new names for just those metrics altogether, but that would make for a somewhat worse user experience. Currently, the names of the metrics are generated from the |
For sure. The current state of things is not great. I didn't anticipate that each histogram bucket would be treated as an individual metric, and I also forgot quite how big I made these histograms. 😅 Going back over the list, it looks like at a bare minimum (without removing "duplicate" metrics -- I'd rather not because the new forms of them are more orthogonal and clearer; the divisions and groupings of memory in 2 of those new histograms are time histograms. They're high dynamic range histograms which are a combination of exponentially and linearly bucketed. Collapsing those down to be exponentially bucketed, we get 45 buckets each. We can then square the base of the exponential series and cut that down to 23, and then cut out a lot of the higher buckets (the range goes all the way up to O(days), we can just catch everything above 1 second into its own bucket for all of the existing ones). That gets us down to about 15. For the other two histograms, we can just do power-of-two exponential bucketing which collapses the 67 or so buckets down to 13 or 14. To be clear, this code would be generic, so any new runtime/metrics histograms metrics with unit "bytes" or "seconds" would immediately be minimized. I can add a test that ensures new metrics to be audited via the generator do not accidentally create a whole bunch of bloat. Another alternative is a So, in total, that's 83 additional metrics assuming we just shrink the histograms as I suggested above. According to @prymitive's back-of-envelope math that's an added on-server bloat of about 400 KiB for this release. Like I said earlier we can Side-note: the reasoning behind these huge histograms is it doesn't cost much (< 10 KiB I'll guess?) to actually manage these in the runtime, and it's better for the runtime to expose more granular data than less granular data. It's totally reasonable for downstream users to adjust that granularity. |
First of all, let's not remove old metrics. There are lots of dashboards out there would be broken. And @mknyszek as you suggested let's try to aggressively reduce the number of the buckets. With the a reason number of cardinality increase and better communication, this should be doable.
Yes, tests are essential for this case. And also as @beorn7 suggested, maybe we should go in a direction to make these additional costly metrics opt-in rather than enabling them implicitly until the major version bump. Or as it already discussed we can start with really conservative bucket distribution for all of them and increase the granularity with configuration rather than having an additional collector. Personally I would prefer this approach more. Otherwise, it would take longer times to discover these new useful metrics. What do you think? @bwplotka |
Er, very sorry. I was definitely not suggesting that. I recognize how important that is. And I can see how what I wrote that reads that way, but I meant that we can cut out new metrics that already exist in some form.
I have a WIP implementation of the reduced bucketing that means 10 exponential buckets for 2 of the histograms, and 13 for the other 2. That makes for a total of |
Thanks, reducing those histograms is definitely the first step. What if we make it so a user has 3 options:
Potentially also 4th one? without MemStats? |
Also a general note: I love high-res histograms, but Prometheus doesn't really do those yet. We are working on it, but it will take months before they are generally available (at which point we should definitely offer those awesome metrics for users of the new high-res histograms at fine granularity). |
@bwplotka Yeah, that works. My apologies for not just doing this up-front, but understanding the constraints now I get where you're coming from. Looking back over the comment history, if the biggest concern here is that 71 metrics being added for a minor release is too much, then I'm fine with only exporting the old metrics, backed by MemStats, and then flipping to the added 71 metrics on the major release (is that something that's happening relatively soon?). Finally, how should I expose the options? A new
(This is not the exact code I'd write, just a mock-up to express the idea.) @beorn7 Cool! Looking forward to that. Honestly though I wasn't necessarily expecting the high granularity histograms to always be used in production anyway; this whole issue is basically just because I forgot how big they were, otherwise I would've done the bucket minimization up-front. |
Yeah, I think this deserves a timely bugfix release. Together with #969. |
Sent #974 to mitigate the immediate problem. Waiting on advice on the best way to expose multiple options so users may opt out of the additional 79 values introduced in this release (I was off by a few because I forgot about the sum and count dimensions of histograms). |
Sorry for the lag @mknyszek and thanks for the quick reaction! Let's aim to release this week.
I would propose adding variadic option pattern so from |
Given how big #974 has gotten, I'll do the options as a separate change. Sorry for the delay this week, I'm on a rotation. |
Thank you very much for fixing this quickly. @kakkoyun are you planning to cut 1.12.1 any time soon? It would help me to cut the final Prometheus v2.33.0 release (the rc.1 has client_golang 1.12.0 with the race condition and the excessive amount of Go metrics). |
Yes, I will do it today Björn. And ping you when it’s out.
On Sat 29. Jan 2022 at 00:41, Björn Rabenstein ***@***.***> wrote:
Thank you very much for fixing this quickly.
@kakkoyun <https://github.com/kakkoyun> are you planning to cut 1.12.1
any time soon? It would help me to cut the final Prometheus v2.33.0 release
(the rc.1 has client_golang 1.12.0 with the race condition and the
excessive amount of Go metrics).
—
Reply to this email directly, view it on GitHub
<#967 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEC7AKHQWBIP3DCVZ7I3NLUYMESLANCNFSM5MJVQSFA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Kemal Akkoyun
|
Thanks for fixing excessive bucket number count.
Isn't |
It is valid (or in other words: Prometheus histograms do support buckets for negative observations). But in this case, the observed GC pauses should never last a negative amount of time (that would mean the pause ended before it started? ;) . I assume that's just a floating point precision issue, and the bucket boundary should perhaps be adjusted (to zero?). @mknyszek you will know more about the internals here. |
@beorn7 is correct. It's the And also apologies for not getting around to the opt-out API yet, I've gotten distracted with other things. |
No worries @mknyszek and thanks for helping! I created another issue for opt-out API, (#983), so hopefully, users like @prymitive can fall back to the old behaviour if needed. Also help wanted if anyone else want to contribute this! 🤗 |
… metrics by default. Fixes #967 Signed-off-by: Bartlomiej Plotka <[email protected]>
Hi All. With new optionality, I would propose reverting the addition (by default) of new metrics. Please voice your opinion here: #1033 |
… metrics by default. (#1033) Fixes #967 Signed-off-by: Bartlomiej Plotka <[email protected]>
… metrics by default. (#1033) Fixes #967 Signed-off-by: Bartlomiej Plotka <[email protected]>
v1.12.2 removes some high cardinality metrics introduced in v1.12.0. See prometheus/client_golang#967 for context.
v1.12.2 removes some high cardinality metrics introduced in v1.12.0. See prometheus/client_golang#967 for context. Signed-off-by: Ben Reedy <[email protected]>
* Cut v1.12.0 Signed-off-by: Kemal Akkoyun <[email protected]> * Bump the day Signed-off-by: Kemal Akkoyun <[email protected]> * Make the Go 1.17 collector thread-safe (#969) * Use simpler locking in the Go 1.17 collector (#975) A previous PR made it so that the Go 1.17 collector locked only around uses of rmSampleBuf, but really that means that Metric values may be sent over the channel containing some values from future metrics.Read calls. While generally-speaking this isn't a problem, we lose any consistency guarantees provided by the runtime/metrics package. Also, that optimization to not just lock around all of Collect was premature. Truthfully, Collect is called relatively infrequently, and its critical path is fairly fast (10s of µs). To prove it, this change also adds a benchmark. name old time/op new time/op delta GoCollector-16 43.7µs ± 2% 43.2µs ± 2% ~ (p=0.190 n=9+9) Note that because the benchmark is single-threaded it actually looks like it might be getting *slightly* faster, because all those Collect calls for the Metrics are direct calls instead of interface calls. Signed-off-by: Michael Anthony Knyszek <[email protected]> * API client: make http reads more efficient (#976) Replace `io.ReadAll` with `bytes.Buffer.ReadFrom`. Both need to resize a buffer until they have finished reading; the former increases by 1.25x each time while the latter uses 2x. Also added a benchmark to demonstrate the benefit: name old time/op new time/op delta Client/4KB-8 35.9µs ± 4% 35.3µs ± 3% ~ (p=0.310 n=5+5) Client/50KB-8 83.1µs ± 8% 69.5µs ± 1% -16.37% (p=0.008 n=5+5) Client/1000KB-8 891µs ± 6% 750µs ± 0% -15.83% (p=0.016 n=5+4) Client/2000KB-8 1.74ms ± 2% 1.35ms ± 1% -22.72% (p=0.008 n=5+5) name old alloc/op new alloc/op delta Client/4KB-8 20.2kB ± 0% 20.4kB ± 0% +1.26% (p=0.008 n=5+5) Client/50KB-8 218kB ± 0% 136kB ± 0% -37.65% (p=0.008 n=5+5) Client/1000KB-8 5.88MB ± 0% 2.11MB ± 0% -64.10% (p=0.008 n=5+5) Client/2000KB-8 11.7MB ± 0% 4.2MB ± 0% -63.93% (p=0.008 n=5+5) name old allocs/op new allocs/op delta Client/4KB-8 75.0 ± 0% 72.0 ± 0% -4.00% (p=0.008 n=5+5) Client/50KB-8 109 ± 0% 98 ± 0% -10.09% (p=0.079 n=4+5) Client/1000KB-8 617 ± 0% 593 ± 0% -3.89% (p=0.008 n=5+5) Client/2000KB-8 1.13k ± 0% 1.09k ± 0% -3.27% (p=0.008 n=5+5) Signed-off-by: Bryan Boreham <[email protected]> * Reduce granularity of histogram buckets for Go 1.17 collector (#974) The Go runtime/metrics package currently exports extremely granular histograms. Exponentially bucket any histogram with unit "seconds" or "bytes" instead to dramatically reduce the number of buckets, and thus the number of metrics. This change also adds a test to check for expected cardinality to prevent cardinality surprises in the future. Signed-off-by: Michael Anthony Knyszek <[email protected]> * Cut v1.12.1 (#978) * Cut v1.12.1 Signed-off-by: Kemal Akkoyun <[email protected]> * Apply review suggestions Signed-off-by: Kemal Akkoyun <[email protected]> * Fix deprecated `NewBuildInfoCollector` API Update `examples/random/main.go`: `prometheus.NewBuildInfoCollector` is deprecated. Use `collectors.NewBuildInfoCollector` instead. Signed-off-by: alissa-tung <[email protected]> * gocollector: Added options to Go Collector for changing the (#1031) * Renamed files. Signed-off-by: Bartlomiej Plotka <[email protected]> * gocollector: Added options to Go Collector for diffetent collections. Fixes #983 Also: * fixed TestMemStatsEquivalence, it was noop before (: * Removed gc_cpu_fraction metric completely, since it's not working completely for Go1.17+ Signed-off-by: Bartlomiej Plotka <[email protected]> * gocollector: Reverted client_golang v1.12 addition of runtime/metrics metrics by default. (#1033) Fixes #967 Signed-off-by: Bartlomiej Plotka <[email protected]> * prometheus: Fix convention violating names for generated collector metrics (#1048) * Fix convention violating names for generated collector metrics Signed-off-by: Kemal Akkoyun <[email protected]> * Add new Go collector example Signed-off-by: Kemal Akkoyun <[email protected]> * Remove -Inf buckets from go collector histograms (#1049) * Remove -Inf buckets from go collector histograms Signed-off-by: Kemal Akkoyun <[email protected]> * Update prometheus/collectors/go_collector_latest_test.go Co-authored-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Kemal Akkoyun <[email protected]> * Simplify Signed-off-by: Kemal Akkoyun <[email protected]> Co-authored-by: Bartlomiej Plotka <[email protected]> * Cut v1.12.2 (#1052) * Cut v1.12.2 Signed-off-by: Kemal Akkoyun <[email protected]> * Apply suggestions Signed-off-by: Kemal Akkoyun <[email protected]> * Update CHANGELOG.md Co-authored-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Kemal Akkoyun <[email protected]> Co-authored-by: Bartlomiej Plotka <[email protected]> Co-authored-by: Kemal Akkoyun <[email protected]> Co-authored-by: Michael Knyszek <[email protected]> Co-authored-by: Bryan Boreham <[email protected]> Co-authored-by: Kemal Akkoyun <[email protected]> Co-authored-by: alissa-tung <[email protected]> Co-authored-by: Bartlomiej Plotka <[email protected]>
… metrics by default. (#1033) Fixes prometheus/client_golang#967 Signed-off-by: Bartlomiej Plotka <[email protected]>
… metrics by default. (#1033) Fixes prometheus/client_golang#967 Signed-off-by: Bartlomiej Plotka <[email protected]>
* Cut v1.12.0 Signed-off-by: Kemal Akkoyun <[email protected]> * Bump the day Signed-off-by: Kemal Akkoyun <[email protected]> * Make the Go 1.17 collector thread-safe (#969) * Use simpler locking in the Go 1.17 collector (#975) A previous PR made it so that the Go 1.17 collector locked only around uses of rmSampleBuf, but really that means that Metric values may be sent over the channel containing some values from future metrics.Read calls. While generally-speaking this isn't a problem, we lose any consistency guarantees provided by the runtime/metrics package. Also, that optimization to not just lock around all of Collect was premature. Truthfully, Collect is called relatively infrequently, and its critical path is fairly fast (10s of µs). To prove it, this change also adds a benchmark. name old time/op new time/op delta GoCollector-16 43.7µs ± 2% 43.2µs ± 2% ~ (p=0.190 n=9+9) Note that because the benchmark is single-threaded it actually looks like it might be getting *slightly* faster, because all those Collect calls for the Metrics are direct calls instead of interface calls. Signed-off-by: Michael Anthony Knyszek <[email protected]> * API client: make http reads more efficient (#976) Replace `io.ReadAll` with `bytes.Buffer.ReadFrom`. Both need to resize a buffer until they have finished reading; the former increases by 1.25x each time while the latter uses 2x. Also added a benchmark to demonstrate the benefit: name old time/op new time/op delta Client/4KB-8 35.9µs ± 4% 35.3µs ± 3% ~ (p=0.310 n=5+5) Client/50KB-8 83.1µs ± 8% 69.5µs ± 1% -16.37% (p=0.008 n=5+5) Client/1000KB-8 891µs ± 6% 750µs ± 0% -15.83% (p=0.016 n=5+4) Client/2000KB-8 1.74ms ± 2% 1.35ms ± 1% -22.72% (p=0.008 n=5+5) name old alloc/op new alloc/op delta Client/4KB-8 20.2kB ± 0% 20.4kB ± 0% +1.26% (p=0.008 n=5+5) Client/50KB-8 218kB ± 0% 136kB ± 0% -37.65% (p=0.008 n=5+5) Client/1000KB-8 5.88MB ± 0% 2.11MB ± 0% -64.10% (p=0.008 n=5+5) Client/2000KB-8 11.7MB ± 0% 4.2MB ± 0% -63.93% (p=0.008 n=5+5) name old allocs/op new allocs/op delta Client/4KB-8 75.0 ± 0% 72.0 ± 0% -4.00% (p=0.008 n=5+5) Client/50KB-8 109 ± 0% 98 ± 0% -10.09% (p=0.079 n=4+5) Client/1000KB-8 617 ± 0% 593 ± 0% -3.89% (p=0.008 n=5+5) Client/2000KB-8 1.13k ± 0% 1.09k ± 0% -3.27% (p=0.008 n=5+5) Signed-off-by: Bryan Boreham <[email protected]> * Reduce granularity of histogram buckets for Go 1.17 collector (#974) The Go runtime/metrics package currently exports extremely granular histograms. Exponentially bucket any histogram with unit "seconds" or "bytes" instead to dramatically reduce the number of buckets, and thus the number of metrics. This change also adds a test to check for expected cardinality to prevent cardinality surprises in the future. Signed-off-by: Michael Anthony Knyszek <[email protected]> * Cut v1.12.1 (#978) * Cut v1.12.1 Signed-off-by: Kemal Akkoyun <[email protected]> * Apply review suggestions Signed-off-by: Kemal Akkoyun <[email protected]> * Fix deprecated `NewBuildInfoCollector` API Update `examples/random/main.go`: `prometheus.NewBuildInfoCollector` is deprecated. Use `collectors.NewBuildInfoCollector` instead. Signed-off-by: alissa-tung <[email protected]> * gocollector: Added options to Go Collector for changing the (#1031) * Renamed files. Signed-off-by: Bartlomiej Plotka <[email protected]> * gocollector: Added options to Go Collector for diffetent collections. Fixes prometheus/client_golang#983 Also: * fixed TestMemStatsEquivalence, it was noop before (: * Removed gc_cpu_fraction metric completely, since it's not working completely for Go1.17+ Signed-off-by: Bartlomiej Plotka <[email protected]> * gocollector: Reverted client_golang v1.12 addition of runtime/metrics metrics by default. (#1033) Fixes prometheus/client_golang#967 Signed-off-by: Bartlomiej Plotka <[email protected]> * prometheus: Fix convention violating names for generated collector metrics (#1048) * Fix convention violating names for generated collector metrics Signed-off-by: Kemal Akkoyun <[email protected]> * Add new Go collector example Signed-off-by: Kemal Akkoyun <[email protected]> * Remove -Inf buckets from go collector histograms (#1049) * Remove -Inf buckets from go collector histograms Signed-off-by: Kemal Akkoyun <[email protected]> * Update prometheus/collectors/go_collector_latest_test.go Co-authored-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Kemal Akkoyun <[email protected]> * Simplify Signed-off-by: Kemal Akkoyun <[email protected]> Co-authored-by: Bartlomiej Plotka <[email protected]> * Cut v1.12.2 (#1052) * Cut v1.12.2 Signed-off-by: Kemal Akkoyun <[email protected]> * Apply suggestions Signed-off-by: Kemal Akkoyun <[email protected]> * Update CHANGELOG.md Co-authored-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Kemal Akkoyun <[email protected]> Co-authored-by: Bartlomiej Plotka <[email protected]> Co-authored-by: Kemal Akkoyun <[email protected]> Co-authored-by: Michael Knyszek <[email protected]> Co-authored-by: Bryan Boreham <[email protected]> Co-authored-by: Kemal Akkoyun <[email protected]> Co-authored-by: alissa-tung <[email protected]> Co-authored-by: Bartlomiej Plotka <[email protected]>
v1.12.2 removes some high cardinality metrics introduced in v1.12.0. See prometheus/client_golang#967 for context. Signed-off-by: Ben Reedy <[email protected]>
v1.12.2 removes some high cardinality metrics introduced in v1.12.0. See prometheus/client_golang#967 for context. Signed-off-by: Ben Reedy <[email protected]>
v1.12.2 removes some high cardinality metrics introduced in v1.12.0. See prometheus/client_golang#967 for context. Signed-off-by: Ben Reedy <[email protected]>
v1.12.2 removes some high cardinality metrics introduced in v1.12.0. See prometheus/client_golang#967 for context. Signed-off-by: Ben Reedy <[email protected]>
It seems that v1.12.0 includes some extra high cardinality metrics, most likely autogenerated by Go itself thanks to #955.
Here's a full diff of metrics between 1.11 and 1.12 - https://gist.github.com/prymitive/3d16b79a8187448bf6e0d61db7336ca0
Having so many extra metrics exposed by default might cause problems for people with lots of Go services.
The text was updated successfully, but these errors were encountered: