-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow explicit metric registration. Fixes #11732 #27966
Conversation
CI Results: |
Build Results: |
} | ||
opts.SummaryDefinitions = []promsink.SummaryDefinition{ | ||
{ | ||
Name: []string{"preexisting_summary"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if one of the preexisting names clashes with one of the incoming names? Is that worth testing and/or documenting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Q. I think the answer is last-write-wins based on:
They are inserted into a sync.Map
keyed by the hash of the metric name and constant labels so if those collide the second registration wins.
We could de-dupe or panic or something here as it's a bug to have multiple definitions, but it also doesn't do any harm since by definition both registrations are identical!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more subtle question is what happens if you register a metric with the same name under different types - I think the answer is still last write wins but now it will impact the actual value of the metric. That said, even before this change it's possible to name two different metrics the same way and you'll get broken and unpredictable metrics behaviour already from doing so with no warning, so I don't think this makes anything any worse...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raskchanky do you think it's important to add tests for this case? We don't have tests or runtime checks that code doesn't duplicate metric names in general now which would have a similar impact so I feel like it's not going to really tell us much but open to adding something if I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you really. I didn't realize that nothing here was changing any existing behavior, so if it's been this way since the beginning w/o tests, and we haven't run into any metrics related issues because of it, then we're probably fine to continue that way.
Background
For years we've had many issues with "missing metrics" caused by long-standing issues in go-metrics where we only output Prometheus metrics after they've been emitted once and then only for a fixed retention time. This is not what Prometheus or the ecosystem around it expects as generally metrics are explicitly defined in software and then consistently reported. This breaks dashboards and forces ugly workarounds.
One (of many) such issue is #11732. I came accross this in the same week that I'd had a conversation with another Vault developer working on an unrelated customer observability issue that stems from the same problem.
In Consul a few years ago we came up with a reasonable workaround here for Prometheus. It's not ideal still in many ways, not least that metrics definitions for libraries like Raft still need to be enumerated in the application itself as we've not standardised a way to define these for all users of go metrics.
This approach involves statically defining metrics in code during init and then passing these definitions to the Prometheus Sink if it is configured. The
PrometheusSink
will then always output the defined metrics even if they've not been recorded yet and will not expire them after it's retention period.This approach and code layout has been discussed internally already. The primary reason for adding the new registry to the
SDK
is so that a few internal (built in) plugins that don't share the same code base and can't import thevault
module could potentially use them.Note: this mechanism will not work for third-party or externally managed plugins since they are separate processes and can't hook into Vault before its metrics sinks are setup. Supporting metrics from external plugins would be a much larger project involving extensions to the plugin API to allow metrics to be transported to Vault for reporting. We've not heard from plugin developers internally or externally that this is an active need so it's not included here.
Result
To demonstrate usage and test this I fixed #11732 specifically by adding the three metrics defined in ha.go. This can be seen by starting a dev server and immediately being able to see those metrics in output despite a step down (or leadership failure) never having occured:
TODO only if you're a HashiCorp employee
getting backported to N-2, use the new style
backport/ent/x.x.x+ent
labelsinstead of the old style
backport/x.x.x
labels.the normal
backport/x.x.x
label (there should be only 1).of a public function, even if that change is in a CE file, double check that
applying the patch for this PR to the ENT repo and running tests doesn't
break any tests. Sometimes ENT only tests rely on public functions in CE
files.
in the PR description, commit message, or branch name.
description. Also, make sure the changelog is in this PR, not in your ENT PR.