-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus support on v1/sys/metrics endpoint #5308
Conversation
2724bef
to
fa05cfc
Compare
@jefferai, sorry for highlighting, do you think this is enough for #5223 ? |
I have not looked at the code, but one thing that was surfaced to me by a team member is that this uses an unauthenticated endpoint. The endpoint must be authenticated. |
Prometheus does not support passing arbitrary headers (hence X-Vault-Token) After searching, a PR has been made on Consul to be compliant with RFC6750 and allow Authorization Bearer headers to be used in place of X-Consul-Token |
Makes sense to me; @chrishoffman @briankassouf thoughts? |
Note for the team: if we support this we need to be sure to blacklist Authorization from being a passthrough request header. |
Sounds good to me too |
Hello again, small update: So #5397 has been merged so I guess it could be used However, when discussing at Hashiconf with @jefferai, you mentioned that metrology informations are sensitive, so should not be open in the wild, and we discussed about maybe adding it behind a separate listener with CIDR limitations, would that still be ok ? This would ease the integration with third-party monitoring tools |
I think if it's behind a token additional CIDR restrictions aren't necessary, especially given you can pop CIDR restrictions on tokens as well. @briankassouf agree? |
Yep, now that it is behind a token we should be okay to allow these to be exported through the API |
The point of binding to another listener was to avoid leaking potentially sensitive informations (since one may consider vault's metrics sensitive) and still be usable by regular monitoring system without having to authenticate. It's not impossible for prometheus to hit with a vault token, but would complexify existing setups for all users as it has no logic by itself to renew a provided token What do you think @jefferai @briankassouf ? |
Prometheus may not have a builtin way to renew a provided token, but it has the bearer_token_file config option, so something external like a cronjob can handle renewal. Why not proceed with the existing PR under the assumption that the token is required, and use another issue for the tokenless CIDR-restricted listener idea? That way we can at least get the work you've already done merged, rather than have it blocked by a more contentious extra feature. I tried applying the changes on the tip of master and was able to monitor Vault using Prometheus with bearer token auth. As far as I'm concerned it looks almost ready to go. Thanks for doing this by the way, I love Prometheus and look forward to being able to monitor Vault with it! |
This is awesome! This feature has been on my personal wishlist for a while. I'm glad this is coming in soon. |
I think we should leave this behind token authentication. There currently isn't a good way to make sure an endpoint is only accessed via one of many server listener configs. Additionally, as @ncabatoff said, there are many helper libraries that can keep a token renewed for you (see consul template). Alternatively you could create a token with a very long TTL and only give it ACL permissions for the metrics endpoint. The main change still required in this PR is to move the endpoint into the logical system backend. This will move it into an authenticated section of Vault's API. As it stands right now the endpoint is still unauthenticated. See |
Yeah at HashiConf I was saying that behind token was a better idea if possible which it seems it is. |
Fine by me :) |
2ccd4d8
to
f9817ff
Compare
Ok, it was nastier than I imagined to put it in Sys backend, but works as expected. I added a Passthrough header for "Accept" in Sys backend to support the openmetrics norm as request in comments |
I was asking for this based on comparison with promhttp.HandlerFor(), but it looks like they've stopped setting content length in recent versions. I retract my request. |
vault/logical_system.go
Outdated
|
||
acceptHeaders := req.Headers["Accept"] | ||
if format == "prometheus" || (len(acceptHeaders) > 0 && strings.HasPrefix(acceptHeaders[0], "application/openmetrics-text")) { | ||
if !b.Core.prometheusEnabled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for having this guard? It's an ACL'd call so presumably if you're allowed to access it you should be allowed to get that data in whatever format you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the same logic as other telemetry backends, with it being disabled by default, unless you specify a valid retention in telemetry config
Do you feel we should add a by-default value and enable it for everybody ? @jefferai @ncabatoff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the same logic as other telemetry backends, with it being disabled by default, unless you specify a valid retention in telemetry config
Right, but then unlike the other types you're plumbing a value all the way through core. Prometheus is just a format, I don't really see a reason to do this. If someone has access to the metrics, it seems like it ought to be able to be fetched in whatever format they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, will modify to set a default value
We're going to 1.0.3 a bit earlier than expected so we're moving this to the 1.1 beta, which should be in a couple of weeks. |
command/server.go
Outdated
Expiration: telConfig.PrometheusRetentionTime, | ||
} | ||
|
||
sink, err := prometheus.NewPrometheusSinkFrom(prometheusOpts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't you only want to do this if the retention time is not zero? That way it'd gate it based on configuration existing, like the other types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I may have misunderstood your previous comment then :(
You mentioned it might be better if users have all formats available with no gates, so right now, specifying a 0 value for Prometheus retention is return an error, with the default being 24h retention
Would you prefer that the behavior is still disabled if an explicit 0 is used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for all the hard work on this!
Prometheus metrics were added as part of the Vault v1.1.0 release in PR hashicorp#5308. But no documentation was created. Adds the telemetry configuration docs and the API docs.
Prometheus metrics were added as part of the Vault v1.1.0 release in PR #5308. But no documentation was created. Adds the telemetry configuration docs and the API docs.
@uepoch Are metrics shared in cluster mode or should one scrape all cluster instances? |
Metrics are internal to each process and not shared, so one should scrape all cluster instances. In OSS, most metrics on standby nodes will be irrelevant (notable exceptions being HA and Seal related metrics.) . They should still be scraped, otherwise if the active node fails and a standby becomes primary your metrics will cease to be relevant. Similarly if a standby stops responding to scrape requests you know your HA solution has become less highly available. |
@ncabatoff I was not able to find any HA/Seal metrics in the output. Am I missing something? |
Seal metrics were only just added in #6478. Unfortunately due to the way we write Prometheus metrics they're not persistent, so if nothing has happened recently in a given domain you won't see any metrics for that domain. The HA summary metrics I'm thinking of include vault_core_step_down, vault_core_leadership_lost, and vault_core_leadership_setup_failed. Try killing your active node (in a test env!) and see what happens. |
Add support for Prometheus and InMemSink described in #5223
This add supports for a Prometheus Sink in Telemetry config, available by specifying a prometheus_retention_time key in telemetry configuration, similar as in consul or nomad.
To pass the information through the config and server command to the http handler, it expands the HandlerProps to pass inmemory and potential prometheus retention time.
Also made the choices to:
This does not forward anything to leader, each node has its own sinks.