Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: secrets engines should be in labels, not metric names (prometheus) #9068

Open
lrstanley opened this issue May 22, 2020 · 10 comments
Open

Comments

@lrstanley
Copy link

lrstanley commented May 22, 2020

Is your feature request related to a problem? Please describe.

Prometheus metrics are output with the following format today (snippet pulled from live cluster):

vault_route_read_<engine>_{quantile="0.5"} 0.031690001487731934
vault_route_read_<engine>_{quantile="0.9"} 0.05686499923467636
vault_route_read_<engine>_{quantile="0.99"} 0.15240700542926788
vault_route_read_<engine>__sum 568405.0729739927
vault_route_read_<engine>__count 1.2668209e+07

Couple of issues:

  • Unneeded trailing underscores in metric names.
  • Unneeded double underscores in metrics names.
  • Values defaulting to NaN (not shown above) makes querying difficult. You have to do > 0 in all queries that default to NaN. These should default to 0, as it's natively supported in Prometheus to support counter resets (i.e. cluster restarts).
  • The main issue: secrets engines should not be in the metric names.
    • This makes it very difficult to do any kind of querying/dashboarding when you're not using static secrets engines (we have a few hundred secrets engines per cluster, all of which are dynamically created, and have dynamically created policies).
    • It means that to query things like "total reads", you have to use Prometheus __name__=~"vault_route_read_.+" label queries. However, this does not work in many situations (e.g. increase() queries). Using __name__ is also heavily discouraged:
      • PromQL bug with query_range and metrics name change prometheus/prometheus#4562 (comment)

        Using regexes on metric names is very strongly discouraged, outside of debugging Prometheus performance issues.

      • https://www.robustperception.io/whats-in-a-__name__

        Finally I'd like to talk about the most common reason users run into this. Users are trying to perform math across multiple metrics using an expression like rate({name=~"pool_.*_total"}[5m]), where there's a regex on the metric name. This is an anti-pattern, as you're meant to know your metric names a-priori. The only time you should use a regex against metric names is when doing performance debugging or metric exploration.

        The correct way to handle this is to use a label to distinguish the different pools (or whatever it is you're trying to distinguish), rather than encoding them inside the metric name. If it's not possible to do this in the code, you can workaround using metric_relabel_configs with replace actions to a) extract out the label and then b) set the metric name to a single value.

Describe the solution you'd like

  • Remove uses of unneeded underscore characters.
  • Default metrics to 0 rather than NaN.
  • Move secrets engines and similar types of info currently put into metric names, into labels that allow proper querying. e.g.:
vault_route_read{engine="some-engine", namespace="some-namespace", quantile="0.5"} 0.031690001487731934
vault_route_read{engine="some-engine", namespace="some-namespace", quantile="0.9"} 0.05686499923467636
vault_route_read{engine="some-engine", namespace="some-namespace", quantile="0.99"} 0.15240700542926788
vault_route_read_sum{engine="some-engine", namespace="some-namespace"} 568405.0729739927
vault_route_read_count{engine="some-engine", namespace="some-namespace"} 1.2668209e+07

With the Go libraries for Prometheus, this should actually be easier to implement than what is currently implemented, I suspect.

Describe alternatives you've considered

The only way of getting around this, is to implement metric_relabel_configs definitions on the Prometheus cluster. This would have to be done for each "type" of metric being output by the Vault cluster. This is an anti-pattern though, and shouldn't be necessary.

I'd submit a PR for these things but I simply don't have the time at the moment.

@wernerb
Copy link

wernerb commented Oct 15, 2020

Any progress on the merge request? or a workaround with metric_relabel_configs?

@aphorise
Copy link
Contributor

aphorise commented Nov 2, 2020

Ditto - experiencing as well. A solution is still sought for and any response indicating that it shall be forthcoming would be well appreciated... current matrices naming are not very practical nor do they comply to common standards with most other systems.

@azelezni
Copy link

azelezni commented Apr 21, 2021

This does the job:

metric_relabel_configs:
  - source_labels: [ __name__ ]
    regex: 'vault_route_([^_]+)_(.+_)(_sum|_count)?'
    replacement: '${2}'
    target_label: mountpoint
  - source_labels: [ __name__ ]
    regex: 'vault_route_([^_]+)_(.+_)(_sum|_count)?'
    replacement: '${1}'
    target_label: action
  - source_labels: [ __name__ ]
    regex: 'vault_route_([^_]+)_(.+_)(_sum|_count)?'
    replacement: 'vault_route${3}'
    target_label: __name__

@aphorise
Copy link
Contributor

@lrstanley - does the solution from @azelezni suffice - and if so perhaps we can get it documented on a related PR and thereafter close this request. WDYT?

@lrstanley
Copy link
Author

As described in the initial issue body, I don't think it's sufficient.

Describe alternatives you've considered

The only way of getting around this, is to implement metric_relabel_configs definitions on the Prometheus cluster. This would have to be done for each "type" of metric being output by the Vault cluster. This is an anti-pattern though, and shouldn't be necessary.

Requiring users to utilize relabel configs would also only solve one of the issues, the other being:

Values defaulting to NaN (not shown above) makes querying difficult. You have to do > 0 in all queries that default to NaN. These should default to 0, as it's natively supported in Prometheus to support counter resets (i.e. cluster restarts).

In short, the current metrics implementation with Vault (for Prometheus metrics at least), doesn't follow any standards/best practices and should be revisited.

@aphorise
Copy link
Contributor

@lrstanley I'm curious if you have a suggestion that you may be able to share via a linked PR on how it could be better?

Any correction here would naturally be a disruptive change to existing Prom consumers but I suspect if a proposed solution is already in place and tested (maybe with some later feature flag to enable) then the community would get ample time to benefit from standard / best practises.

@lrstanley
Copy link
Author

@lrstanley I'm curious if you have a suggestion that you may be able to share via a linked PR on how it could be better?

Any correction here would naturally be a disruptive change to existing Prom consumers but I suspect if a proposed solution is already in place and tested (maybe with some later feature flag to enable) then the community would get ample time to benefit from standard / best practises.

Most of this is covered in the issue description. I don't have bandwidth at the moment to look at creating a PR.

@Thor77
Copy link

Thor77 commented Apr 17, 2023

#12741 fixes this for audit device metrics and I just created #20190 to fix this for request and rollback metrics.

Thor77 added a commit to Thor77/vault that referenced this issue May 12, 2023
@heatherezell
Copy link
Contributor

Hi folks! Is this still an issue in newer versions of Vault? Please let me know so I can bubble it up accordingly. Thanks!

@Thor77
Copy link

Thor77 commented Mar 26, 2024

@hsimon-hashicorp yes, I mentioned two PRs above that would fix this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants