metrics: secrets engines should be in labels, not metric names (prometheus) #9068

lrstanley · 2020-05-22T20:08:31Z

Is your feature request related to a problem? Please describe.

Prometheus metrics are output with the following format today (snippet pulled from live cluster):

vault_route_read_<engine>_{quantile="0.5"} 0.031690001487731934
vault_route_read_<engine>_{quantile="0.9"} 0.05686499923467636
vault_route_read_<engine>_{quantile="0.99"} 0.15240700542926788
vault_route_read_<engine>__sum 568405.0729739927
vault_route_read_<engine>__count 1.2668209e+07

Couple of issues:

Unneeded trailing underscores in metric names.
Unneeded double underscores in metrics names.
Values defaulting to NaN (not shown above) makes querying difficult. You have to do > 0 in all queries that default to NaN. These should default to 0, as it's natively supported in Prometheus to support counter resets (i.e. cluster restarts).
The main issue: secrets engines should not be in the metric names.
- This makes it very difficult to do any kind of querying/dashboarding when you're not using static secrets engines (we have a few hundred secrets engines per cluster, all of which are dynamically created, and have dynamically created policies).
- It means that to query things like "total reads", you have to use Prometheus __name__=~"vault_route_read_.+" label queries. However, this does not work in many situations (e.g. increase() queries). Using __name__ is also heavily discouraged:
  - PromQL bug with query_range and metrics name change prometheus/prometheus#4562 (comment)
    
    Using regexes on metric names is very strongly discouraged, outside of debugging Prometheus performance issues.
  - https://www.robustperception.io/whats-in-a-__name__
    
    Finally I'd like to talk about the most common reason users run into this. Users are trying to perform math across multiple metrics using an expression like rate({name=~"pool_.*_total"}[5m]), where there's a regex on the metric name. This is an anti-pattern, as you're meant to know your metric names a-priori. The only time you should use a regex against metric names is when doing performance debugging or metric exploration.
    
    The correct way to handle this is to use a label to distinguish the different pools (or whatever it is you're trying to distinguish), rather than encoding them inside the metric name. If it's not possible to do this in the code, you can workaround using metric_relabel_configs with replace actions to a) extract out the label and then b) set the metric name to a single value.

Describe the solution you'd like

Remove uses of unneeded underscore characters.
Default metrics to 0 rather than NaN.
Move secrets engines and similar types of info currently put into metric names, into labels that allow proper querying. e.g.:

vault_route_read{engine="some-engine", namespace="some-namespace", quantile="0.5"} 0.031690001487731934
vault_route_read{engine="some-engine", namespace="some-namespace", quantile="0.9"} 0.05686499923467636
vault_route_read{engine="some-engine", namespace="some-namespace", quantile="0.99"} 0.15240700542926788
vault_route_read_sum{engine="some-engine", namespace="some-namespace"} 568405.0729739927
vault_route_read_count{engine="some-engine", namespace="some-namespace"} 1.2668209e+07

With the Go libraries for Prometheus, this should actually be easier to implement than what is currently implemented, I suspect.

Describe alternatives you've considered

The only way of getting around this, is to implement metric_relabel_configs definitions on the Prometheus cluster. This would have to be done for each "type" of metric being output by the Vault cluster. This is an anti-pattern though, and shouldn't be necessary.

I'd submit a PR for these things but I simply don't have the time at the moment.

The text was updated successfully, but these errors were encountered:

wernerb · 2020-10-15T15:47:51Z

Any progress on the merge request? or a workaround with metric_relabel_configs?

aphorise · 2020-11-02T10:58:36Z

Ditto - experiencing as well. A solution is still sought for and any response indicating that it shall be forthcoming would be well appreciated... current matrices naming are not very practical nor do they comply to common standards with most other systems.

azelezni · 2021-04-21T09:43:04Z

This does the job:

metric_relabel_configs:
  - source_labels: [ __name__ ]
    regex: 'vault_route_([^_]+)_(.+_)(_sum|_count)?'
    replacement: '${2}'
    target_label: mountpoint
  - source_labels: [ __name__ ]
    regex: 'vault_route_([^_]+)_(.+_)(_sum|_count)?'
    replacement: '${1}'
    target_label: action
  - source_labels: [ __name__ ]
    regex: 'vault_route_([^_]+)_(.+_)(_sum|_count)?'
    replacement: 'vault_route${3}'
    target_label: __name__

aphorise · 2022-08-30T21:48:53Z

@lrstanley - does the solution from @azelezni suffice - and if so perhaps we can get it documented on a related PR and thereafter close this request. WDYT?

lrstanley · 2022-08-31T15:53:07Z

As described in the initial issue body, I don't think it's sufficient.

Describe alternatives you've considered

The only way of getting around this, is to implement metric_relabel_configs definitions on the Prometheus cluster. This would have to be done for each "type" of metric being output by the Vault cluster. This is an anti-pattern though, and shouldn't be necessary.

Requiring users to utilize relabel configs would also only solve one of the issues, the other being:

Values defaulting to NaN (not shown above) makes querying difficult. You have to do > 0 in all queries that default to NaN. These should default to 0, as it's natively supported in Prometheus to support counter resets (i.e. cluster restarts).

In short, the current metrics implementation with Vault (for Prometheus metrics at least), doesn't follow any standards/best practices and should be revisited.

aphorise · 2022-08-31T16:52:31Z

@lrstanley I'm curious if you have a suggestion that you may be able to share via a linked PR on how it could be better?

Any correction here would naturally be a disruptive change to existing Prom consumers but I suspect if a proposed solution is already in place and tested (maybe with some later feature flag to enable) then the community would get ample time to benefit from standard / best practises.

lrstanley · 2022-09-01T12:53:13Z

@lrstanley I'm curious if you have a suggestion that you may be able to share via a linked PR on how it could be better?

Any correction here would naturally be a disruptive change to existing Prom consumers but I suspect if a proposed solution is already in place and tested (maybe with some later feature flag to enable) then the community would get ample time to benefit from standard / best practises.

Most of this is covered in the issue description. I don't have bandwidth at the moment to look at creating a PR.

Thor77 · 2023-04-17T10:39:04Z

#12741 fixes this for audit device metrics and I just created #20190 to fix this for request and rollback metrics.

Relates to hashicorp#9068

heatherezell · 2024-03-21T22:42:48Z

Hi folks! Is this still an issue in newer versions of Vault? Please let me know so I can bubble it up accordingly. Thanks!

Thor77 · 2024-03-26T11:51:14Z

@hsimon-hashicorp yes, I mentioned two PRs above that would fix this

austingebauer added core/telemetry enhancement labels May 26, 2020

austingebauer added this to the triaged milestone May 26, 2020

Thor77 mentioned this issue Apr 17, 2023

Add mount_point label for rollback + route metrics #20190

Open

Thor77 added a commit to Thor77/vault that referenced this issue May 12, 2023

Add mount_point label for rollback + route metrics

a3c8340

Relates to hashicorp#9068

heatherezell added the waiting-for-response label Mar 21, 2024

heatherezell removed the waiting-for-response label Mar 26, 2024

phac008 mentioned this issue Sep 11, 2024

[vault] metrics name of secret engine suxess-it/kubriX#546

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics: secrets engines should be in labels, not metric names (prometheus) #9068

metrics: secrets engines should be in labels, not metric names (prometheus) #9068

lrstanley commented May 22, 2020 •

edited

Loading

wernerb commented Oct 15, 2020

aphorise commented Nov 2, 2020

azelezni commented Apr 21, 2021 •

edited

Loading

aphorise commented Aug 30, 2022

lrstanley commented Aug 31, 2022

aphorise commented Aug 31, 2022

lrstanley commented Sep 1, 2022

Thor77 commented Apr 17, 2023 •

edited

Loading

heatherezell commented Mar 21, 2024

Thor77 commented Mar 26, 2024

metrics: secrets engines should be in labels, not metric names (prometheus) #9068

metrics: secrets engines should be in labels, not metric names (prometheus) #9068

Comments

lrstanley commented May 22, 2020 • edited Loading

wernerb commented Oct 15, 2020

aphorise commented Nov 2, 2020

azelezni commented Apr 21, 2021 • edited Loading

aphorise commented Aug 30, 2022

lrstanley commented Aug 31, 2022

aphorise commented Aug 31, 2022

lrstanley commented Sep 1, 2022

Thor77 commented Apr 17, 2023 • edited Loading

heatherezell commented Mar 21, 2024

Thor77 commented Mar 26, 2024

lrstanley commented May 22, 2020 •

edited

Loading

azelezni commented Apr 21, 2021 •

edited

Loading

Thor77 commented Apr 17, 2023 •

edited

Loading