Allow explicit metric registration. Fixes #11732 #27966

banks · 2024-08-05T13:27:48Z

Background

For years we've had many issues with "missing metrics" caused by long-standing issues in go-metrics where we only output Prometheus metrics after they've been emitted once and then only for a fixed retention time. This is not what Prometheus or the ecosystem around it expects as generally metrics are explicitly defined in software and then consistently reported. This breaks dashboards and forces ugly workarounds.

One (of many) such issue is #11732. I came accross this in the same week that I'd had a conversation with another Vault developer working on an unrelated customer observability issue that stems from the same problem.

In Consul a few years ago we came up with a reasonable workaround here for Prometheus. It's not ideal still in many ways, not least that metrics definitions for libraries like Raft still need to be enumerated in the application itself as we've not standardised a way to define these for all users of go metrics.

This approach involves statically defining metrics in code during init and then passing these definitions to the Prometheus Sink if it is configured. The PrometheusSink will then always output the defined metrics even if they've not been recorded yet and will not expire them after it's retention period.

This approach and code layout has been discussed internally already. The primary reason for adding the new registry to the SDK is so that a few internal (built in) plugins that don't share the same code base and can't import the vault module could potentially use them.

Note: this mechanism will not work for third-party or externally managed plugins since they are separate processes and can't hook into Vault before its metrics sinks are setup. Supporting metrics from external plugins would be a much larger project involving extensions to the plugin API to allow metrics to be transported to Vault for reporting. We've not heard from plugin developers internally or externally that this is an active need so it's not included here.

Result

To demonstrate usage and test this I fixed #11732 specifically by adding the three metrics defined in ha.go. This can be seen by starting a dev server and immediately being able to see those metrics in output despite a step down (or leadership failure) never having occured:

❯ curl -sH "X-Vault-Token: $VAULT_TOKEN" "127.0.0.1:8200/v1/sys/metrics?format=prometheus" | rg step_down
# HELP core_step_down Time required to step down cluster leadership
# TYPE core_step_down summary
core_step_down{quantile="0.5"} NaN
core_step_down{quantile="0.9"} NaN
core_step_down{quantile="0.99"} NaN
core_step_down_sum 0
core_step_down_count 0

TODO only if you're a HashiCorp employee

Labels: If this PR is the CE portion of an ENT change, and that ENT change is
getting backported to N-2, use the new style backport/ent/x.x.x+ent labels
instead of the old style backport/x.x.x labels.
Labels: If this PR is a CE only change, it can only be backported to N, so use
the normal backport/x.x.x label (there should be only 1).
ENT Breakage: If this PR either 1) removes a public function OR 2) changes the signature
of a public function, even if that change is in a CE file, double check that
applying the patch for this PR to the ENT repo and running tests doesn't
break any tests. Sometimes ENT only tests rely on public functions in CE
files.
Jira: If this change has an associated Jira, it's referenced either
in the PR description, commit message, or branch name.
RFC: If this change has an associated RFC, please link it in the description.
ENT PR: If this change has an associated ENT PR, please link it in the
description. Also, make sure the changelog is in this PR, not in your ENT PR.

github-actions · 2024-08-05T13:30:52Z

CI Results:
All Go tests succeeded! ✅

github-actions · 2024-08-05T13:31:19Z

Build Results:
All builds succeeded! ✅

raskchanky · 2024-08-05T21:41:34Z

sdk/helper/metricregistry/metricregistry_test.go

+	}
+	opts.SummaryDefinitions = []promsink.SummaryDefinition{
+		{
+			Name: []string{"preexisting_summary"},


What happens if one of the preexisting names clashes with one of the incoming names? Is that worth testing and/or documenting?

Good Q. I think the answer is last-write-wins based on:

https://github.com/hashicorp/go-metrics/blob/0ec744010d013e2ce8d0d71c69a9c43bfe523efc/prometheus/prometheus.go#L209-L221

They are inserted into a sync.Map keyed by the hash of the metric name and constant labels so if those collide the second registration wins.

We could de-dupe or panic or something here as it's a bug to have multiple definitions, but it also doesn't do any harm since by definition both registrations are identical!

A more subtle question is what happens if you register a metric with the same name under different types - I think the answer is still last write wins but now it will impact the actual value of the metric. That said, even before this change it's possible to name two different metrics the same way and you'll get broken and unpredictable metrics behaviour already from doing so with no warning, so I don't think this makes anything any worse...

@raskchanky do you think it's important to add tests for this case? We don't have tests or runtime checks that code doesn't duplicate metric names in general now which would have a similar impact so I feel like it's not going to really tell us much but open to adding something if I'm missing something?

Up to you really. I didn't realize that nothing here was changing any existing behavior, so if it's been this way since the beginning w/o tests, and we haven't run into any metrics related issues because of it, then we're probably fine to continue that way.

vault/ha.go

Register ha timing metrics. Fixes #11732

baa77c0

banks added this to the 1.18.0-rc milestone Aug 5, 2024

github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Aug 5, 2024

Add CHANGELOG

d42f01f

banks added 2 commits August 5, 2024 14:31

Fix copywrite headers

c90d3a1

Relicence SDK files after move

14a8e91

banks mentioned this pull request Aug 5, 2024

Clarify audit log failure telemetry docs. #27969

Merged

6 tasks

raskchanky reviewed Aug 5, 2024

View reviewed changes

raskchanky approved these changes Aug 7, 2024

View reviewed changes

banks commented Aug 30, 2024

View reviewed changes

vault/ha.go Outdated Show resolved Hide resolved

banks added 2 commits August 30, 2024 15:37

Update vault/ha.go

9ab3f9b

Merge branch 'main' into f/metrics-definitions

24e7faa

banks enabled auto-merge (squash) August 30, 2024 14:38

vercel bot deployed to Preview August 30, 2024 14:45 View deployment

banks merged commit bb5f658 into main Aug 30, 2024
83 checks passed

banks deleted the f/metrics-definitions branch August 30, 2024 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow explicit metric registration. Fixes #11732 #27966

Allow explicit metric registration. Fixes #11732 #27966

banks commented Aug 5, 2024 •

edited

Loading

github-actions bot commented Aug 5, 2024 •

edited

Loading

github-actions bot commented Aug 5, 2024 •

edited

Loading

raskchanky Aug 5, 2024

banks Aug 6, 2024

banks Aug 6, 2024

banks Aug 6, 2024

raskchanky Aug 6, 2024

Allow explicit metric registration. Fixes #11732 #27966

Allow explicit metric registration. Fixes #11732 #27966

Conversation

banks commented Aug 5, 2024 • edited Loading

Background

Result

TODO only if you're a HashiCorp employee

github-actions bot commented Aug 5, 2024 • edited Loading

github-actions bot commented Aug 5, 2024 • edited Loading

raskchanky Aug 5, 2024

Choose a reason for hiding this comment

banks Aug 6, 2024

Choose a reason for hiding this comment

banks Aug 6, 2024

Choose a reason for hiding this comment

banks Aug 6, 2024

Choose a reason for hiding this comment

raskchanky Aug 6, 2024

Choose a reason for hiding this comment

banks commented Aug 5, 2024 •

edited

Loading

github-actions bot commented Aug 5, 2024 •

edited

Loading

github-actions bot commented Aug 5, 2024 •

edited

Loading