Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow explicit metric registration. Fixes #11732 #27966

Merged
merged 6 commits into from
Aug 30, 2024
Merged

Conversation

banks
Copy link
Member

@banks banks commented Aug 5, 2024

Background

For years we've had many issues with "missing metrics" caused by long-standing issues in go-metrics where we only output Prometheus metrics after they've been emitted once and then only for a fixed retention time. This is not what Prometheus or the ecosystem around it expects as generally metrics are explicitly defined in software and then consistently reported. This breaks dashboards and forces ugly workarounds.

One (of many) such issue is #11732. I came accross this in the same week that I'd had a conversation with another Vault developer working on an unrelated customer observability issue that stems from the same problem.

In Consul a few years ago we came up with a reasonable workaround here for Prometheus. It's not ideal still in many ways, not least that metrics definitions for libraries like Raft still need to be enumerated in the application itself as we've not standardised a way to define these for all users of go metrics.

This approach involves statically defining metrics in code during init and then passing these definitions to the Prometheus Sink if it is configured. The PrometheusSink will then always output the defined metrics even if they've not been recorded yet and will not expire them after it's retention period.

This approach and code layout has been discussed internally already. The primary reason for adding the new registry to the SDK is so that a few internal (built in) plugins that don't share the same code base and can't import the vault module could potentially use them.

Note: this mechanism will not work for third-party or externally managed plugins since they are separate processes and can't hook into Vault before its metrics sinks are setup. Supporting metrics from external plugins would be a much larger project involving extensions to the plugin API to allow metrics to be transported to Vault for reporting. We've not heard from plugin developers internally or externally that this is an active need so it's not included here.

Result

To demonstrate usage and test this I fixed #11732 specifically by adding the three metrics defined in ha.go. This can be seen by starting a dev server and immediately being able to see those metrics in output despite a step down (or leadership failure) never having occured:

❯ curl -sH "X-Vault-Token: $VAULT_TOKEN" "127.0.0.1:8200/v1/sys/metrics?format=prometheus" | rg step_down
# HELP core_step_down Time required to step down cluster leadership
# TYPE core_step_down summary
core_step_down{quantile="0.5"} NaN
core_step_down{quantile="0.9"} NaN
core_step_down{quantile="0.99"} NaN
core_step_down_sum 0
core_step_down_count 0

TODO only if you're a HashiCorp employee

  • Labels: If this PR is the CE portion of an ENT change, and that ENT change is
    getting backported to N-2, use the new style backport/ent/x.x.x+ent labels
    instead of the old style backport/x.x.x labels.
  • Labels: If this PR is a CE only change, it can only be backported to N, so use
    the normal backport/x.x.x label (there should be only 1).
  • ENT Breakage: If this PR either 1) removes a public function OR 2) changes the signature
    of a public function, even if that change is in a CE file, double check that
    applying the patch for this PR to the ENT repo and running tests doesn't
    break any tests. Sometimes ENT only tests rely on public functions in CE
    files.
  • Jira: If this change has an associated Jira, it's referenced either
    in the PR description, commit message, or branch name.
  • RFC: If this change has an associated RFC, please link it in the description.
  • ENT PR: If this change has an associated ENT PR, please link it in the
    description. Also, make sure the changelog is in this PR, not in your ENT PR.

@banks banks added this to the 1.18.0-rc milestone Aug 5, 2024
@github-actions github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Aug 5, 2024
Copy link

github-actions bot commented Aug 5, 2024

CI Results:
All Go tests succeeded! ✅

Copy link

github-actions bot commented Aug 5, 2024

Build Results:
All builds succeeded! ✅

}
opts.SummaryDefinitions = []promsink.SummaryDefinition{
{
Name: []string{"preexisting_summary"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if one of the preexisting names clashes with one of the incoming names? Is that worth testing and/or documenting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Q. I think the answer is last-write-wins based on:

https://github.com/hashicorp/go-metrics/blob/0ec744010d013e2ce8d0d71c69a9c43bfe523efc/prometheus/prometheus.go#L209-L221

They are inserted into a sync.Map keyed by the hash of the metric name and constant labels so if those collide the second registration wins.

We could de-dupe or panic or something here as it's a bug to have multiple definitions, but it also doesn't do any harm since by definition both registrations are identical!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more subtle question is what happens if you register a metric with the same name under different types - I think the answer is still last write wins but now it will impact the actual value of the metric. That said, even before this change it's possible to name two different metrics the same way and you'll get broken and unpredictable metrics behaviour already from doing so with no warning, so I don't think this makes anything any worse...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raskchanky do you think it's important to add tests for this case? We don't have tests or runtime checks that code doesn't duplicate metric names in general now which would have a similar impact so I feel like it's not going to really tell us much but open to adding something if I'm missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you really. I didn't realize that nothing here was changing any existing behavior, so if it's been this way since the beginning w/o tests, and we haven't run into any metrics related issues because of it, then we're probably fine to continue that way.

vault/ha.go Outdated Show resolved Hide resolved
@banks banks enabled auto-merge (squash) August 30, 2024 14:38
@banks banks merged commit bb5f658 into main Aug 30, 2024
83 checks passed
@banks banks deleted the f/metrics-definitions branch August 30, 2024 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing core leadership metrics
2 participants