Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reloader: Expose metrics to give info about last operation result/time #4594

Merged
merged 1 commit into from
Aug 25, 2021

Conversation

philipgough
Copy link
Contributor

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

In order to facilitate alerting on errors in the config reloader, this PR exposes four gauges which provide information on the last operation for both reloads and config applies.

These could be used similarly to prometheus_config_last_reload_success_timestamp_seconds and prometheus_config_last_reload_successful exposed by Prometheus.

@philipgough philipgough force-pushed the mon-1085 branch 2 times, most recently from 415f0b9 to a9080ee Compare August 23, 2021 16:14
CHANGELOG.md Outdated Show resolved Hide resolved
pkg/reloader/reloader.go Outdated Show resolved Hide resolved
@philipgough philipgough force-pushed the mon-1085 branch 3 times, most recently from ab4b835 to 8090288 Compare August 24, 2021 07:11
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, some suggestions first (:

@@ -154,6 +158,18 @@ func New(logger log.Logger, reg prometheus.Registerer, o *Options) *Reloader {
Help: "Total number of reload requests that failed.",
},
),
lastReloadSuccess: promauto.With(reg).NewGauge(
prometheus.GaugeOpts{
Name: "reloader_last_reload_successful",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why querying counter of all reloads does not give us that? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka - I don't understand this sorry. We maintain a counter for total reloads and failed reloads. How would I determine from those if the latest retry was a success?

Copy link
Contributor Author

@philipgough philipgough Aug 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking at the mixin for bad config alert for Prometheus it achieves it using similar https://github.com/prometheus/prometheus/blob/main/documentation/prometheus-mixin/alerts.libsonnet#L8

Still unsure how we would guard against false alarm by determining without determining if the very last attempt was a success.

For example say I have a 90% failure rate on reloads, I might alert on that, although I wouldn't care if I knew the last reload attempt actually forced the reload. How would we achieve the same guard with the timestamp?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - we can do that with increase(total[5m]) - increase(fails[5m]) > 0 no?

But happy to stick to standard if we are familiar with it

),
lastReloadSuccessTimestamp: promauto.With(reg).NewGauge(
prometheus.GaugeOpts{
Name: "reloader_last_reload_success_timestamp_seconds",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, plus counter of reloads should give us enough info, so no need for reloader_last_reload_successful maybe? (Let's avoid too many ways of doing the same thing)

pkg/reloader/reloader.go Outdated Show resolved Hide resolved
pkg/reloader/reloader.go Outdated Show resolved Hide resolved
pkg/reloader/reloader.go Outdated Show resolved Hide resolved
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of duplicated info here, but happy to stick to standard if we want.

@bwplotka bwplotka merged commit 38a1bc6 into thanos-io:main Aug 25, 2021
@philipgough philipgough deleted the mon-1085 branch August 25, 2021 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants