-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reloader: Expose metrics to give info about last operation result/time #4594
Conversation
415f0b9
to
a9080ee
Compare
ab4b835
to
8090288
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, some suggestions first (:
@@ -154,6 +158,18 @@ func New(logger log.Logger, reg prometheus.Registerer, o *Options) *Reloader { | |||
Help: "Total number of reload requests that failed.", | |||
}, | |||
), | |||
lastReloadSuccess: promauto.With(reg).NewGauge( | |||
prometheus.GaugeOpts{ | |||
Name: "reloader_last_reload_successful", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why querying counter of all reloads does not give us that? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bwplotka - I don't understand this sorry. We maintain a counter for total reloads and failed reloads. How would I determine from those if the latest retry was a success?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looking at the mixin for bad config alert for Prometheus it achieves it using similar https://github.com/prometheus/prometheus/blob/main/documentation/prometheus-mixin/alerts.libsonnet#L8
Still unsure how we would guard against false alarm by determining without determining if the very last attempt was a success.
For example say I have a 90% failure rate on reloads, I might alert on that, although I wouldn't care if I knew the last reload attempt actually forced the reload. How would we achieve the same guard with the timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - we can do that with increase(total[5m]) - increase(fails[5m]) > 0
no?
But happy to stick to standard if we are familiar with it
), | ||
lastReloadSuccessTimestamp: promauto.With(reg).NewGauge( | ||
prometheus.GaugeOpts{ | ||
Name: "reloader_last_reload_success_timestamp_seconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, plus counter of reloads should give us enough info, so no need for reloader_last_reload_successful
maybe? (Let's avoid too many ways of doing the same thing)
Signed-off-by: Philip Gough <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of duplicated info here, but happy to stick to standard if we want.
Changes
In order to facilitate alerting on errors in the config reloader, this PR exposes four gauges which provide information on the last operation for both reloads and config applies.
These could be used similarly to
prometheus_config_last_reload_success_timestamp_seconds
andprometheus_config_last_reload_successful
exposed by Prometheus.