Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add counter to track alerts dropped outside of time_intervals #3565

Merged

Conversation

tjhop
Copy link
Contributor

@tjhop tjhop commented Oct 20, 2023

Addresses: #3512

This adds a new counter metric alertmanager_alerts_supressed_total that is incremented by len(alerts) when an alert is suppressed for being outside of a time_interval, ie inside of a mute_time_intervals or outside of an active_time_intervals.

@tjhop
Copy link
Contributor Author

tjhop commented Oct 20, 2023

@gotjosh this is an attempt to concisely shim in the new counter discussed in #3512, please lemme know your thoughts.

Of note, I wonder if the naming of this new metric (alertmanager_alerts_suppressed_total) will cause confusion with the existing alertmanager_alerts{state="suppressed"} metric that gets exposed. Reading through the source indicates the existing metric only reflects alerts suppressed through silences or inhibitions, and this new metric cares only about alerts suppressed for violating a given time interval. It would seem they're complimentary, but I think the metric names used are close enough that people may not understand the distinction

Copy link
Member

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your contribution, the logic overall looks great but please look at my comments!

notify/notify.go Outdated
@@ -251,6 +251,7 @@ type Metrics struct {
numTotalFailedNotifications *prometheus.CounterVec
numNotificationRequestsTotal *prometheus.CounterVec
numNotificationRequestsFailedTotal *prometheus.CounterVec
numAlertsSuppressedTotal prometheus.Counter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for not catching this earlier on, but yes, what we want here is numNotificationSuppressedTotal it's a bit confusing but we have strict nomenclature at a metric level and notification is not the same thing as an alert.

Alerts are what the Alertmanager receives and as you correctly figured out we already reflect those with the suppressed label on alertmanager_alerts.

What we want here is alertmanager_notifications_supressed_total

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also an argument on wether we should use a label to specify if this was suppressed by a an active_interval or a muted_interval but let's leave this out for now and see wether this is useful on its own as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I like that distinction much better, I'll move towards this phrasing.

There's also an argument on wether we should use a label to specify if this was suppressed by a an active_interval or a muted_interval but let's leave this out for now and see wether this is useful on its own as is.

Agreed, and sounds good to me 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in e15292f

notify/notify.go Outdated Show resolved Hide resolved
notify/notify.go Outdated
@@ -902,6 +909,7 @@ func (tms TimeMuteStage) Exec(ctx context.Context, l log.Logger, alerts ...*type

// If the current time is inside a mute time, all alerts are removed from the pipeline.
if muted {
tms.metrics.numAlertsSuppressedTotal.Add(float64(len(alerts)))
level.Debug(l).Log("msg", "Notifications not sent, route is within mute time")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
level.Debug(l).Log("msg", "Notifications not sent, route is within mute time")
level.Debug(l).Log("msg", "Notifications not sent, route is within mute time", "alerts", len(alerts))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're here let's try and log the number of alerts we suppressed for traceability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in e15292f

notify/notify.go Outdated
@@ -939,6 +947,7 @@ func (tas TimeActiveStage) Exec(ctx context.Context, l log.Logger, alerts ...*ty

// If the current time is not inside an active time, all alerts are removed from the pipeline
if !muted {
tas.metrics.numAlertsSuppressedTotal.Add(float64(len(alerts)))
level.Debug(l).Log("msg", "Notifications not sent, route is not within active time")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
level.Debug(l).Log("msg", "Notifications not sent, route is not within active time")
level.Debug(l).Log("msg", "Notifications not sent, route is not within active time", "alerts", len(alerts))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in e15292f

tjhop added a commit to tjhop/alertmanager that referenced this pull request Oct 20, 2023
@tjhop
Copy link
Contributor Author

tjhop commented Oct 20, 2023

@gotjosh ready for round 2 when time permits

tjhop added a commit to tjhop/alertmanager that referenced this pull request Oct 26, 2023
@tjhop tjhop force-pushed the feat/metric-for-time-interval-suppressed-alerts branch from e15292f to 2562373 Compare October 26, 2023 04:32
@tjhop
Copy link
Contributor Author

tjhop commented Oct 26, 2023

Accidentally forgot to sign the last commit. Amended and force pushed to address

notify/notify.go Outdated
@@ -284,6 +285,11 @@ func NewMetrics(r prometheus.Registerer, ff featurecontrol.Flagger) *Metrics {
Name: "notification_requests_failed_total",
Help: "The total number of failed notification requests.",
}, labels),
numNotificationSuppressedTotal: prometheus.NewCounter(prometheus.CounterOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to follow the other comment, we could add a reason label here. Then, users can split up the individual rates based on what did the suppression (was it silenced, or part of a mute time? etc).

It may be worth adding this even if we don't follow my other comment - one reason for active times, and another for mute times, in case users are using both simultaneously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said in response to the other comment, I can add the metrics to the other stages as well. And agreed, I think a reason label is much more obviously valuable now

notify/notify.go Outdated
tas := NewTimeActiveStage(intervener)
tms := NewTimeMuteStage(intervener)
tas := NewTimeActiveStage(intervener, pb.metrics)
tms := NewTimeMuteStage(intervener, pb.metrics)
ss := NewMuteStage(silencer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered adding silenced and inhibited notifications as part of this metric? I.e. we could also pass the metrics into NewMuteStage here for silences and on line 390 for inhibitions, and increment based on muted in that stage as well.

I ask because we didn't specifically tie the metric to mutes (it's not called num_notifications_muted_total) and then users could track suppressions across all different methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to that 👍

Addresses: prometheus#3512

This adds a new counter metric `alertmanager_alerts_supressed_total`
that is incremented by `len(alerts)` when an alert is suppressed for
being outside of a time_interval, ie inside of a mute_time_intervals or
outside of an active_time_intervals.

Signed-off-by: TJ Hoplock <[email protected]>
@tjhop tjhop force-pushed the feat/metric-for-time-interval-suppressed-alerts branch from 3169b8d to 143c833 Compare November 28, 2023 06:12
@tjhop
Copy link
Contributor Author

tjhop commented Nov 28, 2023

Rebased to handle merge conflict 👍

- fixed metric count check to properly check the diff between
  input/output notifications from the suppression to compare to suppression
metric, was previously inverted to compare to how many notifications it
suppressed.
- stopped using `Reset()` to compare collection counts between the
  multiple stages that are executed in `TestMuteStageWithSilences()`.
the intent was to compare a clean metric collection after each stage
execution, but the final stage where all silences are lifted results in
no metric being created in the test, causing `prom_testutil.ToFloat64()`
to panic. changed to separate vars to check counts between each stage,
with care to consider prior counts.

Signed-off-by: TJ Hoplock <[email protected]>
@tjhop
Copy link
Contributor Author

tjhop commented Dec 5, 2023

@alexweav does the current change set more resemble what you had in mind? Lemme know your thoughts

Copy link
Member

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you very much for your contribution!

@gotjosh gotjosh merged commit f00025d into prometheus:main Feb 13, 2024
11 checks passed
gotjosh pushed a commit to grafana/mimir that referenced this pull request Feb 15, 2024
th0th pushed a commit to th0th/alertmanager that referenced this pull request Mar 23, 2024
…rometheus#3565)

* feat: add counter to track alerts dropped outside of time_intervals

Addresses: prometheus#3512

This adds a new counter metric `alertmanager_alerts_supressed_total`
that is incremented by `len(alerts)` when an alert is suppressed for
being outside of a time_interval, ie inside of a mute_time_intervals or
outside of an active_time_intervals.

Signed-off-by: TJ Hoplock <[email protected]>

* test: add time interval suppression metric checks for notify

Signed-off-by: TJ Hoplock <[email protected]>

* test: fix failure message log values in notifier

Signed-off-by: TJ Hoplock <[email protected]>

* ref: address PR feedback for prometheus#3565

Signed-off-by: TJ Hoplock <[email protected]>

* fix: track suppressed notifications metric for inhibit/silence

Based on PR feedback:

https://github.com/prometheus/alertmanager/pull/3565/files#r1393068026

Signed-off-by: TJ Hoplock <[email protected]>

* fix: broken notifier tests

- fixed metric count check to properly check the diff between
  input/output notifications from the suppression to compare to suppression
metric, was previously inverted to compare to how many notifications it
suppressed.
- stopped using `Reset()` to compare collection counts between the
  multiple stages that are executed in `TestMuteStageWithSilences()`.
the intent was to compare a clean metric collection after each stage
execution, but the final stage where all silences are lifted results in
no metric being created in the test, causing `prom_testutil.ToFloat64()`
to panic. changed to separate vars to check counts between each stage,
with care to consider prior counts.

Signed-off-by: TJ Hoplock <[email protected]>

* rename metric and add constants

Signed-off-by: gotjosh <[email protected]>

---------

Signed-off-by: TJ Hoplock <[email protected]>
Signed-off-by: gotjosh <[email protected]>
Co-authored-by: gotjosh <[email protected]>
Signed-off-by: Gokhan Sari <[email protected]>
SuperQ added a commit that referenced this pull request Oct 16, 2024
* [CHANGE] Deprecate and remove api/v1/ #2970
* [CHANGE] Remove unused feature flags #3676
* [CHANGE] Newlines in smtp password file are now ignored #3681
* [CHANGE] Change compat metrics to counters #3686
* [CHANGE] Do not register compat metrics in amtool #3713
* [CHANGE] Remove metrics from compat package #3714
* [CHANGE] Mark muted alerts #3793
* [FEATURE] Add metric for inhibit rules #3681
* [FEATURE] Support UTF-8 label matchers #3453, #3507, #3523, #3483, #3567, #3568, #3569, #3571, #3595, #3604, #3619, #3658, #3659, #3662, #3668, 3572
* [FEATURE] Add counter to track alerts dropped outside of time_intervals #3565
* [FEATURE] Add date and tz functions to templates #3812
* [FEATURE] Add limits for silences #3852
* [FEATURE] Add time helpers for templates #3863
* [FEATURE] Add auto GOMAXPROCS #3837
* [FEATURE] Add auto GOMEMLIMIT #3895
* [FEATURE] Add Jira receiver integration #3590
* [ENHANCEMENT] Add the receiver name to notification metrics #3045
* [ENHANCEMENT] Add the route ID to uuid #3372
* [ENHANCEMENT] Add duration to the notify success message #3559
* [ENHANCEMENT] Implement webhook_url_file for discord and msteams #3555
* [ENHANCEMENT] Add debug logs for muted alerts #3558
* [ENHANCEMENT] API: Allow the Silences API to use their own 400 response #3610
* [ENHANCEMENT] Add summary to msteams notification #3616
* [ENHANCEMENT] Add context reasons to notifications failed counter #3631
* [ENHANCEMENT] Add optional native histogram support to latency metrics #3737
* [ENHANCEMENT] Enable setting ThreadId for Telegram notifications #3638
* [ENHANCEMENT] Allow webex roomID from template #3801
* [BUGFIX] Add missing integrations to notify metrics #3480
* [BUGFIX] Add missing ttl in pushhover #3474
* [BUGFIX] Fix scheme required for webhook url in amtool #3409
* [BUGFIX] Remove duplicate integration from metrics #3516
* [BUGFIX] Reflect Discord's max length message limits #3597
* [BUGFIX] Fix nil error in warn logs about incompatible matchers #3683
* [BUGFIX] Fix a small number of inconsistencies in compat package logging #3718
* [BUGFIX] Fix log line in featurecontrol #3719
* [BUGFIX] Fix panic in acceptance tests #3592
* [BUGFIX] Fix flaky test TestClusterJoinAndReconnect/TestTLSConnection #3722
* [BUGFIX] Fix crash on errors when url_file is used #3800
* [BUGFIX] Fix race condition in dispatch.go #3826
* [BUGFIX] Fix race conditions in the memory alerts store #3648
* [BUGFIX] Hide config.SecretURL when the URL is incorrect. #3887
* [BUGFIX] Fix invalid silence causes incomplete updates #3898
* [BUGFIX] Fix leaking of Silences matcherCache entries #3930
* [BUGFIX] Close SMTP submission correctly to handle errors #4006

Signed-off-by: SuperQ <[email protected]>
@SuperQ SuperQ mentioned this pull request Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants