Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert metric reports different results to what the user sees via API #2943

Merged
merged 2 commits into from
Jun 16, 2022

Conversation

gotjosh
Copy link
Member

@gotjosh gotjosh commented Jun 15, 2022

Fixes #1439 and #2619 (which are duplicates)

The previous metric is not technically reporting incorrect results as the alerts are still around and will be re-used if that same alert (equal fingerprint) is received before it is GCed. Therefore, I have kept the old metric under a new name alertmanager_marked_alerts and repurposed the current metric to match what the user sees in the UI.

I have tested this (and you can too) with the following:

  1. First, send a payload to the Alertmanager with different alerts to cover all the cases (make sure you edit the endsAt for AlwaysFiringExpired, AlwaysFiringExpiringSoon and others as necessary) - you want one alert to be expired on reception and one that eventually expires to see how the metrics behave.
## Request
curl -X "POST" "http://localhost:9093/api/v2/alerts" \
     -H 'Content-Type: text/plain; charset=utf-8' \
     -d $'[
  {
    "labels": {
      "alertname": "AlwaysFiringExpiringSoon",
      "cluster": "us-central1",
      "namespace": "app2"
    },
    "generatorURL": "http://url.com",
    "endsAt": "2022-06-10 12:55:40"
  },
  {
    "labels": {
      "alertname": "AlwaysFiringExpired",
      "cluster": "us-central1",
      "namespace": "app2"
    },
    "generatorURL": "http://url.com",
    "endsAt": "2022-06-10 10:40:40"
  },
  {
    "labels": {
      "alertname": "AlwaysFiringToBeSilenced",
      "cluster": "us-central1",
      "namespace": "app2"
    },
    "generatorURL": "http://url.com",
    "endsAt": "2023-06-11 10:40:40"
  },
  {
    "labels": {
      "alertname": "AlwaysFiringActive",
      "cluster": "us-central1",
      "namespace": "app2"
    },
    "generatorURL": "http://url.com",
    "endsAt": "2023-06-11 10:40:40"
  }
]'
  1. Make sure you set up silence for AlwaysFiringToBeSilenced.
  2. In the UI, you should only see 1 alert after the silence and both expired alerts have disappeared: AlwaysFiringActive.
  3. With this, the metrics are now accurate. Reporting 4 for the marker metrics but only 2 (one suppressed and 1 active for the new metric)

Marker
Screenshot 2022-06-15 at 19 54 26

New alertmanager_alerts
Screenshot 2022-06-15 at 19 54 07

@gotjosh gotjosh force-pushed the fix-alert-count-metrics branch from 977ca67 to a0d866c Compare June 15, 2022 12:57
@@ -33,8 +34,10 @@ const alertChannelLength = 200
type Alerts struct {
cancel context.CancelFunc

alerts *store.Alerts
Copy link
Member Author

@gotjosh gotjosh Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved alerts because mtx is not protecting it and this was a bit confusing at first. alerts has its own mutex.

return prometheus.NewGaugeFunc(
prometheus.GaugeOpts{
Name: "alertmanager_alerts",
Help: "How many alerts by state.",
Name: "alertmanager_marked_alerts",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept the old metric intact functionality-wise, just added more label to complete the story.

@gotjosh gotjosh force-pushed the fix-alert-count-metrics branch from c85ebe0 to 143cf12 Compare June 15, 2022 18:50
@gotjosh
Copy link
Member Author

gotjosh commented Jun 15, 2022

Needs #2944

@gotjosh gotjosh marked this pull request as ready for review June 15, 2022 18:56
func (a *Alerts) count(state types.AlertState) int {
var count int
for _, alert := range a.alerts.List() {
if alert.Resolved() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the big change, we now skip resolved alerts.

@roidelapluie
Copy link
Member

This LGTM.

gotjosh added 2 commits June 16, 2022 08:43
Fixes prometheus#1439 and prometheus#2619.

The previous metric is not _technically_ reporting incorrect results as the alerts _are_ still around and will be re-used if that same alert (equal fingerprint) is received before it is GCed. Therefore, I have kept the old metric under a new name `alertmanager_marked_alerts` and repurpose the current metric to match what the user sees in the UI.

Signed-off-by: gotjosh <[email protected]>
Signed-off-by: gotjosh <[email protected]>
@gotjosh gotjosh force-pushed the fix-alert-count-metrics branch from 143cf12 to 90f0c5b Compare June 16, 2022 07:43
@gotjosh
Copy link
Member Author

gotjosh commented Jun 16, 2022

@roidelapluie @simonpasquier I have just rebased - this should be all good.

@roidelapluie
Copy link
Member

$ curl -s 127.0.0.1:9093/api/v2/alerts|jq '. | length'
481
$ curl -s 127.0.0.1:9093/metrics|grep 'alertmanager\(_marked\)\?_alerts'
# HELP alertmanager_alerts How many alerts by state.
# TYPE alertmanager_alerts gauge
alertmanager_alerts{state="active"} 481
alertmanager_alerts{state="suppressed"} 0
alertmanager_alerts{state="unprocessed"} 0
# HELP alertmanager_alerts_invalid_total The total number of received alerts that were invalid.
# TYPE alertmanager_alerts_invalid_total counter
alertmanager_alerts_invalid_total{version="v1"} 0
alertmanager_alerts_invalid_total{version="v2"} 0
# HELP alertmanager_alerts_received_total The total number of received alerts.
# TYPE alertmanager_alerts_received_total counter
alertmanager_alerts_received_total{status="firing",version="v1"} 0
alertmanager_alerts_received_total{status="firing",version="v2"} 5475
alertmanager_alerts_received_total{status="resolved",version="v1"} 0
alertmanager_alerts_received_total{status="resolved",version="v2"} 18
# HELP alertmanager_marked_alerts How many alerts by state are currently marked in the Alertmanager regardless of their expiry.
# TYPE alertmanager_marked_alerts gauge
alertmanager_marked_alerts{state="active"} 489
alertmanager_marked_alerts{state="suppressed"} 0
alertmanager_marked_alerts{state="unprocessed"} 0

@roidelapluie roidelapluie merged commit 805e505 into prometheus:main Jun 16, 2022
@roidelapluie
Copy link
Member

Thanks!

@roidelapluie
Copy link
Member

More results, at the 500 alerts mark

500


alertmanager_alerts{state="active"} 500
alertmanager_alerts{state="suppressed"} 0
alertmanager_alerts{state="unprocessed"} 0
alertmanager_marked_alerts{state="active"} 534
alertmanager_marked_alerts{state="suppressed"} 0
alertmanager_marked_alerts{state="unprocessed"} 0

LGTM, I will stop my test.

@gotjosh gotjosh deleted the fix-alert-count-metrics branch June 16, 2022 12:51
qinxx108 pushed a commit to qinxx108/alertmanager that referenced this pull request Dec 13, 2022
…rometheus#2943)

* Alert metric reports different results to what the user sees via API

Fixes prometheus#1439 and prometheus#2619.

The previous metric is not _technically_ reporting incorrect results as the alerts _are_ still around and will be re-used if that same alert (equal fingerprint) is received before it is GCed. Therefore, I have kept the old metric under a new name `alertmanager_marked_alerts` and repurpose the current metric to match what the user sees in the UI.

Signed-off-by: gotjosh <[email protected]>
Signed-off-by: Yijie Qin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metric alertmanager_alerts reports "active" alerts when there are none
2 participants