Resolved alert sent without sending corresponding active alert #878

sandersaares · 2017-06-21T09:42:10Z

I have alertmanager sending notifications to Slack. Today it sent a "resolved" notification to Slack without ever sending an "active" notification. This is contrary to my expectations - I expect "resolved" to only be sent if an "active" is sent first.

Here is the log:

time="2017-06-21T09:29:15Z" level=info msg="Starting alertmanager (version=0.7.1, branch=master, revision=ab4138299b94c78dc554ea96e2ab28d04b048059)" source="main.go:109"
time="2017-06-21T09:29:16Z" level=info msg="Build context (go=go1.8.3, user=root@97e9539a4c3f, date=20170609-15:31:09)" source="main.go:110"
time="2017-06-21T09:29:16Z" level=info msg="Loading configuration file" file="/etc/alertmanager/config.yml" source="main.go:234"
time="2017-06-21T09:29:16Z" level=info msg="Listening on :9093" source="main.go:308"
time="2017-06-21T09:34:37Z" level=debug msg="Received alert" alert=web_endpoint_is_down[5fe91a8][active] component=dispatcher source="dispatch.go:183"
time="2017-06-21T09:35:37Z" level=debug msg="Received alert" alert=web_endpoint_is_down[5fe91a8][resolved] component=dispatcher source="dispatch.go:183"
time="2017-06-21T09:35:37Z" level=debug msg="flushing [web_endpoint_is_down[5fe91a8][resolved]]" aggrGroup={}:{alertname="web_endpoint_is_down"} source="dispatch.go:425"
time="2017-06-21T09:36:37Z" level=debug msg="Received alert" alert=web_endpoint_is_down[5fe91a8][resolved] component=dispatcher source="dispatch.go:183"
time="2017-06-21T09:36:37Z" level=debug msg="flushing [web_endpoint_is_down[5fe91a8][resolved]]" aggrGroup={}:{alertname="web_endpoint_is_down"} source="dispatch.go:425"

Configuration:

route:
  group_by: ['alertname', 'service']
  group_wait: 1m
  group_interval: 15m
  repeat_interval: 6h
  receiver: 'axinom-slack'

receivers:
- name: 'axinom-slack'
  slack_configs:
  - api_url: "xxx"
    channel: "#service-status"
    send_resolved: true
    text: "Active alerts: {{ range .Alerts }}{{if eq .Status \"firing\" }}{{ .Labels.job }} {{ end }}{{ end }}\n{{ range .CommonAnnotations.SortedPairs }}{{ .Name }} = {{ .Value }}\n{{ end }}"

Prometheus main configuration:

global:
  scrape_interval: 1m
  scrape_timeout: 30s

scrape_configs:
  - job_name: 'xxx'
    metrics_path: /probe
    params:
      module: ['basic_http_get']
      target: ['xxx']
    static_configs:
      - targets: ['yyy:9115']

Prometheus alerts configuration:

ALERT web_endpoint_is_down
    IF probe_success < 1.0
    FOR 3m

The text was updated successfully, but these errors were encountered:

fabxc · 2017-06-22T06:26:43Z

What version of AM are you using?

sandersaares · 2017-06-22T06:44:53Z

version=0.7.1, branch=master, revision=ab4138299b94c78dc554ea96e2ab28d04b048059

DonMartin76 · 2017-07-07T06:40:30Z

We also get this, with the following alert:

ALERT HighCPUUsage
  IF (100 * (1 - avg by(instance)(irate(node_cpu{mode='idle'}[30s])))) > 90
  LABELS {
    severity = "Notice",
  }
  ANNOTATIONS {
    summary = "Instance {{ $labels.instance }} CPU usage is dangerously high",
    description = "This device's cpu usage has exceeded the threshold with a value of {{ $value }}."
  }

For this one we get "RESOLVED" notifications once in a while, even if we didn't get the initial ones.

We're also on 0.7.1, running in Kubernetes 1.5.7, Prometheus 1.7.1.

domleb · 2017-09-07T13:03:05Z

We're seeing this too on v0.7.0.

agaoglu · 2017-10-25T13:15:27Z

Also happens in our setup with v0.9.1

alkalinecoffee · 2017-11-01T17:24:43Z

Question for the group: does this only happen with a particular alert? So far I'm only seeing it with an alert such as:

ALERT StatusCode500IncreaseProd
  IF sum(increase(http_request_duration_seconds_count{env="prod", status_code="500"}[2m])) by (instance, app, dc, env, stack) / sum(increase(http_request_duration_seconds_count{env="prod"}[2m])) by (instance, app, dc, env, stack) > 0.02
  LABELS { severity = "critical" }
  ANNOTATIONS {
    summary = "500 status codes detected",
    description = "Increase in 500s (> 2% total traffic) observed over last 2m",
    runbook = "StatusCode500Increase"
  }

Other alerts seem to be fine, so I'm wondering if there is a correlation in the actual alert here.

Running 0.9.1 with mesh-mode, receiving alerts from multiple prometheus instances.

Edit: Nevermind, seeing this on a few different alerts.

cepefernando · 2017-11-14T22:11:48Z

I'm also having this issue: My alert config file is:

global:
resolve_timeout: 5m
route:
repeat_interval: 1m
group_interval: 30s
group_wait: 0s
group_by: [ instance ]
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- send_resolved: true
username: 'Alert'
icon_emoji: 'fire'
channel: 'masaje-criollo'
api_url: 'https://hooks.slack.com/services/...'

And my alert.rules:
ALERT Microservices
IF probe_success{job="microservices"} == 0
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "{{ $labels.instance }} is down",
description = "{{ $labels.instance }} of job {{ $labels.job }} is under high load."
}

alkalinecoffee · 2017-11-17T14:06:33Z

Seeing the same on the latest 0.10.0.

Can we get some input from the alertmanager folks on what we can do to help push this along? It seems that most of our alerts now are being sent as resolved without the accompanying firing alert.

log.level=debug doesn't appear to output anything helpful here.

josedonizetti · 2017-11-17T15:41:54Z

@alkalinecoffee Do you have a snippet of the log when this problem happened?

alkalinecoffee · 2017-11-17T19:04:59Z

@josedonizetti

level=info ts=2017-11-20T21:01:11.802785124Z caller=main.go:155 msg="Starting Alertmanager" version="(version=0.11.0, branch=HEAD, revision=30dd0426c08b6479d9a26259ea5efd63bc1ee273)"
level=info ts=2017-11-20T21:01:11.802858747Z caller=main.go:156 build_context="(go=go1.9.2, user=root@3e103e3fc918, date=20171116-17:45:26)"
level=info ts=2017-11-20T21:01:11.803691405Z caller=main.go:293 msg="Loading configuration file" file=simple.yml
level=info ts=2017-11-20T21:01:11.808676357Z caller=main.go:368 msg=Listening address=:9093
level=debug ts=2017-11-20T21:01:27.807339191Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=TestAlert[a3e06bb][active]
level=debug ts=2017-11-20T21:01:37.807322937Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=TestAlert[a3e06bb][resolved]
level=debug ts=2017-11-20T21:01:37.807447281Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{alertname=\"TestAlert\"}" msg=Flushing alerts=[TestAlert[a3e06bb][active]]
level=debug ts=2017-11-20T21:01:37.807689108Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{alertname=\"TestAlert\"}" msg=Flushing alerts=[TestAlert[a3e06bb][resolved]]

I do think I found a pattern here.

I think what's happening is the alert fires, and before the alert is actually sent out, the resolved message comes in very soon after (under a second, almost simultaenously), and that's what is sent out instead. This reflects what's seen in the original post here.

So for example, I have an alert for high memory usage over a 3m period. I have an application that sees memory spikes for about 3 minutes at a time. My alertmanager config also has group_wait: 2m. So by the time the alert threshold (3m) + group_wait (2m) periods have expired, the alert is already resolved before the alert firing message is actually sent out. I hope this makes sense.

It seems to me that if the alert fires and is resolved between 3rd and 5th minute, the firing alert is never sent out (which makes sense), but the resolved alert may be sent out (which doesn't make sense).

I'm able to pretty reliably replicate the issue and will work on a setup for others to easily replicate for further debugging.

sandersaares · 2017-11-17T20:06:12Z

I confirm that my experiences as I remember them seem to fit within the parameters described in the comment above.

nsaud01 · 2018-01-11T17:28:48Z

@alkalinecoffee Any progress or results on this? I'm also getting duplicate resolved alerts with no active firing ones. I'm on Alertmanager v0.12.0. Here's my config:

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: slack

alkalinecoffee · 2018-01-11T17:36:36Z

I haven't been able to duplicate it reliably locally, and I haven't looked at it since before the holidays.

From what I can tell, the firing and resolved alerts are sent immediately after the other, and alertmanager appears to only honor the resolved one.

We've since adjusted our alerts with longer time period thresholds, so we just don't see this issue much anymore. It'll be more prevalent if your alerts have low thresholds and fire often.

The important thing is that alerts don't appear to be outright dropped--they are fired and then resolved just as fast.

simonpasquier · 2018-01-12T14:44:16Z

I can explain why this happens sometimes.

Assuming this AlertManager configuration with a single route:

[...]
route:
  group_by: [job]
  group_wait: 10s
  group_interval: 10s
  receiver: 'web.hook'
[...]

Here is the workflow:

Prometheus fires an alert with labels {alertname="HighCPU",instance="foo",job="node"} (alertA).
After group_wait, AlertManager notifies the receiver and adds an entry to the notification log (nflog). The nflog entry stores the label sets as hash values for the firing/resolved alerts, the hashes are computed at the Dedup stage (here).
After a while, Prometheus resolves the alert.
Wait a bit, AlertManager notifies the receiver and updates the corresponding entry in the nflog.
Then Prometheus fires the same alert but with for another instance {alertname="HighCPU",instance="bar",job="node"} (alertB).
Before group_wait has elapsed, Prometheus resolves the alert.
When AlertManager checks whether a notification should be sent, it compares the current firing/resolved alerts with the hashes from the nflog entry. Since alertB hasn't been notified yet, the nflog entry still references the label set of alertA thus AlertManager notifies the receiver.

Now if the configuration is changed to include the instance label:

route:
  group_by: [job, instance]
  group_wait: 10s
  group_interval: 10s
  receiver: 'web.hook'

The duplication doesn't happen since alertA and alertB don't fall in the same aggregation group.

nsaud01 · 2018-01-12T16:08:23Z

@simonpasquier That's interesting. I'll change my group_by value to include instance and see if that fixes the issue. Thanks!

alkalinecoffee · 2018-01-12T16:13:46Z

I understand it's a workaround, but be aware that adding instance to group_by would cause multiple alerts to be fired (one for each instance) when something goes wrong.

nsaud01 · 2018-01-12T16:16:17Z

Well that's no good. But it also makes sense the more I think about it. @simonpasquier Is that your experience with that workaround?

simonpasquier · 2018-01-12T16:56:20Z

@nsaud01 @alkalinecoffee yes, if you want to group CPU alerts for all instances in a single notification then you can't use the workaround...

simonpasquier · 2018-01-15T07:54:55Z

I'll try to send a PR this week.

Update github.com/prometheus/common dependencies

sandersaares mentioned this issue Jul 5, 2017

Inhibited alerts are being sent as resolved #891

Closed

mxinden added component/notify kind/bug priority/Pmaybe labels Jul 31, 2017

simonpasquier mentioned this issue Jan 15, 2018

Don't notify resolved alerts if none were firing #1198

Merged

stuartnelson3 closed this as completed in #1198 Jan 18, 2018

hh pushed a commit to ii/alertmanager that referenced this issue Apr 4, 2018

Merge pull request prometheus#878 from prometheus/grobie/update-common

eef9ffb

Update github.com/prometheus/common dependencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolved alert sent without sending corresponding active alert #878

Resolved alert sent without sending corresponding active alert #878

sandersaares commented Jun 21, 2017 •

edited

Loading

fabxc commented Jun 22, 2017

sandersaares commented Jun 22, 2017

DonMartin76 commented Jul 7, 2017 •

edited

Loading

domleb commented Sep 7, 2017

agaoglu commented Oct 25, 2017

alkalinecoffee commented Nov 1, 2017 •

edited

Loading

cepefernando commented Nov 14, 2017

alkalinecoffee commented Nov 17, 2017 •

edited

Loading

josedonizetti commented Nov 17, 2017

alkalinecoffee commented Nov 17, 2017 •

edited

Loading

sandersaares commented Nov 17, 2017

nsaud01 commented Jan 11, 2018

alkalinecoffee commented Jan 11, 2018

simonpasquier commented Jan 12, 2018

nsaud01 commented Jan 12, 2018

alkalinecoffee commented Jan 12, 2018

nsaud01 commented Jan 12, 2018 •

edited

Loading

simonpasquier commented Jan 12, 2018

simonpasquier commented Jan 15, 2018

Resolved alert sent without sending corresponding active alert #878

Resolved alert sent without sending corresponding active alert #878

Comments

sandersaares commented Jun 21, 2017 • edited Loading

fabxc commented Jun 22, 2017

sandersaares commented Jun 22, 2017

DonMartin76 commented Jul 7, 2017 • edited Loading

domleb commented Sep 7, 2017

agaoglu commented Oct 25, 2017

alkalinecoffee commented Nov 1, 2017 • edited Loading

cepefernando commented Nov 14, 2017

alkalinecoffee commented Nov 17, 2017 • edited Loading

josedonizetti commented Nov 17, 2017

alkalinecoffee commented Nov 17, 2017 • edited Loading

sandersaares commented Nov 17, 2017

nsaud01 commented Jan 11, 2018

alkalinecoffee commented Jan 11, 2018

simonpasquier commented Jan 12, 2018

nsaud01 commented Jan 12, 2018

alkalinecoffee commented Jan 12, 2018

nsaud01 commented Jan 12, 2018 • edited Loading

simonpasquier commented Jan 12, 2018

simonpasquier commented Jan 15, 2018

sandersaares commented Jun 21, 2017 •

edited

Loading

DonMartin76 commented Jul 7, 2017 •

edited

Loading

alkalinecoffee commented Nov 1, 2017 •

edited

Loading

alkalinecoffee commented Nov 17, 2017 •

edited

Loading

alkalinecoffee commented Nov 17, 2017 •

edited

Loading

nsaud01 commented Jan 12, 2018 •

edited

Loading