Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolved alert sent without sending corresponding active alert #878

Closed
sandersaares opened this issue Jun 21, 2017 · 19 comments
Closed

Resolved alert sent without sending corresponding active alert #878

sandersaares opened this issue Jun 21, 2017 · 19 comments

Comments

@sandersaares
Copy link

sandersaares commented Jun 21, 2017

I have alertmanager sending notifications to Slack. Today it sent a "resolved" notification to Slack without ever sending an "active" notification. This is contrary to my expectations - I expect "resolved" to only be sent if an "active" is sent first.

Here is the log:

time="2017-06-21T09:29:15Z" level=info msg="Starting alertmanager (version=0.7.1, branch=master, revision=ab4138299b94c78dc554ea96e2ab28d04b048059)" source="main.go:109"
time="2017-06-21T09:29:16Z" level=info msg="Build context (go=go1.8.3, user=root@97e9539a4c3f, date=20170609-15:31:09)" source="main.go:110"
time="2017-06-21T09:29:16Z" level=info msg="Loading configuration file" file="/etc/alertmanager/config.yml" source="main.go:234"
time="2017-06-21T09:29:16Z" level=info msg="Listening on :9093" source="main.go:308"
time="2017-06-21T09:34:37Z" level=debug msg="Received alert" alert=web_endpoint_is_down[5fe91a8][active] component=dispatcher source="dispatch.go:183"
time="2017-06-21T09:35:37Z" level=debug msg="Received alert" alert=web_endpoint_is_down[5fe91a8][resolved] component=dispatcher source="dispatch.go:183"
time="2017-06-21T09:35:37Z" level=debug msg="flushing [web_endpoint_is_down[5fe91a8][resolved]]" aggrGroup={}:{alertname="web_endpoint_is_down"} source="dispatch.go:425"
time="2017-06-21T09:36:37Z" level=debug msg="Received alert" alert=web_endpoint_is_down[5fe91a8][resolved] component=dispatcher source="dispatch.go:183"
time="2017-06-21T09:36:37Z" level=debug msg="flushing [web_endpoint_is_down[5fe91a8][resolved]]" aggrGroup={}:{alertname="web_endpoint_is_down"} source="dispatch.go:425"

Configuration:

route:
  group_by: ['alertname', 'service']
  group_wait: 1m
  group_interval: 15m
  repeat_interval: 6h
  receiver: 'axinom-slack'

receivers:
- name: 'axinom-slack'
  slack_configs:
  - api_url: "xxx"
    channel: "#service-status"
    send_resolved: true
    text: "Active alerts: {{ range .Alerts }}{{if eq .Status \"firing\" }}{{ .Labels.job }} {{ end }}{{ end }}\n{{ range .CommonAnnotations.SortedPairs }}{{ .Name }} = {{ .Value }}\n{{ end }}"

Prometheus main configuration:

global:
  scrape_interval: 1m
  scrape_timeout: 30s

scrape_configs:
  - job_name: 'xxx'
    metrics_path: /probe
    params:
      module: ['basic_http_get']
      target: ['xxx']
    static_configs:
      - targets: ['yyy:9115']

Prometheus alerts configuration:

ALERT web_endpoint_is_down
    IF probe_success < 1.0
    FOR 3m
@fabxc
Copy link
Contributor

fabxc commented Jun 22, 2017

What version of AM are you using?

@sandersaares
Copy link
Author

version=0.7.1, branch=master, revision=ab4138299b94c78dc554ea96e2ab28d04b048059

@DonMartin76
Copy link

DonMartin76 commented Jul 7, 2017

We also get this, with the following alert:

ALERT HighCPUUsage
  IF (100 * (1 - avg by(instance)(irate(node_cpu{mode='idle'}[30s])))) > 90
  LABELS {
    severity = "Notice",
  }
  ANNOTATIONS {
    summary = "Instance {{ $labels.instance }} CPU usage is dangerously high",
    description = "This device's cpu usage has exceeded the threshold with a value of {{ $value }}."
  }

For this one we get "RESOLVED" notifications once in a while, even if we didn't get the initial ones.

We're also on 0.7.1, running in Kubernetes 1.5.7, Prometheus 1.7.1.

@domleb
Copy link

domleb commented Sep 7, 2017

We're seeing this too on v0.7.0.

@agaoglu
Copy link

agaoglu commented Oct 25, 2017

Also happens in our setup with v0.9.1

@alkalinecoffee
Copy link

alkalinecoffee commented Nov 1, 2017

Question for the group: does this only happen with a particular alert? So far I'm only seeing it with an alert such as:

ALERT StatusCode500IncreaseProd
  IF sum(increase(http_request_duration_seconds_count{env="prod", status_code="500"}[2m])) by (instance, app, dc, env, stack) / sum(increase(http_request_duration_seconds_count{env="prod"}[2m])) by (instance, app, dc, env, stack) > 0.02
  LABELS { severity = "critical" }
  ANNOTATIONS {
    summary = "500 status codes detected",
    description = "Increase in 500s (> 2% total traffic) observed over last 2m",
    runbook = "StatusCode500Increase"
  }

Other alerts seem to be fine, so I'm wondering if there is a correlation in the actual alert here.

Running 0.9.1 with mesh-mode, receiving alerts from multiple prometheus instances.


Edit: Nevermind, seeing this on a few different alerts.

@cepefernando
Copy link

I'm also having this issue: My alert config file is:

global:
resolve_timeout: 5m
route:
repeat_interval: 1m
group_interval: 30s
group_wait: 0s
group_by: [ instance ]
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- send_resolved: true
username: 'Alert'
icon_emoji: 'fire'
channel: 'masaje-criollo'
api_url: 'https://hooks.slack.com/services/...'

And my alert.rules:
ALERT Microservices
IF probe_success{job="microservices"} == 0
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "{{ $labels.instance }} is down",
description = "{{ $labels.instance }} of job {{ $labels.job }} is under high load."
}

@alkalinecoffee
Copy link

alkalinecoffee commented Nov 17, 2017

Seeing the same on the latest 0.10.0.

Can we get some input from the alertmanager folks on what we can do to help push this along? It seems that most of our alerts now are being sent as resolved without the accompanying firing alert.

log.level=debug doesn't appear to output anything helpful here.

@josedonizetti
Copy link
Contributor

@alkalinecoffee Do you have a snippet of the log when this problem happened?

@alkalinecoffee
Copy link

alkalinecoffee commented Nov 17, 2017

@josedonizetti

level=info ts=2017-11-20T21:01:11.802785124Z caller=main.go:155 msg="Starting Alertmanager" version="(version=0.11.0, branch=HEAD, revision=30dd0426c08b6479d9a26259ea5efd63bc1ee273)"
level=info ts=2017-11-20T21:01:11.802858747Z caller=main.go:156 build_context="(go=go1.9.2, user=root@3e103e3fc918, date=20171116-17:45:26)"
level=info ts=2017-11-20T21:01:11.803691405Z caller=main.go:293 msg="Loading configuration file" file=simple.yml
level=info ts=2017-11-20T21:01:11.808676357Z caller=main.go:368 msg=Listening address=:9093
level=debug ts=2017-11-20T21:01:27.807339191Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=TestAlert[a3e06bb][active]
level=debug ts=2017-11-20T21:01:37.807322937Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=TestAlert[a3e06bb][resolved]
level=debug ts=2017-11-20T21:01:37.807447281Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{alertname=\"TestAlert\"}" msg=Flushing alerts=[TestAlert[a3e06bb][active]]
level=debug ts=2017-11-20T21:01:37.807689108Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{alertname=\"TestAlert\"}" msg=Flushing alerts=[TestAlert[a3e06bb][resolved]]

I do think I found a pattern here.

I think what's happening is the alert fires, and before the alert is actually sent out, the resolved message comes in very soon after (under a second, almost simultaenously), and that's what is sent out instead. This reflects what's seen in the original post here.

So for example, I have an alert for high memory usage over a 3m period. I have an application that sees memory spikes for about 3 minutes at a time. My alertmanager config also has group_wait: 2m. So by the time the alert threshold (3m) + group_wait (2m) periods have expired, the alert is already resolved before the alert firing message is actually sent out. I hope this makes sense.

It seems to me that if the alert fires and is resolved between 3rd and 5th minute, the firing alert is never sent out (which makes sense), but the resolved alert may be sent out (which doesn't make sense).

I'm able to pretty reliably replicate the issue and will work on a setup for others to easily replicate for further debugging.

@sandersaares
Copy link
Author

I confirm that my experiences as I remember them seem to fit within the parameters described in the comment above.

@nsaud01
Copy link

nsaud01 commented Jan 11, 2018

@alkalinecoffee Any progress or results on this? I'm also getting duplicate resolved alerts with no active firing ones. I'm on Alertmanager v0.12.0. Here's my config:

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: slack

@alkalinecoffee
Copy link

I haven't been able to duplicate it reliably locally, and I haven't looked at it since before the holidays.

From what I can tell, the firing and resolved alerts are sent immediately after the other, and alertmanager appears to only honor the resolved one.

We've since adjusted our alerts with longer time period thresholds, so we just don't see this issue much anymore. It'll be more prevalent if your alerts have low thresholds and fire often.

The important thing is that alerts don't appear to be outright dropped--they are fired and then resolved just as fast.

@simonpasquier
Copy link
Member

I can explain why this happens sometimes.

Assuming this AlertManager configuration with a single route:

[...]
route:
  group_by: [job]
  group_wait: 10s
  group_interval: 10s
  receiver: 'web.hook'
[...]

Here is the workflow:

  • Prometheus fires an alert with labels {alertname="HighCPU",instance="foo",job="node"} (alertA).
  • After group_wait, AlertManager notifies the receiver and adds an entry to the notification log (nflog). The nflog entry stores the label sets as hash values for the firing/resolved alerts, the hashes are computed at the Dedup stage (here).
  • After a while, Prometheus resolves the alert.
  • Wait a bit, AlertManager notifies the receiver and updates the corresponding entry in the nflog.
  • Then Prometheus fires the same alert but with for another instance {alertname="HighCPU",instance="bar",job="node"} (alertB).
  • Before group_wait has elapsed, Prometheus resolves the alert.
  • When AlertManager checks whether a notification should be sent, it compares the current firing/resolved alerts with the hashes from the nflog entry. Since alertB hasn't been notified yet, the nflog entry still references the label set of alertA thus AlertManager notifies the receiver.

Now if the configuration is changed to include the instance label:

route:
  group_by: [job, instance]
  group_wait: 10s
  group_interval: 10s
  receiver: 'web.hook'

The duplication doesn't happen since alertA and alertB don't fall in the same aggregation group.

@nsaud01
Copy link

nsaud01 commented Jan 12, 2018

@simonpasquier That's interesting. I'll change my group_by value to include instance and see if that fixes the issue. Thanks!

@alkalinecoffee
Copy link

I understand it's a workaround, but be aware that adding instance to group_by would cause multiple alerts to be fired (one for each instance) when something goes wrong.

@nsaud01
Copy link

nsaud01 commented Jan 12, 2018

Well that's no good. But it also makes sense the more I think about it. @simonpasquier Is that your experience with that workaround?

@simonpasquier
Copy link
Member

@nsaud01 @alkalinecoffee yes, if you want to group CPU alerts for all instances in a single notification then you can't use the workaround...

@simonpasquier
Copy link
Member

I'll try to send a PR this week.

hh pushed a commit to ii/alertmanager that referenced this issue Apr 4, 2018
Update github.com/prometheus/common dependencies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests