Duplicate notifications after upgrading to Alertmanager 0.15 #1550

tanji · 2018-09-14T12:32:57Z

After upgrading from Alertmanager 0.13 to 0.15.2 in a cluster of two members we've started receiving double notifications in slack. It used to work flawlessly with 0.13. Weirdly we're receiving the 2 notifications exactly at the same time, they don't seem to be apart by more than a couple of secs.

System information:

Linux pmm-server 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 GNU/Linux

Both instances using ntp.

Alertmanager version:

alertmanager, version 0.15.2 (branch: HEAD, revision: d19fae3)
build user: root@3101e5b68a55
build date: 20180814-10:53:39
go version: go1.10.3

Cluster status reports up:

Status
Uptime:
2018-09-09T19:03:01.726517546Z
Cluster Status
Name:
01CPZVEFADF9GE2G9F2CTZZZQ6
Status:
ready
Peers:
Name: 01CPZV0HDRQY5M5TW6FDS31MKS
Address: :9094
Name: 01CPZVEFADF9GE2G9F2CTZZZQ6
Address: :9094

Prometheus version:

Irrelevant

Alertmanager configuration file:

global:
  resolve_timeout: 5m
  http_config: {}
  smtp_hello: localhost
  smtp_require_tls: true
  slack_api_url: <secret>
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  hipchat_api_url: https://api.hipchat.com/
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
  receiver: ops-slack
  group_by:
  - alertname
  - group
  routes:
  - receiver: ops-pager
    match:
      alertname: MySQLDown
  - receiver: ops-pager
    match:
      offline: critical
    continue: true
    routes:
    - receiver: ops-slack
      match:
        offline: critical
  - receiver: ops-pager
    match:
      pager: "yes"
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 30m
receivers:
- name: ops-slack
  slack_configs:
  - send_resolved: true
    http_config: {}
    api_url: <secret>
    channel: alerts
    username: prometheus
    color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
    title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing
      | len }}{{ end }}] {{ .CommonAnnotations.summary }}'
    title_link: '{{ template "slack.default.titlelink" . }}'
    pretext: '{{ template "slack.default.pretext" . }}'
    text: |-
      {{ range .Alerts }}
        *Alert:* {{ .Annotations.summary }} - *{{ .Labels.severity | toUpper }}* on {{ .Labels.instance }}
        *Description:* {{ .Annotations.description }}
        *Details:*
        {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
        {{ end }}
      {{ end }}
    footer: '{{ template "slack.default.footer" . }}'
    fallback: '{{ template "slack.default.fallback" . }}'
    icon_emoji: '{{ template "slack.default.iconemoji" . }}'
    icon_url: http://cdn.rancher.com/wp-content/uploads/2015/05/27094511/prometheus-logo-square.png

Logs:
no errors to speak of

level=info ts=2018-09-06T11:10:16.242620478Z caller=main.go:174 msg="Starting Alertmanager" version="(version=0.15.2, branch=HEAD, revision=d19fae3bae451940b8470abb680cfdd59bfa7cfa)"
level=info ts=2018-09-06T11:10:16.242654842Z caller=main.go:175 build_context="(go=go1.10.3, user=root@3101e5b68a55, date=20180814-10:53:39)"
level=info ts=2018-09-06T11:10:16.313588161Z caller=main.go:322 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-09-06T11:10:16.313610447Z caller=cluster.go:570 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-09-06T11:10:16.315607053Z caller=main.go:398 msg=Listening address=:9093
level=info ts=2018-09-06T11:10:18.313944578Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=0 before=0 now=2 elapsed=2.000297466s
level=info ts=2018-09-06T11:10:22.314297199Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=2 before=2 now=1 elapsed=6.000647448s
level=info ts=2018-09-06T11:10:30.315059802Z caller=cluster.go:587 component=cluster msg="gossip settled; proceeding" elapsed=14.001414171s
level=info ts=2018-09-09T18:55:23.930653016Z caller=main.go:426 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2018-09-09T18:55:25.067197275Z caller=main.go:174 msg="Starting Alertmanager" version="(version=0.15.2, branch=HEAD, revision=d19fae3bae451940b8470abb680cfdd59bfa7cfa)"
level=info ts=2018-09-09T18:55:25.067233709Z caller=main.go:175 build_context="(go=go1.10.3, user=root@3101e5b68a55, date=20180814-10:53:39)"
level=info ts=2018-09-09T18:55:25.128486689Z caller=main.go:322 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-09-09T18:55:25.128488742Z caller=cluster.go:570 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-09-09T18:55:25.131985874Z caller=main.go:398 msg=Listening address=:9093
level=info ts=2018-09-09T18:55:27.128662897Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=0 before=0 now=3 elapsed=2.000096829s
level=info ts=2018-09-09T18:55:31.128969079Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=2 before=3 now=2 elapsed=6.000402722s
level=info ts=2018-09-09T18:55:33.129130021Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=3 before=2 now=1 elapsed=8.000564176s
level=info ts=2018-09-09T18:55:37.129427658Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=5 before=1 now=2 elapsed=12.000855483s
level=info ts=2018-09-09T18:55:45.130073007Z caller=cluster.go:587 component=cluster msg="gossip settled; proceeding" elapsed=20.001506309s

The text was updated successfully, but these errors were encountered:

stuartnelson3 · 2018-09-14T13:30:19Z

Are these double-notifications happening consistently?

There's no concensus between alertmanagers -- if they receive the initial alert at different times from a prometheus server, the created alert groups in the different alertmanagers might be out of sync by e.g. a single evaluation interval. If your evaluation interval=15s, and the --cluster.peer-timeout=15s (the default), they could end up sending their notifications at the exact same time.

tanji · 2018-09-18T06:23:57Z

Yes they're quite consistent. What do you mean by evaluation interval? Is this tunable?
Do you recommend to increase peer-timeout?

stuartnelson3 · 2018-09-18T08:23:21Z

Your logs are indicating some weird behavior:

level=info ts=2018-09-09T18:55:27.128662897Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=0 before=0 now=3 elapsed=2.000096829s
level=info ts=2018-09-09T18:55:31.128969079Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=2 before=3 now=2 elapsed=6.000402722s
level=info ts=2018-09-09T18:55:33.129130021Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=3 before=2 now=1 elapsed=8.000564176s
level=info ts=2018-09-09T18:55:37.129427658Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=5 before=1 now=2 elapsed=12.000855483s

now indicates the number of peers in the cluster. In the first two seconds, your instance connects to two other instances (now=3, three total members), then at 6 seconds there are only two instances in the cluster, at 8 seconds there is just the single node, and then it returns to 2 nodes. Something appears to be weird about your setup -- you state there are only 2 nodes, but the logs show that at one point there are 3, and the connection between them appears to be a bit unstable.

How/where are these alertmanagers deployed?

tanji · 2018-09-18T09:28:20Z

That just happens when starting alertmanager, no messages are appearing after this, those alertmanagers are deployed in the cloud and are pretty close to each other. Of note, we use a docker image to deploy. here are the configs and startup parameters

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - 9093:9093
      - 9094:9094
    volumes:
      - alertmanager_data:/alertmanager
      - /etc/alertmanager:/etc/alertmanager
    restart: always
    command:
      - --config.file=/etc/alertmanager/config.yml
      - --storage.path=/alertmanager
      - --web.external-url=http://pmm-server-ec2:9093
      - --cluster.peer=pmm-server:9094
      - --cluster.advertise-address=(external IP of the EC2 VM)`

Second AM

    image: prom/alertmanager:latest
    ports:
      - 127.0.0.1:9093:9093
      - 9094:9094
    volumes:
      - alertmanager_data:/alertmanager
      - /etc/alertmanager:/etc/alertmanager
    restart: always
    command:
      - --config.file=/etc/alertmanager/config.yml
      - --storage.path=/alertmanager
      - --web.external-url=https://alertmanager
      - --cluster.peer=pmm-server-ec2:9094
      - --cluster.advertise-address=(external IP of the server):9094

Nothing really rocket science, this setup (with the old mesh protocol) worked without duplicates until I upgraded.

Re. 3 nodes could it be that it's also trying to connect to itself?

stuartnelson3 · 2018-09-18T09:43:17Z

The connection logs are only written during start-up, they aren't logged if the connection flaps.
In the initial cluster connection code, a resolved IP address that equals the instance's IP address is removed from the initial list of instances to connect to (so I'm still curious about this 3 thing).

Can you check the following alertmanager metrics that your prometheus should be scraping?

alertmanager_peer_position - each node should have a single, stable value
alertmanager_cluster_members - this shouldn't be flapping between different values
alertmanager_cluster_failed_peers - ideally this should be zero, or VERY briefly a non-zero number

tanji · 2018-09-18T11:00:56Z

We don't scrape those, I'll fix this and will look at metrics.

tanji · 2018-09-18T11:48:21Z

There's something wrong indeed, one node is ok, always sees the the other one
the other AM sees its peer flapping all the time and cluster size going from 2 to 1. Is it possible to print more debug levels?

stuartnelson3 · 2018-09-18T11:51:52Z

--log.level=debug will output more logs

tanji · 2018-09-18T12:02:52Z

OK, I found it, the 2nd node wasn't announcing itself on the correct address, it used Amazon internal IP instead of external :( it should work better now, of note I'm getting those errors:

alertmanager_1  | level=debug ts=2018-09-18T12:01:34.998491475Z caller=cluster.go:287 component=cluster memberlist="2018/09/18 12:01:34 [WARN] memberlist: Was able to connect to 01CQP8SVW787P0JEVVFEEM33SG but other probes failed, network may be misconfigured\n"

Is ICMP necessary as part of the protocol? I can enable it on AWS, it's disabled by default

tanji · 2018-09-18T12:53:53Z

The ping thingy doesn't seem to play well with Docker networks:

alertmanager_1  | level=debug ts=2018-09-18T12:52:01.276047845Z caller=cluster.go:287 component=cluster memberlist="2018/09/18 12:52:01 [WARN] memberlist: Got ping for unexpected node 01CQP8VVB33XSGRCWM3S7EJGN7 from=172.18.0.1:48699\n"

That node advertises itself on the external IP, so you shouldn't consider this an unexpected ping if the source is the docker network gateway IP

stuartnelson3 · 2018-09-18T15:10:59Z

Is ICMP necessary as part of the protocol?

I believe only UDP and TCP are used.

That node advertises itself on the external IP, so you shouldn't consider this an unexpected ping if the source is the docker network gateway IP

The connection is made using the resolved address from --cluster.peer; if an unrecognized ipaddr "checks in" the underlying library, memberlist, doesn't like that -- it has to join the cluster first.

mxinden · 2018-09-19T08:41:08Z

Are the two machines in the same VPC? Do they advertise themselves via the external IP, but communicate via the internal IP?

tanji · 2018-09-19T10:02:26Z

OK, after changing it, we're still having duplicate issues for some reason they happen at larger intervals now.
@mxinden the 1st machine is in AWS, the 2nd machine is at a baremetal provider, they communicate over the internet (without problems precedently, as I noted)

mxinden · 2018-09-19T11:34:25Z

@tanji are the clustering metrics mentioned above still flaky, or stable? In the latter case, do you have access to the notification payloads and can post them here?

stuartnelson3 · 2018-09-19T12:24:11Z

The primary form of gossip between the nodes is done over UDP, which might be getting lost between datacenters.

tanji · 2018-09-19T13:16:17Z

Yes, the metrics have been stable.
What do you mean by notification payloads?

apsega · 2018-09-21T07:27:39Z

I have the same issue, that after upgrading 2 AlertManagers to 0.15.2 version, we're receiving duplicate alerts.

Notable config:

group_wait: 30s
--cluster.peer-timeout=1m

Tuning cluster.peer-timeout values to 15s, 30s or 1m does not help in any way.

Debug log shows this:

caller=cluster.go:287 component=cluster memberlist="2018/09/21 07:21:03 [INFO] memberlist: Marking 01CQXFW2PX58MBA1KVDFHTAACN as failed, suspect timeout reached (0 peer confirmations)\n"
caller=delegate.go:215 component=cluster received=NotifyLeave node=01CQXFW2PX58MBA1KVDFHTAACN addr=xxx.xx.x.xx:9094
caller=cluster.go:439 component=cluster msg="peer left" peer=01CQXFW2PX58MBA1KVDFHTAACN
caller=cluster.go:287 component=cluster memberlist="2018/09/21 07:21:04 [DEBUG] memberlist: Initiating push/pull sync with: xx.x.xx.xx:9094\n"
caller=cluster.go:389 component=cluster msg=reconnect result=success peer= addr=xx.x.xx.xx:9094

I wonder if this can be related that AlertManagers are running in Docker containers with flag --cluster.listen-address=0.0.0.0:9094, --cluster.peer= is set with machines IP address on which containers are running, but AlertManager shows Docker internal IPs. Although prior upgrade, everything was fine.

Some Graphs:

apsega · 2018-09-21T07:46:11Z

Seems like tuning --cluster.probe-timeout up to 10s does not help.

mxinden · 2018-09-21T07:46:58Z

What do you mean by notification payloads?

@tanji sorry for not introducing the terms first. We generally refer to an alert as the request send by Prometheus to Alertmanager and a notification as the request send by Alertmanager to e.g. Slack. Do you have access to the payload of two duplicate notifications of Alertmanager send to Slack?

@apsega which Alertmanager version were you running before? v0.15.1?

apsega · 2018-09-21T07:48:08Z

@mxinden actually very old release, something like v0.8.x

tanji · 2018-09-21T07:56:27Z

@mxinden do you mean the JSON payload? Unfortunately I am not sure how to access it. Is it logged anywhere?

On the text side the outputs are strictly similar.

apsega · 2018-09-21T08:39:30Z

Seems like downgrading to v0.14.0 solves the issue. Tried downgrading to v0.15.1 and v0.15.0 with no luck. So the issue occurs only from v0.15.0.

stuartnelson3 · 2018-09-21T08:51:23Z

@apsega your cluster isn't stable, hence the duplicate messages. once your cluster stops flapping it should stop sending the duplicate messages.

I would guess that this is definitely something to do with your setup running in docker containers.

apsega · 2018-09-21T09:01:12Z

Well, downgrading to v0.14.0 made it stable:

simonpasquier · 2018-09-21T09:08:21Z

@apsega 0.14 and 0.15 use different libraries for clustering which explains probably why the behaviors are different. You can try with --log.level=debug to get more details but again, your question would be better answered on the Prometheus users mailing list than here.

tanji · 2019-03-05T07:33:31Z

This is still an issue in 2019, can you let me know how to access the payloads?

PedroMSantosD · 2019-06-06T10:28:13Z

Hi, just for confirmation,
I'm getting duplicate alerts on sending to HipChat on version

alertmanager-0.16.2-1.el7.centos.x86_64

running two nodes on two separate datacenters which are apart by (ICMP stats)

rtt min/avg/max/mdev = 57.887/91.449/392.915/100.489 ms

My firewall only allows TCP connections between the AMs;

Do alertmanagers use BOTH, UDP and TCP protocols for signalling each other? or will TCP suffice?

simonpasquier · 2019-06-06T10:42:41Z

As noted in the README.md:

Important: Both UDP and TCP are needed in alertmanager 0.15 and higher for the cluster to work.

PedroMSantosD · 2019-06-06T10:46:11Z

Thanks!

rnachire · 2019-06-24T07:06:02Z

Hi,
we are facing the same issue with the latest releases as well(v0.17.0, 0.16.2 and 0.15.2)

All the firing alerts are getting thrice to the slack. Attached the snapshot for the same.

But in alert manager it is getting only one time each of them.

level=debug ts=2019-06-24T06:47:12.043122299Z caller=cluster.go:654 component=cluster msg="gossip looks settled" elapsed=4.002890094s
level=debug ts=2019-06-24T06:47:14.043559452Z caller=cluster.go:654 component=cluster msg="gossip looks settled" elapsed=6.003325616s
level=debug ts=2019-06-24T06:47:16.043900893Z caller=cluster.go:654 component=cluster msg="gossip looks settled" elapsed=8.003677238s
level=info ts=2019-06-24T06:47:18.044573091Z caller=cluster.go:649 component=cluster msg="gossip settled; proceeding" elapsed=10.004343151s
level=debug ts=2019-06-24T06:47:21.370795639Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=PodNotReady[52153f9][active]
level=debug ts=2019-06-24T06:47:21.371171307Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=PodNotReady[f155826][active]
level=debug ts=2019-06-24T06:47:21.372144161Z caller=dispatch.go:343 component=dispatcher aggrGroup="{}:{alertname=\"PodNotReady\", kubernetes_node=\"sanity-worker-3\", node=\"sanity-worker-1\", pod=\"glowroot-55888ccb49-75mkd\"}" msg=flushing alerts=[PodNotReady[f155826][active]]
level=debug ts=2019-06-24T06:47:21.372130267Z caller=dispatch.go:343 component=dispatcher aggrGroup="{}:{alertname=\"PodNotReady\", kubernetes_node=\"sanity-worker-3\", node=\"sanity-master\", pod=\"infra-log-forwarder-4xh8r\"}" msg=flushing alerts=[PodNotReady[52153f9][active]]
level=debug ts=2019-06-24T06:47:53.870328657Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighMemory[ac84e40][active]
level=debug ts=2019-06-24T06:47:53.870711408Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighMemory[95a314b][active]
level=debug ts=2019-06-24T06:47:53.87093018Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighMemory[5a20967][active]
level=debug ts=2019-06-24T06:47:53.871072516Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighMemory[e21ff42][active]
level=debug ts=2019-06-24T06:47:53.871208813Z caller=dispatch.go:343 component=dispatcher aggrGroup="{}:{alertname=\"NodeHighMemory\", kubernetes_node=\"sanity-worker-1\"}" msg=flushing alerts=[NodeHighMemory[ac84e40][active]]
level=debug ts=2019-06-24T06:47:53.871781535Z caller=dispatch.go:343 component=dispatcher aggrGroup="{}:{alertname=\"NodeHighMemory\", kubernetes_node=\"sanity-worker-4\"}" msg=flushing alerts=[NodeHighMemory[e21ff42][active]]
level=debug ts=2019-06-24T06:47:53.872117671Z caller=dispatch.go:343 component=dispatcher aggrGroup="{}:{alertname=\"NodeHighMemory\", kubernetes_node=\"sanity-worker-2\"}" msg=flushing alerts=[NodeHighMemory[95a314b][active]]
level=debug ts=2019-06-24T06:47:53.872447398Z caller=dispatch.go:343 component=dispatcher aggrGroup="{}:{alertname=\"NodeHighMemory\", kubernetes_node=\"sanity-worker-3\"}" msg=flushing alerts=[NodeHighMemory[5a20967][active]]
level=debug ts=2019-06-24T06:47:53.884892698Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighCPU[89e6117][active]
level=debug ts=2019-06-24T06:47:53.885712288Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighCPU[5b8c4ba][active]
level=debug ts=2019-06-24T06:47:53.885826035Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighCPU[a70d82b][active]
level=debug ts=2019-06-24T06:47:53.885902372Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighCPU[b8aba53][active]
level=debug ts=2019-06-24T06:47:53.885969583Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighCPU[dcb06f4][active]
level=debug ts=2019-06-24T06:47:53.889850981Z caller=dispatch.go:343 component=dispatcher aggrGroup="{}:{alertname=\"NodeHighCPU\"}" msg=flushing alerts="[NodeHighCPU[b8aba53][active] NodeHighCPU[dcb06f4][active] NodeHighCPU[5b8c4ba][active] NodeHighCPU[a70d82b][active] NodeHighCPU[89e6117][active]]"
level=debug ts=2019-06-24T06:48:21.386528811Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=ContainerNotUp[36ba4f8][active]
level=debug ts=2019-06-24T06:48:21.386850104Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=ContainerNotUp[835d84a][active]
level=debug ts=2019-06-24T06:48:21.387210042Z caller=dispatch.go:343 component=dispatcher aggrGroup="{}:{alertname=\"ContainerNotUp\", container=\"glowroot\", pod=\"glowroot-55888ccb49-75mkd\"}" msg=flushing alerts=[ContainerNotUp[36ba4f8][active]]
level=debug ts=2019-06-24T06:48:21.387250465Z caller=dispatch.go:343 component=dispatcher aggrGroup="{}:{alertname=\"ContainerNotUp\", container=\"infra-log-forwarder\", pod=\"infra-log-forwarder-4xh8r\"}" msg=flushing alerts=[ContainerNotUp[835d84a][active]]




level=debug ts=2019-06-24T06:49:21.366908755Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=PodNotReady[f155826][active]
level=debug ts=2019-06-24T06:49:21.367154283Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=PodNotReady[52153f9][active]
level=debug ts=2019-06-24T06:49:53.86967788Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighMemory[5a20967][active]
level=debug ts=2019-06-24T06:49:53.869880881Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighMemory[e21ff42][active]
level=debug ts=2019-06-24T06:49:53.870030144Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighMemory[ac84e40][active]
level=debug ts=2019-06-24T06:49:53.870142657Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighMemory[95a314b][active]
level=debug ts=2019-06-24T06:49:53.883311217Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighCPU[dcb06f4][active]
level=debug ts=2019-06-24T06:49:53.883562925Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighCPU[89e6117][active]
level=debug ts=2019-06-24T06:49:53.883691331Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighCPU[5b8c4ba][active]
level=debug ts=2019-06-24T06:49:53.883783857Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighCPU[a70d82b][active]
level=debug ts=2019-06-24T06:49:53.884001877Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeHighCPU[b8aba53][active]
level=debug ts=2019-06-24T06:50:21.386374426Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=ContainerNotUp[36ba4f8][active]
level=debug ts=2019-06-24T06:50:21.386703419Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=ContainerNotUp[835d84a][active]

In Alert manager UI:

We are having below configs on Alert manager:

Attached the deployment yaml for alert manager:

test2.txt

Template used for slack:

apiVersion: v1
data:
alertmanager.yml: |
global:
slack_api_url: https://hooks.slack.com/services/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
receivers:
- name: default-receiver
slack_configs:
- channel: '#prom-alerts'
- text: |-
{{ range .Alerts }}
Alert: {{ .Annotations.summary }} - {{ .Labels.severity }}
Description: {{ .Annotations.description }}
Runbook: <{{ .Annotations.runbook }}|:spiral_note_pad:>
Details:
{{ range .Labels.SortedPairs }} • {{ .Name }}: {{ .Value }}
{{ end }}
{{ end }}
- send_resolved: true
route:
group_by: ['alertname','kubernetes_node','pod','node','container']
group_interval: 5m
group_wait: 10s
receiver: default-receiver
repeat_interval: 3h

Please let us know, if any other info you need to debug further. thanks in advance.

Regards,
Rajesh

stuartnelson3 · 2019-06-24T09:12:52Z

As noted in the README, both UDP and TCP ports need to be open for HA mode:

https://github.com/prometheus/alertmanager#high-availability

From looking at the deployment, I only see a TCP endpoint being opened. If you open a UDP port and configure the AMs with this, the duplicate messages should go away.

For further support, please write to the users mailling list, [email protected], since this seems to be a usage question and not a bug.

rnachire · 2019-06-24T12:35:25Z

thanks.. can you please help us setting UDP and TCP in above yaml.

stuartnelson3 · 2019-06-24T13:11:46Z

I recommend looking at the kubernetes documentation

PNRxA · 2019-06-25T01:18:56Z

FWIW I've been following this as I've been having the same issues.

I have UDP and TCP open, I made sure that UDP and TCP were connectable with Ncat.

Today I added the flags:
--cluster.listen-address=
--cluster.advertise-address=

This has fixed the issue from what I can see.

0x63lv · 2019-06-28T08:58:31Z

Had the same issue with 2 Alertmanagers running as docker containers on separate hosts.

The changes made, which, apparently, have resolved the issue for us:

Allowed UDP port for cluster communication, as it was recently added to documentation
Specified --cluster.advertise-address=<host IP address>:9094 for Alertmanager launch command. Otherwise it picks up docker internal IP addresses (172.*), which, apparently does not play that well with Alertmanager clustering.

MattPOlson · 2019-06-28T12:09:00Z

We tried this change and it still doesn't work. The issue is this function is being used to obtain the IP, sockaddr.GetPrivateIP which returns the first public or private IP address on the default interface. This function is a better choice, sockaddr.GetInterfaceIP("eth0"), it returns the IP of the interface passed in. We made the change in a forked repo and it's working better for us so far.

tiwarishrijan · 2019-07-04T10:44:49Z

Facing the same issue:
Version:

Issue:

Configuration:

Rule:

Prometheus Config :

simonpasquier · 2019-07-05T07:12:17Z

The issue is this function is being used to obtain the IP, sockaddr.GetPrivateIP which returns the first public or private IP address on the default interface. This function is a better choice, sockaddr.GetInterfaceIP("eth0"), it returns the IP of the interface passed in. We made the change in a forked repo and it's working better for us so far.

@MattPOlson would you like to submit a PR with your change? Given all the issues reported, I think it would make sense to have this option available.

cc @stuartnelson3 @mxinden

brian-brazil · 2019-07-05T08:09:10Z

How would we determine the interface to pass in?

simonpasquier · 2019-07-05T08:12:15Z

That would be another flag.

brian-brazil · 2019-07-05T08:14:59Z

If you're passing that as a flag, couldn't you pass the IP?

On newer kernel versions, interface names are kinda random.

MattPOlson · 2019-07-05T20:19:44Z

Sure, I can submit a PR with the change, just need to spend a little time getting it buttoned up completely. I did add a flag that defines the name of the interface to be used. As far as I can tell in a Docker Swarm environment it will always be eht0. Passing in the IP won't work because we don't know what it is until the container is spun up. I do have a question around this piece of code.

type getPrivateIPFunc func() (string, error)

// This is overridden in unit tests to mock the sockaddr.GetPrivateIP function.
var getPrivateAddress getPrivateIPFunc = sockaddr.GetPrivateIP

privateIP, err := getPrivateAddress()

What's the purpose of creating that type and var to use later? Why not just call the function directly like this

privateIP, err := sockaddr.GetPrivateIP

brian-brazil · 2019-07-05T21:00:10Z

Passing in the IP won't work because we don't know what it is until the container is spun up.

I don't see how that's a problem, this is something that'll be known within the container.

MattPOlson · 2019-07-05T22:45:47Z

Passing in the IP won't work because we don't know what it is until the container is spun up.

I don't see how that's a problem, this is something that'll be known within the container.

Because that flag is passed in using docker command. For example here is the docker file we use to create the AlertManager Docker Service.

services:
  alertmanager_1:
    image: bdalertmanager:1.0.6
    ports:
      - 9093:9093
    environment:
      - SLACK_URL=${SLACK_URL:-https://hooks.slack.com/services/TOKEN}
      - SLACK_CHANNEL=${SLACK_CHANNEL:-general}
      - SLACK_USER=${SLACK_USER:-alertmanager}
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.external-url=https://alertmanager.${URLSLUG:-dev}.local'
      - '--cluster.peer=alertmanager_2:9094'
      - '--log.level=debug'
      - '--cluster.probe-timeout=15s'
      - '--cluster.probe-interval=30s'
      - '--cluster.peer-timeout=30s'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--cluster.advertise-address=:9094'
      - '--cluster.advertise-address-interface=eth0'
    volumes:
      - ./alertmanager/conf/dev:/etc/alertmanager

I can't put an IPAddress in the file because I don't what it is, it won't be defined until the container is created, then it's to late to set it because AlertManager is already running.

mxinden · 2019-07-06T07:28:58Z

@MattPOlson instead of introducing new flags, would retrieving the interface IP right before startup and templating it into the --cluster.advertise-address flag be an option?

IP=$(ip -6 addr show eth0 | grep -oP 'SOME_REGEX')
alertmanager [...] --cluster.advertise-address=${IP}:9094

vtolstov · 2019-07-06T12:27:34Z

Why not create option to pass network with cidr? So when container starts ypu can check all address and use needed from specified block.
Also this will be useful with auto configuration like dhcp and so.

MattPOlson · 2019-07-08T15:15:33Z

@MattPOlson instead of introducing new flags, would retrieving the interface IP right before startup and templating it into the --cluster.advertise-address flag be an option?
IP=$(ip -6 addr show eth0 | grep -oP 'SOME_REGEX')
alertmanager [...] --cluster.advertise-address=${IP}:9094

We played around with that a little bit but couldn't get the Command Substitution to work with Docker command. Also that would mean hard coding 'eth0' , if doing that this would be simpler sockaddr.GetInterfaceIP("eth0").

Why not create option to pass network with cidr? So when container starts ypu can check all address and use needed from specified block.
Also this will be useful with auto configuration like dhcp and so.

That's an interesting idea, just need to be sure that it returns the right and same IPaddress everything. I like specifying the interface since it will always return the same and correct IP.

woozhijun · 2019-09-10T12:38:57Z

I was having a same problem for docker swarm, and the alertmanager version is 0.18.

[DEBUG] memberlist: Failed to join 172.17.0.3: dial tcp 172.17.0.3:8001: connect: connection refused\n

jmb12686 · 2019-11-02T20:18:50Z

Same issue here as well, running 2 alertmanagers containers in Docker Swarm. I'm attempting
to set them up in HA, however advertise-address is unknown until the container starts up. Any movement on the PR mentioned above?

kovalyukm · 2019-11-04T09:29:09Z

In my case, the issue was not exposed ports for ha. Hope it would be helpful for somebody.

pascal-hofmann · 2020-04-08T09:37:50Z

When using docker it's important to:

pass --cluster.advertise-address so the cluster nodes know their real IP
expose port 9094 via both tcp and udp.

When using docker-compose this can be done like this:

…
    ports:
      - 9094:9094/tcp
      - 9094:9094/udp
…

Doing both fixed the issue with duplicate notifications for me.

darkl0rd · 2021-10-31T21:14:11Z

Same issue here as well, running 2 alertmanagers containers in Docker Swarm. I'm attempting to set them up in HA, however advertise-address is unknown until the container starts up. Any movement on the PR mentioned above?

I'm running alertmanager in Docker Swarm, the two alertmanagers communicate over a dedicated Overlay network (no restrictions on the overlay network, all ports open). However, they both also have an additional network attached to expose them through my load balancer.

As per earlier posts, it's impossible to know the IP when AM starts up, so I have "--cluster.advertise-address=:9094" set instead, which seemingly results in alertmanager using the 172.x (docker host range) instead, rather than the IP's of the overlay network. This leads to (at least) one of the AM's continuously flapping, resulting in the duplicate alerts.

Is there a good solution to deal with this? Or is the current workaround to fetch the IP during startup and setting it as advertise-address? Or pinning them to a host, creating a port mapping for TCP and UDP and using the host address?

Gaozizhong · 2024-05-30T07:01:43Z

          args:
            - --config.file=/etc/alertmanager-config/omc-alertmanager.yaml
            - --storage.path=/alertmanager
            - --cluster.listen-address=0.0.0.0:9094
            - --cluster.advertise-address=$(POD_IP):9094
            - --cluster.peer=omc-alertmanager-0.omc-alertmanager-service.omc:9094
            - --cluster.peer=omc-alertmanager-1.omc-alertmanager-service.omc:9094
            - --cluster.peer-timeout=180s
            - --log.level=info
          env:
            - name: POD_IP
              valueFrom:
                `fieldRef:`
                  fieldPath: status.podIP

k8s部署时出现集群抖动现象，实测这样配置cluster.advertise-address可以解决

simonpasquier added component/high availability kind/more-info-needed labels Sep 24, 2018

This was referenced Oct 1, 2018

Make a note about the firewall in HA mode #1563

Merged

[HA] Show more info about the cluster status in the UI if it is failing #1564

Open

0x63lv mentioned this issue Jun 28, 2019

Alert Manager Flapping #1909

Open

seveas mentioned this issue Sep 16, 2019

Race conditions in the clustering code? #2033

Open

Duplicate notifications after upgrading to Alertmanager 0.15 #1550

Duplicate notifications after upgrading to Alertmanager 0.15 #1550

Comments

tanji commented Sep 14, 2018

stuartnelson3 commented Sep 14, 2018

tanji commented Sep 18, 2018

stuartnelson3 commented Sep 18, 2018

tanji commented Sep 18, 2018 • edited Loading

stuartnelson3 commented Sep 18, 2018

tanji commented Sep 18, 2018

tanji commented Sep 18, 2018

stuartnelson3 commented Sep 18, 2018

tanji commented Sep 18, 2018 • edited Loading

tanji commented Sep 18, 2018

stuartnelson3 commented Sep 18, 2018

mxinden commented Sep 19, 2018

tanji commented Sep 19, 2018

mxinden commented Sep 19, 2018

stuartnelson3 commented Sep 19, 2018

tanji commented Sep 19, 2018

apsega commented Sep 21, 2018

apsega commented Sep 21, 2018

mxinden commented Sep 21, 2018

apsega commented Sep 21, 2018

tanji commented Sep 21, 2018

apsega commented Sep 21, 2018

stuartnelson3 commented Sep 21, 2018

apsega commented Sep 21, 2018

simonpasquier commented Sep 21, 2018

tanji commented Mar 5, 2019

PedroMSantosD commented Jun 6, 2019

simonpasquier commented Jun 6, 2019

PedroMSantosD commented Jun 6, 2019

rnachire commented Jun 24, 2019

stuartnelson3 commented Jun 24, 2019

rnachire commented Jun 24, 2019

stuartnelson3 commented Jun 24, 2019

PNRxA commented Jun 25, 2019

0x63lv commented Jun 28, 2019 • edited Loading

MattPOlson commented Jun 28, 2019

tiwarishrijan commented Jul 4, 2019

simonpasquier commented Jul 5, 2019

brian-brazil commented Jul 5, 2019

simonpasquier commented Jul 5, 2019

brian-brazil commented Jul 5, 2019

MattPOlson commented Jul 5, 2019

brian-brazil commented Jul 5, 2019

MattPOlson commented Jul 5, 2019

mxinden commented Jul 6, 2019

vtolstov commented Jul 6, 2019

MattPOlson commented Jul 8, 2019

woozhijun commented Sep 10, 2019

jmb12686 commented Nov 2, 2019

kovalyukm commented Nov 4, 2019

pascal-hofmann commented Apr 8, 2020

darkl0rd commented Oct 31, 2021 • edited Loading

Gaozizhong commented May 30, 2024

tanji commented Sep 18, 2018 •

edited

Loading

tanji commented Sep 18, 2018 •

edited

Loading

0x63lv commented Jun 28, 2019 •

edited

Loading

darkl0rd commented Oct 31, 2021 •

edited

Loading