-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate notifications after upgrading to Alertmanager 0.15 #1550
Comments
Are these double-notifications happening consistently? There's no concensus between alertmanagers -- if they receive the initial alert at different times from a prometheus server, the created alert groups in the different alertmanagers might be out of sync by e.g. a single evaluation interval. If your evaluation interval=15s, and the |
Yes they're quite consistent. What do you mean by evaluation interval? Is this tunable? |
Your logs are indicating some weird behavior:
How/where are these alertmanagers deployed? |
That just happens when starting alertmanager, no messages are appearing after this, those alertmanagers are deployed in the cloud and are pretty close to each other. Of note, we use a docker image to deploy. here are the configs and startup parameters
Second AM
Nothing really rocket science, this setup (with the old mesh protocol) worked without duplicates until I upgraded. Re. 3 nodes could it be that it's also trying to connect to itself? |
The connection logs are only written during start-up, they aren't logged if the connection flaps. Can you check the following alertmanager metrics that your prometheus should be scraping?
|
We don't scrape those, I'll fix this and will look at metrics. |
There's something wrong indeed, one node is ok, always sees the the other one |
|
OK, I found it, the 2nd node wasn't announcing itself on the correct address, it used Amazon internal IP instead of external :( it should work better now, of note I'm getting those errors:
Is ICMP necessary as part of the protocol? I can enable it on AWS, it's disabled by default |
The ping thingy doesn't seem to play well with Docker networks:
That node advertises itself on the external IP, so you shouldn't consider this an unexpected ping if the source is the docker network gateway IP |
I believe only UDP and TCP are used.
The connection is made using the resolved address from |
Are the two machines in the same VPC? Do they advertise themselves via the external IP, but communicate via the internal IP? |
OK, after changing it, we're still having duplicate issues for some reason they happen at larger intervals now. |
@tanji are the clustering metrics mentioned above still flaky, or stable? In the latter case, do you have access to the notification payloads and can post them here? |
The primary form of gossip between the nodes is done over UDP, which might be getting lost between datacenters. |
Yes, the metrics have been stable. |
I have the same issue, that after upgrading 2 AlertManagers to 0.15.2 version, we're receiving duplicate alerts. Notable config:
Tuning Debug log shows this:
I wonder if this can be related that AlertManagers are running in Docker containers with flag |
Seems like tuning |
@tanji sorry for not introducing the terms first. We generally refer to an alert as the request send by Prometheus to Alertmanager and a notification as the request send by Alertmanager to e.g. Slack. Do you have access to the payload of two duplicate notifications of Alertmanager send to Slack? @apsega which Alertmanager version were you running before? v0.15.1? |
@mxinden actually very old release, something like v0.8.x |
@mxinden do you mean the JSON payload? Unfortunately I am not sure how to access it. Is it logged anywhere? On the text side the outputs are strictly similar. |
Seems like downgrading to |
@apsega your cluster isn't stable, hence the duplicate messages. once your cluster stops flapping it should stop sending the duplicate messages. I would guess that this is definitely something to do with your setup running in docker containers. |
@apsega 0.14 and 0.15 use different libraries for clustering which explains probably why the behaviors are different. You can try with |
This is still an issue in 2019, can you let me know how to access the payloads? |
Hi, just for confirmation,
running two nodes on two separate datacenters which are apart by (ICMP stats)
My firewall only allows TCP connections between the AMs; Do alertmanagers use BOTH, UDP and TCP protocols for signalling each other? or will TCP suffice? |
As noted in the README.md:
|
Thanks! |
Hi, All the firing alerts are getting thrice to the slack. Attached the snapshot for the same. But in alert manager it is getting only one time each of them.
We are having below configs on Alert manager: Attached the deployment yaml for alert manager: Template used for slack:
Please let us know, if any other info you need to debug further. thanks in advance. Regards, |
As noted in the README, both UDP and TCP ports need to be open for HA mode: https://github.com/prometheus/alertmanager#high-availability From looking at the deployment, I only see a TCP endpoint being opened. If you open a UDP port and configure the AMs with this, the duplicate messages should go away. For further support, please write to the users mailling list, [email protected], since this seems to be a usage question and not a bug. |
thanks.. can you please help us setting UDP and TCP in above yaml. |
I recommend looking at the kubernetes documentation |
We tried this change and it still doesn't work. The issue is this function is being used to obtain the IP, sockaddr.GetPrivateIP which returns the first public or private IP address on the default interface. This function is a better choice, sockaddr.GetInterfaceIP("eth0"), it returns the IP of the interface passed in. We made the change in a forked repo and it's working better for us so far. |
@MattPOlson would you like to submit a PR with your change? Given all the issues reported, I think it would make sense to have this option available. |
How would we determine the interface to pass in? |
That would be another flag. |
If you're passing that as a flag, couldn't you pass the IP? On newer kernel versions, interface names are kinda random. |
Sure, I can submit a PR with the change, just need to spend a little time getting it buttoned up completely. I did add a flag that defines the name of the interface to be used. As far as I can tell in a Docker Swarm environment it will always be eht0. Passing in the IP won't work because we don't know what it is until the container is spun up. I do have a question around this piece of code.
What's the purpose of creating that type and var to use later? Why not just call the function directly like this
|
I don't see how that's a problem, this is something that'll be known within the container. |
Because that flag is passed in using docker command. For example here is the docker file we use to create the AlertManager Docker Service.
I can't put an IPAddress in the file because I don't what it is, it won't be defined until the container is created, then it's to late to set it because AlertManager is already running. |
@MattPOlson instead of introducing new flags, would retrieving the interface IP right before startup and templating it into the IP=$(ip -6 addr show eth0 | grep -oP 'SOME_REGEX')
alertmanager [...] --cluster.advertise-address=${IP}:9094 |
Why not create option to pass network with cidr? So when container starts ypu can check all address and use needed from specified block. |
We played around with that a little bit but couldn't get the Command Substitution to work with Docker command. Also that would mean hard coding 'eth0' , if doing that this would be simpler sockaddr.GetInterfaceIP("eth0").
That's an interesting idea, just need to be sure that it returns the right and same IPaddress everything. I like specifying the interface since it will always return the same and correct IP. |
I was having a same problem for docker swarm, and the alertmanager version is 0.18.
|
Same issue here as well, running 2 alertmanagers containers in Docker Swarm. I'm attempting |
In my case, the issue was not exposed ports for ha. Hope it would be helpful for somebody. |
When using docker it's important to:
When using
Doing both fixed the issue with duplicate notifications for me. |
I'm running alertmanager in Docker Swarm, the two alertmanagers communicate over a dedicated Overlay network (no restrictions on the overlay network, all ports open). However, they both also have an additional network attached to expose them through my load balancer. As per earlier posts, it's impossible to know the IP when AM starts up, so I have "--cluster.advertise-address=:9094" set instead, which seemingly results in alertmanager using the 172.x (docker host range) instead, rather than the IP's of the overlay network. This leads to (at least) one of the AM's continuously flapping, resulting in the duplicate alerts. Is there a good solution to deal with this? Or is the current workaround to fetch the IP during startup and setting it as advertise-address? Or pinning them to a host, creating a port mapping for TCP and UDP and using the host address? |
k8s部署时出现集群抖动现象,实测这样配置cluster.advertise-address可以解决 |
After upgrading from Alertmanager 0.13 to 0.15.2 in a cluster of two members we've started receiving double notifications in slack. It used to work flawlessly with 0.13. Weirdly we're receiving the 2 notifications exactly at the same time, they don't seem to be apart by more than a couple of secs.
Linux pmm-server 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 GNU/Linux
Both instances using ntp.
alertmanager, version 0.15.2 (branch: HEAD, revision: d19fae3)
build user: root@3101e5b68a55
build date: 20180814-10:53:39
go version: go1.10.3
Cluster status reports up:
Status
Uptime:
2018-09-09T19:03:01.726517546Z
Cluster Status
Name:
01CPZVEFADF9GE2G9F2CTZZZQ6
Status:
ready
Peers:
Name: 01CPZV0HDRQY5M5TW6FDS31MKS
Address: :9094
Name: 01CPZVEFADF9GE2G9F2CTZZZQ6
Address: :9094
Irrelevant
no errors to speak of
The text was updated successfully, but these errors were encountered: