Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inhibited alerts are being sent as resolved #891

Closed
sciffer opened this issue Jun 28, 2017 · 12 comments
Closed

Inhibited alerts are being sent as resolved #891

sciffer opened this issue Jun 28, 2017 · 12 comments
Labels
Milestone

Comments

@sciffer
Copy link
Contributor

sciffer commented Jun 28, 2017

Alertmanager 0.7.1, Prometheus 1.6.3.
Every alert gets sent form 2 prometheus collectors(redundant collectors), Inhibit rules are in place to make sure only 1 of those alerts will get fired.
When the alerts are fired only 1 of them get's fired (which is what we expect), but when the alert is resolved both alerts are being sent as resolved - this is wrong.
So far, we noticed that behaviour with email notifications, I guess pagerduty will filter and complain about resolved alerts that never got triggered(as it has context and state).

@mxinden
Copy link
Member

mxinden commented Jun 30, 2017

@sciffer Thanks for reporting! Could you supply your Prometheus and Alertmanager config as well as the alert so we can try to reproduce this?

@gfliker
Copy link

gfliker commented Jun 30, 2017

Hi @mxinden, thx for responding.

I have attached our alertmanager yml file (removed the keys and aush parts where relevant)
alertmanager.txt

And the relevant alerting Prometheus config is pasted bellow here
The bellow config returns 4 alert managers but only 2 have configs that have receivers.
The 2 active ones are in a mesh setup.

The reason we need the inhibition(which works great) rules is because for each team we have two Prometheus servers doing the same scraping and we want only one alert to be fired.

Please let us know if more information is needed

Prometheus alerting configs:
alerting:
alertmanagers:

  • timeout: 10s
    consul_sd_configs:
    • server: 'localhost:8500'
      datacenter: 'nydc1'
      services:
      • alertmanager
        relabel_configs:
    • source_labels: ['__meta_consul_node']
      regex: '^alert-.*'
      action: keep
  • timeout: 10s
    consul_sd_configs:
    • server: 'localhost:8500'
      datacenter: 'chidc2'
      services:
      • alertmanager
        relabel_configs:
    • source_labels: ['__meta_consul_node']
      regex: '^alert-.*'
      action: keep
  • timeout: 10s
    consul_sd_configs:
    • server: 'localhost:8500'
      datacenter: 'sadc1'
      services:
      • alertmanager
        relabel_configs:
    • source_labels: ['__meta_consul_node']
      regex: '^alert-.*'
      action: keep
  • timeout: 10s
    consul_sd_configs:
    • server: 'localhost:8500'
      datacenter: 'il'
      services:
      • alertmanager
        relabel_configs:
    • source_labels: ['__meta_consul_node']
      regex: '^alert-.*'
      action: keep

@sandersaares
Copy link

Possibly related: #878

@mxinden mxinden added this to the v0.8 milestone Jul 6, 2017
@gfliker
Copy link

gfliker commented Jul 13, 2017

Any updates on this issue ?

@mxinden
Copy link
Member

mxinden commented Jul 16, 2017

@gfliker @sciffer I am sorry that this takes so long. I am able to reproduce your issue.

First of all a general question: Why are you using inhibition rules to de-duplicate your alerts send by two identical Prometheus for HA? If I am not misunderstanding your use case this can be done with the default behaviour of Alertmanager (See the FAQ). Whenever two alerts with the same label set come in, they are automatically de-duplicated into one alert.

Due to the inhibition logic my guess would be:

  1. AlertA from Prometheus1-> Alertmanager1: Alert is send to Hipchat or other integration
  2. AlertA from Prometheus2 -> Alertmanager1 Alert is inhibited through inhibition rules as AlertA is already send before. Thereby it is not send.
  3. AlertA1 stops firing
  4. AlertA2 stops firing
  5. AlertA from Prometheus1 is being send as resolved by Alertmanager1
  6. AlertA from Prometheus2 is being checked if there are any alerts that would inhibit this one. AlertA by Prometheus1 is already marked as resolved, so there are none, thereby AlertA by Prometheus2 is being send as resolved as well. Resulting in two resolved notifications.

This should not happen with the default de-duplication logic of Alertmanager.

Is it possible for you to use the default HA de-duplication logic and remove the re-labeling on the Prometheus side? I hope I am not missing something here.

@gfliker
Copy link

gfliker commented Jul 18, 2017

Thx @mxinden for following up on this.

I guess that we can use alert_relabel_configs to drop / change the Prometheus instance label.
it makes sense.

FYI using inhibition was working fine for the last 6 months. im guessing something changed in 0.7.1.

Anyway we can close this issue since we will be switching to the alert_relabel_configs way.

Many thanks @mxinden

@mxinden
Copy link
Member

mxinden commented Jul 18, 2017

FYI using inhibition was working fine for the last 6 months. im guessing something changed in 0.7.1.

@gfliker Oh, I don't think this is a wanted change. I hope I got time to look into this further.

@brancz Can you approve my suggestion to use Alertmanagers default de-duplication instead of inhibition rules to handle duplicated Alerts due to an HA-Prometheus setup?

@brancz
Copy link
Member

brancz commented Jul 18, 2017

This should be covered by the default way the Alertmanager works.

De-duplication happens based on the values of the labels for all routes leading to the routing tree leafs, therefore dropping/changing the Prometheus instance labels should not have an effect unless the instance labels are part of this tree.

Therefore I'm assuming there is either a bug in the deduplication logic, or the mesh is maybe not properly initialized.

@gfliker Could you give us the output of the /api/v1/status endpoint for both of the HA instances? It should give us more information on the status of the mesh network. ( @mxinden correct me if that's the wrong endpoint)

@brancz
Copy link
Member

brancz commented Jul 19, 2017

@mxinden and I just went through this in person and we think we figured out what is happening.

Inhibition works not by looking at what notifications have been sent, but by which alerts are currently firing. Therefore it is not the right mechanism to perform this de-duplication. The sequence of events as @mxinden described above seem to match my suspicion.

So in terms of what we suggest you do:

  • Remove inhibition rules for deduplication (this should happend based on normal Alertmanager deduplication)
  • If we understand correctly you are adding additional labels to your alerts with Prometheus identifying the Prometheus instance. These should be removed as well (also could you maybe specify whether these are external labels or relabeled using alert relabelling rules?)

If this doesn't work for you, you should also see multiple alerts firing and multiple alert notifications, if this is the case we should investigate further. In that case please share your full Prometheus configuration as well (of course anonymized, etc.).

Let us know how it works out, we're happy to help out @gfliker ! 🙂

@mxinden
Copy link
Member

mxinden commented Jul 27, 2017

@gfliker I am closing here. Please reopen in case you are still facing any issues.

@mxinden mxinden closed this as completed Jul 27, 2017
@gfliker
Copy link

gfliker commented Jul 28, 2017

Hi @mxinden @brancz ,

Just to follow this through, we have migrated to the "alert relabelling rules" workflow and all is working as expected.

Thanks guys.

@mxinden
Copy link
Member

mxinden commented Jul 28, 2017

This is great. Thanks for the feedback. Please feel free to reach out again if you are facing any further issues.

hh pushed a commit to ii/alertmanager that referenced this issue Apr 14, 2018
Fix code style check in "all" make target
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants