Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inhibit Rules should be able to consider different labels in their equal statement #2254

Open
debugloop opened this issue May 12, 2020 · 11 comments · May be fixed by #3525
Open

Inhibit Rules should be able to consider different labels in their equal statement #2254

debugloop opened this issue May 12, 2020 · 11 comments · May be fixed by #3525

Comments

@debugloop
Copy link

While the title basically says it all, I will try to back this up using a concrete example. Imagine a setup where in addition to all routers being a target for some type of blackbox_exporter-style metrics an additional source of data is used to generate JSON target lists for file_sd.

In my example, this additional source could be a routers configuration backup, which gives us a definitive list of link addresses configured on any interface of any router, as well as any configured meta data for each interface. It is trivial to build a JSON file containing the targets (all link addresses configured locally) with the relevant labels:

  • the router this adress is configured on
  • the interface name/description
  • the remote hostname, for instance parsed from the description

These metrics could end up looking like this (10.0.0.0/8 are link adresses, 192.168.0.0/16 are loopbacks):

# job: ping-router-loopback
probe_success{instance="192.168.0.1", hostname="r1"} 0
probe_success{instance="192.168.0.2", hostname="r2"} 1

# job: ping-router-interface
probe_success{instance="10.0.0.1", hostname="r2", interface="Te0/7/0/12", remote="r1"} 0
probe_success{instance="10.0.0.2", hostname="r1", interface="Te0/2/0/1", remote="r2"} 0

Alerting in the most obvious way would create alerts similar to the job names, for instance a RouterDown alert with the expression probe_success{job="ping-router-loopback"} == 0. I would obviously want the following inhibition rule:

- source_match:
    alertname: RouterDown
  target_match:
    alertname: InterfaceDown
  equal:
  - hostname

This would inhibit the alert informing me that an interface is down on a router which is already being alerted as Down itself. I would however like to go one step further using a inhibit rule such as the following one, as it does not come as a surprise that any interface adjacent to the downed router will go down in turn, even though it is on another router/in another region/whatever.

# option A
- source_match:
    alertname: RouterDown
  target_match:
    alertname: InterfaceDown
  equal:
  - source_label: hostname
    target_label: remote

# option B, which would make the original `equal` kind of unnecessary
# by using `hostname: $hostname` for instance
- source_match:
    alertname: RouterDown
  target_match:
    alertname: InterfaceDown
    remote: "$hostname"

While the proposed syntax variants are just a general idea, I feel that this should be possible in some way which does not involve hacking around with the underlying alerts expressions.

@brian-brazil
Copy link
Contributor

If you wish to do something intricate like this, why not adjust your labels using alert_relabel_configs?

@debugloop
Copy link
Author

Yes, we thought about something along these lines:

- source_labels: [hostname]
  target_label: inhibit_marker
- source_labels: [remote]
  target_label: inhibit_marker

However, this leaves people wondering what the meaning of this label on the alert is. I can not map one label to the other, as both labels are needed, thus the need for a helper label such as inhibit_marker (or something, suggestions welcome). Maybe I am overlooking a possibility of relabeling tho.

It is also somewhat error prone, as I don't think I can limit the action to a specific type of alert (maybe by using a chain of labelmap rules and __ prefixed labels? I haven't checked yet).

I feel checking equality of two different labels is a very natural feature for inhibit rules, independent of the intricacy of my example. At least, the alternative using relabeling should be noted in the docs beside the equal statement.

@brian-brazil
Copy link
Contributor

Label names are meant to have one specific meaning, if you find yourself trying to match label name A with label name B via any means - be that PromQL or inhibition - that implies that something may not be quite right with your label taxonomy.
Inhibition is also meant more for very blunt prevention of alerts such as when an entire datacenter network has gone kaput. Trying to use it for something this fine grained would hint to me that there's an attempt at cause based rather than symptom based alerting.

@debugloop
Copy link
Author

Both labels have one specific meaning. On a RouterDown alert, the hostname identifies the router (as instance labels are used for IPv4 and IPv6). On a InterfaceDown alert, the hostname does the same, but additionally the remote serves as a marker of dependency. While it is true that having the option might invite cause based alerts, my example is strictly symptom based I think.

It really is not so intricate either, it is basically the same as the very blunt prevention you've described. If an entire datacenter network goes kaputt, the links leading there will be going down and will thus trigger alerts I intend to manage using the alertmanager.

In my opinion, having Prometheus sort this out using the relabel_configs described above just creates an additional place of configuration for alert management, which is not within the alertmanager. In addition, it seems somewhat hacky to me to create labels to me matching on, when there could be a mechanism that easily achieves the same result without crutches while being succinct, to the point, and in the software you'd expect it in.

@brian-brazil
Copy link
Contributor

A router or interface going down is a cause, not a symptom. A symptom would be users no longer being able to get the the website behind the router.

@debugloop
Copy link
Author

We are a service provider, keeping customer interfaces online is the only symptom.

The cause could be:

  • BGP misconfiguration of a customer announcment
  • power outage on the customer's site
  • a fiber cut
  • our router going down altogether <-- I want to inhibit the tens or even hundreds of alerts per Router
  • a million other things

@brian-brazil
Copy link
Contributor

As explained above what you want is already possible with existing features, if you try to make cause-based alerting work you have to expect it to take extra work.

@debugloop
Copy link
Author

I understand your reasoning in not wanting to add unnecessary features and can accept you saying so, although I find these existing solutions cumbersome for the reasons outlined above.

At the same time, I'm not sure you did consider the actual points I was making because you keep trying to erode the validity of my use case instead of adressing said points. Could you explain to me what you think a symptom is in a service provider setting, if not the availability of a customer interface? Why do you think suppressing potentially tens to hundreds of InterfaceDown (and thus CustomerDown) alerts when their upstream router is down is to intricate an inhibition rule?

@krzee
Copy link

krzee commented Oct 26, 2022

i would like this too. heres my use case:
i have a global full mesh. i have custom metrics of last keepalive across the mesh. instance is the node reporting the metric and peer is the node on the other side of the connection. I have an alert for when same peer has >5 instances alerting on it. When that fires i want to inhibit the individual alerts for instance when it matches the inhibited peer. Now I can inhibit the single alerts on the peer, but I would also like to inhibit any alerts where instance of the alert is already alerting for the peer >5 alert. instance label would equal peer for this inhibit

@genofire
Copy link

genofire commented Jul 14, 2023

No solution for this problem? ;(

I have alerts for and would like to see only r1:

probe_success{instance="10.0.0.1", hostname="r1"}
probe_success{instance="10.0.1.1", hostname="r2", remote="r1"}
probe_success{instance="10.0.2.1", hostname="r3", remote="r2"}

@fatpat
Copy link

fatpat commented Jul 16, 2024

👍

we have the same kind of use case, it would be far way easier than the workaround we have to use.

I'd like to be able to compare different label but also part of a label using regex.

something like (idea of syntax as-is):

   - source_matchers:
       - alertname="ServerDown"
       - host=~"(.+)"
     target_matchers:
       - alertname="kafka_stream_not_consuming"
       - kafka_stream=~"my_stream_with_a_name_including_(.+)_a_host_identifier"
     equality:
       - source: host
         target: kafka_stream_$1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants