-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inhibit Rules should be able to consider different labels in their equal statement #2254
Comments
If you wish to do something intricate like this, why not adjust your labels using alert_relabel_configs? |
Yes, we thought about something along these lines:
However, this leaves people wondering what the meaning of this label on the alert is. I can not map one label to the other, as both labels are needed, thus the need for a helper label such as It is also somewhat error prone, as I don't think I can limit the action to a specific type of alert (maybe by using a chain of labelmap rules and I feel checking equality of two different labels is a very natural feature for inhibit rules, independent of the intricacy of my example. At least, the alternative using relabeling should be noted in the docs beside the |
Label names are meant to have one specific meaning, if you find yourself trying to match label name A with label name B via any means - be that PromQL or inhibition - that implies that something may not be quite right with your label taxonomy. |
Both labels have one specific meaning. On a RouterDown alert, the It really is not so intricate either, it is basically the same as the very blunt prevention you've described. If an entire datacenter network goes kaputt, the links leading there will be going down and will thus trigger alerts I intend to manage using the alertmanager. In my opinion, having Prometheus sort this out using the relabel_configs described above just creates an additional place of configuration for alert management, which is not within the alertmanager. In addition, it seems somewhat hacky to me to create labels to me matching on, when there could be a mechanism that easily achieves the same result without crutches while being succinct, to the point, and in the software you'd expect it in. |
A router or interface going down is a cause, not a symptom. A symptom would be users no longer being able to get the the website behind the router. |
We are a service provider, keeping customer interfaces online is the only symptom. The cause could be:
|
As explained above what you want is already possible with existing features, if you try to make cause-based alerting work you have to expect it to take extra work. |
I understand your reasoning in not wanting to add unnecessary features and can accept you saying so, although I find these existing solutions cumbersome for the reasons outlined above. At the same time, I'm not sure you did consider the actual points I was making because you keep trying to erode the validity of my use case instead of adressing said points. Could you explain to me what you think a symptom is in a service provider setting, if not the availability of a customer interface? Why do you think suppressing potentially tens to hundreds of InterfaceDown (and thus CustomerDown) alerts when their upstream router is down is to intricate an inhibition rule? |
i would like this too. heres my use case: |
No solution for this problem? ;( I have alerts for and would like to see only r1:
|
👍 we have the same kind of use case, it would be far way easier than the workaround we have to use. I'd like to be able to compare different label but also part of a label using regex. something like (idea of syntax as-is): - source_matchers:
- alertname="ServerDown"
- host=~"(.+)"
target_matchers:
- alertname="kafka_stream_not_consuming"
- kafka_stream=~"my_stream_with_a_name_including_(.+)_a_host_identifier"
equality:
- source: host
target: kafka_stream_$1 |
While the title basically says it all, I will try to back this up using a concrete example. Imagine a setup where in addition to all routers being a target for some type of
blackbox_exporter
-style metrics an additional source of data is used to generate JSON target lists forfile_sd
.In my example, this additional source could be a routers configuration backup, which gives us a definitive list of link addresses configured on any interface of any router, as well as any configured meta data for each interface. It is trivial to build a JSON file containing the targets (all link addresses configured locally) with the relevant labels:
These metrics could end up looking like this (
10.0.0.0/8
are link adresses,192.168.0.0/16
are loopbacks):Alerting in the most obvious way would create alerts similar to the job names, for instance a
RouterDown
alert with the expressionprobe_success{job="ping-router-loopback"} == 0
. I would obviously want the following inhibition rule:This would inhibit the alert informing me that an interface is down on a router which is already being alerted as Down itself. I would however like to go one step further using a inhibit rule such as the following one, as it does not come as a surprise that any interface adjacent to the downed router will go down in turn, even though it is on another router/in another region/whatever.
While the proposed syntax variants are just a general idea, I feel that this should be possible in some way which does not involve hacking around with the underlying alerts expressions.
The text was updated successfully, but these errors were encountered: