-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix resolve notifications #703
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semantically seems fine. Some code comments.
nflog/nflogpb/nflog.proto
Outdated
bool resolved = 4; | ||
// Timestamp of the succeeding notification. | ||
google.protobuf.Timestamp timestamp = 5; | ||
// FiringAlerts list of hashes of firing alerts at last notification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"is a list... at the last notification time"
@@ -0,0 +1,30 @@ | |||
package nflogpb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
License header missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is just 3 small functions I'd probably not even make it a separate file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only reason I'm making this a separate file is so it doesn't get overridden when regenerating from the proto.
nflog/nflogpb/set.go
Outdated
@@ -0,0 +1,30 @@ | |||
package nflogpb | |||
|
|||
func (m *Entry) IsFiringSubset(subset map[uint64]struct{}) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exported method needs docs.
|
||
for _, labelName := range labelNames { | ||
b = append(b, string(labelName)...) | ||
b = append(b, sep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sep
a const within this function body as it's not used elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
notify/notify.go
Outdated
xsum[i] ^= b | ||
} | ||
func hashAlert(a *types.Alert) uint64 { | ||
b := make([]byte, 0, 1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get that one from a sync.Pool? Prometheus doesn't have it because mostly it does shortcut around the hash. (We probably still want to add it there eventually as well though as the QE hashes a lot.)
// If we haven't notified about the alert group before, notify right away | ||
// unless we only have resolved alerts. | ||
if entry == nil { | ||
return !resolved, nil | ||
return ((len(firing) > 0) || (n.sendResolved && len(resolved) > 0)), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering here. I remember providing explicit logic about not sending resolved notifications about stuff we have never notified about being firing in the first place.
But maybe just keep as it is here – might be better in doubt in case some other AM notified something as firing but couldn't propagate the entry via the mesh and then died.
notify/notify.go
Outdated
|
||
ctx = WithNotificationHash(ctx, hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This context entry is completely gone now. The related function and const should be removed as well.
But we are redoing the whole hashing we have done before. Wouldn't it make sense to pass the hashes through the context to the SetNotifies stage?
notify/notify_test.go
Outdated
hash: []byte{2, 3, 4}, | ||
res: true, | ||
entry: &nflogpb.Entry{FiringAlerts: []uint64{1, 2, 3}}, | ||
firingAlerts: map[uint64]struct{}{2: struct{}{}, 3: struct{}{}, 4: struct{}{}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a small constructor function (inlined in the test function or elsewhere) to construct those sets? Probably cannot hurt overall to make a alertHashSet
helper type hiding the awkwardness of map[uint64]struct{}
.
notify/notify_test.go
Outdated
s.nflog = &testNflog{ | ||
qerr: nflog.ErrNotFound, | ||
qres: []*nflogpb.Entry{ | ||
{ | ||
FiringAlerts: []uint64{uint64(0), uint64(1), uint64(2)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explicit uint64()
s here should not be necessary.
Building a hash over an entire set of alerts causes problems, because the hash differs, on any change, whereas we only want to send notifications if the alert and it's state have changed. Therefore this introduces a list of alerts that are active and a list of alerts that are resolved. If the currently active alerts of a group are a subset of the ones that have been notified about before then they are deduplicated. The resolved notifications work the same way, with a separate list of resolved notifications that have already been sent.
Applied all your comments @fabxc. Care for a re-review? 🙂 |
Is this planned to be released/backported to 0.5.x please? |
Currently pondering with releasing it now or waiting a bit more to try to get in a first version of the new UI. I suppose you prefer the fix sooner? In general, nothing speaks against two releases in short proximity. |
I vote for getting the fix out, and then we just make another release when the UI is ready. Last release was back in November, and there have been many additions since then. |
Looks like this PR is included in 0.6.0 though not mentioned in the release description or the CHANGELOG: Thanks! |
Is it actually included in the 0.6.0 release? I'm seeing an infinitely firing "resolved" alert on 0.6.1? |
@damomurf can you describe the scenario more precisely? Those things are hard to debug without a detailed timeline of when alerts were flowing in and what the exact notification behavior was. A reproducible scenario is ideal, of course. |
Sure @fabxc, I appreciate it's difficult to solve without being able to reproduce. So currently, I have an "alert resolved" message being fired to Slack every three hours. The actual alert was resolved days ago and if I visit current alerts in either AlertManager or Prometheus, they're both empty.
The same message has been regularly sent for several days now. I even stopped Alertmanager, deleted its data directory and restarted, to no avail. Using Alertmanager 0.6.1 and Prometheus 1.6.1. |
I assume 3 hours is your configured repeat interval. Restarting the AMs should solve it (but is not a viable solution obviously), will try to reproduce it later but it sounds like some edge case that went wrong and it could be hard to hit it manually again. |
That's correct, 3 hours is the timeout. This is a single instance of Alertmanager, if that helps? |
This PR introduces a fundamental change to the protobufs gossiped, because the previously optimization we implemented for the 0.5 release did not turn out to work as expected. Instead we now gossip the entire list of previously firing and resolved alerts, so if the currently firing alerts are a subset of the alerts that were firing at the last notification, then we can deduplicate. It works the same way for resolved notifications, if the currently resolved alerts are a subset of those that have already been notified about then they are deduplcated. Consequently, if currently firing or resolved alerts are not a subset of the previous sets, then they need to be notified about (after the
group_interval
). This means the payload that is gossiped increases significantly, however, I decided against a premature optimization and rather benchmark this implementation if a load problem actually comes up.I wanted to get this out for people to test, who have been struggling with this issue for a while. As this is quite a fundamental change and touches most of the critical parts of the notification pipeline I would like to test this a bit before having it ready to merge. /cc @ddewaele @tsboj17 @mattbostock
Fixes #611 and #598
@fabxc @stuartnelson3 @mxinden