Fix resolve notifications #703

brancz · 2017-04-11T08:21:25Z

This PR introduces a fundamental change to the protobufs gossiped, because the previously optimization we implemented for the 0.5 release did not turn out to work as expected. Instead we now gossip the entire list of previously firing and resolved alerts, so if the currently firing alerts are a subset of the alerts that were firing at the last notification, then we can deduplicate. It works the same way for resolved notifications, if the currently resolved alerts are a subset of those that have already been notified about then they are deduplcated. Consequently, if currently firing or resolved alerts are not a subset of the previous sets, then they need to be notified about (after the group_interval). This means the payload that is gossiped increases significantly, however, I decided against a premature optimization and rather benchmark this implementation if a load problem actually comes up.

I wanted to get this out for people to test, who have been struggling with this issue for a while. As this is quite a fundamental change and touches most of the critical parts of the notification pipeline I would like to test this a bit before having it ready to merge. /cc @ddewaele @tsboj17 @mattbostock

Fixes #611 and #598

@fabxc @stuartnelson3 @mxinden

fabxc

Semantically seems fine. Some code comments.

fabxc · 2017-04-11T10:33:49Z

nflog/nflogpb/nflog.proto

  bool resolved = 4;
  // Timestamp of the succeeding notification.
  google.protobuf.Timestamp timestamp = 5;
+  // FiringAlerts list of hashes of firing alerts at last notification.


"is a list... at the last notification time"

fabxc · 2017-04-11T10:34:20Z

nflog/nflogpb/set.go

@@ -0,0 +1,30 @@
+package nflogpb


License header missing

Given that this is just 3 small functions I'd probably not even make it a separate file.

Only reason I'm making this a separate file is so it doesn't get overridden when regenerating from the proto.

fabxc · 2017-04-11T10:34:56Z

nflog/nflogpb/set.go

@@ -0,0 +1,30 @@
+package nflogpb
+
+func (m *Entry) IsFiringSubset(subset map[uint64]struct{}) bool {


Exported method needs docs.

fabxc · 2017-04-11T10:38:08Z

notify/notify.go

+
+	for _, labelName := range labelNames {
+		b = append(b, string(labelName)...)
+		b = append(b, sep)


Let's make sep a const within this function body as it's not used elsewhere.

fabxc · 2017-04-11T12:31:51Z

notify/notify.go

-			xsum[i] ^= b
-		}
+func hashAlert(a *types.Alert) uint64 {
+	b := make([]byte, 0, 1024)


Can we get that one from a sync.Pool? Prometheus doesn't have it because mostly it does shortcut around the hash. (We probably still want to add it there eventually as well though as the QE hashes a lot.)

fabxc · 2017-04-11T12:34:52Z

notify/notify.go

 	// If we haven't notified about the alert group before, notify right away
 	// unless we only have resolved alerts.
 	if entry == nil {
-		return !resolved, nil
+		return ((len(firing) > 0) || (n.sendResolved && len(resolved) > 0)), nil


Wondering here. I remember providing explicit logic about not sending resolved notifications about stuff we have never notified about being firing in the first place.

But maybe just keep as it is here – might be better in doubt in case some other AM notified something as firing but couldn't propagate the entry via the mesh and then died.

fabxc · 2017-04-11T12:37:33Z

notify/notify.go


-	ctx = WithNotificationHash(ctx, hash)


This context entry is completely gone now. The related function and const should be removed as well.

But we are redoing the whole hashing we have done before. Wouldn't it make sense to pass the hashes through the context to the SetNotifies stage?

fabxc · 2017-04-11T12:40:10Z

notify/notify_test.go

-			hash:  []byte{2, 3, 4},
-			res:   true,
+			entry:        &nflogpb.Entry{FiringAlerts: []uint64{1, 2, 3}},
+			firingAlerts: map[uint64]struct{}{2: struct{}{}, 3: struct{}{}, 4: struct{}{}},


Add a small constructor function (inlined in the test function or elsewhere) to construct those sets? Probably cannot hurt overall to make a alertHashSet helper type hiding the awkwardness of map[uint64]struct{}.

fabxc · 2017-04-11T12:41:06Z

notify/notify_test.go

 	s.nflog = &testNflog{
 		qerr: nflog.ErrNotFound,
+		qres: []*nflogpb.Entry{
+			{
+				FiringAlerts: []uint64{uint64(0), uint64(1), uint64(2)},


The explicit uint64()s here should not be necessary.

Building a hash over an entire set of alerts causes problems, because the hash differs, on any change, whereas we only want to send notifications if the alert and it's state have changed. Therefore this introduces a list of alerts that are active and a list of alerts that are resolved. If the currently active alerts of a group are a subset of the ones that have been notified about before then they are deduplicated. The resolved notifications work the same way, with a separate list of resolved notifications that have already been sent.

brancz · 2017-04-13T13:21:24Z

Applied all your comments @fabxc. Care for a re-review? 🙂

mattbostock · 2017-04-24T14:30:41Z

Is this planned to be released/backported to 0.5.x please?

fabxc · 2017-04-24T14:36:26Z

Currently pondering with releasing it now or waiting a bit more to try to get in a first version of the new UI. I suppose you prefer the fix sooner?

In general, nothing speaks against two releases in short proximity.

stuartnelson3 · 2017-04-24T16:20:41Z

I vote for getting the fix out, and then we just make another release when the UI is ready. Last release was back in November, and there have been many additions since then.

mattbostock · 2017-04-24T19:02:12Z

@fabxc Sooner rather than later would be appreciated as this bug has the potential to create a lot of noise.

#717 (solving #609) is also an important fix for us, though I think the current implementation may preclude backporting it.

mattbostock · 2017-04-26T10:04:38Z

Looks like this PR is included in 0.6.0 though not mentioned in the release description or the CHANGELOG:
https://github.com/prometheus/alertmanager/releases/tag/v0.6.0

Thanks!

damomurf · 2017-05-05T06:19:41Z

Is it actually included in the 0.6.0 release? I'm seeing an infinitely firing "resolved" alert on 0.6.1?

fabxc · 2017-05-05T06:34:47Z

@damomurf can you describe the scenario more precisely? Those things are hard to debug without a detailed timeline of when alerts were flowing in and what the exact notification behavior was.

A reproducible scenario is ideal, of course.

damomurf · 2017-05-05T06:38:33Z

Sure @fabxc, I appreciate it's difficult to solve without being able to reproduce.

So currently, I have an "alert resolved" message being fired to Slack every three hours. The actual alert was resolved days ago and if I visit current alerts in either AlertManager or Prometheus, they're both empty.

[RESOLVED] ElasticSearchClusterStatus account (/monitoring/es-exporter hostnane:port marathon-discovery-aws mesos-monitor)

The same message has been regularly sent for several days now. I even stopped Alertmanager, deleted its data directory and restarted, to no avail. Using Alertmanager 0.6.1 and Prometheus 1.6.1.

fabxc · 2017-05-05T06:43:12Z

I assume 3 hours is your configured repeat interval. Restarting the AMs should solve it (but is not a viable solution obviously), will try to reproduce it later but it sounds like some edge case that went wrong and it could be hard to hit it manually again.

damomurf · 2017-05-05T06:44:12Z

That's correct, 3 hours is the timeout. This is a single instance of Alertmanager, if that helps?

brancz force-pushed the fix-resolve branch from 1a5b131 to eb8130f Compare April 11, 2017 08:27

fabxc suggested changes Apr 11, 2017

View reviewed changes

brancz force-pushed the fix-resolve branch from eb8130f to d21f344 Compare April 13, 2017 12:57

brancz force-pushed the fix-resolve branch from d21f344 to 309c6af Compare April 13, 2017 13:13

fabxc approved these changes Apr 17, 2017

View reviewed changes

fabxc merged commit 8820ce7 into master Apr 17, 2017

fabxc deleted the fix-resolve branch April 17, 2017 12:19

stuartnelson3 mentioned this pull request Apr 18, 2017

test: add reload test #716

Merged

fabxc mentioned this pull request Apr 21, 2017

Incorrect needsUpdate in DedupStage for notifications results in spammy emails #598

Closed

simonpasquier mentioned this pull request Jan 22, 2018

Notify only when new firing alerts are added #1205

Merged

mxinden mentioned this pull request Jun 12, 2018

Large nflogs/silences can never be gossiped #1412

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix resolve notifications #703

Fix resolve notifications #703

brancz commented Apr 11, 2017 •

edited

Loading

fabxc left a comment

fabxc Apr 11, 2017

fabxc Apr 11, 2017

fabxc Apr 11, 2017

brancz Apr 11, 2017

fabxc Apr 11, 2017

fabxc Apr 11, 2017

brancz Apr 13, 2017

fabxc Apr 11, 2017

fabxc Apr 11, 2017

fabxc Apr 11, 2017

fabxc Apr 11, 2017

fabxc Apr 11, 2017

brancz commented Apr 13, 2017

mattbostock commented Apr 24, 2017

fabxc commented Apr 24, 2017

stuartnelson3 commented Apr 24, 2017

mattbostock commented Apr 24, 2017 •

edited

Loading

mattbostock commented Apr 26, 2017 •

edited

Loading

damomurf commented May 5, 2017

fabxc commented May 5, 2017

damomurf commented May 5, 2017 •

edited

Loading

fabxc commented May 5, 2017

damomurf commented May 5, 2017

		@@ -0,0 +1,30 @@
		package nflogpb

		func (m *Entry) IsFiringSubset(subset map[uint64]struct{}) bool {

Fix resolve notifications #703

Fix resolve notifications #703

Conversation

brancz commented Apr 11, 2017 • edited Loading

fabxc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brancz commented Apr 13, 2017

mattbostock commented Apr 24, 2017

fabxc commented Apr 24, 2017

stuartnelson3 commented Apr 24, 2017

mattbostock commented Apr 24, 2017 • edited Loading

mattbostock commented Apr 26, 2017 • edited Loading

damomurf commented May 5, 2017

fabxc commented May 5, 2017

damomurf commented May 5, 2017 • edited Loading

fabxc commented May 5, 2017

damomurf commented May 5, 2017

brancz commented Apr 11, 2017 •

edited

Loading

mattbostock commented Apr 24, 2017 •

edited

Loading

mattbostock commented Apr 26, 2017 •

edited

Loading

damomurf commented May 5, 2017 •

edited

Loading