Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix resolve notifications #703

Merged
merged 1 commit into from
Apr 17, 2017
Merged

Fix resolve notifications #703

merged 1 commit into from
Apr 17, 2017

Conversation

brancz
Copy link
Member

@brancz brancz commented Apr 11, 2017

This PR introduces a fundamental change to the protobufs gossiped, because the previously optimization we implemented for the 0.5 release did not turn out to work as expected. Instead we now gossip the entire list of previously firing and resolved alerts, so if the currently firing alerts are a subset of the alerts that were firing at the last notification, then we can deduplicate. It works the same way for resolved notifications, if the currently resolved alerts are a subset of those that have already been notified about then they are deduplcated. Consequently, if currently firing or resolved alerts are not a subset of the previous sets, then they need to be notified about (after the group_interval). This means the payload that is gossiped increases significantly, however, I decided against a premature optimization and rather benchmark this implementation if a load problem actually comes up.

I wanted to get this out for people to test, who have been struggling with this issue for a while. As this is quite a fundamental change and touches most of the critical parts of the notification pipeline I would like to test this a bit before having it ready to merge. /cc @ddewaele @tsboj17 @mattbostock

Fixes #611 and #598

@fabxc @stuartnelson3 @mxinden

Copy link
Contributor

@fabxc fabxc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantically seems fine. Some code comments.

bool resolved = 4;
// Timestamp of the succeeding notification.
google.protobuf.Timestamp timestamp = 5;
// FiringAlerts list of hashes of firing alerts at last notification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is a list... at the last notification time"

@@ -0,0 +1,30 @@
package nflogpb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

License header missing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is just 3 small functions I'd probably not even make it a separate file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reason I'm making this a separate file is so it doesn't get overridden when regenerating from the proto.

@@ -0,0 +1,30 @@
package nflogpb

func (m *Entry) IsFiringSubset(subset map[uint64]struct{}) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exported method needs docs.


for _, labelName := range labelNames {
b = append(b, string(labelName)...)
b = append(b, sep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sep a const within this function body as it's not used elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

notify/notify.go Outdated
xsum[i] ^= b
}
func hashAlert(a *types.Alert) uint64 {
b := make([]byte, 0, 1024)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get that one from a sync.Pool? Prometheus doesn't have it because mostly it does shortcut around the hash. (We probably still want to add it there eventually as well though as the QE hashes a lot.)

// If we haven't notified about the alert group before, notify right away
// unless we only have resolved alerts.
if entry == nil {
return !resolved, nil
return ((len(firing) > 0) || (n.sendResolved && len(resolved) > 0)), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering here. I remember providing explicit logic about not sending resolved notifications about stuff we have never notified about being firing in the first place.

But maybe just keep as it is here – might be better in doubt in case some other AM notified something as firing but couldn't propagate the entry via the mesh and then died.

notify/notify.go Outdated

ctx = WithNotificationHash(ctx, hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This context entry is completely gone now. The related function and const should be removed as well.

But we are redoing the whole hashing we have done before. Wouldn't it make sense to pass the hashes through the context to the SetNotifies stage?

hash: []byte{2, 3, 4},
res: true,
entry: &nflogpb.Entry{FiringAlerts: []uint64{1, 2, 3}},
firingAlerts: map[uint64]struct{}{2: struct{}{}, 3: struct{}{}, 4: struct{}{}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a small constructor function (inlined in the test function or elsewhere) to construct those sets? Probably cannot hurt overall to make a alertHashSet helper type hiding the awkwardness of map[uint64]struct{}.

s.nflog = &testNflog{
qerr: nflog.ErrNotFound,
qres: []*nflogpb.Entry{
{
FiringAlerts: []uint64{uint64(0), uint64(1), uint64(2)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explicit uint64()s here should not be necessary.

Building a hash over an entire set of alerts causes problems, because
the hash differs, on any change, whereas we only want to send
notifications if the alert and it's state have changed. Therefore this
introduces a list of alerts that are active and a list of alerts that
are resolved. If the currently active alerts of a group are a subset of
the ones that have been notified about before then they are
deduplicated. The resolved notifications work the same way, with a
separate list of resolved notifications that have already been sent.
@brancz
Copy link
Member Author

brancz commented Apr 13, 2017

Applied all your comments @fabxc. Care for a re-review? 🙂

@mattbostock
Copy link
Contributor

Is this planned to be released/backported to 0.5.x please?

@fabxc
Copy link
Contributor

fabxc commented Apr 24, 2017

Currently pondering with releasing it now or waiting a bit more to try to get in a first version of the new UI. I suppose you prefer the fix sooner?

In general, nothing speaks against two releases in short proximity.

@stuartnelson3
Copy link
Contributor

I vote for getting the fix out, and then we just make another release when the UI is ready. Last release was back in November, and there have been many additions since then.

@mattbostock
Copy link
Contributor

mattbostock commented Apr 24, 2017

@fabxc Sooner rather than later would be appreciated as this bug has the potential to create a lot of noise.

#717 (solving #609) is also an important fix for us, though I think the current implementation may preclude backporting it.

@mattbostock
Copy link
Contributor

mattbostock commented Apr 26, 2017

Looks like this PR is included in 0.6.0 though not mentioned in the release description or the CHANGELOG:
https://github.com/prometheus/alertmanager/releases/tag/v0.6.0

Thanks!

@damomurf
Copy link

damomurf commented May 5, 2017

Is it actually included in the 0.6.0 release? I'm seeing an infinitely firing "resolved" alert on 0.6.1?

@fabxc
Copy link
Contributor

fabxc commented May 5, 2017

@damomurf can you describe the scenario more precisely? Those things are hard to debug without a detailed timeline of when alerts were flowing in and what the exact notification behavior was.

A reproducible scenario is ideal, of course.

@damomurf
Copy link

damomurf commented May 5, 2017

Sure @fabxc, I appreciate it's difficult to solve without being able to reproduce.

So currently, I have an "alert resolved" message being fired to Slack every three hours. The actual alert was resolved days ago and if I visit current alerts in either AlertManager or Prometheus, they're both empty.

[RESOLVED] ElasticSearchClusterStatus account (/monitoring/es-exporter hostnane:port marathon-discovery-aws mesos-monitor)

The same message has been regularly sent for several days now. I even stopped Alertmanager, deleted its data directory and restarted, to no avail. Using Alertmanager 0.6.1 and Prometheus 1.6.1.

@fabxc
Copy link
Contributor

fabxc commented May 5, 2017

I assume 3 hours is your configured repeat interval. Restarting the AMs should solve it (but is not a viable solution obviously), will try to reproduce it later but it sounds like some edge case that went wrong and it could be hard to hit it manually again.

@damomurf
Copy link

damomurf commented May 5, 2017

That's correct, 3 hours is the timeout. This is a single instance of Alertmanager, if that helps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alertmanager sends the resolved messages continuously
5 participants