Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore expired silences OnGossip #999

Conversation

josedonizetti
Copy link
Contributor

This will fix the bug of resync deleted silences due to the state of other peers. More context on #996

@stuartnelson3
Copy link
Contributor

Thanks for the PR! I will have to think about this, as it opens up new issues: some people like to configure their data retention time high so that they can view and recreate silences from expired ones. By not gossiping expired silences, there will be inconsistency between the alertmanagers in a mesh.

@josedonizetti
Copy link
Contributor Author

josedonizetti commented Sep 21, 2017

@stuartnelson3

This isn't about ignoring gossiped expired silences, those will still exist in all HA, no inconsistency on this part. The PR ignores only GCed expired silences. I've started to test this after seen this issue #996.

The test was done on current master.

What I did:

  1. decrease the data retention to a smaller window (5 min)
  2. decrease the GC ticker to a smaller window (3 min)
  3. start 3 instances of HA
  4. send a few alerts
  5. silence those alerts
  6. expire those silences
  7. wait to see if the GC would clean them

What I saw:

The GC would run on a specific instance 'A' and would clean the expired silences. But just a few seconds later the expired silences would reappear in 'A' due to a gossiped of another instance 'B' or 'C', where the GC hadn't run.

Let me know if there more information I can gather to help out on this issue.

@beorn7
Copy link
Member

beorn7 commented Sep 26, 2017

@stuartnelson3 are you still up to reviewing this?

The not-happening GC is hurting us quite a lot by now, so I'm tempted to work on this myself, but I don't want to re-do work already done or work redundantly on something that is already being worked on behind the scenes.

@stuartnelson3
Copy link
Contributor

Feel free to grab it, I was forced to have a weekend to myself and haven't touched it :)

Copy link
Contributor

@stuartnelson3 stuartnelson3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay!

Thank you very much for the PR, and for clarifying to me how it's working :) I made a couple small requests on the comments to ease reading for myself and future readers, but once those are changed it'll be merged.

@@ -831,8 +832,8 @@ func (gd *gossipData) Merge(other mesh.GossipData) mesh.GossipData {
return gd
}

// mergeDelta behaves like Merge but returns a gossipData only
// containing things that have changed.
// mergeDelta behaves like Merge but ignore expired silences, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/s/ignore/ignores/

@@ -843,6 +844,11 @@ func (gd *gossipData) mergeDelta(od *gossipData) *gossipData {
defer gd.mtx.Unlock()

for id, s := range od.data {
// Only merge silence if not expired
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on this comment a bit? I know I'm going to come back to it in the future and need to refresh myself. Something along the lines of:

If a gossiped silence is expired, skip it. For a given silence duration exceeding a few minutes, active silences will have already been gossiped. Once the active silence is gossiped, its expiration should happen more or less simultaneously on the different alertmanager nodes. Preventing the gossiping of expired silences allows them to be GC'd, and doesn't affect consistency across the mesh.

This will fix the bug of resync deleted silences
due to the state of other peers.
@josedonizetti josedonizetti force-pushed the ignore-expired-silences-on-gossip branch from 3ac3929 to 097a4b7 Compare September 27, 2017 15:47
Copy link
Contributor

@stuartnelson3 stuartnelson3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@stuartnelson3 stuartnelson3 merged commit 9449bd1 into prometheus:master Sep 28, 2017
@josedonizetti josedonizetti deleted the ignore-expired-silences-on-gossip branch September 28, 2017 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants