-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore expired silences OnGossip #999
Ignore expired silences OnGossip #999
Conversation
Thanks for the PR! I will have to think about this, as it opens up new issues: some people like to configure their data retention time high so that they can view and recreate silences from expired ones. By not gossiping expired silences, there will be inconsistency between the alertmanagers in a mesh. |
This isn't about ignoring gossiped expired silences, those will still exist in all HA, no inconsistency on this part. The PR ignores only GCed expired silences. I've started to test this after seen this issue #996. The test was done on current master. What I did:
What I saw: The GC would run on a specific instance 'A' and would clean the expired silences. But just a few seconds later the expired silences would reappear in 'A' due to a gossiped of another instance 'B' or 'C', where the GC hadn't run. Let me know if there more information I can gather to help out on this issue. |
@stuartnelson3 are you still up to reviewing this? The not-happening GC is hurting us quite a lot by now, so I'm tempted to work on this myself, but I don't want to re-do work already done or work redundantly on something that is already being worked on behind the scenes. |
Feel free to grab it, I was forced to have a weekend to myself and haven't touched it :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay!
Thank you very much for the PR, and for clarifying to me how it's working :) I made a couple small requests on the comments to ease reading for myself and future readers, but once those are changed it'll be merged.
silence/silence.go
Outdated
@@ -831,8 +832,8 @@ func (gd *gossipData) Merge(other mesh.GossipData) mesh.GossipData { | |||
return gd | |||
} | |||
|
|||
// mergeDelta behaves like Merge but returns a gossipData only | |||
// containing things that have changed. | |||
// mergeDelta behaves like Merge but ignore expired silences, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/s/ignore/ignores/
silence/silence.go
Outdated
@@ -843,6 +844,11 @@ func (gd *gossipData) mergeDelta(od *gossipData) *gossipData { | |||
defer gd.mtx.Unlock() | |||
|
|||
for id, s := range od.data { | |||
// Only merge silence if not expired |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand on this comment a bit? I know I'm going to come back to it in the future and need to refresh myself. Something along the lines of:
If a gossiped silence is expired, skip it. For a given silence duration exceeding a few minutes, active silences will have already been gossiped. Once the active silence is gossiped, its expiration should happen more or less simultaneously on the different alertmanager nodes. Preventing the gossiping of expired silences allows them to be GC'd, and doesn't affect consistency across the mesh.
This will fix the bug of resync deleted silences due to the state of other peers.
3ac3929
to
097a4b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This will fix the bug of resync deleted silences due to the state of other peers. More context on #996