-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix repeated resolved alerts #3006
Fix repeated resolved alerts #3006
Conversation
3d8fadc
to
9fdc96b
Compare
continue | ||
} | ||
if a.Resolved() && got.UpdatedAt == a.UpdatedAt { | ||
if err := ag.alerts.Delete(fp); err != nil { | ||
if err := ag.alerts.DeleteIfResolved(a.Fingerprint()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need an atomic operation to ensure no race between Get and Delete.
I think this might be on purpose and written for simplicity. Alertmanager has at least once promise, it's better to send one extra alert than miss one. |
@roidelapluie Yes, alertmanager has at least once promise. I think that is because alertmanager will retry until the notification success, then remove those resolved alerts. But in this case, we are sending repeated(orphan) resolved alerts, and there are race conditions in the code, I'd consider this as a BUG |
#2474 could be the case. |
cefa32d
to
bb8a37d
Compare
@roidelapluie @simonpasquier @w0rm Please take a look in detail. |
After reviewing the Alertmanager code and history, I kinda agree that there's room for improvement. As far as I can tell while #2040 attempted to fix a race between the aggregation group and the alert store (e.g. a resolved alert could be removed before being notified) by disabling garbage-collection in the alerts store, it introduced a side effect that increased the likelihood of receiving several notifications with the same resolved alert. Any change in the current logic would need careful attention though IMHO. Ideally a unit test highlighting the current issue in |
a070082
to
4cd86dd
Compare
@simonpasquier I have added test cases for this issue, the main branch will fail the test:
|
4cd86dd
to
af2f895
Compare
Ping someone.. |
af2f895
to
1d5a033
Compare
1d5a033
to
f45256e
Compare
@damnever Hi! Apologize for asking, but could you please share some debug log here? I'm supposably having the same problem. |
@bo-er I do not have any logs to reproduce the original issue since we are currently using this patch in our production environment. However, I can provide some logs that were generated by this patch:
|
@simonpasquier, could you please take another look at this? We have been running this in our production environment for over a year. I am confident that it has not introduced any side effects. It has resolved the issues I previously mentioned. |
Signed-off-by: Xiaochao Dong (@damnever) <[email protected]>
Signed-off-by: Xiaochao Dong (@damnever) <[email protected]>
f45256e
to
25323ef
Compare
|
||
// Here we simulate the case when the alertmanager receives resolved alerts | ||
// right after the aggrGroup is deleted. | ||
t.Run("repeated alerts after aggrGroup deleted", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! 👋 Josh has asked if I can help take a look at this issue. I'm looking at this test here but I'm not sure I understand what is being tested and how this is showing one of the reported race conditions (I understand there are at least two?).
Here you are adding an expired alert to the store and waiting for a flush. Your test expects no flush to occur, so when it does occur the test fails. What I don't understand is how this relates to the race condition? It is expected that when Alertmanager receives a resolved alert it still creates an aggregation group and flushes it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall correctly, this kind of test is hard to write. This is just attempt of demonstrating that the main branch will fail.
This PR includes and fixes certain race conditions, but its primary focus is addressing repeated resolved alerts. The race condition simply constitutes a minor part of the fix. I don't know if you agree with this.
The race condition is clearly not tested here, the test doesn't mention the race condition. However, I will do my best to address the race condition in simple text.. the race condition could result in normal alerts being deleted if the same alert becomes active again and is received at the same time as the deletion of the alert.
Ok, the condition in this test here may produce repeated resolved (orphan) alerts if the previous firing alert fails to be sent to the alertmanager due to network conditions or something similar, which I have described before. We reject storing resolved alerts if the corresponding firing alert is not found, it may have been lost due to network conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the race condition could result in normal alerts being deleted if the same alert becomes active again and is received at the same time as the deletion of the alert.
OK, I think I understand the issue!
Goroutine #1: You have a Resolved alert that is about to be deleted because the if statement is true:
if a.Resolved() && got.UpdatedAt == a.UpdatedAt {
...
}
Goroutine #2: Between execution of the condition and execution of the code in the if statement, a firing alert is put into the store via func (ag *aggrGroup) insert(alert *types.Alert)
, changing the alert back from resolved to firing.
Goroutine #1: The code in the if statement resumes execution and deletes the alert from the store. However, the alert has changed since the condition was executed, and got
no longer refers to the alert in the store. The firing alert is deleted.
if err := ag.alerts.Delete(fp); err != nil {
...
}
We reject storing resolved alerts if the corresponding firing alert is not found, it may have been lost due to network conditions.
I don't think this fix works though. It has a bug where the Alertmanager does not accept resolved alerts following a crash or restart. Alertmanager does not write its state to disk, so there is no firing alert when the resolved alert is received.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, I would prefer to avoid receiving an orphan resolved alert, and I think that might be why we have an option called resolved_timeout
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by orphaned alert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, AlertManager could deliver a resolved alert without a preceding firing alert, so I refer to it as an 'orphan resolved alert'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the resolved alert should not be too late, so we can manage the restart scenario efficiently by comparing the process start time with the creation time of the aggregation group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that will be difficult to avoid for a number of reasons. Let's still fix the race condition though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! I am able to reproduce the issue with a unit test.
The test creates and starts a dispatcher, inserts a firing alert into the store, and waits for the alert to be flushed. It then inserts a resolved alert and waits for it to be flushed too.
Last it inserts the same firing alert that it inserted at the start of the test and waits for that to be flushed. However, the flush never happens because the firing alert is deleted.
=== RUN TestDispatcherRaceFiringAlertDeleted
level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]]
level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][resolved]
level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][resolved]]
level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
level=debug component=dispatcher aggrGroup={}:{} msg="Deleted alert" alert=[3fff2c2][active]
dispatch_test.go:693:
--- FAIL: TestDispatcherRaceFiringAlertDeleted (3.00s)
FAIL
FAIL github.com/prometheus/alertmanager/dispatch 3.260s
FAIL
To make the race condition easier to observe I added a 1ms sleep to the dispatcher between the call to Get(fp)
and Delete(fp)
:
diff --git a/dispatch/dispatch.go b/dispatch/dispatch.go
index 104471e3..c2669496 100644
--- a/dispatch/dispatch.go
+++ b/dispatch/dispatch.go
@@ -526,10 +526,16 @@ func (ag *aggrGroup) flush(notify func(...*types.Alert) bool) {
level.Error(ag.logger).Log("msg", "failed to get alert", "err", err, "alert", a.String())
continue
}
+ // Insert a 1ms sleep. This allows the scheduler to resume the test
+ // between ag.alerts.Get(fp) and ag.alerts.Delete(fp) during which
+ // it should replace the resolved alert with a firing alert.
+ <-time.After(time.Millisecond)
if a.Resolved() && got.UpdatedAt == a.UpdatedAt {
- if err := ag.alerts.Delete(fp); err != nil {
+ deleted, err := ag.alerts.Delete(fp)
+ if err != nil {
level.Error(ag.logger).Log("msg", "error on delete alert", "err", err, "alert", a.String())
}
+ level.Info(ag.logger).Log("msg", "Deleted alert", "alert", deleted)
}
}
}
I also updated the store to return the deleted alert:
diff --git a/store/store.go b/store/store.go
index 9b30542f..2d5e4666 100644
--- a/store/store.go
+++ b/store/store.go
@@ -115,12 +115,12 @@ func (a *Alerts) Set(alert *types.Alert) error {
}
// Delete removes the Alert with the matching fingerprint from the store.
-func (a *Alerts) Delete(fp model.Fingerprint) error {
+func (a *Alerts) Delete(fp model.Fingerprint) (*types.Alert, error) {
a.Lock()
defer a.Unlock()
-
+ tmp := a.c[fp]
delete(a.c, fp)
- return nil
+ return tmp, nil
}
We can see that the the firing alert is deleted:
level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][resolved]]
level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
level=debug component=dispatcher aggrGroup={}:{} msg="Deleted alert" alert=[3fff2c2][active]
However, what should happen is no alert is deleted as the alert in the store was updated during the flush.
case <-notifier.C: | ||
} | ||
|
||
alert.UpdatedAt = alert.Alert.EndsAt.Add(time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the other case I don't understand. You are changing the UpdatedAt
time, putting the alert, and expecting it to be deleted. However, if the UpdatedAt
time is incremented during a flush, the alert will not be deleted, and it will be flushed again. You won't get a second resolved notification though because the notification log will show that a resolved notification was sent. Your test skips that check though. Have I misunderstood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, once again, the testcase definitely requires improvement. I believe I should have copied the resolved alert and modified the start time to suit this case:
Prometheus/XXRuler failed to send some active alerts, but succeeded to send all resolved alerts:
alert-a: T1(active:notify-ok), T2(resolved:notify-ok), T3(active:notify-failed), T4(resolved:notify-ok).
We reject storing resolved alerts if the alert has already been resolved. This may be due to the corresponding firing alert getting lost due to network conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the notification log prevents the sending of repeated resolved alerts, then this is not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I believe the notification log has its own garbage collection function that operates independently. This case will occur when the timing is right. And This is the case we should handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the notification log prevents the sending of repeated resolved alerts, then this is not the case.
That is correct! The notification log will contain an entry for the resolved alert as it was the last successful notification sent. Entries in the notification log are deleted after the retention period (5 days). I'm still not sure I understand the bug though?
Should we close this PR then? |
Let's wait for @damnever. I think we fixed the issue, but I just want to confirm with them. |
I may have lost some memories as it's been too long. Let's close this |
Assumptions:
alert-a: T1(active:notify-ok), T2(resolved:notify-ok), T3(active:notify-failed), T4(resolved:notify-ok).
Reproduce:
alert-a[resolved]
is received during aggrGroup.flush.alert-a[resolved]
since the alert has been changed.alert-b[firing]
duringgroup_interval
, then bothalert-a[resolved]
andalert-b[firing]
will be sent out.alert-a[resolved]
multiple times.There is another case that may trigger this issue, if the time is right, the repeated
alert-a[resolved]
is received right after aggrGroup gets deleted.