Fix repeated resolved alerts #3006

damnever · 2022-07-15T13:14:16Z

Assumptions:

Prometheus/XXRuler failed to send some active alerts, but succeeded to send all resolved alerts:
alert-a: T1(active:notify-ok), T2(resolved:notify-ok), T3(active:notify-failed), T4(resolved:notify-ok).
Or maybe the alertmanager is behind a proxy with some retry policy.

Reproduce:

A repeated alert-a[resolved] is received during aggrGroup.flush.
We can not delete alert-a[resolved] since the alert has been changed.
We receive a new alert-b[firing] during group_interval, then both alert-a[resolved] and alert-b[firing] will be sent out.
Users may receive alert-a[resolved] multiple times.

There is another case that may trigger this issue, if the time is right, the repeated alert-a[resolved] is received right after aggrGroup gets deleted.

damnever · 2022-07-16T02:59:23Z

dispatch/dispatch.go

 				continue
 			}
-			if a.Resolved() && got.UpdatedAt == a.UpdatedAt {
-				if err := ag.alerts.Delete(fp); err != nil {
+			if err := ag.alerts.DeleteIfResolved(a.Fingerprint()); err != nil {


We need an atomic operation to ensure no race between Get and Delete.

roidelapluie · 2022-07-18T08:16:09Z

I think this might be on purpose and written for simplicity. Alertmanager has at least once promise, it's better to send one extra alert than miss one.

damnever · 2022-07-18T08:52:36Z

@roidelapluie Yes, alertmanager has at least once promise. I think that is because alertmanager will retry until the notification success, then remove those resolved alerts.

But in this case, we are sending repeated(orphan) resolved alerts, and there are race conditions in the code, I'd consider this as a BUG ~~rather than a "feature"~~.

damnever · 2022-07-19T15:04:44Z

#2474 could be the case.

damnever · 2022-07-26T02:53:05Z

@roidelapluie @simonpasquier @w0rm Please take a look in detail.

simonpasquier · 2022-10-03T14:41:23Z

After reviewing the Alertmanager code and history, I kinda agree that there's room for improvement. As far as I can tell while #2040 attempted to fix a race between the aggregation group and the alert store (e.g. a resolved alert could be removed before being notified) by disabling garbage-collection in the alerts store, it introduced a side effect that increased the likelihood of receiving several notifications with the same resolved alert.

Any change in the current logic would need careful attention though IMHO. Ideally a unit test highlighting the current issue in dispatch.aggrGroup would be a good starting point.

damnever · 2022-10-08T07:47:07Z

@simonpasquier I have added test cases for this issue, the main branch will fail the test:

gotest -v -race -run TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts
=== RUN   TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts
=== RUN   TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_deleted
    dispatch_test.go:743: unexpected repeated resolved alerts
=== RUN   TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_flush
    dispatch_test.go:773: unexpected repeated resolved alerts
--- FAIL: TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts (12.01s)
    --- FAIL: TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_deleted (0.00s)
    --- FAIL: TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_flush (12.00s)
FAIL
exit status 1
FAIL    github.com/prometheus/alertmanager/dispatch     13.180s

damnever · 2022-11-03T09:08:49Z

Ping someone..

bo-er · 2023-04-14T06:22:20Z

@damnever Hi! Apologize for asking, but could you please share some debug log here? I'm supposably having the same problem.

damnever · 2023-04-14T06:57:21Z

@bo-er I do not have any logs to reproduce the original issue since we are currently using this patch in our production environment. However, I can provide some logs that were generated by this patch:

... msg="ignore resolved alert since there is no corresponding record in the store"

damnever · 2023-12-06T11:27:05Z

@simonpasquier, could you please take another look at this? We have been running this in our production environment for over a year. I am confident that it has not introduced any side effects. It has resolved the issues I previously mentioned.

@damnever

Signed-off-by: Xiaochao Dong (@damnever) <[email protected]>

@damnever

Signed-off-by: Xiaochao Dong (@damnever) <[email protected]>

grobinson-grafana · 2024-04-30T20:30:14Z

dispatch/dispatch_test.go

+
+	// Here we simulate the case when the alertmanager receives resolved alerts
+	// right after the aggrGroup is deleted.
+	t.Run("repeated alerts after aggrGroup deleted", func(t *testing.T) {


Hi! 👋 Josh has asked if I can help take a look at this issue. I'm looking at this test here but I'm not sure I understand what is being tested and how this is showing one of the reported race conditions (I understand there are at least two?).

Here you are adding an expired alert to the store and waiting for a flush. Your test expects no flush to occur, so when it does occur the test fails. What I don't understand is how this relates to the race condition? It is expected that when Alertmanager receives a resolved alert it still creates an aggregation group and flushes it.

If I recall correctly, this kind of test is hard to write. This is just attempt of demonstrating that the main branch will fail.

This PR includes and fixes certain race conditions, but its primary focus is addressing repeated resolved alerts. The race condition simply constitutes a minor part of the fix. I don't know if you agree with this.

The race condition is clearly not tested here, the test doesn't mention the race condition. However, I will do my best to address the race condition in simple text.. the race condition could result in normal alerts being deleted if the same alert becomes active again and is received at the same time as the deletion of the alert.

Ok, the condition in this test here may produce repeated resolved (orphan) alerts if the previous firing alert fails to be sent to the alertmanager due to network conditions or something similar, which I have described before. We reject storing resolved alerts if the corresponding firing alert is not found, it may have been lost due to network conditions.

the race condition could result in normal alerts being deleted if the same alert becomes active again and is received at the same time as the deletion of the alert.

OK, I think I understand the issue!

Goroutine #1: You have a Resolved alert that is about to be deleted because the if statement is true:

if a.Resolved() && got.UpdatedAt == a.UpdatedAt { ... }

Goroutine #2: Between execution of the condition and execution of the code in the if statement, a firing alert is put into the store via func (ag *aggrGroup) insert(alert *types.Alert), changing the alert back from resolved to firing.

Goroutine #1: The code in the if statement resumes execution and deletes the alert from the store. However, the alert has changed since the condition was executed, and got no longer refers to the alert in the store. The firing alert is deleted.

if err := ag.alerts.Delete(fp); err != nil { ... }

We reject storing resolved alerts if the corresponding firing alert is not found, it may have been lost due to network conditions.

I don't think this fix works though. It has a bug where the Alertmanager does not accept resolved alerts following a crash or restart. Alertmanager does not write its state to disk, so there is no firing alert when the resolved alert is received.

Then, I would prefer to avoid receiving an orphan resolved alert, ~~and I think that might be why we have an option called resolved_timeout.~~

What do you mean by orphaned alert?

In this case, AlertManager could deliver a resolved alert without a preceding firing alert, so I refer to it as an 'orphan resolved alert'.

I think the resolved alert should not be too late, so we can manage the restart scenario efficiently by comparing the process start time with the creation time of the aggregation group.

I think that will be difficult to avoid for a number of reasons. Let's still fix the race condition though.

OK! I am able to reproduce the issue with a unit test.

The test creates and starts a dispatcher, inserts a firing alert into the store, and waits for the alert to be flushed. It then inserts a resolved alert and waits for it to be flushed too.

Last it inserts the same firing alert that it inserted at the start of the test and waits for that to be flushed. However, the flush never happens because the firing alert is deleted.

=== RUN TestDispatcherRaceFiringAlertDeleted level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active] level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]] level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][resolved] level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][resolved]] level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active] level=debug component=dispatcher aggrGroup={}:{} msg="Deleted alert" alert=[3fff2c2][active] dispatch_test.go:693: --- FAIL: TestDispatcherRaceFiringAlertDeleted (3.00s) FAIL FAIL github.com/prometheus/alertmanager/dispatch 3.260s FAIL

To make the race condition easier to observe I added a 1ms sleep to the dispatcher between the call to Get(fp) and Delete(fp):

diff --git a/dispatch/dispatch.go b/dispatch/dispatch.go index 104471e3..c2669496 100644 --- a/dispatch/dispatch.go +++ b/dispatch/dispatch.go @@ -526,10 +526,16 @@ func (ag *aggrGroup) flush(notify func(...*types.Alert) bool) { level.Error(ag.logger).Log("msg", "failed to get alert", "err", err, "alert", a.String()) continue } + // Insert a 1ms sleep. This allows the scheduler to resume the test + // between ag.alerts.Get(fp) and ag.alerts.Delete(fp) during which + // it should replace the resolved alert with a firing alert. + <-time.After(time.Millisecond) if a.Resolved() && got.UpdatedAt == a.UpdatedAt { - if err := ag.alerts.Delete(fp); err != nil { + deleted, err := ag.alerts.Delete(fp) + if err != nil { level.Error(ag.logger).Log("msg", "error on delete alert", "err", err, "alert", a.String()) } + level.Info(ag.logger).Log("msg", "Deleted alert", "alert", deleted) } } }

I also updated the store to return the deleted alert:

diff --git a/store/store.go b/store/store.go index 9b30542f..2d5e4666 100644 --- a/store/store.go +++ b/store/store.go @@ -115,12 +115,12 @@ func (a *Alerts) Set(alert *types.Alert) error { } // Delete removes the Alert with the matching fingerprint from the store. -func (a *Alerts) Delete(fp model.Fingerprint) error { +func (a *Alerts) Delete(fp model.Fingerprint) (*types.Alert, error) { a.Lock() defer a.Unlock() - + tmp := a.c[fp] delete(a.c, fp) - return nil + return tmp, nil }

We can see that the the firing alert is deleted:

level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][resolved]] level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active] level=debug component=dispatcher aggrGroup={}:{} msg="Deleted alert" alert=[3fff2c2][active]

However, what should happen is no alert is deleted as the alert in the store was updated during the flush.

grobinson-grafana · 2024-04-30T20:30:25Z

dispatch/dispatch_test.go

+		case <-notifier.C:
+		}
+
+		alert.UpdatedAt = alert.Alert.EndsAt.Add(time.Second)


This is the other case I don't understand. You are changing the UpdatedAt time, putting the alert, and expecting it to be deleted. However, if the UpdatedAt time is incremented during a flush, the alert will not be deleted, and it will be flushed again. You won't get a second resolved notification though because the notification log will show that a resolved notification was sent. Your test skips that check though. Have I misunderstood?

Alright, once again, the testcase definitely requires improvement. I believe I should have copied the resolved alert and modified the start time to suit this case:

Prometheus/XXRuler failed to send some active alerts, but succeeded to send all resolved alerts:
alert-a: T1(active:notify-ok), T2(resolved:notify-ok), T3(active:notify-failed), T4(resolved:notify-ok).

We reject storing resolved alerts if the alert has already been resolved. This may be due to the corresponding firing alert getting lost due to network conditions.

If the notification log prevents the sending of repeated resolved alerts, then this is not the case.

Ok, I believe the notification log has its own garbage collection function that operates independently. This case will occur when the timing is right. And This is the case we should handle.

If the notification log prevents the sending of repeated resolved alerts, then this is not the case.

That is correct! The notification log will contain an entry for the resolved alert as it was the last successful notification sent. Entries in the notification log are deleted after the retention period (5 days). I'm still not sure I understand the bug though?

grobinson-grafana · 2024-05-07T10:11:53Z

Thanks! I believe the issue is fixed with #3826. I will take a look at #3648 this week too!

gotjosh · 2024-05-07T10:39:31Z

Should we close this PR then?

grobinson-grafana · 2024-05-07T11:54:26Z

Should we close this PR then?

Let's wait for @damnever. I think we fixed the issue, but I just want to confirm with them.

damnever · 2024-05-08T12:55:19Z

I may have lost some memories as it's been too long. Let's close this

damnever force-pushed the fix/cortexrepeatedresolvedalerts branch 4 times, most recently from 3d8fadc to 9fdc96b Compare July 16, 2022 02:55

damnever commented Jul 16, 2022

View reviewed changes

damnever changed the title ~~Avoid repeated resolved alerts~~ Fix repeated resolved alerts Jul 18, 2022

damnever force-pushed the fix/cortexrepeatedresolvedalerts branch 2 times, most recently from cefa32d to bb8a37d Compare July 19, 2022 15:12

damnever force-pushed the fix/cortexrepeatedresolvedalerts branch 2 times, most recently from a070082 to 4cd86dd Compare October 8, 2022 07:46

damnever force-pushed the fix/cortexrepeatedresolvedalerts branch from 4cd86dd to af2f895 Compare October 8, 2022 07:48

damnever force-pushed the fix/cortexrepeatedresolvedalerts branch from af2f895 to 1d5a033 Compare December 8, 2022 09:08

damnever force-pushed the fix/cortexrepeatedresolvedalerts branch from 1d5a033 to f45256e Compare March 28, 2023 09:15

damnever added 2 commits December 27, 2023 10:51

Fix repeated resolved alerts

237f56c

Signed-off-by: Xiaochao Dong (@damnever) <[email protected]>

Add test cases

25323ef

Signed-off-by: Xiaochao Dong (@damnever) <[email protected]>

damnever force-pushed the fix/cortexrepeatedresolvedalerts branch from f45256e to 25323ef Compare December 27, 2023 02:51

grobinson-grafana reviewed Apr 30, 2024

View reviewed changes

grobinson-grafana mentioned this pull request May 1, 2024

Fix race condition in dispatch.go #3826

Merged

damnever closed this May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix repeated resolved alerts #3006

Fix repeated resolved alerts #3006

damnever commented Jul 15, 2022 •

edited

Loading

damnever Jul 16, 2022

roidelapluie commented Jul 18, 2022

damnever commented Jul 18, 2022 •

edited

Loading

damnever commented Jul 19, 2022

damnever commented Jul 26, 2022

simonpasquier commented Oct 3, 2022

damnever commented Oct 8, 2022

damnever commented Nov 3, 2022

bo-er commented Apr 14, 2023

damnever commented Apr 14, 2023

damnever commented Dec 6, 2023 •

edited

Loading

grobinson-grafana Apr 30, 2024

damnever May 1, 2024 •

edited

Loading

grobinson-grafana May 1, 2024

damnever May 1, 2024 •

edited

Loading

grobinson-grafana May 1, 2024

damnever May 1, 2024

damnever May 1, 2024

grobinson-grafana May 1, 2024

grobinson-grafana May 1, 2024

grobinson-grafana Apr 30, 2024

damnever May 1, 2024 •

edited

Loading

damnever May 1, 2024

damnever May 1, 2024 •

edited

Loading

grobinson-grafana May 7, 2024

grobinson-grafana commented May 7, 2024

gotjosh commented May 7, 2024

grobinson-grafana commented May 7, 2024

damnever commented May 8, 2024

Fix repeated resolved alerts #3006

Fix repeated resolved alerts #3006

Conversation

damnever commented Jul 15, 2022 • edited Loading

Choose a reason for hiding this comment

roidelapluie commented Jul 18, 2022

damnever commented Jul 18, 2022 • edited Loading

damnever commented Jul 19, 2022

damnever commented Jul 26, 2022

simonpasquier commented Oct 3, 2022

damnever commented Oct 8, 2022

damnever commented Nov 3, 2022

bo-er commented Apr 14, 2023

damnever commented Apr 14, 2023

damnever commented Dec 6, 2023 • edited Loading

Choose a reason for hiding this comment

damnever May 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

damnever May 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

damnever May 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

damnever May 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grobinson-grafana commented May 7, 2024

gotjosh commented May 7, 2024

grobinson-grafana commented May 7, 2024

damnever commented May 8, 2024

damnever commented Jul 15, 2022 •

edited

Loading

damnever commented Jul 18, 2022 •

edited

Loading

damnever commented Dec 6, 2023 •

edited

Loading

damnever May 1, 2024 •

edited

Loading

damnever May 1, 2024 •

edited

Loading

damnever May 1, 2024 •

edited

Loading

damnever May 1, 2024 •

edited

Loading