Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix repeated resolved alerts #3006

Closed

Conversation

damnever
Copy link
Contributor

@damnever damnever commented Jul 15, 2022

Assumptions:

  1. Prometheus/XXRuler failed to send some active alerts, but succeeded to send all resolved alerts:
    alert-a: T1(active:notify-ok), T2(resolved:notify-ok), T3(active:notify-failed), T4(resolved:notify-ok).
  2. Or maybe the alertmanager is behind a proxy with some retry policy.

Reproduce:

  1. A repeated alert-a[resolved] is received during aggrGroup.flush.
  2. We can not delete alert-a[resolved] since the alert has been changed.
  3. We receive a new alert-b[firing] during group_interval, then both alert-a[resolved] and alert-b[firing] will be sent out.
  4. Users may receive alert-a[resolved] multiple times.

There is another case that may trigger this issue, if the time is right, the repeated alert-a[resolved] is received right after aggrGroup gets deleted.

@damnever damnever force-pushed the fix/cortexrepeatedresolvedalerts branch 4 times, most recently from 3d8fadc to 9fdc96b Compare July 16, 2022 02:55
continue
}
if a.Resolved() && got.UpdatedAt == a.UpdatedAt {
if err := ag.alerts.Delete(fp); err != nil {
if err := ag.alerts.DeleteIfResolved(a.Fingerprint()); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an atomic operation to ensure no race between Get and Delete.

@roidelapluie
Copy link
Member

I think this might be on purpose and written for simplicity. Alertmanager has at least once promise, it's better to send one extra alert than miss one.

@damnever
Copy link
Contributor Author

damnever commented Jul 18, 2022

@roidelapluie Yes, alertmanager has at least once promise. I think that is because alertmanager will retry until the notification success, then remove those resolved alerts.

But in this case, we are sending repeated(orphan) resolved alerts, and there are race conditions in the code, I'd consider this as a BUG rather than a "feature".

@damnever damnever changed the title Avoid repeated resolved alerts Fix repeated resolved alerts Jul 18, 2022
@damnever
Copy link
Contributor Author

#2474 could be the case.

@damnever damnever force-pushed the fix/cortexrepeatedresolvedalerts branch 2 times, most recently from cefa32d to bb8a37d Compare July 19, 2022 15:12
@damnever
Copy link
Contributor Author

@roidelapluie @simonpasquier @w0rm Please take a look in detail.

@simonpasquier
Copy link
Member

After reviewing the Alertmanager code and history, I kinda agree that there's room for improvement. As far as I can tell while #2040 attempted to fix a race between the aggregation group and the alert store (e.g. a resolved alert could be removed before being notified) by disabling garbage-collection in the alerts store, it introduced a side effect that increased the likelihood of receiving several notifications with the same resolved alert.

Any change in the current logic would need careful attention though IMHO. Ideally a unit test highlighting the current issue in dispatch.aggrGroup would be a good starting point.

@damnever damnever force-pushed the fix/cortexrepeatedresolvedalerts branch 2 times, most recently from a070082 to 4cd86dd Compare October 8, 2022 07:46
@damnever
Copy link
Contributor Author

damnever commented Oct 8, 2022

@simonpasquier I have added test cases for this issue, the main branch will fail the test:

gotest -v -race -run TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts
=== RUN   TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts
=== RUN   TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_deleted
    dispatch_test.go:743: unexpected repeated resolved alerts
=== RUN   TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_flush
    dispatch_test.go:773: unexpected repeated resolved alerts
--- FAIL: TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts (12.01s)
    --- FAIL: TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_deleted (0.00s)
    --- FAIL: TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_flush (12.00s)
FAIL
exit status 1
FAIL    github.com/prometheus/alertmanager/dispatch     13.180s

@damnever damnever force-pushed the fix/cortexrepeatedresolvedalerts branch from 4cd86dd to af2f895 Compare October 8, 2022 07:48
@damnever
Copy link
Contributor Author

damnever commented Nov 3, 2022

Ping someone..

@damnever damnever force-pushed the fix/cortexrepeatedresolvedalerts branch from af2f895 to 1d5a033 Compare December 8, 2022 09:08
@damnever damnever force-pushed the fix/cortexrepeatedresolvedalerts branch from 1d5a033 to f45256e Compare March 28, 2023 09:15
@bo-er
Copy link

bo-er commented Apr 14, 2023

@damnever Hi! Apologize for asking, but could you please share some debug log here? I'm supposably having the same problem.

@damnever
Copy link
Contributor Author

@bo-er I do not have any logs to reproduce the original issue since we are currently using this patch in our production environment. However, I can provide some logs that were generated by this patch:

... msg="ignore resolved alert since there is no corresponding record in the store"

@damnever
Copy link
Contributor Author

damnever commented Dec 6, 2023

@simonpasquier, could you please take another look at this? We have been running this in our production environment for over a year. I am confident that it has not introduced any side effects. It has resolved the issues I previously mentioned.

Signed-off-by: Xiaochao Dong (@damnever) <[email protected]>
@damnever damnever force-pushed the fix/cortexrepeatedresolvedalerts branch from f45256e to 25323ef Compare December 27, 2023 02:51

// Here we simulate the case when the alertmanager receives resolved alerts
// right after the aggrGroup is deleted.
t.Run("repeated alerts after aggrGroup deleted", func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! 👋 Josh has asked if I can help take a look at this issue. I'm looking at this test here but I'm not sure I understand what is being tested and how this is showing one of the reported race conditions (I understand there are at least two?).

Here you are adding an expired alert to the store and waiting for a flush. Your test expects no flush to occur, so when it does occur the test fails. What I don't understand is how this relates to the race condition? It is expected that when Alertmanager receives a resolved alert it still creates an aggregation group and flushes it.

Copy link
Contributor Author

@damnever damnever May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall correctly, this kind of test is hard to write. This is just attempt of demonstrating that the main branch will fail.

This PR includes and fixes certain race conditions, but its primary focus is addressing repeated resolved alerts. The race condition simply constitutes a minor part of the fix. I don't know if you agree with this.

The race condition is clearly not tested here, the test doesn't mention the race condition. However, I will do my best to address the race condition in simple text.. the race condition could result in normal alerts being deleted if the same alert becomes active again and is received at the same time as the deletion of the alert.

Ok, the condition in this test here may produce repeated resolved (orphan) alerts if the previous firing alert fails to be sent to the alertmanager due to network conditions or something similar, which I have described before. We reject storing resolved alerts if the corresponding firing alert is not found, it may have been lost due to network conditions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the race condition could result in normal alerts being deleted if the same alert becomes active again and is received at the same time as the deletion of the alert.

OK, I think I understand the issue!

Goroutine #1: You have a Resolved alert that is about to be deleted because the if statement is true:

if a.Resolved() && got.UpdatedAt == a.UpdatedAt {
	...
}

Goroutine #2: Between execution of the condition and execution of the code in the if statement, a firing alert is put into the store via func (ag *aggrGroup) insert(alert *types.Alert), changing the alert back from resolved to firing.

Goroutine #1: The code in the if statement resumes execution and deletes the alert from the store. However, the alert has changed since the condition was executed, and got no longer refers to the alert in the store. The firing alert is deleted.

if err := ag.alerts.Delete(fp); err != nil {
	...
}

We reject storing resolved alerts if the corresponding firing alert is not found, it may have been lost due to network conditions.

I don't think this fix works though. It has a bug where the Alertmanager does not accept resolved alerts following a crash or restart. Alertmanager does not write its state to disk, so there is no firing alert when the resolved alert is received.

Copy link
Contributor Author

@damnever damnever May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, I would prefer to avoid receiving an orphan resolved alert, and I think that might be why we have an option called resolved_timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by orphaned alert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, AlertManager could deliver a resolved alert without a preceding firing alert, so I refer to it as an 'orphan resolved alert'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the resolved alert should not be too late, so we can manage the restart scenario efficiently by comparing the process start time with the creation time of the aggregation group.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that will be difficult to avoid for a number of reasons. Let's still fix the race condition though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! I am able to reproduce the issue with a unit test.

The test creates and starts a dispatcher, inserts a firing alert into the store, and waits for the alert to be flushed. It then inserts a resolved alert and waits for it to be flushed too.

Last it inserts the same firing alert that it inserted at the start of the test and waits for that to be flushed. However, the flush never happens because the firing alert is deleted.

=== RUN   TestDispatcherRaceFiringAlertDeleted
level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][active]]
level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][resolved]
level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][resolved]]
level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
level=debug component=dispatcher aggrGroup={}:{} msg="Deleted alert" alert=[3fff2c2][active]
    dispatch_test.go:693:
--- FAIL: TestDispatcherRaceFiringAlertDeleted (3.00s)
FAIL
FAIL	github.com/prometheus/alertmanager/dispatch	3.260s
FAIL

To make the race condition easier to observe I added a 1ms sleep to the dispatcher between the call to Get(fp) and Delete(fp):

diff --git a/dispatch/dispatch.go b/dispatch/dispatch.go
index 104471e3..c2669496 100644
--- a/dispatch/dispatch.go
+++ b/dispatch/dispatch.go
@@ -526,10 +526,16 @@ func (ag *aggrGroup) flush(notify func(...*types.Alert) bool) {
                                level.Error(ag.logger).Log("msg", "failed to get alert", "err", err, "alert", a.String())
                                continue
                        }
+                       // Insert a 1ms sleep. This allows the scheduler to resume the test
+                       // between ag.alerts.Get(fp) and ag.alerts.Delete(fp) during which
+                       // it should replace the resolved alert with a firing alert.
+                       <-time.After(time.Millisecond)
                        if a.Resolved() && got.UpdatedAt == a.UpdatedAt {
-                               if err := ag.alerts.Delete(fp); err != nil {
+                               deleted, err := ag.alerts.Delete(fp)
+                               if err != nil {
                                        level.Error(ag.logger).Log("msg", "error on delete alert", "err", err, "alert", a.String())
                                }
+                               level.Info(ag.logger).Log("msg", "Deleted alert", "alert", deleted)
                        }
                }
        }

I also updated the store to return the deleted alert:

diff --git a/store/store.go b/store/store.go
index 9b30542f..2d5e4666 100644
--- a/store/store.go
+++ b/store/store.go
@@ -115,12 +115,12 @@ func (a *Alerts) Set(alert *types.Alert) error {
 }

 // Delete removes the Alert with the matching fingerprint from the store.
-func (a *Alerts) Delete(fp model.Fingerprint) error {
+func (a *Alerts) Delete(fp model.Fingerprint) (*types.Alert, error) {
        a.Lock()
        defer a.Unlock()
-
+       tmp := a.c[fp]
        delete(a.c, fp)
-       return nil
+       return tmp, nil
 }

We can see that the the firing alert is deleted:

level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[[3fff2c2][resolved]]
level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
level=debug component=dispatcher aggrGroup={}:{} msg="Deleted alert" alert=[3fff2c2][active]

However, what should happen is no alert is deleted as the alert in the store was updated during the flush.

case <-notifier.C:
}

alert.UpdatedAt = alert.Alert.EndsAt.Add(time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other case I don't understand. You are changing the UpdatedAt time, putting the alert, and expecting it to be deleted. However, if the UpdatedAt time is incremented during a flush, the alert will not be deleted, and it will be flushed again. You won't get a second resolved notification though because the notification log will show that a resolved notification was sent. Your test skips that check though. Have I misunderstood?

Copy link
Contributor Author

@damnever damnever May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, once again, the testcase definitely requires improvement. I believe I should have copied the resolved alert and modified the start time to suit this case:

Prometheus/XXRuler failed to send some active alerts, but succeeded to send all resolved alerts:
alert-a: T1(active:notify-ok), T2(resolved:notify-ok), T3(active:notify-failed), T4(resolved:notify-ok).

We reject storing resolved alerts if the alert has already been resolved. This may be due to the corresponding firing alert getting lost due to network conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the notification log prevents the sending of repeated resolved alerts, then this is not the case.

Copy link
Contributor Author

@damnever damnever May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I believe the notification log has its own garbage collection function that operates independently. This case will occur when the timing is right. And This is the case we should handle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the notification log prevents the sending of repeated resolved alerts, then this is not the case.

That is correct! The notification log will contain an entry for the resolved alert as it was the last successful notification sent. Entries in the notification log are deleted after the retention period (5 days). I'm still not sure I understand the bug though?

@grobinson-grafana
Copy link
Contributor

Thanks! I believe the issue is fixed with #3826. I will take a look at #3648 this week too!

@gotjosh
Copy link
Member

gotjosh commented May 7, 2024

Should we close this PR then?

@grobinson-grafana
Copy link
Contributor

Should we close this PR then?

Let's wait for @damnever. I think we fixed the issue, but I just want to confirm with them.

@damnever
Copy link
Contributor Author

damnever commented May 8, 2024

I may have lost some memories as it's been too long. Let's close this

@damnever damnever closed this May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants