Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: fix leaking of Silences matcherCache entries #3930

Merged
merged 7 commits into from
Aug 21, 2024

Conversation

Spaceman1701
Copy link
Contributor

@Spaceman1701 Spaceman1701 commented Jul 31, 2024

There's a small memory leak from the matcherCache when a silence is updated in place by this branch of Silences.Set:

if ok && canUpdate(prev, sil, now) {
	sil.UpdatedAt = now
	msil := s.toMeshSilence(sil)
	if err := s.checkSizeLimits(msil); err != nil {
		return err
	}
	return s.setSilence(msil, now)
}

Silences.Set will always create a new silence instance. If canUpdate is true, the new instance will replace the old one in the silences state. However, the matcherCache is keyed by the pointer to the instance so the entry in the matcherCache is ends up dangling. This means that the both compiled matchers and the silence itself are leaked. The Silences.GC run doesn't take care of this because it never searches for dangling references in the matcherCache.

We've observed this issue in the real world running a slightly modified version of 0.26.0. In this PR, I've added a new test (TestSilenceGCOverTime) which fails when the matcherCache leaks entries. This test still fails when run against the latest code on main:

--- FAIL: TestSilenceGCOverTime (0.01s)
    --- FAIL: TestSilenceGCOverTime/silence_update_does_not_leak_state (0.00s)
        silence_test.go:218:
            	Error Trace:	/home/ehunter/nfs-de/oss/forks/alertmanager/silence/silence_test.go:218
            	Error:      	Not equal:
            	            	expected: 1
            	            	actual  : 2
            	Test:       	TestSilenceGCOverTime/silence_update_does_not_leak_state
            	Messages:   	there are extra entries in the matcher cache
FAIL
FAIL	github.com/prometheus/alertmanager/silence	0.040s
FAIL

There are a few ways to fix this, but I've chosen to modify matcherCache to use the silence UUID as the cache key instead of the pointer to the silence instance. I think this is the best fix because it removes the fragile assumption that the pb.Silence pointer will never change over the lifecycle of a silence and replaces it with the existing assumption that the silence's matchers will not change over the lifecycle of a silence. To state this a different way: this fix removes a required invariant and does not add any new ones.

This fix causes the new test cases to pass and has been running in our environment for a while without any problems.

I'm not 100% sure, but I suspect this is the root cause of #2659

@grobinson-grafana
Copy link
Contributor

Hello! 👋 Thank you for opening this PR. I haven't had time to do an in-depth review, but my initial impressions are fantastic work! Thank you for tracking this down and also creating a fix.

I agree with the decision to change the key from a pointer to the UUID, and it is my understanding that you cannot change the matchers of a silence without creating a new UUID, therefore it should never be possible to read stale matchers from the cache.

I'll take an in-depth look later this week.

}}
}

cases := map[string]testCase{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more idiomatic to write test cases like this

diff --git a/silence/silence_test.go b/silence/silence_test.go
index c290522a..7a1bf335 100644
--- a/silence/silence_test.go
+++ b/silence/silence_test.go
@@ -114,11 +114,6 @@ func TestSilenceGCOverTime(t *testing.T) {
                s                    *pb.Silence
                expectPresentAfterGc bool
        }
-       type testCase struct {
-               initialState    []silenceEntry
-               updates         []silenceEntry
-               expectedGCCount int
-       }

        c := clock.NewMock()
        now := c.Now().UTC()
@@ -133,35 +128,39 @@ func TestSilenceGCOverTime(t *testing.T) {
                        }}
        }

-       cases := map[string]testCase{
-               "gc does not clean active silences": {
-                       initialState: []silenceEntry{
-                               {s: newSilence("1", now), expectPresentAfterGc: false},
-                               {s: newSilence("2", now.Add(-time.Second)), expectPresentAfterGc: false},
-                               {s: newSilence("3", now.Add(time.Second)), expectPresentAfterGc: true},
-                       },
+       cases := []struct {
+               name            string
+               initialState    []silenceEntry
+               updates         []silenceEntry
+               expectedGCCount int
+       }{{
+               name: "gc does not clean active silences",
+               initialState: []silenceEntry{
+                       {s: newSilence("1", now), expectPresentAfterGc: false},
+                       {s: newSilence("2", now.Add(-time.Second)), expectPresentAfterGc: false},
+                       {s: newSilence("3", now.Add(time.Second)), expectPresentAfterGc: true},
                },
-               "silences added with Set are handled correctly": {
-                       initialState: []silenceEntry{
-                               {s: newSilence("1", now), expectPresentAfterGc: false},
-                       },
-                       updates: []silenceEntry{
-                               {s: newSilence("", now.Add(time.Second)), expectPresentAfterGc: true},
-                               {s: newSilence("", now.Add(-time.Second)), expectPresentAfterGc: false},
-                       },
+       }, {
+               name: "silences added with Set are handled correctly",
+               initialState: []silenceEntry{
+                       {s: newSilence("1", now), expectPresentAfterGc: false},
                },
-               "silence update does not leak state": {
-                       initialState: []silenceEntry{
-                               {s: newSilence("1", now), expectPresentAfterGc: false},
-                       },
-                       updates: []silenceEntry{
-                               {s: newSilence("1", now.Add(time.Second)), expectPresentAfterGc: true},
-                       },
+               updates: []silenceEntry{
+                       {s: newSilence("", now.Add(time.Second)), expectPresentAfterGc: true},
+                       {s: newSilence("", now.Add(-time.Second)), expectPresentAfterGc: false},
                },
-       }
+       }, {
+               name: "silence update does not leak state",
+               initialState: []silenceEntry{
+                       {s: newSilence("1", now), expectPresentAfterGc: false},
+               },
+               updates: []silenceEntry{
+                       {s: newSilence("1", now.Add(time.Second)), expectPresentAfterGc: true},
+               },
+       }}

-       for name, tc := range cases {
-               t.Run(name, func(t *testing.T) {
+       for _, tc := range cases {
+               t.Run(tc.name, func(t *testing.T) {
                        silences, err := New(Options{})
                        silClock := clock.NewMock()
                        silences.clock = silClock

silence/silence_test.go Outdated Show resolved Hide resolved
// simulate this silences being seen in a query
silences.mc.Get(silences.st[sil.s.Id].Silence)
}
silClock.Add(-time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the significance of rewinding the clock, although I can see that if I comment this out the test fails. How does rewinding the clock help us test GC behavior?

Copy link
Contributor Author

@Spaceman1701 Spaceman1701 Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! I did this because adding a silence with Set that expires before the clock's now is a no-op. Typically, this case is handled before we reach Set. However, rewinding the clock is a really unclear way to handle this.

Instead, I should've started with the clock 2 seconds behind and then incremented the clock forward by one second after the initialState is applied and then again after updates are applied. That has the exact same behavior, but is much more expressive to the reader.

silence/silence_test.go Show resolved Hide resolved
Copy link
Contributor Author

@Spaceman1701 Spaceman1701 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I will make all the requested stylistic changes and the change to how the test clock is handled.

silence/silence_test.go Show resolved Hide resolved
// simulate this silences being seen in a query
silences.mc.Get(silences.st[sil.s.Id].Silence)
}
silClock.Add(-time.Second)
Copy link
Contributor Author

@Spaceman1701 Spaceman1701 Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! I did this because adding a silence with Set that expires before the clock's now is a no-op. Typically, this case is handled before we reach Set. However, rewinding the clock is a really unclear way to handle this.

Instead, I should've started with the clock 2 seconds behind and then incremented the clock forward by one second after the initialState is applied and then again after updates are applied. That has the exact same behavior, but is much more expressive to the reader.

Copy link
Contributor

@grobinson-grafana grobinson-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code is fine, but the tests are still very hard to follow and understand. I'm going to make an attempt at refactoring them further in a local branch.

func TestSilenceGCOverTime(t *testing.T) {
type silenceEntry struct {
s *pb.Silence
expectPresentAfterGc bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more complicated than it needs to be. Much easier if we invert the bool.

Suggested change
expectPresentAfterGc bool
expectGC bool

silences.clock = silClock

// Set time into the past so that silences will be updated
// before they're endsAt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// before they're endsAt
// before their endsAt

name string
initialState []silenceEntry
updates []silenceEntry
expectedGCCount int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expectedGCCount is not used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is used on line 194

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that's a different variable with the same name.

},
},
{
name: "silences added with Set are handled correctly",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handled correctly

What does this mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this case name isn't great - in this case "handled correctly" just means "all the invariants we're testing remain satisfied"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"all the invariants we're testing remain satisfied"

Yeah but what are the invariants being tested? I'm sort of coming at this from the perspective of someone who hasn't reviewed this PR but needs to understand the tests. It's so hard to understand what is being tested here from looking at the test case:

{
	name: "silences added with Set are handled correctly",
	initialState: []silenceEntry{
		{s: newSilence("1", now), expectPresentAfterGc: false},
	},
	updates: []silenceEntry{
		{s: newSilence("", now.Add(time.Second)), expectPresentAfterGc: true},
		{s: newSilence("", now.Add(-time.Second)), expectPresentAfterGc: false},
	},
},

@grobinson-grafana
Copy link
Contributor

This is how I would make the tests simpler. Let me know if I'm missing a test case.

func TestSilenceGCOverTime(t *testing.T) {
	t.Run("GC does not remove active silences", func(t *testing.T) {
		s, err := New(Options{})
		require.NoError(t, err)
		s.clock = clock.NewMock()
		now := s.nowUTC()
		s.st = state{
			"1": &pb.MeshSilence{Silence: &pb.Silence{Id: "1"}, ExpiresAt: now},
			"2": &pb.MeshSilence{Silence: &pb.Silence{Id: "2"}, ExpiresAt: now.Add(-time.Second)},
			"3": &pb.MeshSilence{Silence: &pb.Silence{Id: "3"}, ExpiresAt: now.Add(time.Second)},
		}
		want := state{
			"3": &pb.MeshSilence{Silence: &pb.Silence{Id: "3"}, ExpiresAt: now.Add(time.Second)},
		}
		n, err := s.GC()
		require.NoError(t, err)
		require.Equal(t, 2, n)
		require.Equal(t, want, s.st)
	})

	// This test checks for a memory leak that occurred in the matcher cache when
	// updating an existing silence.
	t.Run("Updating an existing silences does not leak cache entries", func(t *testing.T) {
		s, err := New(Options{})
		require.NoError(t, err)
		clock := clock.NewMock()
		s.clock = clock
		sil1 := &pb.Silence{
			Id: "1",
			Matchers: []*pb.Matcher{{
				Type:    pb.Matcher_EQUAL,
				Name:    "foo",
				Pattern: "bar",
			}},
			StartsAt: clock.Now(),
			EndsAt:   clock.Now().Add(time.Minute),
		}
		s.st["1"] = &pb.MeshSilence{Silence: sil1, ExpiresAt: clock.Now().Add(time.Minute)}
		// Need to query the silence to populate the matcher cache.
		s.Query(QMatches(model.LabelSet{"foo": "bar"}))
		require.Len(t, s.mc, 1)
		// must clone sil1 before updating it.
		sil2 := cloneSilence(sil1)
		require.NoError(t, s.Set(sil2))
		// The memory leak occurred because updating a silence would add a new
		// entry in the matcher cache even though no new silence was created.
		// This check asserts that this no longer happens.
		require.Len(t, s.st, 1)
		require.Len(t, s.mc, 1)
		// Move time forward and both silence and cache entry should be garbage
		// collected.
		clock.Add(time.Minute)
		n, err := s.GC()
		require.NoError(t, err)
		require.Equal(t, 1, n)
		require.Len(t, s.st, 0)
		require.Len(t, s.mc, 0)
	})
}

@Spaceman1701
Copy link
Contributor Author

Spaceman1701 commented Aug 19, 2024

This is how I would make the tests simpler. Let me know if I'm missing a test case.

This looks good to me, but it does make it a bit harder to add new cases in the future. There's no test here which actually validates that the GC runs as expected with silences added via the normal Set method, but this might not be important enough to worry about.

The GC does not remove active silences case is also missing a validation that the matcher cache length is correct after all the operations.

Would you like me to replace my test in this PR with this new implementation?

@grobinson-grafana
Copy link
Contributor

This looks good to me, but it does make it a bit harder to add new cases in the future.

I think writing these specific tests as table-driven tests make them more difficult to understand. There are lots of cases where table-driven tests do make a lot of sense, but I don't think this is one of those. For example, my comment here highlights what I mean.

There's no test here which actually validates that the GC runs as expected with silences added via the normal Set method, but this might not be important enough to worry about.

We can fix that 👍

The GC does not remove active silences case is also missing a validation that the matcher cache length is correct after all the operations.

👍

Would you like me to replace my test in this PR with this new implementation?

Yes please! Let me first add a new comment that has the updated tests including your feedback 👍

@Spaceman1701
Copy link
Contributor Author

Spaceman1701 commented Aug 19, 2024

I think writing these specific tests as table-driven tests make them more difficult to understand. There are lots of cases where table-driven tests do make a lot of sense, but I don't think this is one of those. For example, my comment #3930 (comment) highlights what I mean.

Sure, I think that's fair enough. The way you've rewritten does seem more readable to me as well. Regardless, I'm very happy to conform to whatever is conventional for Alertmanager.

Yes please! Let me first add a new comment that has the updated tests including your feedback 👍

Alright, great. Thanks!

@grobinson-grafana
Copy link
Contributor

func TestSilenceGCOverTime(t *testing.T) {
	t.Run("GC does not remove active silences", func(t *testing.T) {
		s, err := New(Options{})
		require.NoError(t, err)
		s.clock = clock.NewMock()
		now := s.nowUTC()
		s.st = state{
			"1": &pb.MeshSilence{Silence: &pb.Silence{Id: "1"}, ExpiresAt: now},
			"2": &pb.MeshSilence{Silence: &pb.Silence{Id: "2"}, ExpiresAt: now.Add(-time.Second)},
			"3": &pb.MeshSilence{Silence: &pb.Silence{Id: "3"}, ExpiresAt: now.Add(time.Second)},
		}
		want := state{
			"3": &pb.MeshSilence{Silence: &pb.Silence{Id: "3"}, ExpiresAt: now.Add(time.Second)},
		}
		n, err := s.GC()
		require.NoError(t, err)
		require.Equal(t, 2, n)
		require.Equal(t, want, s.st)
	})

	t.Run("GC does not leak cache entries", func(t *testing.T) {
		s, err := New(Options{})
		require.NoError(t, err)
		clock := clock.NewMock()
		s.clock = clock
		sil1 := &pb.Silence{
			Matchers: []*pb.Matcher{{
				Type:    pb.Matcher_EQUAL,
				Name:    "foo",
				Pattern: "bar",
			}},
			StartsAt: clock.Now(),
			EndsAt:   clock.Now().Add(time.Minute),
		}
		require.NoError(t, s.Set(sil1))
		// Need to query the silence to populate the matcher cache.
		s.Query(QMatches(model.LabelSet{"foo": "bar"}))
		require.Len(t, s.st, 1)
		require.Len(t, s.mc, 1)
		// Move time forward and both silence and cache entry should be garbage
		// collected.
		clock.Add(time.Minute)
		n, err := s.GC()
		require.NoError(t, err)
		require.Equal(t, 1, n)
		require.Len(t, s.st, 0)
		require.Len(t, s.mc, 0)
	})

	t.Run("replacing a silences does not leak cache entries", func(t *testing.T) {
		s, err := New(Options{})
		require.NoError(t, err)
		clock := clock.NewMock()
		s.clock = clock
		sil1 := &pb.Silence{
			Matchers: []*pb.Matcher{{
				Type:    pb.Matcher_EQUAL,
				Name:    "foo",
				Pattern: "bar",
			}},
			StartsAt: clock.Now(),
			EndsAt:   clock.Now().Add(time.Minute),
		}
		require.NoError(t, s.Set(sil1))
		// Need to query the silence to populate the matcher cache.
		s.Query(QMatches(model.LabelSet{"foo": "bar"}))
		require.Len(t, s.st, 1)
		require.Len(t, s.mc, 1)
		// must clone sil1 before replacing it.
		sil2 := cloneSilence(sil1)
		sil2.Matchers = []*pb.Matcher{{
			Type:    pb.Matcher_EQUAL,
			Name:    "bar",
			Pattern: "baz",
		}}
		require.NoError(t, s.Set(sil2))
		// Need to query the silence to populate the matcher cache.
		s.Query(QMatches(model.LabelSet{"bar": "baz"}))
		require.Len(t, s.st, 2)
		require.Len(t, s.mc, 2)
		// Move time forward and both silence and cache entry should be garbage
		// collected.
		clock.Add(time.Minute)
		n, err := s.GC()
		require.NoError(t, err)
		require.Equal(t, 2, n)
		require.Len(t, s.st, 0)
		require.Len(t, s.mc, 0)
	})

	// This test checks for a memory leak that occurred in the matcher cache when
	// updating an existing silence.
	t.Run("updating a silences does not leak cache entries", func(t *testing.T) {
		s, err := New(Options{})
		require.NoError(t, err)
		clock := clock.NewMock()
		s.clock = clock
		sil1 := &pb.Silence{
			Id: "1",
			Matchers: []*pb.Matcher{{
				Type:    pb.Matcher_EQUAL,
				Name:    "foo",
				Pattern: "bar",
			}},
			StartsAt: clock.Now(),
			EndsAt:   clock.Now().Add(time.Minute),
		}
		s.st["1"] = &pb.MeshSilence{Silence: sil1, ExpiresAt: clock.Now().Add(time.Minute)}
		// Need to query the silence to populate the matcher cache.
		s.Query(QMatches(model.LabelSet{"foo": "bar"}))
		require.Len(t, s.mc, 1)
		// must clone sil1 before updating it.
		sil2 := cloneSilence(sil1)
		require.NoError(t, s.Set(sil2))
		// The memory leak occurred because updating a silence would add a new
		// entry in the matcher cache even though no new silence was created.
		// This check asserts that this no longer happens.
		require.Len(t, s.st, 1)
		require.Len(t, s.mc, 1)
		// Move time forward and both silence and cache entry should be garbage
		// collected.
		clock.Add(time.Minute)
		n, err := s.GC()
		require.NoError(t, err)
		require.Equal(t, 1, n)
		require.Len(t, s.st, 0)
		require.Len(t, s.mc, 0)
	})
}

// The memory leak occurred because updating a silence would add a new
// entry in the matcher cache even though no new silence was created.
// This check asserts that this no longer happens.
s.Query(QMatches(model.LabelSet{"foo": "bar"}))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this line because the memory leak only occurs if the matcher cache is populated because of a query.

Copy link
Contributor

@grobinson-grafana grobinson-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gotjosh @simonpasquier I approve this fix. Please take a look so we can get it into the next release! Thanks 💯

@grobinson-grafana
Copy link
Contributor

Sorry, there are a couple of lint failures where we need to use require.Empty. Can you make those fixes so lint passes?

@grobinson-grafana
Copy link
Contributor

Can you also rebase main branch? It looks like the frontend tests are failing, which are unrelated to your changes.

@Spaceman1701
Copy link
Contributor Author

Can you also rebase main branch? It looks like the frontend tests are failing, which are unrelated to your changes.

It looks like that test is still broken - is it possible that it's broken on main? As far as I can tell, the diff between this branch and main doesn't include any frontend files.

@grobinson-grafana
Copy link
Contributor

Can you also rebase main branch? It looks like the frontend tests are failing, which are unrelated to your changes.

It looks like that test is still broken - is it possible that it's broken on main? As far as I can tell, the diff between this branch and main doesn't include any frontend files.

Looks like all PRs are broken due to frontend tests. We will fix it 👍 You'll need to rebase a second time once it's fixed.

@gotjosh gotjosh merged commit 3e6356b into prometheus:main Aug 21, 2024
11 checks passed
@gotjosh
Copy link
Member

gotjosh commented Aug 21, 2024

Thank you very much for your contribution and @grobinson-grafana for reviewing.

SuperQ added a commit that referenced this pull request Oct 16, 2024
* [CHANGE] Deprecate and remove api/v1/ #2970
* [CHANGE] Remove unused feature flags #3676
* [CHANGE] Newlines in smtp password file are now ignored #3681
* [CHANGE] Change compat metrics to counters #3686
* [CHANGE] Do not register compat metrics in amtool #3713
* [CHANGE] Remove metrics from compat package #3714
* [CHANGE] Mark muted alerts #3793
* [FEATURE] Add metric for inhibit rules #3681
* [FEATURE] Support UTF-8 label matchers #3453, #3507, #3523, #3483, #3567, #3568, #3569, #3571, #3595, #3604, #3619, #3658, #3659, #3662, #3668, 3572
* [FEATURE] Add counter to track alerts dropped outside of time_intervals #3565
* [FEATURE] Add date and tz functions to templates #3812
* [FEATURE] Add limits for silences #3852
* [FEATURE] Add time helpers for templates #3863
* [FEATURE] Add auto GOMAXPROCS #3837
* [FEATURE] Add auto GOMEMLIMIT #3895
* [FEATURE] Add Jira receiver integration #3590
* [ENHANCEMENT] Add the receiver name to notification metrics #3045
* [ENHANCEMENT] Add the route ID to uuid #3372
* [ENHANCEMENT] Add duration to the notify success message #3559
* [ENHANCEMENT] Implement webhook_url_file for discord and msteams #3555
* [ENHANCEMENT] Add debug logs for muted alerts #3558
* [ENHANCEMENT] API: Allow the Silences API to use their own 400 response #3610
* [ENHANCEMENT] Add summary to msteams notification #3616
* [ENHANCEMENT] Add context reasons to notifications failed counter #3631
* [ENHANCEMENT] Add optional native histogram support to latency metrics #3737
* [ENHANCEMENT] Enable setting ThreadId for Telegram notifications #3638
* [ENHANCEMENT] Allow webex roomID from template #3801
* [BUGFIX] Add missing integrations to notify metrics #3480
* [BUGFIX] Add missing ttl in pushhover #3474
* [BUGFIX] Fix scheme required for webhook url in amtool #3409
* [BUGFIX] Remove duplicate integration from metrics #3516
* [BUGFIX] Reflect Discord's max length message limits #3597
* [BUGFIX] Fix nil error in warn logs about incompatible matchers #3683
* [BUGFIX] Fix a small number of inconsistencies in compat package logging #3718
* [BUGFIX] Fix log line in featurecontrol #3719
* [BUGFIX] Fix panic in acceptance tests #3592
* [BUGFIX] Fix flaky test TestClusterJoinAndReconnect/TestTLSConnection #3722
* [BUGFIX] Fix crash on errors when url_file is used #3800
* [BUGFIX] Fix race condition in dispatch.go #3826
* [BUGFIX] Fix race conditions in the memory alerts store #3648
* [BUGFIX] Hide config.SecretURL when the URL is incorrect. #3887
* [BUGFIX] Fix invalid silence causes incomplete updates #3898
* [BUGFIX] Fix leaking of Silences matcherCache entries #3930
* [BUGFIX] Close SMTP submission correctly to handle errors #4006

Signed-off-by: SuperQ <[email protected]>
@SuperQ SuperQ mentioned this pull request Oct 16, 2024
gotjosh added a commit that referenced this pull request Oct 24, 2024
* Release v0.28.0-rc.0

* [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879
* [FEATURE] Add a new Microsoft Teams integration based on Flows #4024
* [FEATURE] Add a new Rocket.Chat integration #3600
* [FEATURE] Add a new Jira integration #3590 #3931
* [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895
* [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837
* [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877
* [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792
* [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007
* [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961
* [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062
* [ENHANCEMENT] Build using go 1.23 #4071
* [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732
* [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801
* [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638
* [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863
* [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812
* [ENHANCEMENT] Latency metrics now support native histograms. #3737
* [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006
* [BUGFIX]  The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027
* [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930
* [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887
* [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648
* [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826
* [BUGFIX] Fix version in APIv1 deprecation notice. #3815
* [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800
* [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803
* [BUGFIX] Fix deadlock on the alerts memory store. #3715
* [BUGFIX] Fix `amtool template render` when using the default values. #3725
* [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745

---------

Signed-off-by: SuperQ <[email protected]>
Signed-off-by: gotjosh <[email protected]>
Co-authored-by: gotjosh <[email protected]>
SuperQ added a commit that referenced this pull request Dec 19, 2024
* [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879
* [CHANGE] Adopt log/slog, drop go-kit/log #4089
* [FEATURE] Add a new Microsoft Teams integration based on Flows #4024
* [FEATURE] Add a new Rocket.Chat integration #3600
* [FEATURE] Add a new Jira integration #3590 #3931
* [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895
* [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837
* [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877
* [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792
* [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007
* [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961
* [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062
* [ENHANCEMENT] Build using go 1.23 #4071
* [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732
* [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801
* [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638
* [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863
* [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812
* [ENHANCEMENT] Latency metrics now support native histograms. #3737
* [ENHANCEMENT] Add timeout option for webhook notifier. #4137
* [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006
* [BUGFIX]  The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027
* [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930
* [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887
* [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648
* [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826
* [BUGFIX] Fix version in APIv1 deprecation notice. #3815
* [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800
* [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803
* [BUGFIX] Fix deadlock on the alerts memory store. #3715
* [BUGFIX] Fix `amtool template render` when using the default values. #3725
* [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745
* [BUGFIX] Fix wechat api link #4084
* [BUGFIX] Fix build info metric #4166

Signed-off-by: SuperQ <[email protected]>
@SuperQ SuperQ mentioned this pull request Dec 19, 2024
SuperQ added a commit that referenced this pull request Jan 15, 2025
* [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879
* [CHANGE] Adopt log/slog, drop go-kit/log #4089
* [FEATURE] Add a new Microsoft Teams integration based on Flows #4024
* [FEATURE] Add a new Rocket.Chat integration #3600
* [FEATURE] Add a new Jira integration #3590 #3931
* [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895
* [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837
* [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877
* [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792
* [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007
* [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961
* [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062
* [ENHANCEMENT] Build using go 1.23 #4071
* [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732
* [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801
* [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638
* [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863
* [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812
* [ENHANCEMENT] Latency metrics now support native histograms. #3737
* [ENHANCEMENT] Add full width to adaptive card for msteamsv2 #4135
* [ENHANCEMENT] Add timeout option for webhook notifier. #4137
* [ENHANCEMENT] Update config to allow showing secret values when marshaled #4158
* [ENHANCEMENT] Enable templating for Jira project and issue_type #4159
* [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006
* [BUGFIX]  The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027
* [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930
* [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887
* [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648
* [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826
* [BUGFIX] Fix version in APIv1 deprecation notice. #3815
* [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800
* [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803
* [BUGFIX] Fix deadlock on the alerts memory store. #3715
* [BUGFIX] Fix `amtool template render` when using the default values. #3725
* [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745
* [BUGFIX] Fix wechat api link #4084
* [BUGFIX] Fix build info metric #4166
* [BUGFIX] Fix UTF-8 not allowed in Equal field for inhibition rules #4177

Signed-off-by: SuperQ <[email protected]>
SuperQ added a commit that referenced this pull request Jan 15, 2025
* [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879
* [CHANGE] Adopt log/slog, drop go-kit/log #4089
* [FEATURE] Add a new Microsoft Teams integration based on Flows #4024
* [FEATURE] Add a new Rocket.Chat integration #3600
* [FEATURE] Add a new Jira integration #3590 #3931
* [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895
* [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837
* [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877
* [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792
* [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007
* [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961
* [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062
* [ENHANCEMENT] Build using go 1.23 #4071
* [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732
* [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801
* [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638
* [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863
* [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812
* [ENHANCEMENT] Latency metrics now support native histograms. #3737
* [ENHANCEMENT] Add full width to adaptive card for msteamsv2 #4135
* [ENHANCEMENT] Add timeout option for webhook notifier. #4137
* [ENHANCEMENT] Update config to allow showing secret values when marshaled #4158
* [ENHANCEMENT] Enable templating for Jira project and issue_type #4159
* [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006
* [BUGFIX]  The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027
* [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930
* [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887
* [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648
* [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826
* [BUGFIX] Fix version in APIv1 deprecation notice. #3815
* [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800
* [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803
* [BUGFIX] Fix deadlock on the alerts memory store. #3715
* [BUGFIX] Fix `amtool template render` when using the default values. #3725
* [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745
* [BUGFIX] Fix wechat api link #4084
* [BUGFIX] Fix build info metric #4166
* [BUGFIX] Fix UTF-8 not allowed in Equal field for inhibition rules #4177

Signed-off-by: SuperQ <[email protected]>
SuperQ added a commit that referenced this pull request Jan 15, 2025
* [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879
* [CHANGE] Adopt log/slog, drop go-kit/log #4089
* [FEATURE] Add a new Microsoft Teams integration based on Flows #4024
* [FEATURE] Add a new Rocket.Chat integration #3600
* [FEATURE] Add a new Jira integration #3590 #3931
* [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895
* [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837
* [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877
* [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792
* [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007
* [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961
* [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062
* [ENHANCEMENT] Build using go 1.23 #4071
* [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732
* [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801
* [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638
* [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863
* [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812
* [ENHANCEMENT] Latency metrics now support native histograms. #3737
* [ENHANCEMENT] Add full width to adaptive card for msteamsv2 #4135
* [ENHANCEMENT] Add timeout option for webhook notifier. #4137
* [ENHANCEMENT] Update config to allow showing secret values when marshaled #4158
* [ENHANCEMENT] Enable templating for Jira project and issue_type #4159
* [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006
* [BUGFIX]  The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027
* [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930
* [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887
* [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648
* [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826
* [BUGFIX] Fix version in APIv1 deprecation notice. #3815
* [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800
* [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803
* [BUGFIX] Fix deadlock on the alerts memory store. #3715
* [BUGFIX] Fix `amtool template render` when using the default values. #3725
* [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745
* [BUGFIX] Fix wechat api link #4084
* [BUGFIX] Fix build info metric #4166
* [BUGFIX] Fix UTF-8 not allowed in Equal field for inhibition rules #4177

Signed-off-by: SuperQ <[email protected]>
SuperQ added a commit that referenced this pull request Jan 15, 2025
* [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879
* [CHANGE] Adopt log/slog, drop go-kit/log #4089
* [FEATURE] Add a new Microsoft Teams integration based on Flows #4024
* [FEATURE] Add a new Rocket.Chat integration #3600
* [FEATURE] Add a new Jira integration #3590 #3931
* [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895
* [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837
* [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877
* [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792
* [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007
* [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961
* [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062
* [ENHANCEMENT] Build using go 1.23 #4071
* [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732
* [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801
* [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638
* [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863
* [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812
* [ENHANCEMENT] Latency metrics now support native histograms. #3737
* [ENHANCEMENT] Add full width to adaptive card for msteamsv2 #4135
* [ENHANCEMENT] Add timeout option for webhook notifier. #4137
* [ENHANCEMENT] Update config to allow showing secret values when marshaled #4158
* [ENHANCEMENT] Enable templating for Jira project and issue_type #4159
* [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006
* [BUGFIX]  The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027
* [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930
* [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887
* [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648
* [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826
* [BUGFIX] Fix version in APIv1 deprecation notice. #3815
* [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800
* [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803
* [BUGFIX] Fix deadlock on the alerts memory store. #3715
* [BUGFIX] Fix `amtool template render` when using the default values. #3725
* [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745
* [BUGFIX] Fix wechat api link #4084
* [BUGFIX] Fix build info metric #4166
* [BUGFIX] Fix UTF-8 not allowed in Equal field for inhibition rules #4177

Signed-off-by: SuperQ <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants