Resetting alert levels #740

minhdanh · 2016-07-22T04:46:07Z

In regard to this issue #614
This pull request should enable lowering alert levels using lamda expressions.
Only WARN and CRIT levels are supported, INFO level is not.

I'm still trying to come up with some tests for this.

Rebased/mergable
Tests pass
CHANGELOG.md updated
Sign CLA (if not already signed)

minhdanh · 2016-07-22T12:16:01Z

This is what I think of, however I believe it needs more improvements. Will make some more changes in the next days : )

nathanielc · 2016-07-22T15:31:06Z

@minhdanh Looks like a good start. FYI #732 has been merged so you can rebase your PR to see only your changes.

nathanielc · 2016-07-22T15:31:52Z

alert.go

@@ -116,6 +116,8 @@ type AlertNode struct {
 	critsTriggered  *expvar.Int

 	bufPool sync.Pool
+
+	level_resets []stateful.Expression


Go is an opinionated language and prefers camelCase names. Can you change this to levelResets?

Yup. I've changed the variables to camelCase format. : )

minhdanh · 2016-07-29T09:36:17Z

@nathanielc I've just pushed a test with its sample data. Please have a look at them to see it's the expected behavior of this feature.
There's one thing confuses me, as you said:
Meaning the determineLevel would check for any reset expressions for the current level first and only evaluate other expressions if the reset expression is false.
Why would it only evaluate other expressions if the reset expression is false? I mean why not true?
For example I have the following rules in a TICK script:

.info(lambda: "value" > 40)
.infoReset(lambda: "value" < 10)
.warn(lambda: "value" > 60)
.warnReset(lambda: "value" < 20)
.crit(lambda: "value" > 80)
.critReset(lambda: "value" < 30)

Say first we have a data point with a value of 70, that will cause the alert level to be WARNING. Next if we have a data point with a value of 15, the reset expression of the WARNING level will be evaluated. It would return true (15 < 20 = true). Here we'll want the alert to escape the WARNING state so that the other expressions can be evaluated.
Or let's say we have a value of 40 instead of 15. It would cause the reset evaluation to return false (40 < 20 = false). At this point wouldn't we want the determineLevel function to return the current level which is WARNING?

Let's examine another data point with a value of 85 instead of 15 in the above case. Now the reset expression of the WARNING level would return false (85 < 20 = false). Here if we return the current level (WARNING), it will be incorrect as the level now should be CRITICAL, meaning the crit expression should be evaluated as well. So I think in case a reset expression returns false, we would check if the most highest level possible is greater than the current level. If it is, we return that level instead of the current level. If it's not, we just return the current level.

nathanielc · 2016-08-01T19:56:21Z

@minhdanh Thanks for the detailed example. I think we are on the same page. Your example makes sense to me, (I probably inverted the logic earlier when I was talking about the true/false value of the reset expression)

To make sure we are on the same page here are a few more examples:

.info(lambda: "value" > 40)
.infoReset(lambda: "value" < 10)
.warn(lambda: "value" > 60)
.warnReset(lambda: "value" < 20)
.crit(lambda: "value" > 80)
.critReset(lambda: "value" < 30)

Given the above criteria the following data points create the following alert levels.

45 40 30 9
In In In OK

45 61 30 19
In Wn Wn OK

45 61 81 29
In Wn Cr OK

Now for a slightly different example

.info(lambda: "value" > 40)
.infoReset(lambda: "value" < 30)
.warn(lambda: "value" > 60)
.warnReset(lambda: "value" < 50)
.crit(lambda: "value" > 80)
.critReset(lambda: "value" < 70)

45 40 30 29
In In In OK

45 61 49 29
In Wn In OK

45 61 81 69 50 41 25
In Wn Cr Wn Wn In OK

This is how I would expect it to work.

minhdanh · 2016-08-02T12:23:07Z

@nathanielc I updated the test and its sample data with the values in your first example. I could make the test pass on my machine. Should I write another test for the latter example, or just add another tick script and call the same functions in TestStreamer_Alert_WithReset function?

nathanielc · 2016-08-02T14:47:57Z

@minhdanh I'd prefer each test to test a single aspect of the behavior, so creating another test is fine.

I took a quick look at the tests and they look good.

minhdanh · 2016-08-03T04:13:39Z

integrations/streamer_test.go

@@ -4110,7 +4110,7 @@ stream
 	}
 }

-func TestStream_Alert_WithReset(t *testing.T) {
+func TestStream_Alert_WithReset_0(t *testing.T) {
 	requestCount := int32(0)


@nathanielc I'm not sure how to name two test functions of the same feature so I split them into TestStream_Alert_WithReset_0 and TestStream_Alert_WithReset_1. What do you think?

minhdanh · 2016-08-04T06:28:27Z

@nathanielc I've implemented your new way. All the tests pass except TestStream_AlertDuration.
I tried to track the problem down but I could not go further. It seems to be a problem with evaluating the normal expressions (f54b0a2#diff-d02710009bb4a30f396667696f963627R620)

Here is the output for that test:

[edge:task_master:testStreamer|TestStream_AlertDuration->stream] 2016/08/04 13:16:51 D! closing c: 12 e: 10
[edge:task_master:testStreamer|write_points->stream] 2016/08/04 13:16:51 D! closing c: 0 e: 0
[edge:TestStream_AlertDuration|stream->stream0] 2016/08/04 13:16:51 D! closing c: 12 e: 12
[edge:TestStream_AlertDuration|stream0->from1] 2016/08/04 13:16:51 D! closing c: 12 e: 1
[edge:TestStream_AlertDuration|from1->alert2] 2016/08/04 13:16:51 D! closing c: 12 e: 0
[TestStream_AlertDuration:alert2] 2016/08/04 13:16:51 D! CRITICAL alert triggered id:kapacitor/cpu/serverA msg:kapacitor/cpu/serverA is CRITICAL data:&{cpu map[host:serverA type:idle] [time value] [[1971-01-01 00:00:00 +0000 UTC 9]] <nil>}
[TestStream_AlertDuration:alert2] 2016/08/04 13:16:51 D! OK alert triggered id:kapacitor/cpu/serverA msg:kapacitor/cpu/serverA is OK data:&{cpu map[type:idle host:serverA] [time value] [[1971-01-01 00:00:02 +0000 UTC 8]] <nil>}
[TestStream_AlertDuration:alert2] 2016/08/04 13:16:51 D! WARNING alert triggered id:kapacitor/cpu/serverA msg:kapacitor/cpu/serverA is WARNING data:&{cpu map[host:serverA type:idle] [time value] [[1971-01-01 00:00:03 +0000 UTC 8]] <nil>}
[TestStream_AlertDuration:alert2] 2016/08/04 13:16:51 D! OK alert triggered id:kapacitor/cpu/serverA msg:kapacitor/cpu/serverA is OK data:&{cpu map[type:idle host:serverA] [time value] [[1971-01-01 00:00:04 +0000 UTC 6]] <nil>}
[TestStream_AlertDuration:alert2] 2016/08/04 13:16:51 D! WARNING alert triggered id:kapacitor/cpu/serverA msg:kapacitor/cpu/serverA is WARNING data:&{cpu map[host:serverA type:idle] [time value] [[1971-01-01 00:00:05 +0000 UTC 8]] <nil>}
[TestStream_AlertDuration:alert2] 2016/08/04 13:16:51 D! OK alert triggered id:kapacitor/cpu/serverA msg:kapacitor/cpu/serverA is OK data:&{cpu map[host:serverA type:idle] [time value] [[1971-01-01 00:00:08 +0000 UTC 3]] <nil>}
[edge:TestStream_AlertDuration|alert2->http_out3] 2016/08/04 13:16:51 D! closing c: 6 e: 5
[httpd] ::1 - - [04/Aug/2016:13:16:51 +0700] "GET /kapacitor/v1/tasks/TestStream_AlertDuration/TestStream_AlertDuration HTTP/1.1" 200 144 "-" "Go-http-client/1.1" 0880f51a-5a0b-11e6-8088-000000000000 1584
[task_master:testStreamer] 2016/08/04 13:16:51 I! Stopped task: TestStream_AlertDuration
[task_master:testStreamer] 2016/08/04 13:16:51 I! closed
--- FAIL: TestStream_AlertDuration (0.01s)
        streamer_test.go:6512: digraph TestStream_AlertDuration {
                stream0 -> from1;
                from1 -> alert2;
                alert2 -> http_out3;
                }
        streamer_test.go:4980: unexpected alert data for request: 2
                got {kapacitor/cpu/serverA kapacitor/cpu/serverA is OK details 1971-01-01 00:00:02 +0000 UTC 2s OK {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
                exp {kapacitor/cpu/serverA kapacitor/cpu/serverA is WARNING details 1971-01-01 00:00:02 +0000 UTC 2s WARNING {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
        streamer_test.go:4980: unexpected alert data for request: 3
                got {kapacitor/cpu/serverA kapacitor/cpu/serverA is WARNING details 1971-01-01 00:00:03 +0000 UTC 0 WARNING {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
                exp {kapacitor/cpu/serverA kapacitor/cpu/serverA is OK details 1971-01-01 00:00:04 +0000 UTC 4s OK {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
        streamer_test.go:4980: unexpected alert data for request: 4
                got {kapacitor/cpu/serverA kapacitor/cpu/serverA is OK details 1971-01-01 00:00:04 +0000 UTC 1s OK {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
                exp {kapacitor/cpu/serverA kapacitor/cpu/serverA is WARNING details 1971-01-01 00:00:05 +0000 UTC 0 WARNING {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
        streamer_test.go:4980: unexpected alert data for request: 5
                got {kapacitor/cpu/serverA kapacitor/cpu/serverA is WARNING details 1971-01-01 00:00:05 +0000 UTC 0 WARNING {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
                exp {kapacitor/cpu/serverA kapacitor/cpu/serverA is OK details 1971-01-01 00:00:08 +0000 UTC 3s OK {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
        streamer_test.go:4980: unexpected alert data for request: 6
                got {kapacitor/cpu/serverA kapacitor/cpu/serverA is OK details 1971-01-01 00:00:08 +0000 UTC 3s OK {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
                exp {   0001-01-01 00:00:00 +0000 UTC 0 OK {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
        streamer_test.go:5023: got 6 exp 5

nathanielc · 2016-08-04T16:06:33Z

alert.go

+		} else if !pass {
+			return currentLevel
+		} else {
+			if newLevel, err := a.findFirstMatchLevel(currentLevel-1, OKAlert, now, fields, tags); err == nil {


try this. The change is that if the reset expression does not exist then we need to check the rest of the normal expressions below or equal to our current level.

if higherLevel, err := a.findFirstMatchLevel(CritAlert, currentLevel-1, now, fields, tags); err == nil { return higherLevel } if rse := a.levelResets[currentLevel]; rse != nil { if pass, err := EvalPredicate(rse, a.lrScopePools[currentLevel], now, fields, tags); err != nil { a.logger.Printf("E! error evaluating reset expression for current level %v: %s", currentLevel, err) } else if !pass { return currentLevel } // fall through to check the rest of the levels. } if newLevel, err := a.findFirstMatchLevel(currentLevel, OKAlert, now, fields, tags); err == nil { return newLevel } return OKAlert

Also there is some error handling missing for the last bit.

This works like a charm. Thank you so much 💃

minhdanh · 2016-08-05T03:57:49Z

@nathanielc I've just applied the fixes and updated CHANGELOG.md. Is there anything else need to be done?

nathanielc · 2016-08-05T14:59:27Z

@minhdanh OK, I have re-reviewed the entire PR. I just have two small recommendations:

Can you add the example from the CHANGELOG here as well? That way it will part of the official docs.
The logic for the findFirstMatchLevel is a little confusing since it uses the error return to indicate that it didn't find a match. I would be fine with either: adding a bool return value to indicate a match was found, or returning the stop level passed in and then comparing against that in the caller. The error return should only be set if an unrecoverable error occurs.

Thanks a ton!

minhdanh · 2016-08-08T03:48:09Z

@nathanielc I changed the error type to bool type to indicate that if a match was found. The function findFirstMatchLevel doesn't return an unrecoverable error so I think this makes sense. As now no error returned so I don't do error handling in the callers (which makes the code a little shorter :D). What do you think?

nathanielc · 2016-08-08T15:15:46Z

@minhdanh LGTM, thanks for all the work!

minhdanh · 2016-08-08T15:29:16Z

My pleasure. I couldn't make it without your help. Thank you very much.

melor · 2016-08-08T17:17:10Z

Good stuff! Thanks @minhdanh and @nathanielc !

nathanielc reviewed Jul 22, 2016
View reviewed changes

minhdanh reviewed Aug 3, 2016
View reviewed changes

nathanielc reviewed Aug 4, 2016
View reviewed changes

Minh Danh added 13 commits August 5, 2016 10:31

Implement alert level reset

73473c1

Fix logic issue

66b486c

Fix syntax error

1b6442d

Rename variables

e9e2074

Add test TestStream_Alert_WithReset

67a7778

Update sample data

64994e8

Rename test and add new test case

a5f74c1

Implement currentLevel for determineLevel function

28290f2

Implement new way for evaluating reset expressions

eb1e9c6

Minor refactor with getting current level

c391f99

Check the rest of the levels when reset expression not exist

f254d50

Handle errors when checking for first match level

0b9bf42

Update CHANGELOG.md with alert reset expressions feature

006001c

minhdanh changed the title ~~[WIP] Resetting alert levels~~ Resetting alert levels Aug 5, 2016

Minh Danh added 2 commits August 8, 2016 10:34

Update docs with alert reset expresions example

9a3fbcd

Return a bool instead of error type to indicate if a match was found

7bdcde0

nathanielc merged commit 7bdcde0 into influxdata:master Aug 8, 2016

minhdanh deleted the feature/alert_level_reset branch August 9, 2016 02:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resetting alert levels #740

Resetting alert levels #740

minhdanh commented Jul 22, 2016 •

edited

Loading

minhdanh commented Jul 22, 2016

nathanielc commented Jul 22, 2016

nathanielc Jul 22, 2016

minhdanh Jul 26, 2016

minhdanh commented Jul 29, 2016 •

edited

Loading

nathanielc commented Aug 1, 2016

minhdanh commented Aug 2, 2016

nathanielc commented Aug 2, 2016

minhdanh Aug 3, 2016

minhdanh commented Aug 4, 2016

nathanielc Aug 4, 2016 •

edited

Loading

minhdanh Aug 5, 2016

minhdanh commented Aug 5, 2016

nathanielc commented Aug 5, 2016

minhdanh commented Aug 8, 2016

nathanielc commented Aug 8, 2016

minhdanh commented Aug 8, 2016 •

edited

Loading

melor commented Aug 8, 2016

Resetting alert levels #740

Resetting alert levels #740

Conversation

minhdanh commented Jul 22, 2016 • edited Loading

minhdanh commented Jul 22, 2016

nathanielc commented Jul 22, 2016

nathanielc Jul 22, 2016

Choose a reason for hiding this comment

minhdanh Jul 26, 2016

Choose a reason for hiding this comment

minhdanh commented Jul 29, 2016 • edited Loading

nathanielc commented Aug 1, 2016

minhdanh commented Aug 2, 2016

nathanielc commented Aug 2, 2016

minhdanh Aug 3, 2016

Choose a reason for hiding this comment

minhdanh commented Aug 4, 2016

nathanielc Aug 4, 2016 • edited Loading

Choose a reason for hiding this comment

minhdanh Aug 5, 2016

Choose a reason for hiding this comment

minhdanh commented Aug 5, 2016

nathanielc commented Aug 5, 2016

minhdanh commented Aug 8, 2016

nathanielc commented Aug 8, 2016

minhdanh commented Aug 8, 2016 • edited Loading

melor commented Aug 8, 2016

minhdanh commented Jul 22, 2016 •

edited

Loading

minhdanh commented Jul 29, 2016 •

edited

Loading

nathanielc Aug 4, 2016 •

edited

Loading

minhdanh commented Aug 8, 2016 •

edited

Loading