Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resetting alert levels #740

Merged
merged 15 commits into from
Aug 8, 2016
Merged

Resetting alert levels #740

merged 15 commits into from
Aug 8, 2016

Conversation

minhdanh
Copy link
Contributor

@minhdanh minhdanh commented Jul 22, 2016

In regard to this issue #614
This pull request should enable lowering alert levels using lamda expressions.
Only WARN and CRIT levels are supported, INFO level is not.

I'm still trying to come up with some tests for this.

  • Rebased/mergable
  • Tests pass
  • CHANGELOG.md updated
  • Sign CLA (if not already signed)

@minhdanh
Copy link
Contributor Author

This is what I think of, however I believe it needs more improvements. Will make some more changes in the next days : )

@nathanielc
Copy link
Contributor

@minhdanh Looks like a good start. FYI #732 has been merged so you can rebase your PR to see only your changes.

@@ -116,6 +116,8 @@ type AlertNode struct {
critsTriggered *expvar.Int

bufPool sync.Pool

level_resets []stateful.Expression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go is an opinionated language and prefers camelCase names. Can you change this to levelResets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. I've changed the variables to camelCase format. : )

@minhdanh
Copy link
Contributor Author

minhdanh commented Jul 29, 2016

@nathanielc I've just pushed a test with its sample data. Please have a look at them to see it's the expected behavior of this feature.
There's one thing confuses me, as you said:
Meaning the determineLevel would check for any reset expressions for the current level first and only evaluate other expressions if the reset expression is false.
Why would it only evaluate other expressions if the reset expression is false? I mean why not true?
For example I have the following rules in a TICK script:

.info(lambda: "value" > 40)
.infoReset(lambda: "value" < 10)
.warn(lambda: "value" > 60)
.warnReset(lambda: "value" < 20)
.crit(lambda: "value" > 80)
.critReset(lambda: "value" < 30)

Say first we have a data point with a value of 70, that will cause the alert level to be WARNING. Next if we have a data point with a value of 15, the reset expression of the WARNING level will be evaluated. It would return true (15 < 20 = true). Here we'll want the alert to escape the WARNING state so that the other expressions can be evaluated.
Or let's say we have a value of 40 instead of 15. It would cause the reset evaluation to return false (40 < 20 = false). At this point wouldn't we want the determineLevel function to return the current level which is WARNING?

Let's examine another data point with a value of 85 instead of 15 in the above case. Now the reset expression of the WARNING level would return false (85 < 20 = false). Here if we return the current level (WARNING), it will be incorrect as the level now should be CRITICAL, meaning the crit expression should be evaluated as well. So I think in case a reset expression returns false, we would check if the most highest level possible is greater than the current level. If it is, we return that level instead of the current level. If it's not, we just return the current level.

@nathanielc
Copy link
Contributor

@minhdanh Thanks for the detailed example. I think we are on the same page. Your example makes sense to me, (I probably inverted the logic earlier when I was talking about the true/false value of the reset expression)

To make sure we are on the same page here are a few more examples:

.info(lambda: "value" > 40)
.infoReset(lambda: "value" < 10)
.warn(lambda: "value" > 60)
.warnReset(lambda: "value" < 20)
.crit(lambda: "value" > 80)
.critReset(lambda: "value" < 30)

Given the above criteria the following data points create the following alert levels.

45 40 30 9
In In In OK
45 61 30 19
In Wn Wn OK
45 61 81 29
In Wn Cr OK

Now for a slightly different example

.info(lambda: "value" > 40)
.infoReset(lambda: "value" < 30)
.warn(lambda: "value" > 60)
.warnReset(lambda: "value" < 50)
.crit(lambda: "value" > 80)
.critReset(lambda: "value" < 70)
45 40 30 29
In In In OK
45 61 49 29
In Wn In OK
45 61 81 69 50 41 25
In Wn Cr Wn Wn In OK

This is how I would expect it to work.

@minhdanh
Copy link
Contributor Author

minhdanh commented Aug 2, 2016

@nathanielc I updated the test and its sample data with the values in your first example. I could make the test pass on my machine. Should I write another test for the latter example, or just add another tick script and call the same functions in TestStreamer_Alert_WithReset function?

@nathanielc
Copy link
Contributor

@minhdanh I'd prefer each test to test a single aspect of the behavior, so creating another test is fine.

I took a quick look at the tests and they look good.

@@ -4110,7 +4110,7 @@ stream
}
}

func TestStream_Alert_WithReset(t *testing.T) {
func TestStream_Alert_WithReset_0(t *testing.T) {
requestCount := int32(0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathanielc I'm not sure how to name two test functions of the same feature so I split them into TestStream_Alert_WithReset_0 and TestStream_Alert_WithReset_1. What do you think?

@minhdanh
Copy link
Contributor Author

minhdanh commented Aug 4, 2016

@nathanielc I've implemented your new way. All the tests pass except TestStream_AlertDuration.
I tried to track the problem down but I could not go further. It seems to be a problem with evaluating the normal expressions (f54b0a2#diff-d02710009bb4a30f396667696f963627R620)

Here is the output for that test:

[edge:task_master:testStreamer|TestStream_AlertDuration->stream] 2016/08/04 13:16:51 D! closing c: 12 e: 10
[edge:task_master:testStreamer|write_points->stream] 2016/08/04 13:16:51 D! closing c: 0 e: 0
[edge:TestStream_AlertDuration|stream->stream0] 2016/08/04 13:16:51 D! closing c: 12 e: 12
[edge:TestStream_AlertDuration|stream0->from1] 2016/08/04 13:16:51 D! closing c: 12 e: 1
[edge:TestStream_AlertDuration|from1->alert2] 2016/08/04 13:16:51 D! closing c: 12 e: 0
[TestStream_AlertDuration:alert2] 2016/08/04 13:16:51 D! CRITICAL alert triggered id:kapacitor/cpu/serverA msg:kapacitor/cpu/serverA is CRITICAL data:&{cpu map[host:serverA type:idle] [time value] [[1971-01-01 00:00:00 +0000 UTC 9]] <nil>}
[TestStream_AlertDuration:alert2] 2016/08/04 13:16:51 D! OK alert triggered id:kapacitor/cpu/serverA msg:kapacitor/cpu/serverA is OK data:&{cpu map[type:idle host:serverA] [time value] [[1971-01-01 00:00:02 +0000 UTC 8]] <nil>}
[TestStream_AlertDuration:alert2] 2016/08/04 13:16:51 D! WARNING alert triggered id:kapacitor/cpu/serverA msg:kapacitor/cpu/serverA is WARNING data:&{cpu map[host:serverA type:idle] [time value] [[1971-01-01 00:00:03 +0000 UTC 8]] <nil>}
[TestStream_AlertDuration:alert2] 2016/08/04 13:16:51 D! OK alert triggered id:kapacitor/cpu/serverA msg:kapacitor/cpu/serverA is OK data:&{cpu map[type:idle host:serverA] [time value] [[1971-01-01 00:00:04 +0000 UTC 6]] <nil>}
[TestStream_AlertDuration:alert2] 2016/08/04 13:16:51 D! WARNING alert triggered id:kapacitor/cpu/serverA msg:kapacitor/cpu/serverA is WARNING data:&{cpu map[host:serverA type:idle] [time value] [[1971-01-01 00:00:05 +0000 UTC 8]] <nil>}
[TestStream_AlertDuration:alert2] 2016/08/04 13:16:51 D! OK alert triggered id:kapacitor/cpu/serverA msg:kapacitor/cpu/serverA is OK data:&{cpu map[host:serverA type:idle] [time value] [[1971-01-01 00:00:08 +0000 UTC 3]] <nil>}
[edge:TestStream_AlertDuration|alert2->http_out3] 2016/08/04 13:16:51 D! closing c: 6 e: 5
[httpd] ::1 - - [04/Aug/2016:13:16:51 +0700] "GET /kapacitor/v1/tasks/TestStream_AlertDuration/TestStream_AlertDuration HTTP/1.1" 200 144 "-" "Go-http-client/1.1" 0880f51a-5a0b-11e6-8088-000000000000 1584
[task_master:testStreamer] 2016/08/04 13:16:51 I! Stopped task: TestStream_AlertDuration
[task_master:testStreamer] 2016/08/04 13:16:51 I! closed
--- FAIL: TestStream_AlertDuration (0.01s)
        streamer_test.go:6512: digraph TestStream_AlertDuration {
                stream0 -> from1;
                from1 -> alert2;
                alert2 -> http_out3;
                }
        streamer_test.go:4980: unexpected alert data for request: 2
                got {kapacitor/cpu/serverA kapacitor/cpu/serverA is OK details 1971-01-01 00:00:02 +0000 UTC 2s OK {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
                exp {kapacitor/cpu/serverA kapacitor/cpu/serverA is WARNING details 1971-01-01 00:00:02 +0000 UTC 2s WARNING {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
        streamer_test.go:4980: unexpected alert data for request: 3
                got {kapacitor/cpu/serverA kapacitor/cpu/serverA is WARNING details 1971-01-01 00:00:03 +0000 UTC 0 WARNING {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
                exp {kapacitor/cpu/serverA kapacitor/cpu/serverA is OK details 1971-01-01 00:00:04 +0000 UTC 4s OK {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
        streamer_test.go:4980: unexpected alert data for request: 4
                got {kapacitor/cpu/serverA kapacitor/cpu/serverA is OK details 1971-01-01 00:00:04 +0000 UTC 1s OK {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
                exp {kapacitor/cpu/serverA kapacitor/cpu/serverA is WARNING details 1971-01-01 00:00:05 +0000 UTC 0 WARNING {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
        streamer_test.go:4980: unexpected alert data for request: 5
                got {kapacitor/cpu/serverA kapacitor/cpu/serverA is WARNING details 1971-01-01 00:00:05 +0000 UTC 0 WARNING {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
                exp {kapacitor/cpu/serverA kapacitor/cpu/serverA is OK details 1971-01-01 00:00:08 +0000 UTC 3s OK {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
        streamer_test.go:4980: unexpected alert data for request: 6
                got {kapacitor/cpu/serverA kapacitor/cpu/serverA is OK details 1971-01-01 00:00:08 +0000 UTC 3s OK {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
                exp {   0001-01-01 00:00:00 +0000 UTC 0 OK {0 [] [] <nil>} {{{   map[]}  map[]  {0 0 <nil>}} }}
        streamer_test.go:5023: got 6 exp 5

} else if !pass {
return currentLevel
} else {
if newLevel, err := a.findFirstMatchLevel(currentLevel-1, OKAlert, now, fields, tags); err == nil {
Copy link
Contributor

@nathanielc nathanielc Aug 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try this. The change is that if the reset expression does not exist then we need to check the rest of the normal expressions below or equal to our current level.

    if higherLevel, err := a.findFirstMatchLevel(CritAlert, currentLevel-1, now, fields, tags); err == nil {
        return higherLevel
    }
    if rse := a.levelResets[currentLevel]; rse != nil {
        if pass, err := EvalPredicate(rse, a.lrScopePools[currentLevel], now, fields, tags); err != nil {
            a.logger.Printf("E! error evaluating reset expression for current level %v: %s", currentLevel, err)
        } else if !pass {
            return currentLevel
        }
        // fall through to check the rest of the levels.
    }
    if newLevel, err := a.findFirstMatchLevel(currentLevel, OKAlert, now, fields, tags); err == nil {
        return newLevel
    }
    return OKAlert

Also there is some error handling missing for the last bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works like a charm. Thank you so much 💃

@minhdanh minhdanh changed the title [WIP] Resetting alert levels Resetting alert levels Aug 5, 2016
@minhdanh
Copy link
Contributor Author

minhdanh commented Aug 5, 2016

@nathanielc I've just applied the fixes and updated CHANGELOG.md. Is there anything else need to be done?

@nathanielc
Copy link
Contributor

@minhdanh OK, I have re-reviewed the entire PR. I just have two small recommendations:

  1. Can you add the example from the CHANGELOG here as well? That way it will part of the official docs.
  2. The logic for the findFirstMatchLevel is a little confusing since it uses the error return to indicate that it didn't find a match. I would be fine with either: adding a bool return value to indicate a match was found, or returning the stop level passed in and then comparing against that in the caller. The error return should only be set if an unrecoverable error occurs.

Thanks a ton!

@minhdanh
Copy link
Contributor Author

minhdanh commented Aug 8, 2016

@nathanielc I changed the error type to bool type to indicate that if a match was found. The function findFirstMatchLevel doesn't return an unrecoverable error so I think this makes sense. As now no error returned so I don't do error handling in the callers (which makes the code a little shorter :D). What do you think?

@nathanielc
Copy link
Contributor

@minhdanh LGTM, thanks for all the work!

@nathanielc nathanielc merged commit 7bdcde0 into influxdata:master Aug 8, 2016
@minhdanh
Copy link
Contributor Author

minhdanh commented Aug 8, 2016

My pleasure. I couldn't make it without your help. Thank you very much.

@melor
Copy link

melor commented Aug 8, 2016

Good stuff! Thanks @minhdanh and @nathanielc !

@minhdanh minhdanh deleted the feature/alert_level_reset branch August 9, 2016 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants