alerts: threshold values for lowering alert levels #614

melor · 2016-06-04T15:20:16Z

Currently

Alert is generated at every alert level change when going up info -> warn -> critical and when coming down critical -> warn -> info -> ok.

Problem

When monitoring for example free disk space level, the value often hovers both below and above the threshold values, causing e.g. a warn -> ok -> warn -> ok -> warn ... cycle. Flapping() percentages cause delays and are not optimal for every purpose.

Problem example:

stream
    |from()
        .measurement('disk')
    |alert()
        // values 20.0,19.99,20.0,19.99,... each cause an alert level change 
        .warn(lambda: "free" < 20)

Proposed solution

provide optional, separate threshold functions for resetting a higher severity to a lower one.

Example:

stream
    |from()
        .measurement('disk')
    |alert()
        .warn(lambda: "free" < 20)
        .warn_reset(lambda: "free" > 25)
        .crit(lambda: "free" < 10)
        .crit_reset(lambda: "free" > 15)

In this example the value, once going below 20 and causing the initial "WARNING" alert, would be allowed to fluctuate between 10 and 25 without changing the alert level from warning.

The text was updated successfully, but these errors were encountered:

minhdanh · 2016-07-12T09:35:09Z

I'm experiencing the same problem. This would be a very useful feature to be implemented to avoid continuous alerts on changes hover up and down the threshold values.

minhdanh · 2016-07-18T02:57:56Z

@rossmcdonald I would like to make a PR for this feature. Can you please tell me generally where should I begin? Which files should be involved?

melor · 2016-07-21T15:48:57Z

Perhaps the syntax could simply be with an optional extra argument to the alert level functions, e.g. .warn(ALERT_CRITERIA, [RESET_CRITERIA])

Example:

stream
    |from()
        .measurement('disk')
    |alert()
        .warn(lambda: "free" < 20, lambda: "free" > 25)
        .crit(lambda: "free" < 10, lambda: "free" > 15)

nathanielc · 2016-07-21T17:01:45Z

@melor I think using warn_reset is more explicit and cleaner to implement internally.

@minhdanh Thanks for stepping up.

There is a function determineLevel that is responsible for determining the alert level of a data point. It probably needs another arg for the current level...

It can be found here:
https://github.com/influxdata/kapacitor/blob/master/alert.go#L554

Then in the pipeline/alert.go file you will need to field for the *Reset versions of the Alert node:
https://github.com/influxdata/kapacitor/blob/master/pipeline/alert.go#L204

Finally, you will need to parse the ast.LambdaNode expressions into stateful expressions. Which is done starting here for the normal expressions. https://github.com/influxdata/kapacitor/blob/master/alert.go#L313

nathanielc · 2016-07-21T17:06:19Z

@minhdanh Also I forgot to mention. You should probably base your changes off #732 since it makes a small change to that code, and is about to get merged.

minhdanh · 2016-07-22T02:40:20Z

Thank you for your detailed reply, @nathanielc
I actually already made some changes to the files you mentioned above. Now I'm thinking about writing some tests for this feature. I'm going to submit a pull request and it would be great if you can give me some advise on it.

nathanielc · 2016-07-22T15:29:30Z

@minhdanh The tests can be found in the integrations/ package. Have a look first at TestStream_Alert. All tests basically replay data from a file found in integratoins/data and then run a TICKscript against that data. In the case of alert tests, a test http server is created and the alert is configured to post any events to it. You can then verify the alert triggered the correct events. A good place to start would to copy TestStream_Alert and create a test TestStream_Alert_WithReset and copy the TestStream_Alert.srpl file and modify it to test the reset conditions.

rossmcdonald added the enhancement label Jul 12, 2016

minhdanh mentioned this issue Jul 22, 2016

Resetting alert levels #740

Merged

4 tasks

nathanielc closed this as completed Aug 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alerts: threshold values for lowering alert levels #614

alerts: threshold values for lowering alert levels #614

melor commented Jun 4, 2016

minhdanh commented Jul 12, 2016

minhdanh commented Jul 18, 2016

melor commented Jul 21, 2016 •

edited

Loading

nathanielc commented Jul 21, 2016

nathanielc commented Jul 21, 2016

minhdanh commented Jul 22, 2016 •

edited

Loading

nathanielc commented Jul 22, 2016

alerts: threshold values for lowering alert levels #614

alerts: threshold values for lowering alert levels #614

Comments

melor commented Jun 4, 2016

Currently

Problem

Proposed solution

minhdanh commented Jul 12, 2016

minhdanh commented Jul 18, 2016

melor commented Jul 21, 2016 • edited Loading

nathanielc commented Jul 21, 2016

nathanielc commented Jul 21, 2016

minhdanh commented Jul 22, 2016 • edited Loading

nathanielc commented Jul 22, 2016

melor commented Jul 21, 2016 •

edited

Loading

minhdanh commented Jul 22, 2016 •

edited

Loading