Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alerts: threshold values for lowering alert levels #614

Closed
melor opened this issue Jun 4, 2016 · 7 comments
Closed

alerts: threshold values for lowering alert levels #614

melor opened this issue Jun 4, 2016 · 7 comments

Comments

@melor
Copy link

melor commented Jun 4, 2016

Currently

Alert is generated at every alert level change when going up info -> warn -> critical and when coming down critical -> warn -> info -> ok.

Problem

When monitoring for example free disk space level, the value often hovers both below and above the threshold values, causing e.g. a warn -> ok -> warn -> ok -> warn ... cycle. Flapping() percentages cause delays and are not optimal for every purpose.

Problem example:

stream
    |from()
        .measurement('disk')
    |alert()
        // values 20.0,19.99,20.0,19.99,... each cause an alert level change 
        .warn(lambda: "free" < 20)  

Proposed solution

provide optional, separate threshold functions for resetting a higher severity to a lower one.

Example:

stream
    |from()
        .measurement('disk')
    |alert()
        .warn(lambda: "free" < 20)
        .warn_reset(lambda: "free" > 25)
        .crit(lambda: "free" < 10)
        .crit_reset(lambda: "free" > 15)

In this example the value, once going below 20 and causing the initial "WARNING" alert, would be allowed to fluctuate between 10 and 25 without changing the alert level from warning.

@minhdanh
Copy link
Contributor

I'm experiencing the same problem. This would be a very useful feature to be implemented to avoid continuous alerts on changes hover up and down the threshold values.

@minhdanh
Copy link
Contributor

@rossmcdonald I would like to make a PR for this feature. Can you please tell me generally where should I begin? Which files should be involved?

@melor
Copy link
Author

melor commented Jul 21, 2016

Perhaps the syntax could simply be with an optional extra argument to the alert level functions, e.g. .warn(ALERT_CRITERIA, [RESET_CRITERIA])

Example:

stream
    |from()
        .measurement('disk')
    |alert()
        .warn(lambda: "free" < 20, lambda: "free" > 25)
        .crit(lambda: "free" < 10, lambda: "free" > 15)

@nathanielc
Copy link
Contributor

@melor I think using warn_reset is more explicit and cleaner to implement internally.

@minhdanh Thanks for stepping up.

There is a function determineLevel that is responsible for determining the alert level of a data point. It probably needs another arg for the current level...

It can be found here:
https://github.com/influxdata/kapacitor/blob/master/alert.go#L554

Then in the pipeline/alert.go file you will need to field for the *Reset versions of the Alert node:
https://github.com/influxdata/kapacitor/blob/master/pipeline/alert.go#L204

Finally, you will need to parse the ast.LambdaNode expressions into stateful expressions. Which is done starting here for the normal expressions. https://github.com/influxdata/kapacitor/blob/master/alert.go#L313

@nathanielc
Copy link
Contributor

@minhdanh Also I forgot to mention. You should probably base your changes off #732 since it makes a small change to that code, and is about to get merged.

@minhdanh
Copy link
Contributor

minhdanh commented Jul 22, 2016

Thank you for your detailed reply, @nathanielc
I actually already made some changes to the files you mentioned above. Now I'm thinking about writing some tests for this feature. I'm going to submit a pull request and it would be great if you can give me some advise on it.

@minhdanh minhdanh mentioned this issue Jul 22, 2016
4 tasks
@nathanielc
Copy link
Contributor

@minhdanh The tests can be found in the integrations/ package. Have a look first at TestStream_Alert. All tests basically replay data from a file found in integratoins/data and then run a TICKscript against that data. In the case of alert tests, a test http server is created and the alert is configured to post any events to it. You can then verify the alert triggered the correct events. A good place to start would to copy TestStream_Alert and create a test TestStream_Alert_WithReset and copy the TestStream_Alert.srpl file and modify it to test the reset conditions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants