Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Alert Levels to be independent instead of subsets. #298

Closed
yosiat opened this issue Mar 9, 2016 · 9 comments
Closed

Change Alert Levels to be independent instead of subsets. #298

yosiat opened this issue Mar 9, 2016 · 9 comments
Milestone

Comments

@yosiat
Copy link
Contributor

yosiat commented Mar 9, 2016

Hi,

I have the next tick script:

stream
    .from().measurement('window_data')
    .alert()
      .id('WindowAlert')
      .info(lambda: "value" < 10)
      .warn(lambda: "value" > 10)
      .crit(lambda: "value" > 200)
      .log('/Volumes/Data/influx/alerts.log')

I am injecting data using this script:

require 'influxdb'

database  = 'kapacitor_example'
name      = 'window_data'
precision = 's'
retention = 'default'

influxb = InfluxDB::Client.new database, host: "localhost", port: 9092


time = Time.now.to_i
value = 0

points = []
10.times do
  value = value + 5
  time = time + 10

  data = {
    series: name,
    values: { value: value },
    timestamp: time
  }

  puts value
  points << data
end

puts "Inserting #{points.length} points"
influxb.write_points(points, precision, retention)

And I am getting two alerts triggered OK and INFO, and I expected WARN to jump as well.
I debugged alert source code and found some issues with code (and I can submit pull request):

  • The levels are examined from OK to Critical, which in my opinion is wrong, it should be from Critical to OK - For example: If we inject the value 250 only warning (>10) will jump and not critical (>200)
  • I saw that once some levels is passed this is level isn't returned, for example we have value of 250 and critical passed it we won't return Critical we will return Info (because this the last level - see my previous point)

I modified the code to be like this:

func (a *AlertNode) determineLevel(now time.Time, fields models.Fields, tags map[string]string) (level AlertLevel) {
    for i := len(a.levels) - 1; i >= 0; i-- {
        l := i
        se := a.levels[i]
        if se == nil {
            continue
        }

        if pass, err := EvalPredicate(se, now, fields, tags); pass {
            return AlertLevel(l)
        } else if err != nil {
            a.logger.Println("E! error evaluating expression:", err)
            return
        }
    }


    return OKAlert
}

@nathanielc What do you think? do you want PR?

@yosiat yosiat changed the title Alert is not triggered Wrong level of alert is determined Mar 9, 2016
@nathanielc
Copy link
Contributor

@yosiat
You have a value that exists on a number line and you have two break points that define three regions of the number line. So to define the three regions you can use:

stream
    .from().measurement('window_data')
    .alert()
      .id('WindowAlert')
      .info(lambda: TRUE)
      .warn(lambda: "value" > 10)
      .crit(lambda: "value" > 200)
      .log('/Volumes/Data/influx/alerts.log')

And the highest matching level will be chosen. The advantage of assuming that each level is a subset of the previous level is that you can break out early and not evaluate every condition.

From the docs It is assumed that each successive level filters a subset of the previous level. As a result, the filter will only be applied if a data point passed the previous level.

While it is possible for the problem to be computed in the other direction since expressions can be stateful the current behavior made the most since as each higher level will only be evaluated on the data set the passed the previous level.

@yosiat
Copy link
Contributor Author

yosiat commented Mar 9, 2016

@nathanielc
Ok, I understand but Kapacitor checks the alerts as a subset for each other (Info > Warn > Crit)
But if we are using separate conditions for each level than the only alert that triggered is info.

Let's take an example: we have stream of "number_of_exceptions" which increments when the application server has an alert while processing, and we want to check that the exceptions isn't above some threshold.

stream
    .from().measurement('number_of_exceptions')
    .alert()
      .id('NumberOfExceptions')
      .info(lambda: "value" < 1)
      .warn(lambda: "value" == 1)
      .crit(lambda: "value" > 1)
      .log('/Volumes/Data/influx/alerts.log')

For this case we will have info, in that case my alert levels aren't successive and I don't know how to create alert for this.

And of course there is the opposite case - I want to check if I have some data and I want to warn if it's 1 and send critical value is less than 1 ( => no data)

@yosiat
Copy link
Contributor Author

yosiat commented Mar 11, 2016

@nathanielc I finally solved this by separating the alert to multiple alerts like this:

var data = stream
    .from().measurement('number_of_exceptions')

data.alert().id('NumberOfExceptions').info(lambda: "value" < 1).log('/Volumes/Data/influx/alerts.log')
data.alert().id('NumberOfExceptions').warn(lambda: "value" == 1).log('/Volumes/Data/influx/alerts.log')
data.alert().id('NumberOfExceptions').crit(lambda: "value" > 1).log('/Volumes/Data/influx/alerts.log')

What do you think?

@nathanielc
Copy link
Contributor

Wouldn't this work too?

stream
    .from().measurement('number_of_exceptions')
    .alert()
      .id('NumberOfExceptions')
      .info(lambda: TRUE)
      .warn(lambda: "value" >= 1)
      .crit(lambda: "value" > 1)
      .log('/Volumes/Data/influx/alerts.log')

@yosiat
Copy link
Contributor Author

yosiat commented Mar 11, 2016

@nathanielc This would work, but my tick script are auto generated (converter from legacy system), so this is the best solution I have.

@nathanielc
Copy link
Contributor

Ok, I am thinking we might change this to be a mix of our implementations. Since it can be confusing that lower levels need to be a super set of higher levels what if we evaluate every level every time? That way its simple and you always know that each point will be evaluated for all levels. It will be slower but we are taking a max of four levels and simplicity is probably more important than a slight optimization. Thoughts?

@yosiat
Copy link
Contributor Author

yosiat commented Mar 11, 2016

@nathanielc I agree, I think we need first measure this - to know how much does it hearts on performance and then decide upon this, but I prefer simplicity over performance.

@nathanielc nathanielc modified the milestones: v0.14, v0.13 Apr 5, 2016
@nathanielc nathanielc changed the title Wrong level of alert is determined Change Alert Levels to be independent instead of subsets. Apr 5, 2016
@nathanielc nathanielc modified the milestones: v0.14, v0.13 May 5, 2016
@nathanielc
Copy link
Contributor

For now we are going to leave the behavior as is, since there hasn't been a strong push to change it. We can revisit later if need be.

@nathanielc
Copy link
Contributor

OK, I keep running into issues where the subset logic is annoying and requires that you work around it. I am going to reopen this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants