Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make creating an alert based on event count easy #137

Closed
nathanielc opened this issue Jan 7, 2016 · 8 comments
Closed

Make creating an alert based on event count easy #137

nathanielc opened this issue Jan 7, 2016 · 8 comments

Comments

@nathanielc
Copy link
Contributor

Currently creating an alert on event count seems like a simple task, for example:

stream().from()...
   .window()
       .period(10s)
       .every(10s)
   .mapReduce(influxql.count('value'))
   .alert()
      .crit(lambda: "count" == 0)

The above script will not work. The reason is that once the event stream stops so does time and therefore the rest of the pipeline is not executed. As a result the alert logic is never evaluated and an alert is never triggered.

Possible solutions:

  • Use a separate task and the internal Kapacitor stats about throughput to achieve this.
    For example:

    stream()
    .from()
        .measurement('_kapacitor')
        .where(lambda: "task" == 'my_task' AND "parent" =='stream0')
    .alert()
      .crit(lambda: "collected" == 0)
    

    This will work but has the requirement that you use two tasks. One to receive the data and the other to check the throughput.

  • Add the ability to use the real clock time on a window node so that when real time passes it will emit a batch even if its empty. i.e.

stream().from()...
   .window()
       .period(10s)
       .every(10s)
       .realtime() //use realtime so even if the data stops the empty/partial window will still be emitted.
   .mapReduce(influxql.count('value'))
   .alert()
      .crit(lambda: "count" == 0)

Simple and optional. Using the realtime clock allows for races to occur when processing the data and so will generally not be recommended but for the specific use case of counting it could work. To be clear the races are not golang data races but rather races in how Kapacitor processes the data, i.e the datapoint arrives too late and so Kapacitor does not process the datapoint but has to drop it since the window was already emitted.

@nathanielc
Copy link
Contributor Author

For option 2 the behavior will be to use the data flow to understand time but on a realtime interval emit the window. If data arrives too late it will need to be dropped since time has already passed but this will be measurable and we can report the number of dropped points.

There might be some challenges with using realtime and replays but we should be able to work something out.

@nathanielc
Copy link
Contributor Author

Another option could be to expose the internal throughput numbers directly to the running task.

Each node could have a stats() method that emits the internal stats of the node on a specified realtime interval. For example:

var src = stream.from().measurement('cpu')

// Process the cpu data, looking for spikes
src
   .window()
       .period(10s)
       .every(10s)
   .mapReduce(influxql.mean('value'))
   .alert()
      .crit(lambda: "mean" > 70)

// Also grab the stats from the 'src' node and alert on its throughput
src
    //emit the internal stats of the node every 10s realtime.
   .stats(10s)
   .alert()
       .crit(lambda: "collected" == 0)

This allows for more interesting applications since it allows you to essentially meta program your tasks.

@nathanielc
Copy link
Contributor Author

@pauldix Thoughts on some of the ideas for making a dead-man's switch easy to build?

@rossmcdonald
Copy link
Contributor

I personally like the .stats() method the most. It's simple to explain, and could enable some interesting alerting functionality (service discovery, receiving too many points, etc).

Would stats() be affected by the same data race as realtime()?

@gunnaraasen
Copy link
Contributor

I also like the stats() proposal. It feels more intuitive than realtime().

@nathanielc
Copy link
Contributor Author

Yes, but in a different way that makes it harder to shoot yourself in the foot (i.e no data points need be dropped).

For example given that you are sending a data point to Kapacitor once per second here are the two options:

Reatime.

stream.from()...
    .window()
      .period(10s)
      .every(10s)
      .realtime()
    .mapReduce(influxql.count('value'))
    .alert()
      .crit(lambda: "count" != 10)

With the realtime approach this alert is prone to many false positives if the data arrives slightly ahead or behind. If a data point arrives too late it needs to be dropped since time has moved on without it.

Here is the stats approach:

var src = stream.from()...
    .window()
      .period(10s)
      .every(10s)
    .mapReduce(influxql.count('value')) // this count will always be 10 since its based off the data time.

// This count is susceptible to the data race but since the normal stream is unaffected no data needs to be dropped.
src.stats(10s).alert().crit(lambda: "collected" != 10)

The stats alert in this case can still create lots of false positive is the data arrives too early or too late, but since this is now a separate data stream from the normal data, no data points need be dropped. This means its possible for the times of the two different streams be different but since you are using them separately this shouldn't be a problem. Now, if you try to join some of the stats data with data from the 'real' stream you could cause issues, but that is a different problem around timeout writes to streams.

Obviously in practice you would compare against a threshold not an exact value i.e.:

src.stats(10s).alert().crit(lambda: "collected" < 5)

@pauldix
Copy link
Member

pauldix commented Jan 7, 2016

+1 to the stats approach. Although it feels super verbose to define something as simple as a dead man's switch.

@nathanielc
Copy link
Contributor Author

If all you want is a dead man's switch and don't care about the data itself it looks like this:

stream
  .stats(10s)
  .alert()
     .crit(lambda: "collected" == 0)

We could add special method for just the dead man's switch use case?

stream
  .deadmans(10s)// syntactic sugar to to the same as above

Then it could leverage the global alert config for how to handle to alert. This can always be added later if we think it adds enough value. For now I think I'll move forward with the stats approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants