Make creating an alert based on event count easy #137

nathanielc · 2016-01-07T17:53:53Z

Currently creating an alert on event count seems like a simple task, for example:

stream().from()...
   .window()
       .period(10s)
       .every(10s)
   .mapReduce(influxql.count('value'))
   .alert()
      .crit(lambda: "count" == 0)

The above script will not work. The reason is that once the event stream stops so does time and therefore the rest of the pipeline is not executed. As a result the alert logic is never evaluated and an alert is never triggered.

Possible solutions:

Use a separate task and the internal Kapacitor stats about throughput to achieve this.
For example:
```
stream()
.from()
    .measurement('_kapacitor')
    .where(lambda: "task" == 'my_task' AND "parent" =='stream0')
.alert()
  .crit(lambda: "collected" == 0)
```
This will work but has the requirement that you use two tasks. One to receive the data and the other to check the throughput.
Add the ability to use the real clock time on a window node so that when real time passes it will emit a batch even if its empty. i.e.

stream().from()...
   .window()
       .period(10s)
       .every(10s)
       .realtime() //use realtime so even if the data stops the empty/partial window will still be emitted.
   .mapReduce(influxql.count('value'))
   .alert()
      .crit(lambda: "count" == 0)

Simple and optional. Using the realtime clock allows for races to occur when processing the data and so will generally not be recommended but for the specific use case of counting it could work. To be clear the races are not golang data races but rather races in how Kapacitor processes the data, i.e the datapoint arrives too late and so Kapacitor does not process the datapoint but has to drop it since the window was already emitted.

The text was updated successfully, but these errors were encountered:

nathanielc · 2016-01-07T17:59:43Z

For option 2 the behavior will be to use the data flow to understand time but on a realtime interval emit the window. If data arrives too late it will need to be dropped since time has already passed but this will be measurable and we can report the number of dropped points.

There might be some challenges with using realtime and replays but we should be able to work something out.

nathanielc · 2016-01-07T18:10:19Z

Another option could be to expose the internal throughput numbers directly to the running task.

Each node could have a stats() method that emits the internal stats of the node on a specified realtime interval. For example:

var src = stream.from().measurement('cpu')

// Process the cpu data, looking for spikes
src
   .window()
       .period(10s)
       .every(10s)
   .mapReduce(influxql.mean('value'))
   .alert()
      .crit(lambda: "mean" > 70)

// Also grab the stats from the 'src' node and alert on its throughput
src
    //emit the internal stats of the node every 10s realtime.
   .stats(10s)
   .alert()
       .crit(lambda: "collected" == 0)

This allows for more interesting applications since it allows you to essentially meta program your tasks.

nathanielc · 2016-01-07T18:11:26Z

@pauldix Thoughts on some of the ideas for making a dead-man's switch easy to build?

rossmcdonald · 2016-01-07T22:26:17Z

I personally like the .stats() method the most. It's simple to explain, and could enable some interesting alerting functionality (service discovery, receiving too many points, etc).

Would stats() be affected by the same data race as realtime()?

gunnaraasen · 2016-01-07T22:36:46Z

I also like the stats() proposal. It feels more intuitive than realtime().

nathanielc · 2016-01-07T22:43:20Z

Yes, but in a different way that makes it harder to shoot yourself in the foot (i.e no data points need be dropped).

For example given that you are sending a data point to Kapacitor once per second here are the two options:

Reatime.

stream.from()...
    .window()
      .period(10s)
      .every(10s)
      .realtime()
    .mapReduce(influxql.count('value'))
    .alert()
      .crit(lambda: "count" != 10)

With the realtime approach this alert is prone to many false positives if the data arrives slightly ahead or behind. If a data point arrives too late it needs to be dropped since time has moved on without it.

Here is the stats approach:

var src = stream.from()...
    .window()
      .period(10s)
      .every(10s)
    .mapReduce(influxql.count('value')) // this count will always be 10 since its based off the data time.

// This count is susceptible to the data race but since the normal stream is unaffected no data needs to be dropped.
src.stats(10s).alert().crit(lambda: "collected" != 10)

The stats alert in this case can still create lots of false positive is the data arrives too early or too late, but since this is now a separate data stream from the normal data, no data points need be dropped. This means its possible for the times of the two different streams be different but since you are using them separately this shouldn't be a problem. Now, if you try to join some of the stats data with data from the 'real' stream you could cause issues, but that is a different problem around timeout writes to streams.

Obviously in practice you would compare against a threshold not an exact value i.e.:

src.stats(10s).alert().crit(lambda: "collected" < 5)

pauldix · 2016-01-07T22:45:59Z

+1 to the stats approach. Although it feels super verbose to define something as simple as a dead man's switch.

nathanielc · 2016-01-07T22:53:51Z

If all you want is a dead man's switch and don't care about the data itself it looks like this:

stream
  .stats(10s)
  .alert()
     .crit(lambda: "collected" == 0)

We could add special method for just the dead man's switch use case?

stream
  .deadmans(10s)// syntactic sugar to to the same as above

Then it could leverage the global alert config for how to handle to alert. This can always be added later if we think it adds enough value. For now I think I'll move forward with the stats approach.

lpalm mentioned this issue Jan 11, 2016

Allow alert evaluation to only occur during certain times/days #141

Closed

nathanielc mentioned this issue Jan 19, 2016

Add deadman's switch #149

Merged

nathanielc closed this as completed in #149 Jan 20, 2016

This was referenced Feb 5, 2016

Draft of Kapacitor design doc [WIP] #110

Merged

Window works unexpected #221

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make creating an alert based on event count easy #137

Make creating an alert based on event count easy #137

nathanielc commented Jan 7, 2016

nathanielc commented Jan 7, 2016

nathanielc commented Jan 7, 2016

nathanielc commented Jan 7, 2016

rossmcdonald commented Jan 7, 2016

gunnaraasen commented Jan 7, 2016

nathanielc commented Jan 7, 2016

pauldix commented Jan 7, 2016

nathanielc commented Jan 7, 2016

Make creating an alert based on event count easy #137

Make creating an alert based on event count easy #137

Comments

nathanielc commented Jan 7, 2016

nathanielc commented Jan 7, 2016

nathanielc commented Jan 7, 2016

nathanielc commented Jan 7, 2016

rossmcdonald commented Jan 7, 2016

gunnaraasen commented Jan 7, 2016

nathanielc commented Jan 7, 2016

pauldix commented Jan 7, 2016

nathanielc commented Jan 7, 2016