should be able to pause/mute alerts with criteria #722

beckettsean · 2016-07-18T22:10:31Z

A good use case example is system monitoring on a server that's undergoing scheduled maintenance. I want the deadman alerts to keep firing for other servers, but not for the one I'm rebooting right now.

Obviously this needs to be doable without a restart, and ideally without pausing the entire TICKscript. Other hosts should continue to fire deadman alerts while host=host-a-1 is paused.

The text was updated successfully, but these errors were encountered:

phemmer · 2016-07-28T03:12:32Z

Just spent the last few days (and several issues: #750, #752, #755, #756) trying to get a maintenance mode working. I've tried numerous things, and each gets very close to working, but is stopped by one tiny but significant gotcha.

Most of the scenarios have been some derivative of the following script:

var maintlock = stream|from().measurement('maintlock').groupBy('host')
var data = stream
    |from()
        .measurement('disk').groupBy('host','path')
    |join(maintlock)
        .as('disk','maintlock')
        .on('host')
        .tolerance(24h)
    |where(lambda: "maintlock.count" == 0)
data
    |alert()
        .crit(lambda: "disk.used_percent" >= 90)
data
    |alert()
        .warn(lambda: "disk.used_percent" >= 80 AND "disk.used_percent" < 90)

The idea is that the maintlock measurement contains a count field which indicates the number of maintenance locks on a host. 0 means that no locks are held and the host is not under maintenance.
In our setup, the maintlock data points are only inserted into influxdb when the count changes. So the last value could be days old. Though it would be possible to store the last value on the host and just feed it into telegraf over and over.

The most pervasive issue with all the solutions I've tried are that the maintlock data points come in much less frequently than the other metrics, and the join essentially blocks until it has a data point from both the disk and maintlock measurements within the same tolerance value. But then if you end up with 2 maintlock data points within the same tolerance value, it results in the disk data point getting duplicated, once for each of the different maintlock data points. And if you try to use last() on the maintlock measurement, you end up being 1 data point behind (last() will keep a buffer of 1 metric).

Thus in order for this to work properly, kapacitor needs a way to get the single last value from maintlock up to when the measurement from disk came in.

avdhoot · 2017-02-17T10:04:28Z

+1

jcmcken · 2018-01-10T16:10:56Z

We're also looking for something like this. We've contemplated tagging every measurement with maintenance=false or maintenance=true and filtering in the TICKscript, but it seems really inelegant and wasteful.

It looks like Kapacitor 1.4 has sideloading capabilities, which could be useful for this purpose. Of course if you're using Enterprise Kapacitor, and you have multiple nodes, this is annoying to use and requires some jury rigging. If sideloading could read from something like Redis or an HTTP endpoint that would make this a much better approach.

nathanielc mentioned this issue Jul 27, 2016

Fix fill for join on and batches #756

Merged

3 tasks

nathanielc added the enhancement label Aug 24, 2016

nathanielc added this to the Unplanned milestone Aug 24, 2016

nathanielc added new-feature and removed enhancement labels Aug 31, 2016

jcmcken mentioned this issue Jan 10, 2018

Feature Request: New Sideloading Sources #1751

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should be able to pause/mute alerts with criteria #722

should be able to pause/mute alerts with criteria #722

beckettsean commented Jul 18, 2016

phemmer commented Jul 28, 2016

avdhoot commented Feb 17, 2017

jcmcken commented Jan 10, 2018

should be able to pause/mute alerts with criteria #722

should be able to pause/mute alerts with criteria #722

Comments

beckettsean commented Jul 18, 2016

phemmer commented Jul 28, 2016

avdhoot commented Feb 17, 2017

jcmcken commented Jan 10, 2018