Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should be able to pause/mute alerts with criteria #722

Open
beckettsean opened this issue Jul 18, 2016 · 3 comments
Open

should be able to pause/mute alerts with criteria #722

beckettsean opened this issue Jul 18, 2016 · 3 comments
Milestone

Comments

@beckettsean
Copy link
Contributor

A good use case example is system monitoring on a server that's undergoing scheduled maintenance. I want the deadman alerts to keep firing for other servers, but not for the one I'm rebooting right now.

Obviously this needs to be doable without a restart, and ideally without pausing the entire TICKscript. Other hosts should continue to fire deadman alerts while host=host-a-1 is paused.

@phemmer
Copy link

phemmer commented Jul 28, 2016

Just spent the last few days (and several issues: #750, #752, #755, #756) trying to get a maintenance mode working. I've tried numerous things, and each gets very close to working, but is stopped by one tiny but significant gotcha.

Most of the scenarios have been some derivative of the following script:

var maintlock = stream|from().measurement('maintlock').groupBy('host')
var data = stream
    |from()
        .measurement('disk').groupBy('host','path')
    |join(maintlock)
        .as('disk','maintlock')
        .on('host')
        .tolerance(24h)
    |where(lambda: "maintlock.count" == 0)
data
    |alert()
        .crit(lambda: "disk.used_percent" >= 90)
data
    |alert()
        .warn(lambda: "disk.used_percent" >= 80 AND "disk.used_percent" < 90)

The idea is that the maintlock measurement contains a count field which indicates the number of maintenance locks on a host. 0 means that no locks are held and the host is not under maintenance.
In our setup, the maintlock data points are only inserted into influxdb when the count changes. So the last value could be days old. Though it would be possible to store the last value on the host and just feed it into telegraf over and over.

The most pervasive issue with all the solutions I've tried are that the maintlock data points come in much less frequently than the other metrics, and the join essentially blocks until it has a data point from both the disk and maintlock measurements within the same tolerance value. But then if you end up with 2 maintlock data points within the same tolerance value, it results in the disk data point getting duplicated, once for each of the different maintlock data points. And if you try to use last() on the maintlock measurement, you end up being 1 data point behind (last() will keep a buffer of 1 metric).

Thus in order for this to work properly, kapacitor needs a way to get the single last value from maintlock up to when the measurement from disk came in.

@avdhoot
Copy link

avdhoot commented Feb 17, 2017

+1

@jcmcken
Copy link

jcmcken commented Jan 10, 2018

We're also looking for something like this. We've contemplated tagging every measurement with maintenance=false or maintenance=true and filtering in the TICKscript, but it seems really inelegant and wasteful.

It looks like Kapacitor 1.4 has sideloading capabilities, which could be useful for this purpose. Of course if you're using Enterprise Kapacitor, and you have multiple nodes, this is annoying to use and requires some jury rigging. If sideloading could read from something like Redis or an HTTP endpoint that would make this a much better approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants