Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadman batch query firing at wrong times? #667

Closed
gargooza opened this issue Jun 22, 2016 · 1 comment
Closed

deadman batch query firing at wrong times? #667

gargooza opened this issue Jun 22, 2016 · 1 comment
Labels

Comments

@gargooza
Copy link

gargooza commented Jun 22, 2016

My deadman task seems to be firing at the wrong times and I can't make sense of the "emitted" statistic it is reporting.

I have a deadman task which I creatively called deadmanTask which looks like this (via show task):

ID: deadmanTask
Error:
Template:
Type: batch
Status: enabled
Executng: true
Created: 22 Jun 16 14:15 EDT
Modified: 22 Jun 16 15:03 EDT
LastEnabled: 22 Jun 16 15:03 EDT
Databases Retenton Policies: ["sq"."autogen"]
TICKscript:
var b = batch
    |query('''SELECT count("num") FROM "sq"."autogen"."cp"''')
        .period(2m)
        .every(2m)

b
    |deadman(1.0, 2m)
        .id('deadmanTask')
        .message('{{ .ID }} is {{ .Level }}: {{ index .Fields "emitted" }} (< 1) points seen in last 2m')
        .stateChangesOnly()
        .log('/tmp/deadmanTask.log')
        .mode(644)

DOT:
digraph deadmanTask {
graph [throughput="0.00 batches/s"];

query1 [avg_exec_time_ns="3.579178ms" batches_queried="5" connect_errors="0" points_queried="5" query_errors="0" ];
query1 -> noop3 [processed="5"];

noop3 [avg_exec_time_ns="0" ];

stats2 [avg_exec_time_ns="0" ];
stats2 -> derivative4 [processed="8"];

derivative4 [avg_exec_time_ns="0" ];
derivative4 -> alert5 [processed="7"];

alert5 [alerts_triggered="4" avg_exec_time_ns="2.888137ms" crits_triggered="2" infos_triggered="0" oks_triggered="2" warns_triggered="0" ];
}

First of all, this task will only trigger deadman alerts when I enable it to run live. Replays never fire deadman alerts even if I replay them over a range of time that fired alerts in real time. But that's another issue I can file later. Right now I just need everything to make sense and do the right thing. And so far the output does not seem to make sense nor does the deadman seem to do the right thing.

Here are logs for the four triggered alerts:

{"id":"deadmanTask","message":"deadmanTask is CRITICAL: 0.9999999976916667 (\u003c 1) points seen in last 2m","details":"{\u0026#34;Name\u0026#34;:\u0026#34;stats\u0026#34;,\u0026#34;TaskName\u0026#34;:\u0026#34;deadmanTask\u0026#34;,\u0026#34;Group\u0026#34;:\u0026#34;nil\u0026#34;,\u0026#34;Tags\u0026#34;:null,\u0026#34;ID\u0026#34;:\u0026#34;deadmanTa
sk\u0026#34;,\u0026#34;Fields\u0026#34;:{\u0026#34;emitted\u0026#34;:0.9999999976916667},\u0026#34;Level\u0026#34;:\u0026#34;CRITICAL\u0026#34;,\u0026#34;Time\u0026#34;:\u0026#34;2016-06-22T19:05:57.013555631Z\u0026#34;,\u0026#34;Message\u0026#34;:\u0026#34;deadmanTask is CRITICAL: 0.9999999976916667 (\\u003c 1) points seen in last 2m\u0026#34;}\n","time
":"2016-06-22T19:05:57.013555631Z","duration":0,"level":"CRITICAL","data":{"series":[{"name":"stats","columns":["time","emitted"],"values":[["2016-06-22T19:05:57.013555631Z",0.9999999976916667]]}]}}

{"id":"deadmanTask","message":"deadmanTask is OK: 1.0000000904916748 (\u003c 1) points seen in last 2m","details":"{\u0026#34;Name\u0026#34;:\u0026#34;stats\u0026#34;,\u0026#34;TaskName\u0026#34;:\u0026#34;deadmanTask\u0026#34;,\u0026#34;Group\u0026#34;:\u0026#34;nil\u0026#34;,\u0026#34;Tags\u0026#34;:null,\u0026#34;ID\u0026#34;:\u0026#34;deadmanTask\u00
26#34;,\u0026#34;Fields\u0026#34;:{\u0026#34;emitted\u0026#34;:1.0000000904916748},\u0026#34;Level\u0026#34;:\u0026#34;OK\u0026#34;,\u0026#34;Time\u0026#34;:\u0026#34;2016-06-22T19:07:57.013555908Z\u0026#34;,\u0026#34;Message\u0026#34;:\u0026#34;deadmanTask is OK: 1.0000000904916748 (\\u003c 1) points seen in last 2m\u0026#34;}\n","time":"2016-06-22T19:0
7:57.013555908Z","duration":120000000277,"level":"OK","data":{"series":[{"name":"stats","columns":["time","emitted"],"values":[["2016-06-22T19:07:57.013555908Z",1.0000000904916748]]}]}}

{"id":"deadmanTask","message":"deadmanTask is CRITICAL: 0 (\u003c 1) points seen in last 2m","details":"{\u0026#34;Name\u0026#34;:\u0026#34;stats\u0026#34;,\u0026#34;TaskName\u0026#34;:\u0026#34;deadmanTask\u0026#34;,\u0026#34;Group\u0026#34;:\u0026#34;nil\u0026#34;,\u0026#34;Tags\u0026#34;:null,\u0026#34;ID\u0026#34;:\u0026#34;deadmanTask\u0026#34;,\u00
26#34;Fields\u0026#34;:{\u0026#34;emitted\u0026#34;:0},\u0026#34;Level\u0026#34;:\u0026#34;CRITICAL\u0026#34;,\u0026#34;Time\u0026#34;:\u0026#34;2016-06-22T19:09:57.013545049Z\u0026#34;,\u0026#34;Message\u0026#34;:\u0026#34;deadmanTask is CRITICAL: 0 (\\u003c 1) points seen in last 2m\u0026#34;}\n","time":"2016-06-22T19:09:57.013545049Z","duration":0,"le
vel":"CRITICAL","data":{"series":[{"name":"stats","columns":["time","emitted"],"values":[["2016-06-22T19:09:57.013545049Z",0]]}]}}

{"id":"deadmanTask","message":"deadmanTask is OK: 1.0000000052666667 (\u003c 1) points seen in last 2m","details":"{\u0026#34;Name\u0026#34;:\u0026#34;stats\u0026#34;,\u0026#34;TaskName\u0026#34;:\u0026#34;deadmanTask\u0026#34;,\u0026#34;Group\u0026#34;:\u0026#34;nil\u0026#34;,\u0026#34;Tags\u0026#34;:null,\u0026#34;ID\u0026#34;:\u0026#34;deadmanTask\u00
26#34;,\u0026#34;Fields\u0026#34;:{\u0026#34;emitted\u0026#34;:1.0000000052666667},\u0026#34;Level\u0026#34;:\u0026#34;OK\u0026#34;,\u0026#34;Time\u0026#34;:\u0026#34;2016-06-22T19:17:57.013558697Z\u0026#34;,\u0026#34;Message\u0026#34;:\u0026#34;deadmanTask is OK: 1.0000000052666667 (\\u003c 1) points seen in last 2m\u0026#34;}\n","time":"2016-06-22T19:1
7:57.013558697Z","duration":480000013648,"level":"OK","data":{"series":[{"name":"stats","columns":["time","emitted"],"values":[["2016-06-22T19:17:57.013558697Z",1.0000000052666667]]}]}}

Note that the first alert was CRITICAL and that seems wrong. The deadman should only trigger if there are fewer than 1.0 data points in the most recent two minute window. Based on the query results below, there was exactly one data point in that two minute window. The deadman alert should not have been triggered.

Given that the alert was in a CRITICAL state -- even though I don't think it should have been -- the next OK alert seems correct given that there were three data points in that 2 minute window.

The last two alerts (CRITICAL and OK) seem correct. But in ALL cases I can't figure out why the "emitted" value is such a weird number (0.9999999976916667, 1.0000000904916748, 0, and 1.0000000052666667)

The four influx queries and their results which led to these alerts are pasted below:

> SELECT count(num) FROM sq.autogen.cp WHERE time >= '2016-06-22T19:03:57.013562119Z' AND time < '2016-06-22T19:05:57.013562119Z'
name: cp
------------------
time                count
2016-06-22T19:03:57.013562119Z  1

> SELECT count(num) FROM sq.autogen.cp WHERE time >= '2016-06-22T19:05:57.013561054Z' AND time < '2016-06-22T19:07:57.013561054Z'
name: cp
------------------
time                count
2016-06-22T19:05:57.013561054Z  3

> SELECT count(num) FROM sq.autogen.cp WHERE time >= '2016-06-22T19:07:57.013553713Z' AND time < '2016-06-22T19:09:57.013553713Z'
> SELECT count(num) FROM sq.autogen.cp WHERE time >= '2016-06-22T19:15:57.013566185Z' AND time < '2016-06-22T19:17:57.013566185Z'
name: cp
------------------
time                count
2016-06-22T19:15:57.013566185Z  1

>

Help!

@nathanielc
Copy link
Contributor

@gargooza

Replays never fire deadman alerts even if I replay them over a range of time that fired alerts in real time. But that's another issue I can file later.

To make this work the replay needs to use a real time clock. kapacitor replay -recording XXX -task XXX -real-clock

To address the issue of the odd numbers. The emitted field is really the derivative aka rate of points received. So if you have two points arrive at slightly less than 2m apart then the emitted value is slightly greater that 1. The converse is true, if the points arrive at slightly greater than 2m apart then the value will be less than 1. Since the stats are emitted on a sleep loop, it possible for the times on the points to vary by small amounts.

For now I would recommend using a slightly larger time window for the deadman than the expected rate so that you can avoid these edge cases. In your case using |deadman(1.0 3m) should prevent these edge cases.

But this is obviously not ideal. I think it makes sense to round the times from the stats points to the time interval specified so that the sleep loop delay doesn't effect the actual data timestamps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants