Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kapacitor kept restarting with mutiple stateDuration (kapacitor 1.3.0~beta2) #1369

Closed
wshi5985 opened this issue May 10, 2017 · 12 comments
Closed

Comments

@wshi5985
Copy link

wshi5985 commented May 10, 2017

when we use one stateDuration in data stream, it works fine, but if check multiple event's stateDuration with same data stream, kapacitor kept restarting.

$ grep "Kapacitor starting" /logs/kapacitor/kapacitor.log |tail
[run] 2017/05/09 13:05:24 I! Kapacitor starting, version 1.3.0~beta2, branch master, commit 46de6ebb7b51                 7b6127e5122396552148e1b8e2f8
[run] 2017/05/09 13:17:25 I! Kapacitor starting, version 1.3.0~beta2, branch master, commit 46de6ebb7b51                 7b6127e5122396552148e1b8e2f8
[run] 2017/05/09 13:32:26 I! Kapacitor starting, version 1.3.0~beta2, branch master, commit 46de6ebb7b51                 7b6127e5122396552148e1b8e2f8
[run] 2017/05/09 13:57:27 I! Kapacitor starting, version 1.3.0~beta2, branch master, commit 46de6ebb7b51                 7b6127e5122396552148e1b8e2f8
[run] 2017/05/09 14:16:22 I! Kapacitor starting, version 1.3.0~beta2, branch master, commit 46de6ebb7b51  
var response_data = stream
    |from()
        .database('test')
        .measurement(measurement)
        .groupBy('domain','monitor_name')
    |window()
        .period(5m)
        .every(1m)

response_data
    |stateDuration(lambda: int("result_status") != 1 )
        .unit(1m)
        .as('resultstatus_events_duration')
    |stateDuration(lambda: int("httpcode_status") != 1 )
        .unit(1m)
        .as('httpcode_events_duration')
    |stateDuration(lambda: float("response_time") > float("responsetime_threshold") )
        .unit(1m)
        .as('responsetime_events_duration')
@phemmer
Copy link

phemmer commented May 10, 2017

That's really odd that it just exits without displaying any error. If you run it by hand, does it display anything different? Does dmesg show anything from the kernel saying the kernel is killing the process?

@nathanielc
Copy link
Contributor

nathanielc commented May 10, 2017

@wshi5985 I think I have been able to reproduce the issue. Running the task above with about 1000 writes per second, I can see a steady increase in RAM usage of Kapacitor. My guess is that the Linux oom killer is killing the Kapacitor process.

I am looking into what causes the increased RAM usage. Can you confirm that you are also seeing a steady increase in RAM usage while the task is running?

@nathanielc
Copy link
Contributor

nathanielc commented May 10, 2017

I may have spoken too soon. After letting the test run for several hours, the RAM usage has leveled out and is constant. It too longer than expected to reach the steady state but it has.

@wshi5985 Maybe there isn't enough RAM available for it to reach steady state? Or this could be totally unrelated to a RAM issue. Anymore details you can provide would be much appreciated.

@wshi5985
Copy link
Author

wshi5985 commented May 10, 2017

I checked the memory usage , we have 4G memory, and only half are used. I don't see obvious memory usage increase.

when it crashed every 15-20 min, the config was a little different, I changed to the config above, now it crashes every a few hours.

this was the config when it crashed every 15-20 min:
var response_data = stream
|from()
.database('iacpl')
.measurement(measurement)
.groupBy('domain','monitor_name')
|window()
.period(5m)
.every(1m)

response_data
|stateDuration(lambda: int("result_status") != 1 )
.unit(1m)
.as('resultstatus_events_duration')
|httpOut('resultstatus_events_duration')
|alert()
.crit(lambda: "resultstatus_events_duration" > 2 )
.alerta()
.message('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} http result error, result matched: {{ index .Fields "result_matched" }} result error: {{ index .Fields "result_error" }}' + string(arp))
.event('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} http result error')
.resource(resource)
.value('ERROR')

response_data
|stateDuration(lambda: int("httpcode_status") != 1 )
.unit(1m)
.as('httpcode_events_duration')
|httpOut('httpcode_events_duration')
|alert()
.crit(lambda: "httpcode_events_duration" > 2 )
.alerta()
.message('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} http code error, http code: {{ index .Fields "http_code" }} http error: {{ index .Fields "http_error" }}' + string(arp))
.event('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} http code error')
.resource(resource)
.value('ERROR')

response_data
|stateDuration(lambda: float("response_time") > float("responsetime_threshold") )
.unit(1m)
.as('responsetime_events_duration')
|httpOut('responsetime_events_duration')
|alert()
.crit(lambda: "responsetime_events_duration" > 3 )
.alerta()
.message('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} response time high {{ index .Fields "response_time" }}s > threshold {{ index .Fields "responsetime_threshold" }}s' + string(arp))
.event('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} response time high')
.resource(resource)
.value('ERROR')

@wshi5985
Copy link
Author

And this is the config which has crash every a few hours

var response_data = stream
|from()
.database('iacpl')
.measurement(measurement)
.groupBy('domain','monitor_name')
|window()
.period(5m)
.every(1m)

response_data
|stateDuration(lambda: int("result_status") != 1 )
.unit(1m)
.as('resultstatus_events_duration')
|stateDuration(lambda: int("httpcode_status") != 1 )
.unit(1m)
.as('httpcode_events_duration')
|stateDuration(lambda: float("response_time") > float("responsetime_threshold") )
.unit(1m)
.as('responsetime_events_duration')
|httpOut('response_data')

response_data
|alert()
.crit(lambda: "resultstatus_events_duration" > 2 )
.alerta()
.message('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} http result error, result matched: {{ index .Fields "result_matched" }} result error: {{ index .Fields "result_error" }}' + string(arp))
.event('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} http result error')
.resource(resource)
.value('ERROR')

response_data
|alert()
.crit(lambda: "httpcode_events_duration" > 2 )
.alerta()
.message('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} http code error, http code: {{ index .Fields "http_code" }} http error: {{ index .Fields "http_error" }}' + string(arp))
.event('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} http code error')
.resource(resource)
.value('ERROR')

response_data
|alert()
.crit(lambda: "responsetime_events_duration" > 2 )
.alerta()
.message('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} response time high {{ index .Fields "response_time" }}s > threshold {{ index .Fields "responsetime_threshold" }}s' + string(arp))
.event('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} response time high')
.resource(resource)
.value('ERROR')

@wshi5985
Copy link
Author

the only error i see from logs are lot of error like this

[vip_test:alert9] 2017/05/10 14:03:28 E! error evaluating expression for level CRITICAL: no field or tag exists for responsetime_events_duration
[vip_test:alert9] 2017/05/10 14:03:28 E! error evaluating expression for level CRITICAL: no field or tag exists for responsetime_events_duration
[vip_test:alert9] 2017/05/10 14:03:28 E! error evaluating expression for level CRITICAL: no field or tag exists for responsetime_events_duration

[vip_test:alert8] 2017/05/09 22:58:49 E! error evaluating expression for level CRITICAL: no field or tag exists for httpcode_events_duration
[vip_test:alert8] 2017/05/09 22:58:49 E! error evaluating expression for level CRITICAL: no field or tag exists for httpcode_events_duration
[vip_test:alert8] 2017/05/09 22:58:49 E! error evaluating expression for level CRITICAL: no field or tag exists for httpcode_events_duration

resultstatus_events_duration
[vip_test:alert7] 2017/05/09 22:40:48 E! error evaluating expression for level CRITICAL: no field or tag exists for
resultstatus_events_duration
[vip_test:alert7] 2017/05/09 22:40:48 E! error evaluating expression for level CRITICAL: no field or tag exists for
resultstatus_events_duration
[vip_test:alert7] 2017/05/09 22:40:48 E! error evaluating expression for level CRITICAL: no field or tag exists for
resultstatus_events_duration
[vip_test:alert7] 2017/05/09 22:40:48 E! error evaluating expression for level CRITICAL: no field or tag exists for
resultstatus_events_duration

@wshi5985
Copy link
Author

any update on this issue ? thanks.
Kapacitor on our testing server is still crashing every a few hours.

[run] 2017/05/14 19:49:42 I! Kapacitor starting, version 1.3.0beta2, branch master, commit 46de6eb
[run] 2017/05/14 19:59:42 I! Kapacitor starting, version 1.3.0
beta2, branch master, commit 46de6eb
[run] 2017/05/14 23:25:50 I! Kapacitor starting, version 1.3.0beta2, branch master, commit 46de6eb
[run] 2017/05/14 23:32:50 I! Kapacitor starting, version 1.3.0
beta2, branch master, commit 46de6eb
[run] 2017/05/15 03:56:00 I! Kapacitor starting, version 1.3.0beta2, branch master, commit 46de6eb
[run] 2017/05/15 07:17:09 I! Kapacitor starting, version 1.3.0
beta2, branch master, commit 46de6eb

@dsalbert
Copy link

dsalbert commented May 16, 2017

Trying to reproduce this issue on current code base (v1.3.0-rc2) after couple of minutes I got a crash fatal error: concurrent map writes related to state_tracking.go.

Version: compiled from master branch (last commit: 74fc18b)
go version go1.8.1 linux/amd64

Configuration that I've used for this:

var response_data = stream
    |from()
        .database('db')
        .measurement('monitor')
        .groupBy('domain','monitor_name')
    |window()
        .period(1m)
        .every(30s)

response_data
    |stateDuration(lambda: int("result_status") != 1 )
        .unit(10s)
        .as('resultstatus_events_duration')

response_data
    |stateDuration(lambda: int("httpcode_status") != 1 )
        .unit(10s)
        .as('httpcode_events_duration')

response_data
    |stateDuration(lambda: float("response_time") > float("responsetime_threshold") )
        .unit(10s)
        .as('responsetime_events_duration')

kapacitord output:

fatal error: concurrent map writes

goroutine 482 [running]:
runtime.throw(0x207ddfe, 0x15)
	/home/dankan/Work/golang/go/src/runtime/panic.go:596 +0x95 fp=0xc42117db28 sp=0xc42117db08
runtime.mapassign(0x1dbd700, 0xc42123d920, 0xc4200d9c70, 0x0)
	/home/dankan/Work/golang/go/src/runtime/hashmap.go:499 +0x667 fp=0xc42117dbc8 sp=0xc42117db28
github.com/influxdata/kapacitor.(*StateTrackingNode).runStateTracking(0xc4200d9b80, 0x0, 0x0, 0x0, 0xc4205ca778, 0xc4205bbcc0)
	/home/dankan/src/github.com/influxdata/kapacitor/state_tracking.go:130 +0xcd2 fp=0xc42117df38 sp=0xc42117dbc8
github.com/influxdata/kapacitor.(*StateTrackingNode).(github.com/influxdata/kapacitor.runStateTracking)-fm(0x0, 0x0, 0x0, 0xc4205ca7a0, 0xc420f8e230)
	/home/dankan/src/github.com/influxdata/kapacitor/state_tracking.go:179 +0x48 fp=0xc42117df78 sp=0xc42117df38
github.com/influxdata/kapacitor.(*node).start.func1(0xc4200d9b80, 0x0, 0x0, 0x0)
	/home/dankan/src/github.com/influxdata/kapacitor/node.go:140 +0x8e fp=0xc42117dfc0 sp=0xc42117df78
runtime.goexit()
	/home/dankan/Work/golang/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc42117dfc8 sp=0xc42117dfc0
created by github.com/influxdata/kapacitor.(*node).start
	/home/dankan/src/github.com/influxdata/kapacitor/node.go:141 +0x5d

goroutine 1 [chan receive, 9 minutes]:
main.(*Main).Run(0xc420885f40, 0xc420010250, 0x2, 0x2, 0x40651c, 0xc42006e058)
	/home/dankan/src/github.com/influxdata/kapacitor/cmd/kapacitord/main.go:96 +0x6ce
main.main()
	/home/dankan/src/github.com/influxdata/kapacitor/cmd/kapacitord/main.go:41 +0x1f1

goroutine 17 [syscall, 9 minutes, locked to thread]:
runtime.goexit()
	/home/dankan/Work/golang/go/src/runtime/asm_amd64.s:2197 +0x1

goroutine 5 [syscall, 9 minutes]:
os/signal.signal_recv(0x0)
	/home/dankan/Work/golang/go/src/runtime/sigqueue.go:116 +0x104
os/signal.loop()
	/home/dankan/Work/golang/go/src/os/signal/signal_unix.go:22 +0x22
created by os/signal.init.1
	/home/dankan/Work/golang/go/src/os/signal/signal_unix.go:28 +0x41

goroutine 50 [chan receive]:
github.com/influxdata/kapacitor/vendor/github.com/golang/glog.(*loggingT).flushDaemon(0x2f84640)
	/home/dankan/src/github.com/influxdata/kapacitor/vendor/github.com/golang/glog/glog.go:882 +0x7a
created by github.com/influxdata/kapacitor/vendor/github.com/golang/glog.init.1
	/home/dankan/src/github.com/influxdata/kapacitor/vendor/github.com/golang/glog/glog.go:410 +0x21d

goroutine 68 [runnable]:
github.com/influxdata/kapacitor.(*Edge).NextPoint(0xc42019a100, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/dankan/src/github.com/influxdata/kapacitor/edge.go:156 +0x242
github.com/influxdata/kapacitor.(*TaskMaster).runForking(0xc4204d6b40, 0xc42019a100)
	/home/dankan/src/github.com/influxdata/kapacitor/task_master.go:575 +0xda
github.com/influxdata/kapacitor.(*TaskMaster).stream.func1(0xc4204d6b40, 0xc42019a100)
	/home/dankan/src/github.com/influxdata/kapacitor/task_master.go:569 +0x64
created by github.com/influxdata/kapacitor.(*TaskMaster).stream
	/home/dankan/src/github.com/influxdata/kapacitor/task_master.go:570 +0x1b7

goroutine 69 [chan receive]:
github.com/influxdata/kapacitor/services/influxdb.(*influxdbCluster).watchSubs.func1(0xc420015d80, 0xc42028c1e0)
	/home/dankan/src/github.com/influxdata/kapacitor/services/influxdb/service.go:657 +0x67
created by github.com/influxdata/kapacitor/services/influxdb.(*influxdbCluster).watchSubs
	/home/dankan/src/github.com/influxdata/kapacitor/services/influxdb/service.go:660 +0xad

goroutine 486 [chan receive, 7 minutes]:
github.com/influxdata/kapacitor.(*node).Wait(0xc4200d9e00, 0x0, 0x0)
	/home/dankan/src/github.com/influxdata/kapacitor/node.go:162 +0xe5
github.com/influxdata/kapacitor.(*ExecutingTask).Wait.func1(0x2f36d20, 0xc4200d9e00, 0x0, 0x0)
	/home/dankan/src/github.com/influxdata/kapacitor/task.go:316 +0x31
github.com/influxdata/kapacitor.(*ExecutingTask).rwalk(0xc42010ee10, 0x20ecb28, 0x100010000, 0xc4206ab7c8)
	/home/dankan/src/github.com/influxdata/kapacitor/task.go:147 +0x64
github.com/influxdata/kapacitor.(*ExecutingTask).Wait(0xc42010ee10, 0x30424395, 0x2f842c0)
	/home/dankan/src/github.com/influxdata/kapacitor/task.go:317 +0x37
github.com/influxdata/kapacitor/services/task_store.(*Service).startTask.func1(0xc42010ee10, 0xc4203bac80, 0xc4204d6b40, 0xc4207fd8c0)
	/home/dankan/src/github.com/influxdata/kapacitor/services/task_store/service.go:1875 +0x40
created by github.com/influxdata/kapacitor/services/task_store.(*Service).startTask
	/home/dankan/src/github.com/influxdata/kapacitor/services/task_store/service.go:1889 +0x18e

goroutine 138 [select]:
github.com/influxdata/kapacitor/services/smtp.(*Service).runMailer(0xc420278fa0)
	/home/dankan/src/github.com/influxdata/kapacitor/services/smtp/service.go:138 +0x7b9
github.com/influxdata/kapacitor/services/smtp.(*Service).Open.func1(0xc420278fa0)
	/home/dankan/src/github.com/influxdata/kapacitor/services/smtp/service.go:53 +0x57
created by github.com/influxdata/kapacitor/services/smtp.(*Service).Open
	/home/dankan/src/github.com/influxdata/kapacitor/services/smtp/service.go:54 +0x174

goroutine 383 [select]:
github.com/influxdata/kapacitor.(*Edge).NextPoint(0xc420773d80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/dankan/src/github.com/influxdata/kapacitor/edge.go:156 +0x242
github.com/influxdata/kapacitor.(*FromNode).runStream(0xc421611760, 0x0, 0x0, 0x0, 0xc4204b3f78, 0xc42079a380)
	/home/dankan/src/github.com/influxdata/kapacitor/stream.go:81 +0x392
github.com/influxdata/kapacitor.(*FromNode).(github.com/influxdata/kapacitor.runStream)-fm(0x0, 0x0, 0x0, 0xc4204b3fa0, 0x10)
	/home/dankan/src/github.com/influxdata/kapacitor/stream.go:61 +0x48
github.com/influxdata/kapacitor.(*node).start.func1(0xc421611760, 0x0, 0x0, 0x0)
	/home/dankan/src/github.com/influxdata/kapacitor/node.go:140 +0x8e
created by github.com/influxdata/kapacitor.(*node).start
	/home/dankan/src/github.com/influxdata/kapacitor/node.go:141 +0x5d

goroutine 127 [IO wait]:
net.runtime_pollWait(0x7ff06607e9e0, 0x72, 0x9)
	/home/dankan/Work/golang/go/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc42035ad18, 0x72, 0x2f12dc0, 0x2f057c8)
	/home/dankan/Work/golang/go/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc42035ad18, 0xc420438000, 0x1000)
	/home/dankan/Work/golang/go/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).Read(0xc42035acb0, 0xc420438000, 0x1000, 0x1000, 0x0, 0x2f12dc0, 0x2f057c8)
	/home/dankan/Work/golang/go/src/net/fd_unix.go:250 +0x1b7
net.(*conn).Read(0xc420f0a1e8, 0xc420438000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/home/dankan/Work/golang/go/src/net/net.go:181 +0x70
net/http.(*connReader).Read(0xc42029dc00, 0xc420438000, 0x1000, 0x1000, 0x0, 0xc420707bb0, 0x5c8a48)
	/home/dankan/Work/golang/go/src/net/http/server.go:754 +0x140
bufio.(*Reader).fill(0xc42078a060)
	/home/dankan/Work/golang/go/src/bufio/bufio.go:97 +0x117
bufio.(*Reader).Peek(0xc42078a060, 0x4, 0x0, 0x0, 0x0, 0x0, 0xc420707c10)
	/home/dankan/Work/golang/go/src/bufio/bufio.go:129 +0x67
net/http.(*conn).readRequest(0xc420160640, 0x2f24680, 0xc42029dbc0, 0x0, 0x0, 0x0)
	/home/dankan/Work/golang/go/src/net/http/server.go:931 +0xe91
net/http.(*conn).serve(0xc420160640, 0x2f24680, 0xc42029dbc0)
	/home/dankan/Work/golang/go/src/net/http/server.go:1763 +0x49a
created by net/http.(*Server).Serve
	/home/dankan/Work/golang/go/src/net/http/server.go:2668 +0x2ce

goroutine 382 [select]:
github.com/influxdata/kapacitor.(*Edge).NextPoint(0xc4216bc080, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/dankan/src/github.com/influxdata/kapacitor/edge.go:156 +0x242
github.com/influxdata/kapacitor.(*StreamNode).runSourceStream(0xc42001a5a0, 0x0, 0x0, 0x0, 0xc4206a8f78, 0xc42026a000)
	/home/dankan/src/github.com/influxdata/kapacitor/stream.go:29 +0x190
github.com/influxdata/kapacitor.(*StreamNode).(github.com/influxdata/kapacitor.runSourceStream)-fm(0x0, 0x0, 0x0, 0xc4206a8fa0, 0xc4200d6536)
	/home/dankan/src/github.com/influxdata/kapacitor/stream.go:24 +0x48
github.com/influxdata/kapacitor.(*node).start.func1(0xc42001a5a0, 0x0, 0x0, 0x0)
	/home/dankan/src/github.com/influxdata/kapacitor/node.go:140 +0x8e
created by github.com/influxdata/kapacitor.(*node).start
	/home/dankan/src/github.com/influxdata/kapacitor/node.go:141 +0x5d

goroutine 384 [runnable]:
github.com/influxdata/kapacitor.(*Edge).CollectBatch(0xc4216bc000, 0xc4212d25a0, 0x14, 0xed0acf01f, 0x0, 0x0, 0xc421180910, 0x49, 0x0, 0xc421650d50, ...)
	/home/dankan/src/github.com/influxdata/kapacitor/edge.go:193 +0x1ff
github.com/influxdata/kapacitor.(*WindowNode).runWindow(0xc42001aa50, 0x0, 0x0, 0x0, 0xc420027f78, 0xc42029c180)
	/home/dankan/src/github.com/influxdata/kapacitor/window.go:95 +0x478
github.com/influxdata/kapacitor.(*WindowNode).(github.com/influxdata/kapacitor.runWindow)-fm(0x0, 0x0, 0x0, 0xc420027fa0, 0xc42014de00)
	/home/dankan/src/github.com/influxdata/kapacitor/window.go:26 +0x48
github.com/influxdata/kapacitor.(*node).start.func1(0xc42001aa50, 0x0, 0x0, 0x0)
	/home/dankan/src/github.com/influxdata/kapacitor/node.go:140 +0x8e
created by github.com/influxdata/kapacitor.(*node).start
	/home/dankan/src/github.com/influxdata/kapacitor/node.go:141 +0x5d

goroutine 385 [runnable]:
github.com/influxdata/kapacitor.(*Edge).NextBatch(0xc420773e80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/dankan/src/github.com/influxdata/kapacitor/edge.go:168 +0x264
github.com/influxdata/kapacitor.(*StateTrackingNode).runStateTracking(0xc4200d83c0, 0x0, 0x0, 0x0, 0xc420027778, 0xc420477380)
	/home/dankan/src/github.com/influxdata/kapacitor/state_tracking.go:109 +0xe42
github.com/influxdata/kapacitor.(*StateTrackingNode).(github.com/influxdata/kapacitor.runStateTracking)-fm(0x0, 0x0, 0x0, 0xc4200277a0, 0xc420e4c300)
	/home/dankan/src/github.com/influxdata/kapacitor/state_tracking.go:179 +0x48
github.com/influxdata/kapacitor.(*node).start.func1(0xc4200d83c0, 0x0, 0x0, 0x0)
	/home/dankan/src/github.com/influxdata/kapacitor/node.go:140 +0x8e
created by github.com/influxdata/kapacitor.(*node).start
	/home/dankan/src/github.com/influxdata/kapacitor/node.go:141 +0x5d

goroutine 177 [select]:
github.com/influxdata/kapacitor.(*Edge).NextPoint(0xc42074f180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/dankan/src/github.com/influxdata/kapacitor/edge.go:156 +0x242
github.com/influxdata/kapacitor.(*TaskMaster).runForking(0xc4204d6b40, 0xc42074f180)
	/home/dankan/src/github.com/influxdata/kapacitor/task_master.go:575 +0xda
github.com/influxdata/kapacitor.(*TaskMaster).stream.func1(0xc4204d6b40, 0xc42074f180)
	/home/dankan/src/github.com/influxdata/kapacitor/task_master.go:569 +0x64
created by github.com/influxdata/kapacitor.(*TaskMaster).stream
	/home/dankan/src/github.com/influxdata/kapacitor/task_master.go:570 +0x1b7

goroutine 178 [select]:
github.com/influxdata/kapacitor/services/stats.(*Service).sendStats(0xc4201dfb00)
	/home/dankan/src/github.com/influxdata/kapacitor/services/stats/service.go:106 +0x176
created by github.com/influxdata/kapacitor/services/stats.(*Service).Open
	/home/dankan/src/github.com/influxdata/kapacitor/services/stats/service.go:82 +0x17a

goroutine 179 [select, 9 minutes]:
github.com/influxdata/kapacitor/services/scraper.(*Service).scrape(0xc4201dfb90)
	/home/dankan/src/github.com/influxdata/kapacitor/services/scraper/service.go:119 +0x1f8
created by github.com/influxdata/kapacitor/services/scraper.(*Service).Open
	/home/dankan/src/github.com/influxdata/kapacitor/services/scraper/service.go:71 +0x121

goroutine 181 [select]:
github.com/influxdata/kapacitor/services/httpd.(*Service).manage(0xc4200a5810)
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/service.go:187 +0x6f8
created by github.com/influxdata/kapacitor/services/httpd.(*Service).Open
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/service.go:122 +0x747

goroutine 182 [IO wait]:
net.runtime_pollWait(0x7ff06607eb60, 0x72, 0x2f12dc0)
	/home/dankan/Work/golang/go/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc42035b258, 0x72, 0x2f057c8, 0xc420f4a040)
	/home/dankan/Work/golang/go/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc42035b258, 0xffffffffffffffff, 0x0)
	/home/dankan/Work/golang/go/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).accept(0xc42035b1f0, 0x0, 0x2f0ffc0, 0xc420f4a040)
	/home/dankan/Work/golang/go/src/net/fd_unix.go:430 +0x1e5
net.(*TCPListener).accept(0xc4206c3138, 0xc42085de60, 0x6d692e, 0x457e10)
	/home/dankan/Work/golang/go/src/net/tcpsock_posix.go:136 +0x2e
net.(*TCPListener).Accept(0xc4206c3138, 0x20f3370, 0xc420210be0, 0x2f24740, 0xc4203e2210)
	/home/dankan/Work/golang/go/src/net/tcpsock.go:228 +0x49
net/http.(*Server).Serve(0xc4201e68f0, 0x2f22440, 0xc4206c3138, 0x0, 0x0)
	/home/dankan/Work/golang/go/src/net/http/server.go:2643 +0x228
github.com/influxdata/kapacitor/services/httpd.(*Service).serve(0xc4200a5810)
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/service.go:241 +0x88
created by github.com/influxdata/kapacitor/services/httpd.(*Service).Open
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/service.go:125 +0x78a

goroutine 183 [chan receive, 9 minutes]:
github.com/influxdata/kapacitor/server.(*Server).watchServices(0xc4204eb200)
	/home/dankan/src/github.com/influxdata/kapacitor/server/server.go:837 +0x64
created by github.com/influxdata/kapacitor/server.(*Server).Open
	/home/dankan/src/github.com/influxdata/kapacitor/server/server.go:796 +0xe7

goroutine 184 [chan receive, 9 minutes]:
github.com/influxdata/kapacitor/server.(*Server).watchConfigUpdates(0xc4204eb200)
	/home/dankan/src/github.com/influxdata/kapacitor/server/server.go:843 +0x9f
created by github.com/influxdata/kapacitor/server.(*Server).Open
	/home/dankan/src/github.com/influxdata/kapacitor/server/server.go:797 +0x109

goroutine 185 [select, 9 minutes]:
github.com/influxdata/kapacitor/cmd/kapacitord/run.(*Command).monitorServerErrors(0xc42010fcb0)
	/home/dankan/src/github.com/influxdata/kapacitor/cmd/kapacitord/run/command.go:153 +0x1ee
created by github.com/influxdata/kapacitor/cmd/kapacitord/run.(*Command).Run
	/home/dankan/src/github.com/influxdata/kapacitor/cmd/kapacitord/run/command.go:133 +0xd22

goroutine 186 [select, 9 minutes, locked to thread]:
runtime.gopark(0x20f39f0, 0x0, 0x20620eb, 0x6, 0x18, 0x2)
	/home/dankan/Work/golang/go/src/runtime/proc.go:271 +0x13a
runtime.selectgoImpl(0xc42003cf50, 0x0, 0x18)
	/home/dankan/Work/golang/go/src/runtime/select.go:423 +0x1364
runtime.selectgo(0xc42003cf50)
	/home/dankan/Work/golang/go/src/runtime/select.go:238 +0x1c
runtime.ensureSigM.func1()
	/home/dankan/Work/golang/go/src/runtime/signal_unix.go:434 +0x2dd
runtime.goexit()
	/home/dankan/Work/golang/go/src/runtime/asm_amd64.s:2197 +0x1

goroutine 1890 [IO wait]:
net.runtime_pollWait(0x7ff06607eaa0, 0x72, 0x8)
	/home/dankan/Work/golang/go/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc4201912c8, 0x72, 0x2f12dc0, 0x2f057c8)
	/home/dankan/Work/golang/go/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc4201912c8, 0xc420f1abd1, 0x1)
	/home/dankan/Work/golang/go/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).Read(0xc420191260, 0xc420f1abd1, 0x1, 0x1, 0x0, 0x2f12dc0, 0x2f057c8)
	/home/dankan/Work/golang/go/src/net/fd_unix.go:250 +0x1b7
net.(*conn).Read(0xc420781c08, 0xc420f1abd1, 0x1, 0x1, 0x0, 0x0, 0x0)
	/home/dankan/Work/golang/go/src/net/net.go:181 +0x70
net/http.(*connReader).backgroundRead(0xc420f1abc0)
	/home/dankan/Work/golang/go/src/net/http/server.go:656 +0x58
created by net/http.(*connReader).startBackgroundRead
	/home/dankan/Work/golang/go/src/net/http/server.go:652 +0xdf

goroutine 89 [runnable]:
github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).unmarshalBinary(0xc4212d0d80, 0xc210)
	/home/dankan/src/github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models/points.go:1497 +0x2b8
github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Fields(0xc4212d0d80, 0x4)
	/home/dankan/src/github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models/points.go:1365 +0x34
github.com/influxdata/kapacitor.(*TaskMaster).WritePoints(0xc4204d6b40, 0xc4212e601c, 0x5, 0xc4212e6032, 0x7, 0x3, 0xc420076a00, 0x64, 0x65, 0x0, ...)
	/home/dankan/src/github.com/influxdata/kapacitor/task_master.go:658 +0x1b6
github.com/influxdata/kapacitor/services/httpd.(*Handler).serveWriteLine(0xc42019a180, 0x2f22880, 0xc42162c8c0, 0xc42000ae00, 0xc420e58000, 0xabbb, 0xfe00, 0x206736c, 0xa, 0x1, ...)
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/handler.go:493 +0x3c1
github.com/influxdata/kapacitor/services/httpd.(*Handler).serveWrite(0xc42019a180, 0x2f22880, 0xc42162c8c0, 0xc42000ae00, 0x206736c, 0xa, 0x1, 0x2fa4258, 0x0, 0x0, ...)
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/handler.go:452 +0x3e0
github.com/influxdata/kapacitor/services/httpd.(*Handler).(github.com/influxdata/kapacitor/services/httpd.serveWrite)-fm(0x2f22880, 0xc42162c8c0, 0xc42000ae00, 0x206736c, 0xa, 0x1, 0x2fa4258, 0x0, 0x0, 0xc42047b860)
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/handler.go:169 +0x75
github.com/influxdata/kapacitor/services/httpd.authorizeForward.func1(0x2f22880, 0xc42162c8c0, 0xc42000ae00, 0x206736c, 0xa, 0x1, 0x2fa4258, 0x0, 0x0, 0xc42047b860)
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/handler.go:725 +0x116
github.com/influxdata/kapacitor/services/httpd.authenticate.func1(0x2f22880, 0xc42162c8c0, 0xc42000ae00)
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/handler.go:579 +0xa07
net/http.HandlerFunc.ServeHTTP(0xc420408460, 0x2f22880, 0xc42162c8c0, 0xc42000ae00)
	/home/dankan/Work/golang/go/src/net/http/server.go:1942 +0x44
github.com/influxdata/kapacitor/services/httpd.jsonContent.func1(0x2f22880, 0xc42162c8c0, 0xc42000ae00)
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/handler.go:818 +0xb1
net/http.HandlerFunc.ServeHTTP(0xc420408480, 0x2f22880, 0xc42162c8c0, 0xc42000ae00)
	/home/dankan/Work/golang/go/src/net/http/server.go:1942 +0x44
github.com/influxdata/kapacitor/services/httpd.gzipFilter.func1(0x2f1a580, 0xc42162c560, 0xc42000ae00)
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/handler.go:811 +0x1da
net/http.HandlerFunc.ServeHTTP(0xc4204084c0, 0x2f1a580, 0xc42162c560, 0xc42000ae00)
	/home/dankan/Work/golang/go/src/net/http/server.go:1942 +0x44
github.com/influxdata/kapacitor/services/httpd.versionHeader.func1(0x2f1a580, 0xc42162c560, 0xc42000ae00)
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/handler.go:827 +0xbc
net/http.HandlerFunc.ServeHTTP(0xc4204084e0, 0x2f1a580, 0xc42162c560, 0xc42000ae00)
	/home/dankan/Work/golang/go/src/net/http/server.go:1942 +0x44
github.com/influxdata/kapacitor/services/httpd.cors.func1(0x2f1a580, 0xc42162c560, 0xc42000ae00)
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/handler.go:860 +0xee
net/http.HandlerFunc.ServeHTTP(0xc420408520, 0x2f1a580, 0xc42162c560, 0xc42000ae00)
	/home/dankan/Work/golang/go/src/net/http/server.go:1942 +0x44
github.com/influxdata/kapacitor/services/httpd.requestID.func1(0x2f1a580, 0xc42162c560, 0xc42000ae00)
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/handler.go:870 +0x138
net/http.HandlerFunc.ServeHTTP(0xc420408540, 0x2f1a580, 0xc42162c560, 0xc42000ae00)
	/home/dankan/Work/golang/go/src/net/http/server.go:1942 +0x44
github.com/influxdata/kapacitor/services/httpd.recovery.func1(0x2f22640, 0xc4200e6620, 0xc42000ae00)
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/handler.go:887 +0xfe
net/http.HandlerFunc.ServeHTTP(0xc420408560, 0x2f22640, 0xc4200e6620, 0xc42000ae00)
	/home/dankan/Work/golang/go/src/net/http/server.go:1942 +0x44
github.com/influxdata/kapacitor/services/httpd.(*ServeMux).ServeHTTP(0xc420417140, 0x2f22640, 0xc4200e6620, 0xc42000ae00)
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/mux.go:163 +0x130
github.com/influxdata/kapacitor/services/httpd.(*Handler).ServeHTTP(0xc42019a180, 0x2f22640, 0xc4200e6620, 0xc42000ae00)
	/home/dankan/src/github.com/influxdata/kapacitor/services/httpd/handler.go:365 +0xcf
net/http.serverHandler.ServeHTTP(0xc4201e68f0, 0x2f22640, 0xc4200e6620, 0xc42000ae00)
	/home/dankan/Work/golang/go/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xc4202119a0, 0x2f24680, 0xc420f1ab80)
	/home/dankan/Work/golang/go/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
	/home/dankan/Work/golang/go/src/net/http/server.go:2668 +0x2ce

goroutine 484 [select]:
github.com/influxdata/kapacitor.(*ExecutingTask).runSnapshotter(0xc42010ee10)
	/home/dankan/src/github.com/influxdata/kapacitor/task.go:547 +0x596
created by github.com/influxdata/kapacitor.(*ExecutingTask).start
	/home/dankan/src/github.com/influxdata/kapacitor/task.go:222 +0x1af

goroutine 219 [IO wait]:
net.runtime_pollWait(0x7ff06607ec20, 0x72, 0x6)
	/home/dankan/Work/golang/go/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc42035abc8, 0x72, 0x2f12dc0, 0x2f057c8)
	/home/dankan/Work/golang/go/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc42035abc8, 0xc42031b000, 0x1000)
	/home/dankan/Work/golang/go/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).Read(0xc42035ab60, 0xc42031b000, 0x1000, 0x1000, 0x0, 0x2f12dc0, 0x2f057c8)
	/home/dankan/Work/golang/go/src/net/fd_unix.go:250 +0x1b7
net.(*conn).Read(0xc420f0a1e0, 0xc42031b000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/home/dankan/Work/golang/go/src/net/net.go:181 +0x70
net/http.(*connReader).Read(0xc420f1a980, 0xc42031b000, 0x1000, 0x1000, 0x0, 0xc420887bb0, 0x5c8a48)
	/home/dankan/Work/golang/go/src/net/http/server.go:754 +0x140
bufio.(*Reader).fill(0xc42078aae0)
	/home/dankan/Work/golang/go/src/bufio/bufio.go:97 +0x117
bufio.(*Reader).Peek(0xc42078aae0, 0x4, 0x0, 0x0, 0x0, 0x0, 0xc420887c10)
	/home/dankan/Work/golang/go/src/bufio/bufio.go:129 +0x67
net/http.(*conn).readRequest(0xc420160500, 0x2f24680, 0xc420f1a940, 0x0, 0x0, 0x0)
	/home/dankan/Work/golang/go/src/net/http/server.go:931 +0xe91
net/http.(*conn).serve(0xc420160500, 0x2f24680, 0xc420f1a940)
	/home/dankan/Work/golang/go/src/net/http/server.go:1763 +0x49a
created by net/http.(*Server).Serve
	/home/dankan/Work/golang/go/src/net/http/server.go:2668 +0x2ce

goroutine 483 [runnable]:
github.com/influxdata/kapacitor.(*Edge).NextBatch(0xc4216bc000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/dankan/src/github.com/influxdata/kapacitor/edge.go:168 +0x264
github.com/influxdata/kapacitor.(*StateTrackingNode).runStateTracking(0xc4200d9e00, 0x0, 0x0, 0x0, 0xc4206aff78, 0xc420014640)
	/home/dankan/src/github.com/influxdata/kapacitor/state_tracking.go:109 +0xe42
github.com/influxdata/kapacitor.(*StateTrackingNode).(github.com/influxdata/kapacitor.runStateTracking)-fm(0x0, 0x0, 0x0, 0xc4206affa0, 0xc421100276)
	/home/dankan/src/github.com/influxdata/kapacitor/state_tracking.go:179 +0x48
github.com/influxdata/kapacitor.(*node).start.func1(0xc4200d9e00, 0x0, 0x0, 0x0)
	/home/dankan/src/github.com/influxdata/kapacitor/node.go:140 +0x8e
created by github.com/influxdata/kapacitor.(*node).start
	/home/dankan/src/github.com/influxdata/kapacitor/node.go:141 +0x5d

goroutine 485 [select]:
github.com/influxdata/kapacitor.(*ExecutingTask).calcThroughput(0xc42010ee10)
	/home/dankan/src/github.com/influxdata/kapacitor/task.go:423 +0x2da
created by github.com/influxdata/kapacitor.(*ExecutingTask).start
	/home/dankan/src/github.com/influxdata/kapacitor/task.go:226 +0x155

Thanks!

@wshi5985
Copy link
Author

wshi5985 commented May 18, 2017

with binary in #1380 , I don't see crash for 24 hours so far, but the stateDuration does not fully function now.

from httpOut i can see response time are always above threshold , but it doesnot trigger any alert

var response_data = stream
    |from()
        .database('iacpl')
        .measurement(measurement)
        .groupBy('domain','monitor_name')
    |window()
        .period(4m)
        .every(1m)

response_data
    |stateDuration(lambda: int("result_status") != 1 )
        .unit(1m)
        .as('resultstatus_events_duration')
    |stateDuration(lambda: int("httpcode_status") != 1 )
        .unit(1m)
        .as('httpcode_events_duration')
    |stateDuration(lambda: float("response_time") > float("responsetime_threshold") )
        .unit(1m)
        .as('responsetime_events_duration')
    |httpOut('response_data')

response_data
    |alert()
        .crit(lambda: "resultstatus_events_duration" > 2 )
        .alerta()
            .message('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} http result error, result matched: {{ index .Fields "result_matched" }} result error: {{ index .Fields "result_error" }}' + string(arp))
            .event('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} http result error')
            .resource(resource)
            .value('ERROR')

response_data
    |alert()
        .crit(lambda: "httpcode_events_duration" > 2 )
        .alerta()
            .message('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} http code error, http code: {{ index .Fields "http_code" }} http error: {{ index .Fields "http_error" }}' + string(arp))
            .event('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} http code error')
            .resource(resource)
            .value('ERROR')

response_data
    |alert()
        .crit(lambda: "responsetime_events_duration" > 1 )
        .alerta()
            .message('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} response time high {{ index .Fields "response_time" }}s > threshold {{ index .Fields "responsetime_threshold" }}s' + string(arp))
            .event('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} response time high')
            .resource(resource)
            .value('ERROR')

$ kapacitor show monitor_test


DOT:
digraph vip_test {
graph [throughput="0.00 points/s"];

stream0 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ];
stream0 -> from1 [processed="163488"];

from1 [avg_exec_time_ns="2.772µs" errors="0" working_cardinality="0" ];
from1 -> window2 [processed="163488"];

window2 [avg_exec_time_ns="16.578µs" errors="0" working_cardinality="111" ];
window2 -> alert9 [processed="102184"];
window2 -> alert8 [processed="102184"];
window2 -> alert7 [processed="102184"];
window2 -> state_duration3 [processed="102184"];

alert9 [alerts_triggered="0" avg_exec_time_ns="8.884657ms" crits_triggered="0" errors="476091" infos_triggered="0" oks_triggered="0" warns_triggered="0" working_cardinality="111" ];

alert8 [alerts_triggered="0" avg_exec_time_ns="112.503µs" crits_triggered="0" errors="476091" infos_triggered="0" oks_triggered="0" warns_triggered="0" working_cardinality="111" ];

alert7 [alerts_triggered="0" avg_exec_time_ns="112.464µs" crits_triggered="0" errors="476091" infos_triggered="0" oks_triggered="0" warns_triggered="0" working_cardinality="111" ];

state_duration3 [avg_exec_time_ns="265.641µs" errors="0" working_cardinality="111" ];
state_duration3 -> state_duration4 [processed="102184"];

state_duration4 [avg_exec_time_ns="134.759µs" errors="0" working_cardinality="111" ];
state_duration4 -> state_duration5 [processed="102184"];

state_duration5 [avg_exec_time_ns="106.235µs" errors="0" working_cardinality="111" ];
state_duration5 -> http_out6 [processed="102184"];

http_out6 [avg_exec_time_ns="77.221µs" errors="0" working_cardinality="111" ];
}

@nathanielc
Copy link
Contributor

nathanielc commented May 18, 2017

@wshi5985 Thanks for testing out the PR in #1380! As for the current issue I think its a simple typo in the TICKscript.

Try this edit:

var response_data = stream
    |from()
        .database('iacpl')
        .measurement(measurement)
        .groupBy('domain','monitor_name')
    |window()
        .period(4m)
        .every(1m)

var state_data = response_data
    |stateDuration(lambda: int("result_status") != 1 )
        .unit(1m)
        .as('resultstatus_events_duration')
    |stateDuration(lambda: int("httpcode_status") != 1 )
        .unit(1m)
        .as('httpcode_events_duration')
    |stateDuration(lambda: float("response_time") > float("responsetime_threshold") )
        .unit(1m)
        .as('responsetime_events_duration')
    |httpOut('response_data')

state_data
    |alert()
        .crit(lambda: "resultstatus_events_duration" > 2 )
        .alerta()
            .message('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} http result error, result matched: {{ index .Fields "result_matched" }} result error: {{ index .Fields "result_error" }}' + string(arp))
            .event('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} http result error')
            .resource(resource)
            .value('ERROR')

state_data
    |alert()
        .crit(lambda: "httpcode_events_duration" > 2 )
        .alerta()
            .message('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} http code error, http code: {{ index .Fields "http_code" }} http error: {{ index .Fields "http_error" }}' + string(arp))
            .event('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} http code error')
            .resource(resource)
            .value('ERROR')

state_data
    |alert()
        .crit(lambda: "responsetime_events_duration" > 1 )
        .alerta()
            .message('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} response time high {{ index .Fields "response_time" }}s > threshold {{ index .Fields "responsetime_threshold" }}s' + string(arp))
            .event('vip {{ index .Tags "domain" }} monitor {{ index .Tags "monitor_name" }} response time high')
            .resource(resource)
            .value('ERROR')

@wshi5985
Copy link
Author

Thanks, @nathanielc, It starts alerting now.

@nathanielc
Copy link
Contributor

@wshi5985 Thanks for the detailed reports. Looks like we have this fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants