[Heartbeat] Fix incorrect 'Up' status for all mode #11895

andrewvc · 2019-04-23T03:08:58Z

This fixes #11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works.

From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints.

The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified.

Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic.

Fixes #11737

elasticmachine · 2019-04-23T03:09:00Z

Pinging @elastic/uptime

heartbeat/eventext/eventext.go

ruflin

LGTM

Did not run it locally yet.

ruflin · 2019-04-24T10:13:51Z

heartbeat/eventext/eventext.go

@@ -30,3 +30,22 @@ func MergeEventFields(e *beat.Event, merge common.MapStr) {
 		e.Fields = merge
 	}
 }
+
+// EventCalledMetaKey is the full path to the key marking an event as cancelled.
+const EventCancelledMetaKey = "meta.__hb_evt_cancel__"


Nit: Leave out the meta. as otherwise if you use Put with a . in the key it will create a tree and then the Get has to traverse the tree. Not having this would make the checks more efficient assuming this will happen on a lot of times.

Yeah, it is unfortunate that this is more expensive, but otherwise it winds up in event.Fields which is even hackier. I'm not really worried about the perf here since that should be very fast, even though it isn't ideal. WDYT?

ruflin · 2019-04-24T10:15:15Z

heartbeat/monitors/task.go

+		logp.Err("Job %v failed with: ", err)
+	}
+
+	if event.Fields != nil && !eventext.IsEventCancelled(event) {


Previously there was an event != nil check here. I assume because of :138 this cannot happen?

That is correct. My IDE actually flagged that as redundant.

event is being passed as pointer to job which could nil it, but that's something we wont ever do as far as I am concerned

odacremolbap · 2019-04-24T22:25:13Z

@andrewvc I'm getting a panic running the target scenario

It seems that MapStr.DeepUpdate is not thread safe and that routines 114 and 115 are writing concurrently.
I'm still learning this beat, and it's routines model, will need some more time to investigate further.

9-04-25T00:07:30.493+0200	INFO	[monitoring]	log/log.go:144	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":13,"time":{"ms":13}},"total":{"ticks":35,"time":{"ms":35},"value":35},"user":{"ticks":22,"time":{"ms":22}}},"info":{"ephemeral_id":"df233f57-6054-47a2-bdef-786b0e68afa6","uptime":{"ms":30017}},"memstats":{"gc_next":4194304,"memory_alloc":2507000,"memory_total":4034584,"rss":18956288}},"heartbeat":{"http":{"endpoint_starts":1,"monitor_starts":1}},"libbeat":{"config":{"module":{"running":0},"reloads":1},"output":{"type":"elasticsearch"},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"cpu":{"cores":12},"load":{"1":1.5054,"15":1.7524,"5":1.7104,"norm":{"1":0.1254,"15":0.146,"5":0.1425}}}}}}
2019-04-25T00:07:30.499+0200	DEBUG	[scheduler]	scheduler/scheduler.go:343	Start job 'auto-http-0X42B10F5590324BD9' at 2019-04-25 00:07:30.499282 +0200 CEST m=+30.027691338.
2019-04-25T00:07:30.499+0200	DEBUG	[scheduler]	scheduler/scheduler.go:206	Next wakeup time: 2019-04-25 00:07:40.494498 +0200 CEST
2019-04-25T00:07:30.501+0200	DEBUG	[scheduler]	scheduler/scheduler.go:248	Job 'auto-http-0X42B10F5590324BD9' returned at 2019-04-25 00:07:30.501034 +0200 CEST m=+30.029442922 (cont=2).
2019-04-25T00:07:30.501+0200	DEBUG	[scheduler]	scheduler/scheduler.go:302	start returned tasks
2019-04-25T00:07:30.501+0200	DEBUG	[scheduler]	scheduler/scheduler.go:206	Next wakeup time: 2019-04-25 00:07:40.494498 +0200 CEST
fatal error: concurrent map writes
fatal error: concurrent map writes

goroutine 115 [running]:
runtime.throw(0x4da4cb0, 0x15)
	/usr/local/go/src/runtime/panic.go:617 +0x72 fp=0xc0003156e8 sp=0xc0003156b8 pc=0x402f5d2
runtime.mapassign_faststr(0x4d1c5c0, 0xc0003e06c0, 0x4d96ceb, 0x6, 0xc0003e59e0)
	/usr/local/go/src/runtime/map_faststr.go:291 +0x3ff fp=0xc000315750 sp=0xc0003156e8 pc=0x401549f
github.com/elastic/beats/libbeat/common.MapStr.DeepUpdate(0xc0003e06c0, 0xc000156c90)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/common/mapstr.go:75 +0x12c fp=0xc000315828 sp=0xc000315750 pc=0x43d796c
github.com/elastic/beats/libbeat/common.deepUpdateValue(0x4d1c5c0, 0xc0003e06c0, 0xc000156c90, 0x7, 0xc0003e4088)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/common/mapstr.go:87 +0xa3 fp=0xc000315858 sp=0xc000315828 pc=0x43d7c43
github.com/elastic/beats/libbeat/common.MapStr.DeepUpdate(0xc0003e1320, 0xc000156cc0)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/common/mapstr.go:73 +0x1ac fp=0xc000315930 sp=0xc000315858 pc=0x43d79ec
github.com/elastic/beats/heartbeat/eventext.MergeEventFields(...)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/eventext/eventext.go:28
github.com/elastic/beats/heartbeat/monitors/wrappers.addMonitorStatus.func1(0xc000154380, 0xc000152410, 0x8a3d74c977bf, 0x100000001, 0xc000288a10, 0x404fcb8)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/wrappers/monitors.go:99 +0x20e fp=0xc0003159c0 sp=0xc000315930 pc=0x440ccfe
github.com/elastic/beats/heartbeat/monitors/jobs.Wrap.func1(0xc000154380, 0x6fde67ff3, 0x58bf1c0, 0xc0004f6400, 0x78021f0, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/jobs/job.go:72 +0x50 fp=0xc000315a48 sp=0xc0003159c0 pc=0x440aba0
github.com/elastic/beats/heartbeat/monitors/wrappers.addMonitorDuration.func1(0xc000154380, 0xc000152400, 0xc000288bf0, 0x203000, 0x203000, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/wrappers/monitors.go:109 +0x76 fp=0xc000315af8 sp=0xc000315a48 pc=0x440ce46
github.com/elastic/beats/heartbeat/monitors/jobs.Wrap.func1(0xc000154380, 0x40, 0x48, 0xc0001500f0, 0x0, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/jobs/job.go:72 +0x50 fp=0xc000315b80 sp=0xc000315af8 pc=0x440aba0
github.com/elastic/beats/heartbeat/monitors/wrappers.addMonitorMeta.func1.1(0xc000154380, 0xc0001500f0, 0x0, 0x0, 0xc0004bcd40, 0x400e3c9)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/wrappers/monitors.go:55 +0x95 fp=0xc000315c80 sp=0xc000315b80 pc=0x440c475
github.com/elastic/beats/heartbeat/monitors/jobs.Wrap.func1(0xc000154380, 0x78021f0, 0x0, 0x203000, 0x203000, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/jobs/job.go:72 +0x50 fp=0xc000315d08 sp=0xc000315c80 pc=0x440aba0
github.com/elastic/beats/heartbeat/monitors/wrappers.makeAddSummary.func2.1(0xc000154380, 0x0, 0x0, 0x0, 0x0, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/wrappers/monitors.go:154 +0x78 fp=0xc000315db0 sp=0xc000315d08 pc=0x440d328
github.com/elastic/beats/heartbeat/monitors/jobs.Wrap.func1(0xc000154380, 0xc000154380, 0x0, 0x0, 0x0, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/jobs/job.go:72 +0x50 fp=0xc000315e38 sp=0xc000315db0 pc=0x440aba0
github.com/elastic/beats/heartbeat/monitors.runPublishJob(0xc00015c400, 0x4f30f00, 0xc00028f320, 0xc0004bcf98, 0x10, 0xc0004bcf90)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/task.go:141 +0x7c fp=0xc000315f48 sp=0xc000315e38 pc=0x441f12c
github.com/elastic/beats/heartbeat/monitors.runPublishJob.func1(0x10, 0x4de8940, 0xc00015e460)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/task.go:172 +0x3f fp=0xc000315f88 sp=0xc000315f48 pc=0x4420e5f
github.com/elastic/beats/heartbeat/scheduler.(*Scheduler).runTask.func1(0xc00015e460, 0xc00032c080, 0xc00015e460, 0xc00015c440)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/scheduler/scheduler.go:362 +0x5b fp=0xc000315fc0 sp=0xc000315f88 pc=0x43f113b
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc000315fc8 sp=0xc000315fc0 pc=0x405f031
created by github.com/elastic/beats/heartbeat/scheduler.(*Scheduler).runTask
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/scheduler/scheduler.go:352 +0x5c

goroutine 1 [chan receive]:
github.com/elastic/beats/heartbeat/beater.(*Heartbeat).Run(0xc0001d4000, 0xc00054e000, 0x0, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/beater/heartbeat.go:116 +0x134
github.com/elastic/beats/libbeat/cmd/instance.(*Beat).launch(0xc00054e000, 0x4d99f03, 0x9, 0x4d99f03, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:415 +0x50d
github.com/elastic/beats/libbeat/cmd/instance.Run.func1(0x4d99f03, 0x9, 0x4d99f03, 0x9, 0x0, 0x0, 0x4d99f03, 0x9, 0x4d99f03, 0x9, ...)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:185 +0x63c
github.com/elastic/beats/libbeat/cmd/instance.Run(0x4d99f03, 0x9, 0x4d99f03, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:186 +0xbe
github.com/elastic/beats/libbeat/cmd.genRunCmd.func1(0xc00042a000, 0xc0001500a0, 0x0, 0x5)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/cmd/run.go:36 +0x9a
github.com/elastic/beats/vendor/github.com/spf13/cobra.(*Command).execute(0xc00042a000, 0xc0000c60d0, 0x5, 0x5, 0xc00042a000, 0xc0000c60d0)
	/Users/pmercado/go/src/github.com/elastic/beats/vendor/github.com/spf13/cobra/command.go:704 +0x289
github.com/elastic/beats/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00042a000, 0x4aac3aa, 0xc000269f88, 0xc0000ae058)
	/Users/pmercado/go/src/github.com/elastic/beats/vendor/github.com/spf13/cobra/command.go:785 +0x29f
github.com/elastic/beats/vendor/github.com/spf13/cobra.(*Command).Execute(...)
	/Users/pmercado/go/src/github.com/elastic/beats/vendor/github.com/spf13/cobra/command.go:738
main.main()
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/main.go:29 +0x30

goroutine 6 [syscall]:
os/signal.signal_recv(0x0)
	/usr/local/go/src/runtime/sigqueue.go:139 +0x9f
os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
	/usr/local/go/src/os/signal/signal_unix.go:29 +0x41

goroutine 53 [select]:
github.com/elastic/beats/libbeat/publisher/queue/memqueue.(*bufferingEventLoop).run(0xc0001600a0)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/eventloop.go:316 +0x1cf
github.com/elastic/beats/libbeat/publisher/queue/memqueue.NewBroker.func1(0xc0001c5340, 0x4f24580, 0xc0001600a0)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/broker.go:175 +0x5d
created by github.com/elastic/beats/libbeat/publisher/queue/memqueue.NewBroker
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/broker.go:173 +0x374

goroutine 54 [select]:
github.com/elastic/beats/libbeat/publisher/queue/memqueue.(*ackLoop).run(0xc000150820)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/ackloop.go:60 +0x107
github.com/elastic/beats/libbeat/publisher/queue/memqueue.NewBroker.func2(0xc0001c5340, 0xc000150820)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/broker.go:179 +0x57
created by github.com/elastic/beats/libbeat/publisher/queue/memqueue.NewBroker
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/broker.go:177 +0x3a3

goroutine 55 [select]:
github.com/elastic/beats/libbeat/publisher/queue/memqueue.(*consumer).Get(0xc00015d960, 0x32, 0x0, 0x0, 0xc0004b7601, 0x1)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/consume.go:67 +0xeb
github.com/elastic/beats/libbeat/publisher/pipeline.(*eventConsumer).loop(0xc000150910, 0x4f245c0, 0xc00015d960)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/consumer.go:167 +0x415
created by github.com/elastic/beats/libbeat/publisher/pipeline.newEventConsumer
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/consumer.go:77 +0x155

goroutine 56 [select]:
github.com/elastic/beats/libbeat/publisher/pipeline.(*retryer).loop(0xc000155200)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/retry.go:131 +0x1ac
created by github.com/elastic/beats/libbeat/publisher/pipeline.newRetryer
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/retry.go:86 +0x11e

goroutine 57 [chan receive]:
github.com/elastic/beats/libbeat/publisher/pipeline.(*netClientWorker).run(0xc000155240)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/output.go:84 +0xc4
created by github.com/elastic/beats/libbeat/publisher/pipeline.makeClientWorker
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/output.go:48 +0xe0

goroutine 24 [select]:
github.com/elastic/beats/libbeat/monitoring/report/log.(*reporter).snapshotLoop(0xc00040a210)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/monitoring/report/log/log.go:128 +0x3c5
github.com/elastic/beats/libbeat/monitoring/report/log.MakeReporter.func1(0xc00040a210)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/monitoring/report/log/log.go:106 +0x53
created by github.com/elastic/beats/libbeat/monitoring/report/log.MakeReporter
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/monitoring/report/log/log.go:104 +0x159

goroutine 58 [chan receive]:
github.com/elastic/beats/libbeat/service.HandleSignals.func1(0xc00040e180, 0xc0000d4f30, 0xc000270070, 0xc0000d02c0)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/service/service.go:49 +0x41
created by github.com/elastic/beats/libbeat/service.HandleSignals
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/service/service.go:48 +0x115

goroutine 27 [select]:
github.com/elastic/beats/libbeat/cfgfile.(*Reloader).Run(0xc0001d4070, 0x4f23b00, 0xc0000d4f10)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/cfgfile/reload.go:175 +0x227
created by github.com/elastic/beats/heartbeat/beater.(*Heartbeat).RunReloadableMonitors
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/beater/heartbeat.go:150 +0x92

goroutine 28 [select]:
github.com/elastic/beats/heartbeat/scheduler.(*Scheduler).run(0xc00032c080)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/scheduler/scheduler.go:223 +0x6f0
created by github.com/elastic/beats/heartbeat/scheduler.(*Scheduler).Start
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/scheduler/scheduler.go:120 +0xb7

goroutine 114 [running]:
	goroutine running on other thread; stack unavailable
created by github.com/elastic/beats/heartbeat/scheduler.(*Scheduler).runTask
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/scheduler/scheduler.go:352 +0x5c

goroutine 114 [running]:
runtime.throw(0x4da4cb0, 0x15)
	/usr/local/go/src/runtime/panic.go:617 +0x72 fp=0xc000311800 sp=0xc0003117d0 pc=0x402f5d2
runtime.mapassign_faststr(0x4d1c5c0, 0xc0003e06c0, 0x4d98dec, 0x8, 0xc0003e1650)
	/usr/local/go/src/runtime/map_faststr.go:291 +0x3ff fp=0xc000311868 sp=0xc000311800 pc=0x401549f
github.com/elastic/beats/libbeat/common.MapStr.DeepUpdate(0xc0003e06c0, 0xc0003e1680)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/common/mapstr.go:73 +0x1f1 fp=0xc000311940 sp=0xc000311868 pc=0x43d7a31
github.com/elastic/beats/libbeat/common.deepUpdateValue(0x4d1c5c0, 0xc0003e06c0, 0xc0003e1680, 0x7, 0xc0000e5a68)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/common/mapstr.go:87 +0xa3 fp=0xc000311970 sp=0xc000311940 pc=0x43d7c43
github.com/elastic/beats/libbeat/common.MapStr.DeepUpdate(0xc000170150, 0xc0003e16b0)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/common/mapstr.go:73 +0x1ac fp=0xc000311a48 sp=0xc000311970 pc=0x43d79ec
github.com/elastic/beats/heartbeat/eventext.MergeEventFields(...)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/eventext/eventext.go:28
github.com/elastic/beats/heartbeat/monitors/wrappers.addMonitorDuration.func1(0xc000094380, 0xc000082000, 0x780c7b8, 0x203000, 0x203000, 0x780c7b8)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/wrappers/monitors.go:112 +0x32b fp=0xc000311af8 sp=0xc000311a48 pc=0x440d0fb
github.com/elastic/beats/heartbeat/monitors/jobs.Wrap.func1(0xc000094380, 0x40, 0x48, 0xc00015e1e0, 0x0, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/jobs/job.go:72 +0x50 fp=0xc000311b80 sp=0xc000311af8 pc=0x440aba0
github.com/elastic/beats/heartbeat/monitors/wrappers.addMonitorMeta.func1.1(0xc000094380, 0xc00015e1e0, 0x780ccc8, 0x9a78068, 0xc0004bdd40, 0x400e3c9)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/wrappers/monitors.go:55 +0x95 fp=0xc000311c80 sp=0xc000311b80 pc=0x440c475
github.com/elastic/beats/heartbeat/monitors/jobs.Wrap.func1(0xc000094380, 0x78006d0, 0x0, 0x203000, 0x203000, 0x780ccc8)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/jobs/job.go:72 +0x50 fp=0xc000311d08 sp=0xc000311c80 pc=0x440aba0
github.com/elastic/beats/heartbeat/monitors/wrappers.makeAddSummary.func2.1(0xc000094380, 0x0, 0x0, 0x0, 0x0, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/wrappers/monitors.go:154 +0x78 fp=0xc000311db0 sp=0xc000311d08 pc=0x440d328
github.com/elastic/beats/heartbeat/monitors/jobs.Wrap.func1(0xc000094380, 0xc000094380, 0x0, 0x0, 0x0, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/jobs/job.go:72 +0x50 fp=0xc000311e38 sp=0xc000311db0 pc=0x440aba0
github.com/elastic/beats/heartbeat/monitors.runPublishJob(0xc00015c400, 0x4f30f00, 0xc00028f320, 0xc0004bdf98, 0x10, 0xc0004bdf90)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/task.go:141 +0x7c fp=0xc000311f48 sp=0xc000311e38 pc=0x441f12c
github.com/elastic/beats/heartbeat/monitors.runPublishJob.func1(0x10, 0x4de8940, 0xc00015e460)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/task.go:172 +0x3f fp=0xc000311f88 sp=0xc000311f48 pc=0x4420e5f
github.com/elastic/beats/heartbeat/scheduler.(*Scheduler).runTask.func1(0xc00015e460, 0xc00032c080, 0xc00015e460, 0xc00015c420)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/scheduler/scheduler.go:362 +0x5b fp=0xc000311fc0 sp=0xc000311f88 pc=0x43f113b
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc000311fc8 sp=0xc000311fc0 pc=0x405f031
created by github.com/elastic/beats/heartbeat/scheduler.(*Scheduler).runTask
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/scheduler/scheduler.go:352 +0x5c

andrewvc · 2019-04-25T03:34:35Z

Good catch @odacremolbap . I wasn't seeing the segfault consistently, but running go run -race main.go -e turned up consistent races here.

I don't think we really introduced the bug here, we just sort of exposed it. After a lot of head scratching I found the root cause in dd9fe1168 , where we re-used maps across continuations.

Tested this by running go run -race main.go against the following config:

heartbeat.monitors:
- type: http
  urls: ["http://elastic.co:9200"]
  mode: all
  schedule: '@every 10s'

houndci-bot · 2019-04-26T02:52:21Z

heartbeat/eventext/eventext.go

 	}
 }
+
+// EventCalledMetaKey is the path to the @metadata key marking an event as cancelled.


comment on exported const EventCancelledMetaKey should be of the form "EventCancelledMetaKey ..."

heartbeat/eventext/eventext.go

andrewvc · 2019-04-26T02:55:29Z

@ruflin @odacremolbap I've made a final push here to fix the concurrency in a more general way by just cloning in 237b8a8 . We could do it better in the future with #11941 which avoids some of the complexity of MapStr

Fixes elastic#11737

This reverts commit dd9fe116838bbbb5f60a2cc0921f1f5c7d071ce1.

ruflin

Skimmed through it again and looks but (with few nits), but would be good if @odacremolbap could take a look again.

heartbeat/monitors/task.go

andrewvc · 2019-04-29T16:14:05Z

@ruflin I've fixed the remaining little things in 88a6f78 and b860676

ruflin · 2019-04-30T12:51:53Z

CI failures look related.

andrewvc · 2019-04-30T15:08:54Z

@ruflin Good catch. 61cdef9 should fix it (WFG)

andrewvc · 2019-04-30T17:24:00Z

Tests are now all passing. Can I get a final LGTM here @ruflin

ruflin

LGTM

@ruflin

This fixes elastic#11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works. From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints. The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified. Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic. Fixes elastic#11737 (cherry picked from commit 078612e)

hyksos · 2019-05-02T02:02:12Z

@andrewvc since this is a 6.6 to 6.7 regression will this be backported to 6.x ?

Mostly asking at this point since I don’t know how this is decided.
7.x is a major so I thought 6.x would have the fix as it breaks a 6.x feature.

On a unrelated note, I also wanted to thank you for your work on this issue.

@ruflin

This fixes #11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works. From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints. The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified. Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic. Fixes #11737 (cherry picked from commit 078612e)

@ruflin

…lastic#12007) This fixes elastic#11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works. From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints. The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified. Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic. Fixes elastic#11737 (cherry picked from commit 9e80988)

andrewvc added in progress Pull request is currently in progress. Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team labels Apr 23, 2019

andrewvc requested a review from a team as a code owner April 23, 2019 03:08

andrewvc self-assigned this Apr 23, 2019

andrewvc mentioned this pull request Apr 23, 2019

Heartbeat 6.7 regression bug; http mode:all always logs a false or incomplete UP #11737

Closed

houndci-bot reviewed Apr 23, 2019

View reviewed changes

heartbeat/eventext/eventext.go Show resolved Hide resolved

heartbeat/eventext/eventext.go Show resolved Hide resolved

heartbeat/eventext/eventext.go Outdated Show resolved Hide resolved

heartbeat/eventext/eventext.go Outdated Show resolved Hide resolved

houndci-bot reviewed Apr 23, 2019

View reviewed changes

heartbeat/eventext/eventext.go Outdated Show resolved Hide resolved

andrewvc requested a review from odacremolbap April 23, 2019 21:42

andrewvc added review and removed in progress Pull request is currently in progress. labels Apr 23, 2019

andrewvc requested a review from ruflin April 23, 2019 21:43

andrewvc added bug needs_backport PR is waiting to be backported to other branches. labels Apr 23, 2019

andrewvc changed the title ~~[Heartbeat][WIP] Initial concept for fixing 'all' pings'~~ [Heartbeat] Fix incorrect 'Up' status for all mode Apr 23, 2019

ruflin approved these changes Apr 24, 2019

View reviewed changes

houndci-bot reviewed Apr 26, 2019

View reviewed changes

heartbeat/eventext/eventext.go Outdated Show resolved Hide resolved

houndci-bot reviewed Apr 26, 2019

View reviewed changes

heartbeat/eventext/eventext.go Outdated Show resolved Hide resolved

andrewvc added 7 commits April 26, 2019 09:52

Initial concept for fixing 'all' pings'

bd14dda

Fixes elastic#11737

Add cancelled tasks

d92b670

Moar

49b1347

Make hound happy

fa73f91

Make hound happy

b9e72bf

Moar

7316675

[Heartbeat] Fix race on WithFields

6a43545

andrewvc added 4 commits April 26, 2019 09:52

moar

7fdcb33

Revert "[Heartbeat] Fix race on WithFields"

8bb663a

This reverts commit dd9fe116838bbbb5f60a2cc0921f1f5c7d071ce1.

Revert heartbeat.yml

fd4bf2a

Fix comment

dad49b9

andrewvc force-pushed the fix-all-pings branch from 6039bd6 to dad49b9 Compare April 26, 2019 14:52

ruflin approved these changes Apr 26, 2019

View reviewed changes

heartbeat/monitors/task.go Outdated Show resolved Hide resolved

heartbeat/monitors/task.go Outdated Show resolved Hide resolved

andrewvc added 2 commits April 29, 2019 10:36

Fix ordering of error check

88a6f78

Fix import sorting

b860676

Test metadata, not fields for evt cancelled info

61cdef9

ruflin approved these changes May 1, 2019

View reviewed changes

andrewvc merged commit 078612e into elastic:master May 1, 2019

andrewvc deleted the fix-all-pings branch May 1, 2019 12:42

andrewvc mentioned this pull request May 1, 2019

Cherry-pick #11895 to 7.0: [Heartbeat] Fix incorrect 'Up' status for all mode #12007

Merged

andrewvc added v7.0.1 and removed needs_backport PR is waiting to be backported to other branches. labels May 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Heartbeat] Fix incorrect 'Up' status for all mode #11895

[Heartbeat] Fix incorrect 'Up' status for all mode #11895

andrewvc commented Apr 23, 2019 •

edited

Loading

elasticmachine commented Apr 23, 2019

ruflin left a comment

ruflin Apr 24, 2019

andrewvc Apr 24, 2019

ruflin Apr 24, 2019

andrewvc Apr 24, 2019

odacremolbap Apr 24, 2019

odacremolbap commented Apr 24, 2019

andrewvc commented Apr 25, 2019

houndci-bot Apr 26, 2019

andrewvc commented Apr 26, 2019

ruflin left a comment

andrewvc commented Apr 29, 2019

ruflin commented Apr 30, 2019

andrewvc commented Apr 30, 2019

andrewvc commented Apr 30, 2019

ruflin left a comment

hyksos commented May 2, 2019

[Heartbeat] Fix incorrect 'Up' status for all mode #11895

[Heartbeat] Fix incorrect 'Up' status for all mode #11895

Conversation

andrewvc commented Apr 23, 2019 • edited Loading

elasticmachine commented Apr 23, 2019

ruflin left a comment

Choose a reason for hiding this comment

ruflin Apr 24, 2019

Choose a reason for hiding this comment

andrewvc Apr 24, 2019

Choose a reason for hiding this comment

ruflin Apr 24, 2019

Choose a reason for hiding this comment

andrewvc Apr 24, 2019

Choose a reason for hiding this comment

odacremolbap Apr 24, 2019

Choose a reason for hiding this comment

odacremolbap commented Apr 24, 2019

andrewvc commented Apr 25, 2019

houndci-bot Apr 26, 2019

Choose a reason for hiding this comment

andrewvc commented Apr 26, 2019

ruflin left a comment

Choose a reason for hiding this comment

andrewvc commented Apr 29, 2019

ruflin commented Apr 30, 2019

andrewvc commented Apr 30, 2019

andrewvc commented Apr 30, 2019

ruflin left a comment

Choose a reason for hiding this comment

hyksos commented May 2, 2019

andrewvc commented Apr 23, 2019 •

edited

Loading