Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Heartbeat] Fix incorrect 'Up' status for all mode #11895

Merged
merged 14 commits into from
May 1, 2019

Conversation

andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Apr 23, 2019

This fixes #11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works.

From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints.

The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified.

Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic.

Fixes #11737

@andrewvc andrewvc added in progress Pull request is currently in progress. Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team labels Apr 23, 2019
@andrewvc andrewvc requested a review from a team as a code owner April 23, 2019 03:08
@andrewvc andrewvc self-assigned this Apr 23, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/uptime

heartbeat/eventext/eventext.go Show resolved Hide resolved
heartbeat/eventext/eventext.go Show resolved Hide resolved
heartbeat/eventext/eventext.go Outdated Show resolved Hide resolved
heartbeat/eventext/eventext.go Outdated Show resolved Hide resolved
@andrewvc andrewvc requested a review from odacremolbap April 23, 2019 21:42
@andrewvc andrewvc added review and removed in progress Pull request is currently in progress. labels Apr 23, 2019
@andrewvc andrewvc requested a review from ruflin April 23, 2019 21:43
@andrewvc andrewvc added bug needs_backport PR is waiting to be backported to other branches. labels Apr 23, 2019
@andrewvc andrewvc changed the title [Heartbeat][WIP] Initial concept for fixing 'all' pings' [Heartbeat] Fix incorrect 'Up' status for all mode Apr 23, 2019
Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Did not run it locally yet.

@@ -30,3 +30,22 @@ func MergeEventFields(e *beat.Event, merge common.MapStr) {
e.Fields = merge
}
}

// EventCalledMetaKey is the full path to the key marking an event as cancelled.
const EventCancelledMetaKey = "meta.__hb_evt_cancel__"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Leave out the meta. as otherwise if you use Put with a . in the key it will create a tree and then the Get has to traverse the tree. Not having this would make the checks more efficient assuming this will happen on a lot of times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is unfortunate that this is more expensive, but otherwise it winds up in event.Fields which is even hackier. I'm not really worried about the perf here since that should be very fast, even though it isn't ideal. WDYT?

logp.Err("Job %v failed with: ", err)
}

if event.Fields != nil && !eventext.IsEventCancelled(event) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously there was an event != nil check here. I assume because of :138 this cannot happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. My IDE actually flagged that as redundant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event is being passed as pointer to job which could nil it, but that's something we wont ever do as far as I am concerned

@odacremolbap
Copy link
Contributor

@andrewvc I'm getting a panic running the target scenario

It seems that MapStr.DeepUpdate is not thread safe and that routines 114 and 115 are writing concurrently.
I'm still learning this beat, and it's routines model, will need some more time to investigate further.

9-04-25T00:07:30.493+0200	INFO	[monitoring]	log/log.go:144	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":13,"time":{"ms":13}},"total":{"ticks":35,"time":{"ms":35},"value":35},"user":{"ticks":22,"time":{"ms":22}}},"info":{"ephemeral_id":"df233f57-6054-47a2-bdef-786b0e68afa6","uptime":{"ms":30017}},"memstats":{"gc_next":4194304,"memory_alloc":2507000,"memory_total":4034584,"rss":18956288}},"heartbeat":{"http":{"endpoint_starts":1,"monitor_starts":1}},"libbeat":{"config":{"module":{"running":0},"reloads":1},"output":{"type":"elasticsearch"},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"cpu":{"cores":12},"load":{"1":1.5054,"15":1.7524,"5":1.7104,"norm":{"1":0.1254,"15":0.146,"5":0.1425}}}}}}
2019-04-25T00:07:30.499+0200	DEBUG	[scheduler]	scheduler/scheduler.go:343	Start job 'auto-http-0X42B10F5590324BD9' at 2019-04-25 00:07:30.499282 +0200 CEST m=+30.027691338.
2019-04-25T00:07:30.499+0200	DEBUG	[scheduler]	scheduler/scheduler.go:206	Next wakeup time: 2019-04-25 00:07:40.494498 +0200 CEST
2019-04-25T00:07:30.501+0200	DEBUG	[scheduler]	scheduler/scheduler.go:248	Job 'auto-http-0X42B10F5590324BD9' returned at 2019-04-25 00:07:30.501034 +0200 CEST m=+30.029442922 (cont=2).
2019-04-25T00:07:30.501+0200	DEBUG	[scheduler]	scheduler/scheduler.go:302	start returned tasks
2019-04-25T00:07:30.501+0200	DEBUG	[scheduler]	scheduler/scheduler.go:206	Next wakeup time: 2019-04-25 00:07:40.494498 +0200 CEST
fatal error: concurrent map writes
fatal error: concurrent map writes

goroutine 115 [running]:
runtime.throw(0x4da4cb0, 0x15)
	/usr/local/go/src/runtime/panic.go:617 +0x72 fp=0xc0003156e8 sp=0xc0003156b8 pc=0x402f5d2
runtime.mapassign_faststr(0x4d1c5c0, 0xc0003e06c0, 0x4d96ceb, 0x6, 0xc0003e59e0)
	/usr/local/go/src/runtime/map_faststr.go:291 +0x3ff fp=0xc000315750 sp=0xc0003156e8 pc=0x401549f
github.com/elastic/beats/libbeat/common.MapStr.DeepUpdate(0xc0003e06c0, 0xc000156c90)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/common/mapstr.go:75 +0x12c fp=0xc000315828 sp=0xc000315750 pc=0x43d796c
github.com/elastic/beats/libbeat/common.deepUpdateValue(0x4d1c5c0, 0xc0003e06c0, 0xc000156c90, 0x7, 0xc0003e4088)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/common/mapstr.go:87 +0xa3 fp=0xc000315858 sp=0xc000315828 pc=0x43d7c43
github.com/elastic/beats/libbeat/common.MapStr.DeepUpdate(0xc0003e1320, 0xc000156cc0)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/common/mapstr.go:73 +0x1ac fp=0xc000315930 sp=0xc000315858 pc=0x43d79ec
github.com/elastic/beats/heartbeat/eventext.MergeEventFields(...)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/eventext/eventext.go:28
github.com/elastic/beats/heartbeat/monitors/wrappers.addMonitorStatus.func1(0xc000154380, 0xc000152410, 0x8a3d74c977bf, 0x100000001, 0xc000288a10, 0x404fcb8)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/wrappers/monitors.go:99 +0x20e fp=0xc0003159c0 sp=0xc000315930 pc=0x440ccfe
github.com/elastic/beats/heartbeat/monitors/jobs.Wrap.func1(0xc000154380, 0x6fde67ff3, 0x58bf1c0, 0xc0004f6400, 0x78021f0, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/jobs/job.go:72 +0x50 fp=0xc000315a48 sp=0xc0003159c0 pc=0x440aba0
github.com/elastic/beats/heartbeat/monitors/wrappers.addMonitorDuration.func1(0xc000154380, 0xc000152400, 0xc000288bf0, 0x203000, 0x203000, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/wrappers/monitors.go:109 +0x76 fp=0xc000315af8 sp=0xc000315a48 pc=0x440ce46
github.com/elastic/beats/heartbeat/monitors/jobs.Wrap.func1(0xc000154380, 0x40, 0x48, 0xc0001500f0, 0x0, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/jobs/job.go:72 +0x50 fp=0xc000315b80 sp=0xc000315af8 pc=0x440aba0
github.com/elastic/beats/heartbeat/monitors/wrappers.addMonitorMeta.func1.1(0xc000154380, 0xc0001500f0, 0x0, 0x0, 0xc0004bcd40, 0x400e3c9)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/wrappers/monitors.go:55 +0x95 fp=0xc000315c80 sp=0xc000315b80 pc=0x440c475
github.com/elastic/beats/heartbeat/monitors/jobs.Wrap.func1(0xc000154380, 0x78021f0, 0x0, 0x203000, 0x203000, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/jobs/job.go:72 +0x50 fp=0xc000315d08 sp=0xc000315c80 pc=0x440aba0
github.com/elastic/beats/heartbeat/monitors/wrappers.makeAddSummary.func2.1(0xc000154380, 0x0, 0x0, 0x0, 0x0, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/wrappers/monitors.go:154 +0x78 fp=0xc000315db0 sp=0xc000315d08 pc=0x440d328
github.com/elastic/beats/heartbeat/monitors/jobs.Wrap.func1(0xc000154380, 0xc000154380, 0x0, 0x0, 0x0, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/jobs/job.go:72 +0x50 fp=0xc000315e38 sp=0xc000315db0 pc=0x440aba0
github.com/elastic/beats/heartbeat/monitors.runPublishJob(0xc00015c400, 0x4f30f00, 0xc00028f320, 0xc0004bcf98, 0x10, 0xc0004bcf90)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/task.go:141 +0x7c fp=0xc000315f48 sp=0xc000315e38 pc=0x441f12c
github.com/elastic/beats/heartbeat/monitors.runPublishJob.func1(0x10, 0x4de8940, 0xc00015e460)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/task.go:172 +0x3f fp=0xc000315f88 sp=0xc000315f48 pc=0x4420e5f
github.com/elastic/beats/heartbeat/scheduler.(*Scheduler).runTask.func1(0xc00015e460, 0xc00032c080, 0xc00015e460, 0xc00015c440)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/scheduler/scheduler.go:362 +0x5b fp=0xc000315fc0 sp=0xc000315f88 pc=0x43f113b
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc000315fc8 sp=0xc000315fc0 pc=0x405f031
created by github.com/elastic/beats/heartbeat/scheduler.(*Scheduler).runTask
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/scheduler/scheduler.go:352 +0x5c

goroutine 1 [chan receive]:
github.com/elastic/beats/heartbeat/beater.(*Heartbeat).Run(0xc0001d4000, 0xc00054e000, 0x0, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/beater/heartbeat.go:116 +0x134
github.com/elastic/beats/libbeat/cmd/instance.(*Beat).launch(0xc00054e000, 0x4d99f03, 0x9, 0x4d99f03, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:415 +0x50d
github.com/elastic/beats/libbeat/cmd/instance.Run.func1(0x4d99f03, 0x9, 0x4d99f03, 0x9, 0x0, 0x0, 0x4d99f03, 0x9, 0x4d99f03, 0x9, ...)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:185 +0x63c
github.com/elastic/beats/libbeat/cmd/instance.Run(0x4d99f03, 0x9, 0x4d99f03, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:186 +0xbe
github.com/elastic/beats/libbeat/cmd.genRunCmd.func1(0xc00042a000, 0xc0001500a0, 0x0, 0x5)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/cmd/run.go:36 +0x9a
github.com/elastic/beats/vendor/github.com/spf13/cobra.(*Command).execute(0xc00042a000, 0xc0000c60d0, 0x5, 0x5, 0xc00042a000, 0xc0000c60d0)
	/Users/pmercado/go/src/github.com/elastic/beats/vendor/github.com/spf13/cobra/command.go:704 +0x289
github.com/elastic/beats/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00042a000, 0x4aac3aa, 0xc000269f88, 0xc0000ae058)
	/Users/pmercado/go/src/github.com/elastic/beats/vendor/github.com/spf13/cobra/command.go:785 +0x29f
github.com/elastic/beats/vendor/github.com/spf13/cobra.(*Command).Execute(...)
	/Users/pmercado/go/src/github.com/elastic/beats/vendor/github.com/spf13/cobra/command.go:738
main.main()
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/main.go:29 +0x30

goroutine 6 [syscall]:
os/signal.signal_recv(0x0)
	/usr/local/go/src/runtime/sigqueue.go:139 +0x9f
os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
	/usr/local/go/src/os/signal/signal_unix.go:29 +0x41

goroutine 53 [select]:
github.com/elastic/beats/libbeat/publisher/queue/memqueue.(*bufferingEventLoop).run(0xc0001600a0)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/eventloop.go:316 +0x1cf
github.com/elastic/beats/libbeat/publisher/queue/memqueue.NewBroker.func1(0xc0001c5340, 0x4f24580, 0xc0001600a0)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/broker.go:175 +0x5d
created by github.com/elastic/beats/libbeat/publisher/queue/memqueue.NewBroker
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/broker.go:173 +0x374

goroutine 54 [select]:
github.com/elastic/beats/libbeat/publisher/queue/memqueue.(*ackLoop).run(0xc000150820)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/ackloop.go:60 +0x107
github.com/elastic/beats/libbeat/publisher/queue/memqueue.NewBroker.func2(0xc0001c5340, 0xc000150820)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/broker.go:179 +0x57
created by github.com/elastic/beats/libbeat/publisher/queue/memqueue.NewBroker
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/broker.go:177 +0x3a3

goroutine 55 [select]:
github.com/elastic/beats/libbeat/publisher/queue/memqueue.(*consumer).Get(0xc00015d960, 0x32, 0x0, 0x0, 0xc0004b7601, 0x1)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/consume.go:67 +0xeb
github.com/elastic/beats/libbeat/publisher/pipeline.(*eventConsumer).loop(0xc000150910, 0x4f245c0, 0xc00015d960)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/consumer.go:167 +0x415
created by github.com/elastic/beats/libbeat/publisher/pipeline.newEventConsumer
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/consumer.go:77 +0x155

goroutine 56 [select]:
github.com/elastic/beats/libbeat/publisher/pipeline.(*retryer).loop(0xc000155200)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/retry.go:131 +0x1ac
created by github.com/elastic/beats/libbeat/publisher/pipeline.newRetryer
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/retry.go:86 +0x11e

goroutine 57 [chan receive]:
github.com/elastic/beats/libbeat/publisher/pipeline.(*netClientWorker).run(0xc000155240)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/output.go:84 +0xc4
created by github.com/elastic/beats/libbeat/publisher/pipeline.makeClientWorker
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/output.go:48 +0xe0

goroutine 24 [select]:
github.com/elastic/beats/libbeat/monitoring/report/log.(*reporter).snapshotLoop(0xc00040a210)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/monitoring/report/log/log.go:128 +0x3c5
github.com/elastic/beats/libbeat/monitoring/report/log.MakeReporter.func1(0xc00040a210)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/monitoring/report/log/log.go:106 +0x53
created by github.com/elastic/beats/libbeat/monitoring/report/log.MakeReporter
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/monitoring/report/log/log.go:104 +0x159

goroutine 58 [chan receive]:
github.com/elastic/beats/libbeat/service.HandleSignals.func1(0xc00040e180, 0xc0000d4f30, 0xc000270070, 0xc0000d02c0)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/service/service.go:49 +0x41
created by github.com/elastic/beats/libbeat/service.HandleSignals
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/service/service.go:48 +0x115

goroutine 27 [select]:
github.com/elastic/beats/libbeat/cfgfile.(*Reloader).Run(0xc0001d4070, 0x4f23b00, 0xc0000d4f10)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/cfgfile/reload.go:175 +0x227
created by github.com/elastic/beats/heartbeat/beater.(*Heartbeat).RunReloadableMonitors
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/beater/heartbeat.go:150 +0x92

goroutine 28 [select]:
github.com/elastic/beats/heartbeat/scheduler.(*Scheduler).run(0xc00032c080)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/scheduler/scheduler.go:223 +0x6f0
created by github.com/elastic/beats/heartbeat/scheduler.(*Scheduler).Start
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/scheduler/scheduler.go:120 +0xb7

goroutine 114 [running]:
	goroutine running on other thread; stack unavailable
created by github.com/elastic/beats/heartbeat/scheduler.(*Scheduler).runTask
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/scheduler/scheduler.go:352 +0x5c

goroutine 114 [running]:
runtime.throw(0x4da4cb0, 0x15)
	/usr/local/go/src/runtime/panic.go:617 +0x72 fp=0xc000311800 sp=0xc0003117d0 pc=0x402f5d2
runtime.mapassign_faststr(0x4d1c5c0, 0xc0003e06c0, 0x4d98dec, 0x8, 0xc0003e1650)
	/usr/local/go/src/runtime/map_faststr.go:291 +0x3ff fp=0xc000311868 sp=0xc000311800 pc=0x401549f
github.com/elastic/beats/libbeat/common.MapStr.DeepUpdate(0xc0003e06c0, 0xc0003e1680)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/common/mapstr.go:73 +0x1f1 fp=0xc000311940 sp=0xc000311868 pc=0x43d7a31
github.com/elastic/beats/libbeat/common.deepUpdateValue(0x4d1c5c0, 0xc0003e06c0, 0xc0003e1680, 0x7, 0xc0000e5a68)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/common/mapstr.go:87 +0xa3 fp=0xc000311970 sp=0xc000311940 pc=0x43d7c43
github.com/elastic/beats/libbeat/common.MapStr.DeepUpdate(0xc000170150, 0xc0003e16b0)
	/Users/pmercado/go/src/github.com/elastic/beats/libbeat/common/mapstr.go:73 +0x1ac fp=0xc000311a48 sp=0xc000311970 pc=0x43d79ec
github.com/elastic/beats/heartbeat/eventext.MergeEventFields(...)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/eventext/eventext.go:28
github.com/elastic/beats/heartbeat/monitors/wrappers.addMonitorDuration.func1(0xc000094380, 0xc000082000, 0x780c7b8, 0x203000, 0x203000, 0x780c7b8)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/wrappers/monitors.go:112 +0x32b fp=0xc000311af8 sp=0xc000311a48 pc=0x440d0fb
github.com/elastic/beats/heartbeat/monitors/jobs.Wrap.func1(0xc000094380, 0x40, 0x48, 0xc00015e1e0, 0x0, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/jobs/job.go:72 +0x50 fp=0xc000311b80 sp=0xc000311af8 pc=0x440aba0
github.com/elastic/beats/heartbeat/monitors/wrappers.addMonitorMeta.func1.1(0xc000094380, 0xc00015e1e0, 0x780ccc8, 0x9a78068, 0xc0004bdd40, 0x400e3c9)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/wrappers/monitors.go:55 +0x95 fp=0xc000311c80 sp=0xc000311b80 pc=0x440c475
github.com/elastic/beats/heartbeat/monitors/jobs.Wrap.func1(0xc000094380, 0x78006d0, 0x0, 0x203000, 0x203000, 0x780ccc8)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/jobs/job.go:72 +0x50 fp=0xc000311d08 sp=0xc000311c80 pc=0x440aba0
github.com/elastic/beats/heartbeat/monitors/wrappers.makeAddSummary.func2.1(0xc000094380, 0x0, 0x0, 0x0, 0x0, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/wrappers/monitors.go:154 +0x78 fp=0xc000311db0 sp=0xc000311d08 pc=0x440d328
github.com/elastic/beats/heartbeat/monitors/jobs.Wrap.func1(0xc000094380, 0xc000094380, 0x0, 0x0, 0x0, 0x0)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/jobs/job.go:72 +0x50 fp=0xc000311e38 sp=0xc000311db0 pc=0x440aba0
github.com/elastic/beats/heartbeat/monitors.runPublishJob(0xc00015c400, 0x4f30f00, 0xc00028f320, 0xc0004bdf98, 0x10, 0xc0004bdf90)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/task.go:141 +0x7c fp=0xc000311f48 sp=0xc000311e38 pc=0x441f12c
github.com/elastic/beats/heartbeat/monitors.runPublishJob.func1(0x10, 0x4de8940, 0xc00015e460)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/monitors/task.go:172 +0x3f fp=0xc000311f88 sp=0xc000311f48 pc=0x4420e5f
github.com/elastic/beats/heartbeat/scheduler.(*Scheduler).runTask.func1(0xc00015e460, 0xc00032c080, 0xc00015e460, 0xc00015c420)
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/scheduler/scheduler.go:362 +0x5b fp=0xc000311fc0 sp=0xc000311f88 pc=0x43f113b
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc000311fc8 sp=0xc000311fc0 pc=0x405f031
created by github.com/elastic/beats/heartbeat/scheduler.(*Scheduler).runTask
	/Users/pmercado/go/src/github.com/elastic/beats/heartbeat/scheduler/scheduler.go:352 +0x5c

@andrewvc
Copy link
Contributor Author

Good catch @odacremolbap . I wasn't seeing the segfault consistently, but running go run -race main.go -e turned up consistent races here.

I don't think we really introduced the bug here, we just sort of exposed it. After a lot of head scratching I found the root cause in dd9fe1168 , where we re-used maps across continuations.

Tested this by running go run -race main.go against the following config:

heartbeat.monitors:
- type: http
  urls: ["http://elastic.co:9200"]
  mode: all
  schedule: '@every 10s'

}
}

// EventCalledMetaKey is the path to the @metadata key marking an event as cancelled.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported const EventCancelledMetaKey should be of the form "EventCancelledMetaKey ..."

@andrewvc
Copy link
Contributor Author

@ruflin @odacremolbap I've made a final push here to fix the concurrency in a more general way by just cloning in 237b8a8 . We could do it better in the future with #11941 which avoids some of the complexity of MapStr

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed through it again and looks but (with few nits), but would be good if @odacremolbap could take a look again.

heartbeat/monitors/task.go Outdated Show resolved Hide resolved
heartbeat/monitors/task.go Outdated Show resolved Hide resolved
@andrewvc
Copy link
Contributor Author

@ruflin I've fixed the remaining little things in 88a6f78 and b860676

@ruflin
Copy link
Contributor

ruflin commented Apr 30, 2019

CI failures look related.

@andrewvc
Copy link
Contributor Author

@ruflin Good catch. 61cdef9 should fix it (WFG)

@andrewvc
Copy link
Contributor Author

Tests are now all passing. Can I get a final LGTM here @ruflin

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andrewvc andrewvc merged commit 078612e into elastic:master May 1, 2019
@andrewvc andrewvc deleted the fix-all-pings branch May 1, 2019 12:42
andrewvc added a commit to andrewvc/beats that referenced this pull request May 1, 2019
This fixes elastic#11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works.

From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints.

The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified.

Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic.

Fixes elastic#11737

(cherry picked from commit 078612e)
@andrewvc andrewvc added v7.0.1 and removed needs_backport PR is waiting to be backported to other branches. labels May 1, 2019
@hyksos
Copy link

hyksos commented May 2, 2019

@andrewvc since this is a 6.6 to 6.7 regression will this be backported to 6.x ?

Mostly asking at this point since I don’t know how this is decided.
7.x is a major so I thought 6.x would have the fix as it breaks a 6.x feature.

On a unrelated note, I also wanted to thank you for your work on this issue.

andrewvc added a commit that referenced this pull request May 2, 2019
This fixes #11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works.

From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints.

The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified.

Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic.

Fixes #11737

(cherry picked from commit 078612e)
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
…lastic#12007)

This fixes elastic#11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works.

From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints.

The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified.

Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic.

Fixes elastic#11737

(cherry picked from commit 9e80988)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Heartbeat review Team:obs-ds-hosted-services Label for the Observability Hosted Services team v7.0.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Heartbeat 6.7 regression bug; http mode:all always logs a false or incomplete UP
6 participants