-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Heartbeat] Fix incorrect 'Up' status for all mode #11895
Conversation
Pinging @elastic/uptime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Did not run it locally yet.
heartbeat/eventext/eventext.go
Outdated
@@ -30,3 +30,22 @@ func MergeEventFields(e *beat.Event, merge common.MapStr) { | |||
e.Fields = merge | |||
} | |||
} | |||
|
|||
// EventCalledMetaKey is the full path to the key marking an event as cancelled. | |||
const EventCancelledMetaKey = "meta.__hb_evt_cancel__" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Leave out the meta.
as otherwise if you use Put
with a .
in the key it will create a tree and then the Get has to traverse the tree. Not having this would make the checks more efficient assuming this will happen on a lot of times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is unfortunate that this is more expensive, but otherwise it winds up in event.Fields
which is even hackier. I'm not really worried about the perf here since that should be very fast, even though it isn't ideal. WDYT?
logp.Err("Job %v failed with: ", err) | ||
} | ||
|
||
if event.Fields != nil && !eventext.IsEventCancelled(event) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously there was an event != nil
check here. I assume because of :138 this cannot happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. My IDE actually flagged that as redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
event
is being passed as pointer to job
which could nil it, but that's something we wont ever do as far as I am concerned
@andrewvc I'm getting a panic running the target scenario It seems that MapStr.DeepUpdate is not thread safe and that routines 114 and 115 are writing concurrently.
|
Good catch @odacremolbap . I wasn't seeing the segfault consistently, but running I don't think we really introduced the bug here, we just sort of exposed it. After a lot of head scratching I found the root cause in dd9fe1168 , where we re-used maps across continuations. Tested this by running heartbeat.monitors:
- type: http
urls: ["http://elastic.co:9200"]
mode: all
schedule: '@every 10s' |
heartbeat/eventext/eventext.go
Outdated
} | ||
} | ||
|
||
// EventCalledMetaKey is the path to the @metadata key marking an event as cancelled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on exported const EventCancelledMetaKey should be of the form "EventCancelledMetaKey ..."
@ruflin @odacremolbap I've made a final push here to fix the concurrency in a more general way by just cloning in 237b8a8 . We could do it better in the future with #11941 which avoids some of the complexity of |
This reverts commit dd9fe116838bbbb5f60a2cc0921f1f5c7d071ce1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skimmed through it again and looks but (with few nits), but would be good if @odacremolbap could take a look again.
CI failures look related. |
Tests are now all passing. Can I get a final LGTM here @ruflin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This fixes elastic#11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works. From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints. The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified. Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic. Fixes elastic#11737 (cherry picked from commit 078612e)
@andrewvc since this is a 6.6 to 6.7 regression will this be backported to 6.x ? Mostly asking at this point since I don’t know how this is decided. On a unrelated note, I also wanted to thank you for your work on this issue. |
This fixes #11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works. From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints. The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified. Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic. Fixes #11737 (cherry picked from commit 078612e)
…lastic#12007) This fixes elastic#11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works. From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints. The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified. Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic. Fixes elastic#11737 (cherry picked from commit 9e80988)
This fixes #11737 , where setting
mode: all
now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works.From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints.
The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using
event.private
for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified.Testing this manually is easy, just point at a domain with multiple A records, like
elastic.co
. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic.Fixes #11737