-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heartbeat 6.7 regression bug; http mode:all always logs a false or incomplete UP #11737
Comments
This also replicate, you get 2 broken UP events with missing fields, in addition to a true good UP event for www.google.com and true good DOWN event for samsung.com with all the required fields: heartbeat.monitors:
- type: icmp
schedule: '@every 5s'
hosts:
- samsung.com # not pingable
- www.google.com # pingable
mode: all
ipv6: false
timeout: 2s Again if you flip it to "mode:any" the broken UP events with missing field stops and you get the expected behavior. |
Works ok with either mode: all or any, all expected events, nothing broken or false. heartbeat.monitors:
- type: icmp
schedule: '@every 5s'
hosts:
- 8.8.8.8 # pingable
- 2.2.2.2 # non-pingable
mode: all
ipv6: false
timeout: 2s Works ok with either mode: all or any, all expected events, nothing broken or false. heartbeat.monitors:
- type: icmp
schedule: '@every 5s'
hosts:
- www.idonotexist-forreal.com # does not resolve
mode: all
ipv6: false
timeout: 2s |
Pinging @elastic/uptime |
This initially reminded me of #10777 but it seems to be an other issue. |
Able to reproduce. |
So, the root issue here, it seems, is that there is some category of error that is no longer being handled correctly due to one of the refactors. All of the erroneous up events also do not have an The status is set to |
I have an initial fix here: #11895, which needs some work tomorrow to become reviewable. The root cause is that the The right solution is to keep that task, but to let jobs mark themselves as |
This fixes #11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works. From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints. The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified. Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic. Fixes #11737
This fixes elastic#11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works. From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints. The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified. Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic. Fixes elastic#11737 (cherry picked from commit 078612e)
This fixes #11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works. From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints. The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified. Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic. Fixes #11737 (cherry picked from commit 078612e)
…lastic#12007) This fixes elastic#11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works. From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints. The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified. Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic. Fixes elastic#11737 (cherry picked from commit 9e80988)
Heartbeat 6.7 when using the mode:all of the type:http monitor will generated false/broken UP events for DOWN and UP targets... in addition to real/expected UP/DOWN events representing the true status of the monitored target. Minimal repro case below. This behavior can't be reproduced with heartbeat < 6.7.0 (tested with 6.7.1, 6.7.0, 6.6.2, 6.6.1, 6.6.0)
For confirmed bugs, please report:
docker run --volume="$(pwd)/heartbeat.docker.yml:/usr/share/heartbeat/heartbeat.yml:ro" docker.elastic.co/beats/heartbeat:6.7.1 --strict.perms=false -e -E output.console.enabled=true
Bad UP event for google.com(reachable target), notice all the missing fields that should come with a normal UP event.
Bad UP event for elastic.co:8888(unreachable target), notice all the missing fields that should come with a normal UP event.
Good UP event for google.com(reachable target), notice that all the expected fields for a normal UP event are there.
Good DOWN events (x4) for elastic.co:8888(unreachable target), notice that all the expected fields for a normal DOWN event are there. There are 4 of those events, all good, since we use the setting "mode:all". Only one such event is shown here, the other 3 are for the other IPs you get for an elastic.co DNS resolution where I am (Multi A DNS record). So you get 4 good DOWN event, 1 per IP.
[...] This continues of course, for every interval. There is always 2 bad UP event, followed by the truth; the normal expected events, either UP or DOWN depending on the true status of the target.
Using "mode:any" in the monitor config, you can't replicate, all events are good and true. There is of course only 1 DOWN event for elastic.co:8888, since mode:any only checks one of the IP.
Using heartbeat 6.6.2 you can't replicate.
The text was updated successfully, but these errors were encountered: