Heartbeat 6.7 regression bug; http mode:all always logs a false or incomplete UP #11737

hyksos · 2019-04-10T05:22:37Z

Heartbeat 6.7 when using the mode:all of the type:http monitor will generated false/broken UP events for DOWN and UP targets... in addition to real/expected UP/DOWN events representing the true status of the monitored target. Minimal repro case below. This behavior can't be reproduced with heartbeat < 6.7.0 (tested with 6.7.1, 6.7.0, 6.6.2, 6.6.1, 6.6.0)

For confirmed bugs, please report:

Version: 6.7.0 6.7.1
Operating System: centos 7 (official heartbeat docker image)
Discuss Forum URL: https://discuss.elastic.co/t/heartbeat-6-7-regression-bug-mode-all-always-logs-a-false-up/174892
Steps to Reproduce:

heartbeat.docker.yml content:

heartbeat.monitors:
- type: http
  schedule: '@every 5s'
  urls:
    - http://google.com #reachable target
    - http://elastic.co:8888 #unreachable target
  mode: all
  ipv6: false
  timeout: 2s

Start heartbeat

docker run --volume="$(pwd)/heartbeat.docker.yml:/usr/share/heartbeat/heartbeat.yml:ro"   docker.elastic.co/beats/heartbeat:6.7.1   --strict.perms=false -e  -E output.console.enabled=true

Results:
Bad UP event for google.com(reachable target), notice all the missing fields that should come with a normal UP event.

{
  "@timestamp": "2019-04-10T04:39:40.159Z",
  "@metadata": {
    "beat": "heartbeat",
    "type": "doc",
    "version": "6.7.1"
  },
  "beat": {
    "name": "29e998770c99",
    "hostname": "29e998770c99",
    "version": "6.7.1"
  },
  "monitor": {
    "type": "http",
    "duration": {
      "us": 1607
    },
    "status": "up",
    "scheme": "http",
    "id": "http@http://google.com",
    "name": "http"
  },
  "http": {
    "url": "http://google.com"
  },
  "tcp": {
    "port": 80
  },
  "event": {
    "dataset": "uptime"
  },
  "host": {
    "name": "29e998770c99"
  }
}

Bad UP event for elastic.co:8888(unreachable target), notice all the missing fields that should come with a normal UP event.

{
  "@timestamp": "2019-04-10T04:39:40.159Z",
  "@metadata": {
    "beat": "heartbeat",
    "type": "doc",
    "version": "6.7.1"
  },
  "tcp": {
    "port": 8888
  },
  "event": {
    "dataset": "uptime"
  },
  "beat": {
    "name": "29e998770c99",
    "hostname": "29e998770c99",
    "version": "6.7.1"
  },
  "host": {
    "name": "29e998770c99"
  },
  "monitor": {
    "name": "http",
    "type": "http",
    "duration": {
      "us": 1561
    },
    "status": "up",
    "scheme": "http",
    "id": "http@http://elastic.co:8888"
  },
  "http": {
    "url": "http://elastic.co:8888"
  }
}

Good UP event for google.com(reachable target), notice that all the expected fields for a normal UP event are there.

{
  "@timestamp": "2019-04-10T04:39:40.161Z",
  "@metadata": {
    "beat": "heartbeat",
    "type": "doc",
    "version": "6.7.1"
  },
  "beat": {
    "name": "29e998770c99",
    "hostname": "29e998770c99",
    "version": "6.7.1"
  },
  "host": {
    "name": "29e998770c99"
  },
  "http": {
    "rtt": {
      "content": {
        "us": 2312
      },
      "total": {
        "us": 96849
      },
      "write_request": {
        "us": 88
      },
      "response_header": {
        "us": 60115
      },
      "validate": {
        "us": 62428
      }
    },
    "response": {
      "status_code": 200
    },
    "url": "http://google.com"
  },
  "monitor": {
    "scheme": "http",
    "id": "http@http://google.com",
    "name": "http",
    "type": "http",
    "host": "google.com",
    "ip": "172.217.13.110",
    "duration": {
      "us": 96972
    },
    "status": "up"
  },
  "resolve": {
    "host": "google.com",
    "ip": "172.217.13.110",
    "rtt": {
      "us": 1601
    }
  },
  "tcp": {
    "port": 80,
    "rtt": {
      "connect": {
        "us": 1802
      }
    }
  },
  "event": {
    "dataset": "uptime"
  }
}

Good DOWN events (x4) for elastic.co:8888(unreachable target), notice that all the expected fields for a normal DOWN event are there. There are 4 of those events, all good, since we use the setting "mode:all". Only one such event is shown here, the other 3 are for the other IPs you get for an elastic.co DNS resolution where I am (Multi A DNS record). So you get 4 good DOWN event, 1 per IP.

{
  "@timestamp": "2019-04-10T04:39:40.161Z",
  "@metadata": {
    "beat": "heartbeat",
    "type": "doc",
    "version": "6.7.1"
  },
  "event": {
    "dataset": "uptime"
  },
  "beat": {
    "name": "29e998770c99",
    "hostname": "29e998770c99",
    "version": "6.7.1"
  },
  "host": {
    "name": "29e998770c99"
  },
  "tcp": {
    "port": 8888
  },
  "monitor": {
    "type": "http",
    "host": "elastic.co",
    "ip": "151.101.2.217",
    "status": "down",
    "duration": {
      "us": 2001576
    },
    "scheme": "http",
    "id": "http@http://elastic.co:8888",
    "name": "http"
  },
  "resolve": {
    "host": "elastic.co",
    "ip": "151.101.2.217",
    "rtt": {
      "us": 1533
    }
  },
  "http": {
    "url": "http://elastic.co:8888"
  },
  "error": {
    "type": "io",
    "message": "Get http://elastic.co:8888: dial tcp 151.101.2.217:8888: i/o timeout"
  }
}

[...] This continues of course, for every interval. There is always 2 bad UP event, followed by the truth; the normal expected events, either UP or DOWN depending on the true status of the target.

Using "mode:any" in the monitor config, you can't replicate, all events are good and true. There is of course only 1 DOWN event for elastic.co:8888, since mode:any only checks one of the IP.
Using heartbeat 6.6.2 you can't replicate.

hyksos · 2019-04-10T05:44:55Z

This also replicate, you get 2 broken UP events with missing fields, in addition to a true good UP event for www.google.com and true good DOWN event for samsung.com with all the required fields:

heartbeat.monitors:
- type: icmp
  schedule: '@every 5s'
  hosts:
    - samsung.com # not pingable
    - www.google.com # pingable
  mode: all
  ipv6: false
  timeout: 2s

Again if you flip it to "mode:any" the broken UP events with missing field stops and you get the expected behavior.

hyksos · 2019-04-10T06:26:27Z

Works ok with either mode: all or any, all expected events, nothing broken or false.

heartbeat.monitors:
- type: icmp
  schedule: '@every 5s'
  hosts:
    - 8.8.8.8 # pingable
    - 2.2.2.2 # non-pingable
  mode: all
  ipv6: false
  timeout: 2s

Works ok with either mode: all or any, all expected events, nothing broken or false.

heartbeat.monitors:
- type: icmp
  schedule: '@every 5s'
  hosts:
    - www.idonotexist-forreal.com # does not resolve
  mode: all
  ipv6: false
  timeout: 2s

elasticmachine · 2019-04-11T07:54:55Z

Pinging @elastic/uptime

ruflin · 2019-04-11T07:56:54Z

This initially reminded me of #10777 but it seems to be an other issue.

odacremolbap · 2019-04-22T19:05:58Z

Able to reproduce.
Also an issue for 7.x

andrewvc · 2019-04-23T02:46:48Z

So, the root issue here, it seems, is that there is some category of error that is no longer being handled correctly due to one of the refactors. All of the erroneous up events also do not have an error field set.

The status is set to down when the error is set, so my suspicion is that some error case is not being handled correctly.

andrewvc · 2019-04-23T03:10:13Z

I have an initial fix here: #11895, which needs some work tomorrow to become reviewable.

The root cause is that the all ping mode now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does.

The right solution is to keep that task, but to let jobs mark themselves as discardable so that their associated event disappears.

Fixes elastic#11737

@ruflin

This fixes #11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works. From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints. The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified. Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic. Fixes #11737

@ruflin

This fixes elastic#11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works. From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints. The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified. Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic. Fixes elastic#11737 (cherry picked from commit 078612e)

@ruflin

This fixes #11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works. From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints. The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified. Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic. Fixes #11737 (cherry picked from commit 078612e)

@ruflin

…lastic#12007) This fixes elastic#11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works. From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints. The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified. Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic. Fixes elastic#11737 (cherry picked from commit 9e80988)

hyksos changed the title ~~Heartbeat 6.7 regression bug. mode:all always logs a false UP~~ Heartbeat 6.7 regression bug. mode:all always logs a false or incomplete UP Apr 10, 2019

hyksos changed the title ~~Heartbeat 6.7 regression bug. mode:all always logs a false or incomplete UP~~ Heartbeat 6.7 regression bug; mode:all always logs a false or incomplete UP Apr 10, 2019

hyksos changed the title ~~Heartbeat 6.7 regression bug; mode:all always logs a false or incomplete UP~~ Heartbeat 6.7 regression bug; http mode:all always logs a false or incomplete UP Apr 10, 2019

ruflin added Team:obs-ds-hosted-services Label for the Observability Hosted Services team Heartbeat labels Apr 11, 2019

ruflin added the bug label Apr 11, 2019

alvarolobato assigned odacremolbap Apr 11, 2019

andrewvc mentioned this issue Apr 23, 2019

[Heartbeat] Fix incorrect 'Up' status for all mode #11895

Merged

andrewvc added a commit to andrewvc/beats that referenced this issue Apr 26, 2019

Initial concept for fixing 'all' pings'

bd14dda

Fixes elastic#11737

andrewvc closed this as completed in #11895 May 1, 2019

andrewvc mentioned this issue May 1, 2019

Cherry-pick #11895 to 7.0: [Heartbeat] Fix incorrect 'Up' status for all mode #12007

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heartbeat 6.7 regression bug; http mode:all always logs a false or incomplete UP #11737

Heartbeat 6.7 regression bug; http mode:all always logs a false or incomplete UP #11737

hyksos commented Apr 10, 2019

hyksos commented Apr 10, 2019 •

edited

Loading

hyksos commented Apr 10, 2019

elasticmachine commented Apr 11, 2019

ruflin commented Apr 11, 2019

odacremolbap commented Apr 22, 2019

andrewvc commented Apr 23, 2019

andrewvc commented Apr 23, 2019

Heartbeat 6.7 regression bug; http mode:all always logs a false or incomplete UP #11737

Heartbeat 6.7 regression bug; http mode:all always logs a false or incomplete UP #11737

Comments

hyksos commented Apr 10, 2019

hyksos commented Apr 10, 2019 • edited Loading

hyksos commented Apr 10, 2019

elasticmachine commented Apr 11, 2019

ruflin commented Apr 11, 2019

odacremolbap commented Apr 22, 2019

andrewvc commented Apr 23, 2019

andrewvc commented Apr 23, 2019

hyksos commented Apr 10, 2019 •

edited

Loading