Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heartbeat 6.7 regression bug; http mode:all always logs a false or incomplete UP #11737

Closed
hyksos opened this issue Apr 10, 2019 · 7 comments · Fixed by #11895
Closed

Heartbeat 6.7 regression bug; http mode:all always logs a false or incomplete UP #11737

hyksos opened this issue Apr 10, 2019 · 7 comments · Fixed by #11895
Assignees
Labels
bug Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team

Comments

@hyksos
Copy link

hyksos commented Apr 10, 2019

Heartbeat 6.7 when using the mode:all of the type:http monitor will generated false/broken UP events for DOWN and UP targets... in addition to real/expected UP/DOWN events representing the true status of the monitored target. Minimal repro case below. This behavior can't be reproduced with heartbeat < 6.7.0 (tested with 6.7.1, 6.7.0, 6.6.2, 6.6.1, 6.6.0)

For confirmed bugs, please report:

  1. heartbeat.docker.yml content:
heartbeat.monitors:
- type: http
  schedule: '@every 5s'
  urls:
    - http://google.com #reachable target
    - http://elastic.co:8888 #unreachable target
  mode: all
  ipv6: false
  timeout: 2s
  1. Start heartbeat
docker run --volume="$(pwd)/heartbeat.docker.yml:/usr/share/heartbeat/heartbeat.yml:ro"   docker.elastic.co/beats/heartbeat:6.7.1   --strict.perms=false -e  -E output.console.enabled=true
  1. Results:
    Bad UP event for google.com(reachable target), notice all the missing fields that should come with a normal UP event.
{
  "@timestamp": "2019-04-10T04:39:40.159Z",
  "@metadata": {
    "beat": "heartbeat",
    "type": "doc",
    "version": "6.7.1"
  },
  "beat": {
    "name": "29e998770c99",
    "hostname": "29e998770c99",
    "version": "6.7.1"
  },
  "monitor": {
    "type": "http",
    "duration": {
      "us": 1607
    },
    "status": "up",
    "scheme": "http",
    "id": "http@http://google.com",
    "name": "http"
  },
  "http": {
    "url": "http://google.com"
  },
  "tcp": {
    "port": 80
  },
  "event": {
    "dataset": "uptime"
  },
  "host": {
    "name": "29e998770c99"
  }
}

Bad UP event for elastic.co:8888(unreachable target), notice all the missing fields that should come with a normal UP event.

{
  "@timestamp": "2019-04-10T04:39:40.159Z",
  "@metadata": {
    "beat": "heartbeat",
    "type": "doc",
    "version": "6.7.1"
  },
  "tcp": {
    "port": 8888
  },
  "event": {
    "dataset": "uptime"
  },
  "beat": {
    "name": "29e998770c99",
    "hostname": "29e998770c99",
    "version": "6.7.1"
  },
  "host": {
    "name": "29e998770c99"
  },
  "monitor": {
    "name": "http",
    "type": "http",
    "duration": {
      "us": 1561
    },
    "status": "up",
    "scheme": "http",
    "id": "http@http://elastic.co:8888"
  },
  "http": {
    "url": "http://elastic.co:8888"
  }
}

Good UP event for google.com(reachable target), notice that all the expected fields for a normal UP event are there.

{
  "@timestamp": "2019-04-10T04:39:40.161Z",
  "@metadata": {
    "beat": "heartbeat",
    "type": "doc",
    "version": "6.7.1"
  },
  "beat": {
    "name": "29e998770c99",
    "hostname": "29e998770c99",
    "version": "6.7.1"
  },
  "host": {
    "name": "29e998770c99"
  },
  "http": {
    "rtt": {
      "content": {
        "us": 2312
      },
      "total": {
        "us": 96849
      },
      "write_request": {
        "us": 88
      },
      "response_header": {
        "us": 60115
      },
      "validate": {
        "us": 62428
      }
    },
    "response": {
      "status_code": 200
    },
    "url": "http://google.com"
  },
  "monitor": {
    "scheme": "http",
    "id": "http@http://google.com",
    "name": "http",
    "type": "http",
    "host": "google.com",
    "ip": "172.217.13.110",
    "duration": {
      "us": 96972
    },
    "status": "up"
  },
  "resolve": {
    "host": "google.com",
    "ip": "172.217.13.110",
    "rtt": {
      "us": 1601
    }
  },
  "tcp": {
    "port": 80,
    "rtt": {
      "connect": {
        "us": 1802
      }
    }
  },
  "event": {
    "dataset": "uptime"
  }
}

Good DOWN events (x4) for elastic.co:8888(unreachable target), notice that all the expected fields for a normal DOWN event are there. There are 4 of those events, all good, since we use the setting "mode:all". Only one such event is shown here, the other 3 are for the other IPs you get for an elastic.co DNS resolution where I am (Multi A DNS record). So you get 4 good DOWN event, 1 per IP.

{
  "@timestamp": "2019-04-10T04:39:40.161Z",
  "@metadata": {
    "beat": "heartbeat",
    "type": "doc",
    "version": "6.7.1"
  },
  "event": {
    "dataset": "uptime"
  },
  "beat": {
    "name": "29e998770c99",
    "hostname": "29e998770c99",
    "version": "6.7.1"
  },
  "host": {
    "name": "29e998770c99"
  },
  "tcp": {
    "port": 8888
  },
  "monitor": {
    "type": "http",
    "host": "elastic.co",
    "ip": "151.101.2.217",
    "status": "down",
    "duration": {
      "us": 2001576
    },
    "scheme": "http",
    "id": "http@http://elastic.co:8888",
    "name": "http"
  },
  "resolve": {
    "host": "elastic.co",
    "ip": "151.101.2.217",
    "rtt": {
      "us": 1533
    }
  },
  "http": {
    "url": "http://elastic.co:8888"
  },
  "error": {
    "type": "io",
    "message": "Get http://elastic.co:8888: dial tcp 151.101.2.217:8888: i/o timeout"
  }
}

[...] This continues of course, for every interval. There is always 2 bad UP event, followed by the truth; the normal expected events, either UP or DOWN depending on the true status of the target.

Using "mode:any" in the monitor config, you can't replicate, all events are good and true. There is of course only 1 DOWN event for elastic.co:8888, since mode:any only checks one of the IP.
Using heartbeat 6.6.2 you can't replicate.

@hyksos hyksos changed the title Heartbeat 6.7 regression bug. mode:all always logs a false UP Heartbeat 6.7 regression bug. mode:all always logs a false or incomplete UP Apr 10, 2019
@hyksos hyksos changed the title Heartbeat 6.7 regression bug. mode:all always logs a false or incomplete UP Heartbeat 6.7 regression bug; mode:all always logs a false or incomplete UP Apr 10, 2019
@hyksos hyksos changed the title Heartbeat 6.7 regression bug; mode:all always logs a false or incomplete UP Heartbeat 6.7 regression bug; http mode:all always logs a false or incomplete UP Apr 10, 2019
@hyksos
Copy link
Author

hyksos commented Apr 10, 2019

This also replicate, you get 2 broken UP events with missing fields, in addition to a true good UP event for www.google.com and true good DOWN event for samsung.com with all the required fields:

heartbeat.monitors:
- type: icmp
  schedule: '@every 5s'
  hosts:
    - samsung.com # not pingable
    - www.google.com # pingable
  mode: all
  ipv6: false
  timeout: 2s

Again if you flip it to "mode:any" the broken UP events with missing field stops and you get the expected behavior.

@hyksos
Copy link
Author

hyksos commented Apr 10, 2019

Works ok with either mode: all or any, all expected events, nothing broken or false.

heartbeat.monitors:
- type: icmp
  schedule: '@every 5s'
  hosts:
    - 8.8.8.8 # pingable
    - 2.2.2.2 # non-pingable
  mode: all
  ipv6: false
  timeout: 2s

Works ok with either mode: all or any, all expected events, nothing broken or false.

heartbeat.monitors:
- type: icmp
  schedule: '@every 5s'
  hosts:
    - www.idonotexist-forreal.com # does not resolve
  mode: all
  ipv6: false
  timeout: 2s

@ruflin ruflin added Team:obs-ds-hosted-services Label for the Observability Hosted Services team Heartbeat labels Apr 11, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/uptime

@ruflin ruflin added the bug label Apr 11, 2019
@ruflin
Copy link
Contributor

ruflin commented Apr 11, 2019

This initially reminded me of #10777 but it seems to be an other issue.

@odacremolbap
Copy link
Contributor

Able to reproduce.
Also an issue for 7.x

@andrewvc
Copy link
Contributor

So, the root issue here, it seems, is that there is some category of error that is no longer being handled correctly due to one of the refactors. All of the erroneous up events also do not have an error field set.

The status is set to down when the error is set, so my suspicion is that some error case is not being handled correctly.

@andrewvc
Copy link
Contributor

I have an initial fix here: #11895, which needs some work tomorrow to become reviewable.

The root cause is that the all ping mode now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does.

The right solution is to keep that task, but to let jobs mark themselves as discardable so that their associated event disappears.

andrewvc added a commit to andrewvc/beats that referenced this issue Apr 26, 2019
andrewvc added a commit that referenced this issue May 1, 2019
This fixes #11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works.

From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints.

The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified.

Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic.

Fixes #11737
andrewvc added a commit to andrewvc/beats that referenced this issue May 1, 2019
This fixes elastic#11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works.

From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints.

The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified.

Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic.

Fixes elastic#11737

(cherry picked from commit 078612e)
andrewvc added a commit that referenced this issue May 2, 2019
This fixes #11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works.

From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints.

The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified.

Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic.

Fixes #11737

(cherry picked from commit 078612e)
leweafan pushed a commit to leweafan/beats that referenced this issue Apr 28, 2023
…lastic#12007)

This fixes elastic#11737 , where setting mode: all now spawns multiple sub-tasks, but the parent task still runs as a job, even though it tests very little, and will always succeed so long as the DNS query does. It is essentially a parent job testing that the DNS resolver works.

From a certain strict POV the behavior before this patch is correct. We executed a discrete job (checking DNS) and it worked successfully. A second part of the job, actually hitting the endpoints, failed. The total job (as noted in the summary) did fail, but a sub-part did succeed. That said, this is too complex and a bad UX, so this patch makes sense. Maybe in the future we will be under different constraints.

The fix here involved marking that job's emitted event as cancelled using job metadata. I'd discussed using event.private for that with @ruflin, but looking at the code metadata seemed more appropriate. I'd love to hear your $0.02 @ruflin if you think private is more appropriate. Metricbeat wraps all events in its own metadata struct, but I don't think we're there yet in heartbeat in terms of a major refactor being justified.

Testing this manually is easy, just point at a domain with multiple A records, like elastic.co. Truly integration testing it programmatically is hard without setting up a DNS resolver. The compromise I reached here is unit tests for the individual bits of logic.

Fixes elastic#11737

(cherry picked from commit 9e80988)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants