Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix polling node when it is not ready and monitor by hostname #22666

Merged
merged 3 commits into from
Dec 1, 2020

Conversation

vjsamuel
Copy link
Contributor

@vjsamuel vjsamuel commented Nov 19, 2020

Enhancement

What does this PR do?

This PR ensures that as a last resort we use node's host name to monitor the node in node autodiscover. It also ensures that all events emitted by autodiscover for nodes checks for ready state.

Why is it important?

The kube spec leaves it to providers to either define InternalIP, ExternalIp or HostName on the node object. It is upto us to ensure that we monitor any one of them. Currently we dont use HostName if the other two are missing.

By not checking for ready state, we leave add events to monitor NotReady nodes which is not right.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

How to test this PR locally

Related issues

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 19, 2020
@elasticmachine
Copy link
Collaborator

elasticmachine commented Nov 19, 2020

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Started by user Chris Mark

  • Start Time: 2020-12-01T10:13:22.042+0000

  • Duration: 71 min 33 sec

Test stats 🧪

Test Results
Failed 0
Passed 16719
Skipped 1372
Total 18091

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 16719
Skipped 1372
Total 18091

@andresrc andresrc added the Team:Platforms Label for the Integrations - Platforms team label Nov 19, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-platforms (Team:Platforms)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 19, 2020
Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vjsamuel looks good, just left some minors. Also consider adding a unit test like

libbeat/autodiscover/providers/kubernetes/node.go Outdated Show resolved Hide resolved
libbeat/autodiscover/providers/kubernetes/node.go Outdated Show resolved Hide resolved
Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -168,6 +168,11 @@ func (n *node) emit(node *kubernetes.Node, flag string) {
return
}

// If the node is not in ready state then dont monitor it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that also correct for heartbeat? We might want to keep a monitor running to see when the node becomes ready. A node can become NotReady if it is powered off or has some kind of problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think we should still emit a stop event if a node is in NotReady state before being removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can either do it here or move this to onAdd. what i saw is that when beats starts up, it starts to monitor hosts that are in NotReady state which is probably incorrect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's go with this. I guess that for the heartbeat case, if a node unexpectedly disappears, some monitor checks will fail before the node reaches the NotReady state.

if address.Type == v1.NodeHostName && address.Address != "" {
return address.Address
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jsoriano
Copy link
Member

jenkins run the tests

@jsoriano jsoriano added needs_backport PR is waiting to be backported to other branches. v7.11.0 labels Nov 20, 2020
@ChrsMark
Copy link
Member

ChrsMark commented Nov 26, 2020

Thanks for addressing the comments @vjsamuel ! It looks good to me. Would it be possible to add a couple of unit tests for this please?

Also current CI failures look related: https://travis-ci.org/github/elastic/beats/jobs/745712376#L701

@vjsamuel
Copy link
Contributor Author

@ChrsMark updated tests and addressed failures.

@ChrsMark
Copy link
Member

jenkins run the tests

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ChrsMark
Copy link
Member

jenkins run the tests

1 similar comment
@ChrsMark
Copy link
Member

ChrsMark commented Dec 1, 2020

jenkins run the tests

@ChrsMark
Copy link
Member

ChrsMark commented Dec 1, 2020

Thanks for adding this @vjsamuel ! Merging.

@ChrsMark ChrsMark merged commit 09008c8 into elastic:master Dec 1, 2020
ChrsMark pushed a commit to ChrsMark/beats that referenced this pull request Dec 1, 2020
@ChrsMark ChrsMark removed the needs_backport PR is waiting to be backported to other branches. label Dec 1, 2020
ChrsMark added a commit that referenced this pull request Dec 2, 2020
v1v added a commit to v1v/beats that referenced this pull request Dec 2, 2020
…-issues

* upstream/master: (41 commits)
  Fix version parser regex for packaging (elastic#22581)
  Fix local_dynamic documentation and add providers inline doc. (elastic#22657)
  fix: use proper param name for e2e tests (elastic#22836)
  [Heartbeat] Fix exit on disabled monitor (elastic#22829)
  Update Golang to 1.14.12 (elastic#22790)
  docs: fix setup.template.overwrite typos (elastic#22804)
  Add docs section for ECS EC2 monitoring (elastic#22784)
  Fixing logic to keep list of unique cluster UUIDs (elastic#22808)
  Skip somewhat flaky UDP system test on Windows (elastic#22810)
  Fix polling node when it is not ready and monitor by hostname (elastic#22666)
  Skip Filebeat test_shutdown on windows 7 (elastic#22797)
  Make monitoring Namespace thread-safe (elastic#22640)
  Drop pkt_dstaddr and pkt_srcaddr when equals to "-" (elastic#22721)
  Add support for reading from UNIX datagram sockets (elastic#22699)
  Fix export dashboard command from Elastic Cloud (elastic#22746)
  Skip flaky winlogbeat test on Windows-7 (elastic#22754)
  Missing `>` (elastic#22763) (elastic#22766)
  Fix k8s watcher issue when node access to list nodes and ns (elastic#22714)
  [Metricbeat/Kibana/stats] Enforce `exclude_usage=true` (elastic#22732)
  Avoid sending non-numeric floats in cloud foundry integrations (elastic#22634)
  ...
v1v added a commit to v1v/beats that referenced this pull request Dec 2, 2020
…dows-7

* upstream/master: (41 commits)
  Fix version parser regex for packaging (elastic#22581)
  Fix local_dynamic documentation and add providers inline doc. (elastic#22657)
  fix: use proper param name for e2e tests (elastic#22836)
  [Heartbeat] Fix exit on disabled monitor (elastic#22829)
  Update Golang to 1.14.12 (elastic#22790)
  docs: fix setup.template.overwrite typos (elastic#22804)
  Add docs section for ECS EC2 monitoring (elastic#22784)
  Fixing logic to keep list of unique cluster UUIDs (elastic#22808)
  Skip somewhat flaky UDP system test on Windows (elastic#22810)
  Fix polling node when it is not ready and monitor by hostname (elastic#22666)
  Skip Filebeat test_shutdown on windows 7 (elastic#22797)
  Make monitoring Namespace thread-safe (elastic#22640)
  Drop pkt_dstaddr and pkt_srcaddr when equals to "-" (elastic#22721)
  Add support for reading from UNIX datagram sockets (elastic#22699)
  Fix export dashboard command from Elastic Cloud (elastic#22746)
  Skip flaky winlogbeat test on Windows-7 (elastic#22754)
  Missing `>` (elastic#22763) (elastic#22766)
  Fix k8s watcher issue when node access to list nodes and ns (elastic#22714)
  [Metricbeat/Kibana/stats] Enforce `exclude_usage=true` (elastic#22732)
  Avoid sending non-numeric floats in cloud foundry integrations (elastic#22634)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Platforms Label for the Integrations - Platforms team v7.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants