Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing logic to keep list of unique cluster UUIDs #22808

Merged
merged 7 commits into from
Dec 1, 2020
Merged

Fixing logic to keep list of unique cluster UUIDs #22808

merged 7 commits into from
Dec 1, 2020

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Dec 1, 2020

What does this PR do?

This PR fixes a pretty bad bug with the logstash module when xpack.enabled: true is set.

The logstash module has two metricsets: node and node_stats. When xpack.enabled: true is set, these metricsets emit events for Stack Monitoring with type: logstash_state and type: logstash_stats, respectively.

The node metricset is supposed to emit as many events per Fetch() cycle as there are monitored Logstash pipelines x the number of Elasticsearch clusters that each such pipeline talks to. If a Logstash pipeline does not talk to an Elasticsearch cluster, one event for that pipeline must be emitted.

Instead, due to terribly faulty logic in the code, the node metricset was emitting as many events per Fetch() cycle as there are monitored Logstash pipelines x the number of vertices in each such pipeline!

There was similar faulty logic in the node_stats metricset.

This PR fixes this bug in both metricsets.

Why is it important?

By fixing this bug, we are eliminating a huge source of redundant and unnecessary events being published from the logstash module when xpack.enabled: true is set. This will not only make the module's behavior correct but also drastically reduce memory consumption in Metricbeat and the number of events sent to Metricbeat's output over the wire.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@ycombinator ycombinator added bug module Metricbeat Metricbeat Feature:Stack Monitoring Team:Services (Deprecated) Label for the former Integrations-Services team v7.11.0 v7.10.1 labels Dec 1, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/stack-monitoring (Stack monitoring)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-services (Team:Services)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Dec 1, 2020
@ycombinator ycombinator requested a review from faec December 1, 2020 07:31
@elasticmachine
Copy link
Collaborator

elasticmachine commented Dec 1, 2020

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #22808 updated

  • Start Time: 2020-12-01T12:59:31.896+0000

  • Duration: 63 min 55 sec

Test stats 🧪

Test Results
Failed 0
Passed 2293
Skipped 508
Total 2801

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 2293
Skipped 508
Total 2801

clusterToPipelinesMap[overrideClusterUUID] = pipelines
return clusterToPipelinesMap
}

Copy link
Contributor Author

@ycombinator ycombinator Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this section because it's also incorrect. We shouldn't be using the override cluster UUID, if set, for all pipelines so broadly. Instead we should figure out (as we do further below) the cluster UUIDs appropriate for each pipeline and build the cluster UUID => pipelines map that way.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a change in semantics and maybe a bc breaking change.

Before this change all pipelines have had been associated with overrideClusterUUID. With this change pipelines are only associated with overrideClusterUUID iff they do no clusterUUID set. This way overrideClusterUUID actually becomes defaultClusterUUID.

Instead of removing this line, how about printing a deprecation warning if this setting is configured and introduce an a default cluster uuid setting that matches the new semantics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, looking at where the value of overrideClusterUUID comes from ultimately, it comes from the monitoring.cluster_uuid setting in logstash.yml: https://www.elastic.co/guide/en/logstash/current/monitoring-with-metricbeat.html#define-cluster__uuid.

Reading the documentation for that setting:

To bind the metrics of Logstash to a specific cluster, optionally define ...

This makes me think, if this setting is set and therefore overrideClusterUUID is set, we should make that the cluster UUID for all pipelines, ignoring what Elasticsearch clusters each pipeline might individually be talking to.

So I'm going to revert this change and, in fact, add a similar code block in the makeClusterToPipelinesMap function in the node metricset.

@andresrc
Copy link
Contributor

andresrc commented Dec 1, 2020

/cc @sayden

for _, pipeline := range pipelines {
var clusterUUIDs []string
clusterUUIDs := make(map[string]struct{}, 0)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is common.StringSet.

The overall data expension starts by converting the map[string]pipelineState into a []pipelineState. Would it make sense to keep the map instead, to reduce the chance of creating duplicate entries in general?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the code to use common.StringSet in 4cb337f but I didn't follow this:

The overall data expension starts by converting the map[string]pipelineState into a []pipelineState. Would it make sense to keep the map instead, to reduce the chance of creating duplicate entries in general?

I think it's because I'm not finding the map[string]pipelineState you mentioned. Where are you seeing that?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do the conversion here.
I was wondering if it is necessary, or maybe can even lead to similar problems (in the future) if we are not careful about not producing duplicates.

}

for _, clusterUUID := range clusterUUIDs {
for clusterUUID, _ := range clusterUUIDs {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit clusterUUID := range clusterUUIDS { ... }

clusterToPipelinesMap[overrideClusterUUID] = pipelines
return clusterToPipelinesMap
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a change in semantics and maybe a bc breaking change.

Before this change all pipelines have had been associated with overrideClusterUUID. With this change pipelines are only associated with overrideClusterUUID iff they do no clusterUUID set. This way overrideClusterUUID actually becomes defaultClusterUUID.

Instead of removing this line, how about printing a deprecation warning if this setting is configured and introduce an a default cluster uuid setting that matches the new semantics?

},
},
},
}
Copy link

@urso urso Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this particular case able to reproduce the issue we have had? If not, we should add one.

The expannsion into the cluster map is somewhat critical in order to produce correct docs, that show the correct association of configurations to single Elasticsearch clusters. Can we make this test more exhaustive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the case that was problematic. Before this PR this case would've produced 3 PipelineState associated with the prod_cluster_id cluster UUID when we should've only got 1 PipelineState.

I will expand this test to add a few more test cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more test cases in 672ee5e.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@@ -67,20 +67,20 @@ func makeClusterToPipelinesMap(pipelines []logstash.PipelineState, overrideClust
clusterToPipelinesMap = make(map[string][]logstash.PipelineState)

for _, pipeline := range pipelines {
var clusterUUIDs []string
clusterUUIDs := make(map[string]struct{}, 0)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

common.StringSet

@ycombinator
Copy link
Contributor Author

@urso I've addressed all your feedback. Would you mind re-reviewing please, when you get a chance? Thanks!

@ycombinator ycombinator requested a review from urso December 1, 2020 12:26
}

for _, clusterUUID := range clusterUUIDs {
for _, clusterUUID := range clusterUUIDs.ToSlice() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ToSlice is a little overkill here (it copies and sorts all strings). The StringSet is defined as type StringSet map[string]struct{} and can be directly use with the range-clause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in dcfce4d.

@ycombinator ycombinator merged commit cddf4d4 into elastic:master Dec 1, 2020
@ycombinator ycombinator deleted the mon-ls-dup-cluster-uuid branch December 1, 2020 14:05
@ycombinator ycombinator added the needs_backport PR is waiting to be backported to other branches. label Dec 1, 2020
@ycombinator ycombinator removed the needs_backport PR is waiting to be backported to other branches. label Dec 1, 2020
ycombinator added a commit that referenced this pull request Dec 1, 2020
#22808) (#22817)

* Fixing logic to keep list of unique cluster UUIDs (#22808)

* Fixing logic to keep list of unique cluster UUIDs

* Adding CHANGELOG entry

* Use common.StringSet

* Adding more test cases

* Adding back logic to broadly override cluster UUID for all pipelines, if set

* Removing ToSlice()

* Fixing loop

* Fixing up CHANGELOG
ycombinator added a commit that referenced this pull request Dec 1, 2020
* Fixing logic to keep list of unique cluster UUIDs

* Adding CHANGELOG entry

* Use common.StringSet

* Adding more test cases

* Adding back logic to broadly override cluster UUID for all pipelines, if set

* Removing ToSlice()

* Fixing loop
v1v added a commit to v1v/beats that referenced this pull request Dec 2, 2020
…-issues

* upstream/master: (41 commits)
  Fix version parser regex for packaging (elastic#22581)
  Fix local_dynamic documentation and add providers inline doc. (elastic#22657)
  fix: use proper param name for e2e tests (elastic#22836)
  [Heartbeat] Fix exit on disabled monitor (elastic#22829)
  Update Golang to 1.14.12 (elastic#22790)
  docs: fix setup.template.overwrite typos (elastic#22804)
  Add docs section for ECS EC2 monitoring (elastic#22784)
  Fixing logic to keep list of unique cluster UUIDs (elastic#22808)
  Skip somewhat flaky UDP system test on Windows (elastic#22810)
  Fix polling node when it is not ready and monitor by hostname (elastic#22666)
  Skip Filebeat test_shutdown on windows 7 (elastic#22797)
  Make monitoring Namespace thread-safe (elastic#22640)
  Drop pkt_dstaddr and pkt_srcaddr when equals to "-" (elastic#22721)
  Add support for reading from UNIX datagram sockets (elastic#22699)
  Fix export dashboard command from Elastic Cloud (elastic#22746)
  Skip flaky winlogbeat test on Windows-7 (elastic#22754)
  Missing `>` (elastic#22763) (elastic#22766)
  Fix k8s watcher issue when node access to list nodes and ns (elastic#22714)
  [Metricbeat/Kibana/stats] Enforce `exclude_usage=true` (elastic#22732)
  Avoid sending non-numeric floats in cloud foundry integrations (elastic#22634)
  ...
v1v added a commit to v1v/beats that referenced this pull request Dec 2, 2020
…dows-7

* upstream/master: (41 commits)
  Fix version parser regex for packaging (elastic#22581)
  Fix local_dynamic documentation and add providers inline doc. (elastic#22657)
  fix: use proper param name for e2e tests (elastic#22836)
  [Heartbeat] Fix exit on disabled monitor (elastic#22829)
  Update Golang to 1.14.12 (elastic#22790)
  docs: fix setup.template.overwrite typos (elastic#22804)
  Add docs section for ECS EC2 monitoring (elastic#22784)
  Fixing logic to keep list of unique cluster UUIDs (elastic#22808)
  Skip somewhat flaky UDP system test on Windows (elastic#22810)
  Fix polling node when it is not ready and monitor by hostname (elastic#22666)
  Skip Filebeat test_shutdown on windows 7 (elastic#22797)
  Make monitoring Namespace thread-safe (elastic#22640)
  Drop pkt_dstaddr and pkt_srcaddr when equals to "-" (elastic#22721)
  Add support for reading from UNIX datagram sockets (elastic#22699)
  Fix export dashboard command from Elastic Cloud (elastic#22746)
  Skip flaky winlogbeat test on Windows-7 (elastic#22754)
  Missing `>` (elastic#22763) (elastic#22766)
  Fix k8s watcher issue when node access to list nodes and ns (elastic#22714)
  [Metricbeat/Kibana/stats] Enforce `exclude_usage=true` (elastic#22732)
  Avoid sending non-numeric floats in cloud foundry integrations (elastic#22634)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Feature:Stack Monitoring Metricbeat Metricbeat module Team:Services (Deprecated) Label for the former Integrations-Services team v7.10.1 v7.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants