Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metricbeat autodiscover][Provider Kubernetes] Add condition to node/namespace watchers #37181

Closed
wants to merge 6 commits into from

Conversation

constanca-m
Copy link
Contributor

@constanca-m constanca-m commented Nov 22, 2023

Proposed commit message

To disable the watchers we need to do both of these:

  1. Disable hints:
hints.enabled: true
  1. Disable resource metadata:
add_resource_metadata:
  namespace.enabled: false

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

  1. Follow the instructions in this file.
  • Make sure that at least namespace / node resources permissions are not granted cluster wide.
  • Run metricbeat without this permission and you should no longer see any error.

Related issues

Logs

The error

W1122 12:17:27.683394      23 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:kube-system:metricbeat" cannot list resource "namespaces" in API group "" at the cluster scope

is no longer present in the logs.

@constanca-m constanca-m added backport-v7.11.0 Automated backport with mergify Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team backport-v8.11.0 Automated backport with mergify labels Nov 22, 2023
@constanca-m constanca-m self-assigned this Nov 22, 2023
@constanca-m constanca-m requested a review from a team as a code owner November 22, 2023 13:10
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Nov 22, 2023
@constanca-m constanca-m requested review from a team, ChrsMark and tetianakravchenko and removed request for a team November 22, 2023 13:11
Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ensure that node and namespace settings are preserved with the patch.

ref: https://www.elastic.co/guide/en/beats/metricbeat/current/configuration-autodiscover.html#_kubernetes.

@gizas
Copy link
Contributor

gizas commented Nov 22, 2023

Please add a new entry changelog for this PR.

Also this PR needs to be backported to 8.11 and 7.17 as it is a bug

@constanca-m
Copy link
Contributor Author

Also this PR needs to be backported to 8.11 and 7.17 as it is a bug

The backports are there.

I will add an entry in changelog

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2023-11-22T13:11:05.530+0000

  • Duration: 47 min 33 sec

Test stats 🧪

Test Results
Failed 2
Passed 21723
Skipped 1582
Total 23307

Test errors 2

Expand to view the tests failures

Build&Test / libbeat-unitTest / [empty] – TEST-go-unit.xml
  • no error details
  • Expand to view the stacktrace

     Test report file /var/lib/jenkins/workspace/PR-37181-1-ab3df46e-de8f-4ed2-be7d-d6815a8d393b/src/github.com/elastic/beats/build/libbeat/build/TEST-go-unit.xml was length 0 
    

Build&Test / libbeat-goIntegTest / [empty] – TEST-go-integration.xml
  • no error details
  • Expand to view the stacktrace

     Test report file /var/lib/jenkins/workspace/PR-37181-1-f5919e51-b99e-4342-9f33-d79361e78ce5/src/github.com/elastic/beats/build/libbeat/build/TEST-go-integration.xml was length 0 
    

Steps errors 1

Expand to view the steps failures

Error signal
  • Took 0 min 0 sec . View more details here
  • Description: Error 'org.jenkinsci.plugins.workflow.steps.FlowInterruptedException'

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2023-11-22T13:47:44.893+0000

  • Duration: 100 min 22 sec

Test stats 🧪

Test Results
Failed 0
Passed 28258
Skipped 2015
Total 30273

Steps errors 1

Expand to view the steps failures

Error signal
  • Took 0 min 0 sec . View more details here
  • Description: Error 'org.jenkinsci.plugins.workflow.steps.FlowInterruptedException'

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-11-22T15:20:48.444+0000

  • Duration: 133 min 5 sec

Test stats 🧪

Test Results
Failed 0
Passed 28692
Skipped 2015
Total 30707

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

updater := kubernetes.NewNodePodUpdater(p.unlockedUpdate, watcher.Store(), &p.crossUpdate)
nodeWatcher.AddEventHandler(updater)
}

if namespaceWatcher != nil && (config.Hints.Enabled() || metaConf.Namespace.Enabled()) {
if namespaceWatcher != nil && config.Hints.Enabled() {
Copy link
Member

@ChrsMark ChrsMark Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I see this line again I realize that metaConf.Namespace/Node was not checked in order to initialize the watchers above because these Watchers are used for Hint's based autodiscovery anyways no matter what the add_resource_metadata defines.

Let's take for example the namespaceWatcher:
We want to trigger the updater even if metaConf.Namespace.Enabled()!=true in case of having Hints enabled. Some details can be found at #25117.

So I believe we need to rethink of this patch more carefully. I see 2 options here:

  1. We initialize the watchers if config.Hints.Enabled() || metaConf.Namespace.Enabled() anyways in order to have the hints to work as expected.
  2. Introduce new settings to disable the Watchers only for users that actually don't have permissions to watch on Namespaces and/or Nodes.

The point is that we cannot couple Hint's based autodiscovery functionality with the add_resource_metadata setting. That's why we had this || in this if statement here.

Most probably that's why we fetch the namespace Annotations at

namespaceAnnotations := kubernetes.PodNamespaceAnnotations(pod, p.namespaceWatcher)
using the Watcher.

A real use case would be the following:

a) the users have disabled the namespace metadata enrichment with add_resource_metadata.namespace.enabled: false
b) but they want to have hints' based autodiscovery based on Namespace's annotations (see https://www.elastic.co/guide/en/beats/metricbeat/current/configuration-autodiscover-hints.html#_namespace_defaults)

The hints' events won't be complete because we won't have Namespace annotations at

kubemeta["namespace_annotations"] = namespaceAnnotations
.

I might miss sth here but my point is that we need to revisit this seeing the big picture and then decide accordingly. As we can see there are lot's of different pieces affected here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We initialize the watchers if config.Hints.Enabled() || metaConf.Namespace.Enabled() anyways in order to have the hints to work as expected.

I like this option, because this way we can still stop the watchers with hints.enabled: false. The main problem as it is now is that we have no way to stop them.
We would have to update the documentation for Autodiscover with this, as we never mention the option hints (only at this page).

Introduce new settings to disable the Watchers only for users that actually don't have permissions to watch on Namespaces and/or Nodes.

Maybe the user would have way too many options when we can already use the ones available.

A real use case would be the following:

Thanks, it is clear 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We initialize the watchers if config.Hints.Enabled() || metaConf.Namespace.Enabled() anyways in order to have the hints to work as expected.

I commited new changes so now it works like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also somehow related: #34717

@ChrsMark
Copy link
Member

ChrsMark commented Nov 22, 2023

The PR's description mention the following:

Make sure that at least namespace / node resources permissions are not granted cluster wide.
Run metricbeat without this permission and you should no longer see any error.

However for changes on codebase that affect multiple features we can cannot just rely on verifying if a bug or specific thing is fixed/works. From the conversation it's obvious that proper e2e testing is missing here in order to catch possible regressions like the one mentioned at #37181 (comment).

@gizas @bturquet that is sth I suggest we should take into account some time soon.

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2023-11-23T07:14:44.185+0000

  • Duration: 10 min 44 sec

Steps errors 1

Expand to view the steps failures

Error signal
  • Took 0 min 0 sec . View more details here
  • Description: Error 'org.jenkinsci.plugins.workflow.steps.FlowInterruptedException'

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

// in order to be able to retrieve 2nd layer Owner metadata like in case of:
// Deployment -> Replicaset -> Pod
// CronJob -> job -> Pod
if metaConf.Deployment {
replicaSetWatcher, err = kubernetes.NewNamedWatcher("resource_metadata_enricher_rs", client, &kubernetes.ReplicaSet{}, kubernetes.WatchOptions{
options = kubernetes.WatchOptions{
Copy link
Member

@ChrsMark ChrsMark Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion since we touch this codebase: Would it make sense to use the same Namespace scope setting with the one used for the Pod watcher?

If we watch for Pods on specific Namespace then their parent Deployments and CronJobs will be on the same Namespace as well.

@constanca-m
Copy link
Contributor Author

constanca-m commented Nov 23, 2023

From the conversation it's obvious that proper e2e testing is missing here in order to catch possible regressions like the one mentioned at #37181 (comment).

I could implement some tests with all the options (hints enabled, add resource metadata enabled or none) to check if the watchers exist or not. I don't think much else could be done on this part of the code. What do you think? Is it worth to check if it is nil or do we leave the test files as they are @ChrsMark ?

Edit: I added the test to check the conditions.

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2023-11-23T07:19:30.327+0000

  • Duration: 90 min 29 sec

Test stats 🧪

Test Results
Failed 0
Passed 28260
Skipped 2015
Total 30275

Steps errors 1

Expand to view the steps failures

Error signal
  • Took 0 min 0 sec . View more details here
  • Description: Error 'org.jenkinsci.plugins.workflow.steps.FlowInterruptedException'

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@constanca-m constanca-m added backport-7.17 Automated backport to the 7.17 branch with mergify and removed backport-v7.11.0 Automated backport with mergify labels Nov 23, 2023
@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 132 min 16 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@constanca-m
Copy link
Contributor Author

I am closing this PR for now based on the discussion held yesterday. It seems the watchers issue is more complex, and this implementation needs to be changed.

@constanca-m constanca-m deleted the namespace-watchers branch December 7, 2023 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.11.0 Automated backport with mergify Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants