Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add insecure_skip_verify in kubeletMetrics preset #1034

Closed

Conversation

ChrsMark
Copy link
Member

The OTel documentation for the Kubeletstats Receiver component suggests the following config:

receivers:
  kubeletstats:
    collection_interval: 10s
    auth_type: 'serviceAccount'
    endpoint: '${env:K8S_NODE_NAME}:10250'
    insecure_skip_verify: true
    metric_groups:
      - node
      - pod
      - container

However kubeletMetrics preset of the respective Helm Chart does not include the insecure_skip_verify.

Thus, using the preset results in not being able to scrape metrics from Kubelet:

2024-02-12T13:06:38.943Z	error	scraperhelper/scrapercontroller.go:200	Error scraping metrics	{"kind": "receiver", "name": "kubeletstats", "data_type": "metrics", "error": "Get \"https://kind-worker2:10250/stats/summary\": tls: failed to verify certificate: x509: certificate signed by unknown authority", "scraper": "kubeletstats"}

I guess it would be nice to skip the tls verification with the preset as well since this is mostly the case and would also be aligned with what the plain config example suggests.

@ChrsMark ChrsMark requested a review from a team February 12, 2024 13:14
@@ -169,6 +169,7 @@ receivers:
collection_interval: 20s
auth_type: "serviceAccount"
endpoint: "${env:K8S_NODE_NAME}:10250"
insecure_skip_verify: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to enable this by default since it isn't the recommended behavior for production. Really the receiver's README should be updated to not set that value. If the cluster's certificates are in order this value is not necessary. @jinja2 do you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also @povilasv and @dmitryax

Copy link
Contributor

@povilasv povilasv Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I agree, though these certificates are a bit of a hassle to manage and we don't really have good docs how to add these to collector.

Also it used to work before 0.87 as there was a bug in kubeletstats, which caused to ignore ceritficate verification -> https://github.com/open-telemetry/opentelemetry-collector-contrib/releases/tag/v0.87.0

So I'm a bit conflicted here.

Copy link
Member Author

@ChrsMark ChrsMark Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the preset does not add any of the ca_file, key_file, cert_file, right? So I wonder in which cases the preset is actually useful with the current implementation.

ref: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/27070/files#r1334828771

From my perspective, insecure_skip_verify serves as a good base and starting point for the cases where certs are not required. For production clusters, our users would need to explicitly configure the receiver (without the preset) and manually set the certs anyways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone correct me if I'm wrong, but I thought the receiver was looking in an expected place for the certs and ca_file, key_file, and cert_file was only necessary if your certs didn't live in that expected place.

Are the certs for a standard production k8s cluster no longer installed at that location?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preset in the chart is using the serviceAccount auth type. This auth type does not use the ca_file, key_file, cert_file options at all. These are used for auth type = TLS so basically if you are using client certificates to authenticate. With serviceAccount auth we get the cluster's root CA cert so it can be added to the trust store/cert pool, and this is the ca.crt we see along with the token for service account at the well-known path that Tyler linked. The standard practice is to setup kubelet to serve certificate signed by the cluster CA which we have access to already when using serviceaccount auth.

Some setups might not set the k8s node name in the Subject alternative name in the kubelet's server cert. In such cases users might get the hostname mismatch error since we use the K8S_NODE_NAME as host here. The referenced comment has more details on my findings on some of the more commonly used distros.

I would recommend to not add insecure_skip_verify: true as default cause that would be a step back in terms of secure practices. This is the first issue about this setting after the change in behavior went out so it might not be impacting most users. A compromise might be an option under the preset variable?

The error posted in the issue with running in kind cluster can most likely be fixed by adding below patch to the kind cluster's config.

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
kubeadmConfigPatches:
  - |
    kind: KubeletConfiguration
    serverTLSBootstrap: true

After the cluster is up, you'll probably need to approve the certificate signing request for the kubelet certs by running kubectl certificate approve {PENDING CSR}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank's for the details @jinja2 @TylerHelmuth on how the internals of serviceAccount auth type work.

Based on what you shared insecure_skip_verify: true does not make a lot of sense to be the default.
I tested the preset on a GKE cluster and works without any issues.
However would that make sense to document these details around the serviceAccount type so as to make it clear how it works and what is required in order for this work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrsMark Yes, let's definitely document it and the possible fix.

@ChrsMark
Copy link
Member Author

Closing this in honor of open-telemetry/opentelemetry-collector-contrib#31280 as per the discussions at #1034 (comment). Let's continue the conversation there.

@ChrsMark ChrsMark closed this Feb 15, 2024
TylerHelmuth pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this pull request Feb 21, 2024
#31280)

**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->

This PR adds some clarifications in the documentation of the
`serviceAccount` auth type of the `kubeletstats` receiver.
This is an outcome of the discussions at
open-telemetry/opentelemetry-helm-charts#1034 (comment).

Signed-off-by: ChrsMark <[email protected]>
XinRanZhAWS pushed a commit to XinRanZhAWS/opentelemetry-collector-contrib that referenced this pull request Mar 13, 2024
open-telemetry#31280)

**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->

This PR adds some clarifications in the documentation of the
`serviceAccount` auth type of the `kubeletstats` receiver.
This is an outcome of the discussions at
open-telemetry/opentelemetry-helm-charts#1034 (comment).

Signed-off-by: ChrsMark <[email protected]>
@ChrsMark
Copy link
Member Author

ChrsMark commented Jul 3, 2024

Bumping this up again since I got an error while testing the preset on GKE.

Config:

mode: daemonset
presets:
  kubeletMetrics:
    enabled: true

config:
  exporters:
    debug:
      verbosity: detailed

  service:
    pipelines:
      metrics:
        receivers: [kubeletstats]
        processors: [batch]
        exporters: [debug]

Logs:

2024-07-03T13:16:16.966Z	error	[email protected]/scraper.go:77	call to /stats/summary endpoint failed	{"kind": "receiver", "name": "kubeletstats", "data_type": "metrics", "error": "Get \"https://gke-otel-chrismark-test-default-pool-b71e7458-hnfn:10250/stats/summary\": tls: failed to verify certificate: x509: certificate is not valid for any names, but wanted to match gke-otel-chrismark-test-default-pool-b71e7458-hnfn"}
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/kubeletstatsreceiver.(*kubletScraper).scrape
	github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/scraper.go:77
go.opentelemetry.io/collector/receiver/scraperhelper.ScrapeFunc.Scrape
	go.opentelemetry.io/collector/[email protected]/scraperhelper/scraper.go:20
go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport
	go.opentelemetry.io/collector/[email protected]/scraperhelper/scrapercontroller.go:194
go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1
	go.opentelemetry.io/collector/[email protected]/scraperhelper/scrapercontroller.go:169
2024-07-03T13:16:16.967Z	error	scraperhelper/scrapercontroller.go:197	Error scraping metrics	{"kind": "receiver", "name": "kubeletstats", "data_type": "metrics", "error": "Get \"https://gke-otel-chrismark-test-default-pool-b71e7458-hnfn:10250/stats/summary\": tls: failed to verify certificate: x509: certificate is not valid for any names, but wanted to match gke-otel-chrismark-test-default-pool-b71e7458-hnfn", "scraper": "kubeletstats"}
go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport
	go.opentelemetry.io/collector/[email protected]/scraperhelper/scrapercontroller.go:197
go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1
	go.opentelemetry.io/collector/[email protected]/scraperhelper/scrapercontroller.go:169

Container image: "otel/opentelemetry-collector-k8s:0.100.0"
GKE version: v1.29.4-gke.1043002

I don't get these same on EKS, so I guess that sth changed recently on how GKE generates these certs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants