-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to detect the kubelet URL automatically / cannot validate certificate #2582
Comments
This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there's been inactivity. Thank you for participating in the Datadog open source community. |
I'm seeing this issue in kubernetes v1.12 on digital ocean as well |
@mjhuber I opened a ticket on the Datadog issue tracker. Advice was to set |
Im running into this same issue using AWS EKS, using the EKS Optimized AMI Image for a worker node. Using I suppose the Datadog agent should get stats from kubelet using a different method. |
The recommendation for newer Kubernetes versions is to use kube-state-metrics for cluster-level metrics and use the metrics API (powered by e.g. metrics-server) for node-level and pod-level metrics. |
@praseodym oh hi Mark ;-) Where did you find this recommendation? Because both the integrations page in Datadog, and the documentation pointed me towards a 'standard' kubernetes deployment that uses the kubelet readonly port. Someone then pointed me to the helm chart for deploying datadog which uses the method you suggest and that works for me. |
@PHameete did you solve your issue using |
Sorry seems I had an old version of the page open. |
@PHameete Sorry for the confusion here: I meant that the Datadog agent itself should be updated to use kube-state-metrics and the metrics API, which should prevent it from needing access to the kubelets directly. This is more of an improvement than an actual bug, though. Regarding your issue with EKS, you should still be able to connect to the TLS port (10250) if RBAC is configured correctly so that the agent can authenticate to kubelet. We’re running without the read-only port on Kubernetes v1.13.2 and disabling TLS verification in the agent was all we had to do. Edit: I only now noticed that the Helm chart you linked does mention the agent using kube-state-metrics, so I guess that part is already implemented :) |
Even I ran into same problem ! Problem is with the EKS AMI(worker node) with the new AMI datadog and even some cpu and memory related metrics are also not working properly. I have used ami-0a0b913ef3249b655 and its working fine. |
Just worked through this and wanted to share what it I understand it would take to not set We use typhoon which runs the kubelet via systemd. It disables the readonly port and passes I hopped onto a worker and tried By default, the kubelet is creating a self-signed key/cert for its server on start. If you specified Here's some discussion about addressing this issue in kubeadm: Given that the datadog role is effectively read-only, we felt the risks of unverified TLS were acceptable until we have an opportunity to look at ways to sign kubelet API certs with a known CA or have the kubelet write its CA cert out to disk. |
We are seeing this error even after setting I have tried running all versions from 6.6.0 to 6.9.0 |
@sridhar81 have you tried my solution ? |
@VinayVanama Thanks for the pointer. We are not using EKS. We are running our own cluster. Changing the AMI is going to be hard. |
Hi everyone, There seems to be several problems here. For @jcassee: The certificate cannot be validated because there is no SAN for the IP address of the node.
By:
We are also in touch with DigitalOcean to suggest adding the node IP as a SAN in the certificate. For @mjhuber For @PHameete and @praseodym We also have the kubernetes_state integration which queries the KSM pod and get these metrics: https://docs.datadoghq.com/agent/kubernetes/metrics/#kube-state-metrics Disabling the TLS verification should not be needed if the correct certificates are used. For @bendrucker Please reach out to our support team if you need further details: [email protected] For @sridhar81 Please reach out to our support team if more troubleshooting is needed: [email protected] |
@Simwar I determined that the certificate does not, unfortunately, has the plain hostname as the common name:
Also, the node hostname cannot be resolved from within the pod:
|
Can confirm adding For posterity: https://github.com/chris-short/wingedblade/blob/master/datadog-agent.yaml#L35 |
Confirming that with Kubernetes 1.13 installed via kubeadm |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there's been inactivity. Thank you for participating in the Datadog open source community. |
I am running on k8s v1.11.5 on digital ocean This confirms the findings of @jcassee along with warning
|
We're seeing the same thing on EKS running 1.12. Setting Anybody got a workaround for this? |
Hi @groodt If it still doesn't work after redeploying the agent with the RBACs provided and the correct service account, feel free to reach out to our support team: [email protected] |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there's been inactivity. Thank you for participating in the Datadog open source community. |
@jonhoare I tried to confirm what you wrote here:
However, I compared a 1.14 cluster AKS cluster to a 1.16 cluster, and the path to the cert appeared identical on both. Both the 1.14 cluster and the 1.16 cluster had a certificate in Setting |
@apeschel according to the config_template.yaml |
@PSanetra It appears as though (Note: in these examples, I mounted 1.16:
1.14:
|
Further, the proposed cause for this issue was that the file (Note: 1.14:
Here's a comparison of 1.16:
1.14:
|
I dug into the cause that @mopalinski suggested, and was able to verify that is the actual cause for the breakage on AKS. All this discussion about moving CA files and self signed certificates is incorrect and misleading. The truth is that Datadog has never worked correctly on AKS, and has been silently relying on the unsecured Kubelet fallback port this whole time. The removal of this unsecure port has only revealed the truth that Datadog has been broken this whole time. It appears this unsecured fallback port was removed at some point in the AKS 1.16 line, which is what ultimately revealed the problem with the Datadog agent. It's trivially easy to verify this is the actual cause: 1.14:
1.16:
Datadog should hopefully prioritize a fix for this problem on their end, since it actually affects all versions of AKS. Until that time, it seems like the simplest workaround is to set |
Some more info for you on all on this from MS direct - Even with AKS version 1.16.x, the kubelet is accessible over HTTP on port 10255 if the cluster is upgraded from a previous version. The plan to discontinue this has been rolled out, and is planned to be introduced in the upcoming versions : 1.18.x I did a repro in my lab environment and found that the new version of AKS, does not allow access to Kubelet on plain HTTP, and that the port 10255 is discontinued. I launched a cluster with version 1.17.5: Then I tried to access the plain HTTP port 10255 for kubelet: I can confirm on my nodepools upgraded to 1.16.x the kubelet checks do work if I have the DD_KUBELET_TLS_VERIFY=false set. However on brand new node pools I can't get any access to the kubelet with via datadog. |
If someone is looking how to deploy on AKS with
I can see my Kubernetes metrics in DD now! |
Those of you deploying DD with Helm aka Here's something you can copy into your
|
As a work around - disable tls - is ok, but not sure that it's recommended production ready recommendation. otherwise -we this tls verification exist |
For AKS version 1.17.9 disabling TLS verify appears to work. The solution provided by @jonhoare appears to work for Linux nodes, but I am not positive it is the same for Windows node pools. I have attempted mounting C:\var\lib\kubelet\pki\kubelet.crt with DD_KUBELET_CLIENT_CA set and the error still appears. When I use the following config, the kubernetes_state* metrics come in for the windows node, but shown under a separate host tagged as host:-<cluster_name>. The kubernetes* metrics still do not come in, though.
|
Hi! I have 2 clusters here:
The solution from @jonhoare works for the upgraded one, but not for the newly created one. Grml...
EDIT 2: |
Just wanted to relate my experience using EKS 1.17/eks.3: Experienced this issue deploying using the instructions here: https://docs.datadoghq.com/agent/cluster_agent/setup/?tab=secret I basically did this:
Eventually I noticed that the pods for both cluster-agent and node-agents weren't mounting anything at /var/run/secrets/kubernetes.io/serviceaccount - resulting in a failure to auth to kubelet. The unable to detect kubelet URL error was actually a symptom of this problem. This for me turned out to be quirk of the terraform kubernetes provider - the fix was to specify Note that I did not have to disable DD_KUBELET_TLS_VERIFY 🎉 |
@crielly Can you share the mount configuration that Terraform did when generate the deployment at your eks? |
All I did was take the yaml manifest from the link above and Then add |
I'm the last person casually disabling TLS verification but in this case with the connection staying on localhost, it shouldn't be a big deal or am I missing something? |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that this will not be automatically closed, but the notification will remind us to investigate why there's been inactivity. If you would like this issue to remain open:
Thank you for participating in the Datadog open source community! |
@crielly does this setup work for helm3 as well? |
We're using a custom DNS server using private dns zones on our vnet and have run into similar issue
|
This is still an issue, and the root problem is still the same. The method used for TLS verification by the Datadog image is still completely broken, and the most viable workaround at the moment is to just disable TLS verification. For those using the Datadog Helm chart, you can fix it by setting datadog:
kubelet:
tlsVerify: false |
This is the case for my AKS clusters as well: Changing to Resolved the issue |
Dear datadog team: would it be possible to implement: #2582 (comment) I've tested with EKS 1.21 and with Datadog 2.22.15 and it solved my issue. The solution could be something like this for the nodeagent, but the same applies to the cluster agent:
Without the above patch either one manually modifies the agent daemonset/deployment resources like above or one needs to disable TLS verification, which is by no means a best practice IMHO. |
Hello, Multiple issues were reported over time in this issue, we've added a documentation dedicated to Kubernetes distribution specificities here (including AKS One note about Feel free to open more dedicated issues or contact our support if your issue is not solved. |
Output of the info page
Additional environment details (Operating System, Cloud provider, etc):
Kubernetes 1.12 cluster on DigitalOcean.
Steps to reproduce the issue:
Describe the results you received:
Many dashboard entries remain empty.
Describe the results you expected:
No errors, access to kubelet, functional Kubernetes dashboard.
Additional information you deem important (e.g. issue happens only occasionally):
Seems to be the same problem as #1829, however that issue is closed. Hosted Kubernetes services like DigitalOcean do not allow editing the kubelet configuration as far as I know.
The text was updated successfully, but these errors were encountered: