Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Agent pools do not have kubelet metrics #1601

Closed
Aaron-ML opened this issue May 11, 2020 · 10 comments
Closed

New Agent pools do not have kubelet metrics #1601

Aaron-ML opened this issue May 11, 2020 · 10 comments
Labels

Comments

@Aaron-ML
Copy link

What happened:
Added secondary node pools due to the inflexibilities of the original nodepool.

What you expected to happen:
Expect non-default Nodepools to have kubelet stats to allow for container stat monitoring.

How to reproduce it (as minimally and precisely as possible):

Add new nodepool on existing AKS cluster, setup prometheus to scrape kubelet http-metrics from each node. Observe only the default nodepool is reachable via http://<node_IP>:10255/metrics

Anything else we need to know?:

We currently use kublet to provide container level metrics from prometheus such as CPU/memory stats among other things.

Support Ticket: 120050624005838

Environment:

  • Kubernetes version (use kubectl version):

v1.16.7

  • Size of cluster (how many worker nodes are in the cluster?)

This is happening on our smaller dev AKS clusters, stage and production currently. So anywhere from 1-15 nodes that vary in VM size.

  • General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.)

Webservices/Java JVM applications

  • Others:
    This is a huge pain point for us as we can't troubleshoot any container resources in our new primary nodepools.
@andyzhangx
Copy link
Contributor

andyzhangx commented May 13, 2020

port 10255 is HTTP, it's not secure, pls switch to use 10250, here is an example about how to set config in prometheus to use port 10250
https://github.com/prometheus/prometheus/blob/5bb7f00d00ba2d73488630851b352974511c233a/documentation/examples/prometheus-kubernetes.yml#L65-L73

@embik
Copy link

embik commented May 13, 2020

@andyzhangx has the plain http port on 10255 been deactivated without mention in the changelog? This worked on clusters deployed a few months ago, so very surprised to see it gone.

@andyzhangx
Copy link
Contributor

cc @palma21

@Aaron-ML
Copy link
Author

@andyzhangx I've attempted to set it to https:10250 but I get "Unauthorized" as a response.

Note that this is still working on the original node pool... so I'm not sure what's actually changed on the new nodepool. Can you clarify?

image

@Aaron-ML
Copy link
Author

@andyzhangx @palma21

Can we open this ticket back up? Not sure that anything got resolved here.

I've attempted to scale up the original node pool and can confirm those new nodes have working kubelet endpoints.

I've also scaled up the new nodepool and those new nodes do not have working endpoints for kubelet

@lBowlin
Copy link

lBowlin commented May 14, 2020

@andyzhangx We attempted to use https:10250 and were not successful. Can you suggest other steps to try? Or why we are getting "Unauthorized" when trying to use this port?

We only see this issue on secondary nodepools. It does not occur on the primary nodepool.

@palma21
Copy link
Member

palma21 commented May 15, 2020

Mentioned on release notes here: https://github.com/Azure/AKS/blob/master/CHANGELOG.md#release-2020-01-27

Did you upgrade your original nodepool to 1.16 or was it created on 1.16?

there is currently a bug that is being fixed where the upgrade did not pick up that change, so you might have 12255 working on that one. New pools created originally on 1.16 would not. That could explain your different pool behavior, could you confirm?

Kubelet port 10255 is disabled by default:
kubernetes/kubernetes#59666 (comment)
it is required to either use healthz port:
kubernetes/kubernetes#63812
or 10250 with required authorization

I believe in your case what is missing is needed flags on kubelet which we are enabling right now as well. In the meantime you can workaround it with and confirm if that solves it?
https://github.com/jnoller/kubernaughty/blob/master/tools/enable-webhook

@palma21 palma21 reopened this May 15, 2020
@Aaron-ML
Copy link
Author

Mentioned on release notes here: https://github.com/Azure/AKS/blob/master/CHANGELOG.md#release-2020-01-27

Did you upgrade your original nodepool to 1.16 or was it created on 1.16?

there is currently a bug that is being fixed where the upgrade did not pick up that change, so you might have 12255 working on that one. New pools created originally on 1.16 would not. That could explain your different pool behavior, could you confirm?

Kubelet port 10255 is disabled by default:
kubernetes/kubernetes#59666 (comment)
it is required to either use healthz port:
kubernetes/kubernetes#63812
or 10250 with required authorization

I believe in your case what is missing is needed flags on kubelet which we are enabling right now as well. In the meantime you can workaround it with and confirm if that solves it?
https://github.com/jnoller/kubernaughty/blob/master/tools/enable-webhook

Thanks for responding and reopening the ticket!

Looks like it was referenced under "Azure Monitor" on release notes so I missed it, that's my mistake.

The original node pool was created on an older version but was upgraded to 1.16 including the control plane, the secondary nodepool came after. The issue we were concerned with was the difference between the two, the bug you mention sounds like it could be what we are seeing.

We've since setup Prometheus authorization over https to scrape the new endpoint. Do you know if this bug is isolated to this endpoint or are there other things we should be looking at as well?

@ShaggO
Copy link

ShaggO commented Jun 12, 2020

port 10255 is HTTP, it's not secure, pls switch to use 10250, here is an example about how to set config in prometheus to use port 10250
https://github.com/prometheus/prometheus/blob/5bb7f00d00ba2d73488630851b352974511c233a/documentation/examples/prometheus-kubernetes.yml#L65-L73

System information/context:
I've just upgraded an "old" cluster (1.14.6 -> 1.15.11 -> 1.16.9) that uses availability sets and now the http endpoint at port 10255 is missing. I've switched to port 10250 and set up a service account bound to a clusterrole with appropriate rules:

- apiGroups: [""]
  resources: ["nodes/stats", "nodes/metrics"]
  verbs: ["get"]

I can now query https://${NODE_IP}:10250/metrics and https://${NODE_IP}:10250/stats/summary with the bearer token header from my serviceaccount but I don't have the CA bundle to verify the connection/certificate.

Problem: I want to verify the "self-signed" kubelet certificate, but the serviceaccount ca.crt does not work. Said ca.crt works for verifying the master node api endpoint but not the nodes' kubelet endpoint.
Is there a way to verify the nodes' kubelet certificate(s)/get the nodes' kubelet CA bundle(s) making me able to verify the connection?

@StianOvrevage
Copy link

For those of you using ServiceMonitor CRDs this worked for me.

Update the prometheus-operator-kubelet ServiceMonitor, in our case it's in the monitoring namespace.

kubectl edit servicemonitor -n monitoring prometheus-operator-kubelet

Change from http to https where applicable, and add tlsConfig.insecureSkipVerify: true.

  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 10s
    port: https-metrics
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    honorLabels: true
    interval: 10s
    path: /metrics/cadvisor
    port: https-metrics
    scheme: https
    tlsConfig:
      insecureSkipVerify: true

Now both kubelet+cadvisor targets in Prometheus are up and metrics are once again flowing :)

@ghost ghost locked as resolved and limited conversation to collaborators Jul 29, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants