Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus plugin #10928

Closed
simoelmou opened this issue Apr 1, 2022 · 18 comments
Closed

Prometheus plugin #10928

simoelmou opened this issue Apr 1, 2022 · 18 comments
Labels
area/prometheus bug unexpected problem or unintended behavior

Comments

@simoelmou
Copy link

Relevant telegraf.conf

[[inputs.prometheus]]
  ## An array of urls to scrape metrics from.
  # urls = ["$URL"]

  ## An array of Kubernetes services to scrape metrics from.
  # kubernetes_services = ["http://my-service-dns.my-namespace:9100/metrics" ]

  ## Kubernetes config file to create client from.
  # kube_config = "/path/to/kubernetes.config"

  ## Scrape Kubernetes pods for the following prometheus annotations:
  ## - prometheus.io/scrape: Enable scraping for this pod
  ## - prometheus.io/scheme: If the metrics endpoint is secured then you will need to
  ##     set this to 'https' & most likely set the tls config.
  ## - prometheus.io/path: If the metrics path is not /metrics, define it with this annotation.
  ## - prometheus.io/port: If port is not 9102 use this annotation
  monitor_kubernetes_pods = true

   ## Restricts Kubernetes monitoring to a single namespace

 monitor_kubernetes_pods_namespace = "$NAMESPACE"

 ## Use bearer token for authorization
  # bearer_token = /path/to/bearer/token

  ## Specify timeout duration for slower prometheus clients (default is 3s)
  response_timeout = "10s"

  ## Optional TLS Config
  # tls_ca = /path/to/cafile
  # tls_cert = /path/to/certfile
  # tls_key = /path/to/keyfile
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false

  ## Excluding useless tags
  tagexclude = ["openshift.io/*","prometheus.io/*","kubernetes.io/*","url","deployment","deploymentConfig","address","host","app","exception","outcome"]

Logs from Telegraf

2022-03-30T22:47:43Z E! [inputs.prometheus] Error in plugin: error making HTTP request to http://10.134.3.200:8080/actuator/prometheus: Get "http://10.134.3.200:8080/actuator/prometheus": dial tcp 10.134.3.200:8080: connect: no route to host
2022-03-30T22:47:43Z E! [inputs.prometheus] Error in plugin: error making HTTP request to http://10.134.5.70:8081/actuator/prometheus: Get "http://10.134.5.70:8081/actuator/prometheus": dial tcp 10.134.5.70:8081: connect: no route to host
2022-03-30T22:47:43Z E! [inputs.prometheus] Error in plugin: error making HTTP request to http://10.133.2.165:8080/actuator/prometheus: Get "http://10.133.2.165:8080/actuator/prometheus": dial tcp 10.133.2.165:8080: connect: no route to host
2022-03-30T22:47:43Z E! [inputs.prometheus] Error in plugin: error making HTTP request to http://10.134.3.206:8081/actuator/prometheus: Get "http://10.134.3.206:8081/actuator/prometheus": dial tcp 10.134.3.206:8081: connect: no route to host
2022-03-30T22:47:43Z E! [inputs.prometheus] Error in plugin: error making HTTP request to http://10.132.11.127:8080/actuator/prometheus: Get "http://10.132.11.127:8080/actuator/prometheus": dial tcp 10.132.11.127:8080: connect: no route to host

System info

Telegraf 1.20.4, Openshift v3.11.570 ( kubernetes v1.11.0+d4cacc0)

Docker

No response

Steps to reproduce

  1. Deploy Telegraf in Openshift with input plugin prometheus with namespace configuration (monitor_kubernetes_pods_namespace)
  2. Telegraf is able to retrieve pods and retrieve their metrics
  3. Redeploy pods few times

Expected behavior

  1. Deploy Telegraf in Openshift with input plugin prometheus with namespace configuration (monitor_kubernetes_pods_namespace)
  2. Telegraf is able to retrieve pods and retrieve their metrics
  3. Redeploy pods few times
  4. Telegraf drops IP of old pods and retrieve IP of new pods

Actual behavior

  1. Deploy Telegraf in Openshift with input plugin prometheus with namespace configuration (monitor_kubernetes_pods_namespace)
  2. Telegraf is able to retrieve pods and retrieve their metrics
  3. Redeploy pods few times
  4. Telegraf always tries to retrieves metrics of old pods and doesn't detect new pods and shows the error :
    2022-03-30T22:47:43Z E! [inputs.prometheus] Error in plugin: error making HTTP request to http://10.132.11.127:8080/actuator/prometheus: Get "http://10.132.11.127:8080/actuator/prometheus": dial tcp 10.132.11.127:8080: connect: no route to host

Additional info

No response

@simoelmou simoelmou added the bug unexpected problem or unintended behavior label Apr 1, 2022
@MyaLongmire
Copy link
Contributor

@simoelmou will you please test pr #10932?

@simoelmou
Copy link
Author

simoelmou commented Apr 6, 2022

@MyaLongmire
We have these errors now that we didn't have with 1.21.x :

W0406 15:54:35.155507       1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:namespace:telegraf" cannot list pods at the cluster scope: no RBAC policy matched
E0406 15:54:35.155894       1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:namespace:telegraf" cannot list pods at the cluster scope: no RBAC policy matched

It seems the parameter monitor_kubernetes_pods_namespace is not used with the pr #10932

@shubrajp
Copy link

shubrajp commented Apr 7, 2022

Hi @simoelmou,

Can you check if you have permissions to get pods:
kubectl auth can-i get pods --as=system:serviceaccount:namespace:telegraf

If you get "no"... You might have to edit the ClusterRole for telegraf agent

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
...
rules:
  - verbs:
      - list
      - watch
      - get
    apiGroups:
      - ''
    resources:
      - pods
      - nodes
      - endpoints
...

@simoelmou
Copy link
Author

Hi @shubrajp

➜ oc auth can-i get pods --as=system:serviceaccount:namespace:telegraf
yes

Why do we need to retrieve pods at the cluster scope if monitor_kubernetes_pods_namespace is enabled ? it only needs to retrieve pods at the namespace scope.

@shubrajp
Copy link

shubrajp commented Apr 8, 2022

Ahh... Right...
Looks like it is informer's limitation... Check: kubernetes-sigs/controller-runtime#124

As to "retrieve pods at the cluster scope if monitor_kubernetes_pods_namespace is enabled"
It is possible to add a check on pod's namespace... so that unregisterPod(...) and registerPod(...) are only triggered for given namespace... Do you think it will work?

@shubrajp
Copy link

shubrajp commented Apr 8, 2022

@simoelmou,
The PR is updated... Can you please give it another try?

@simoelmou
Copy link
Author

@shubrajp
I used the build artifact of the commit ~175ec58bfe90e324d3d12a0065b21b654f289e88 but I have the same error :

W0406 15:54:35.155507       1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:namespace:telegraf" cannot list pods at the cluster scope: no RBAC policy matched
E0406 15:54:35.155894       1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:namespace:telegraf" cannot list pods at the cluster scope: no RBAC policy matched

@shubrajp
Copy link

shubrajp commented Apr 8, 2022

Hi @simoelmou,

Can you please try this:
oc auth can-i list pods --as=system:serviceaccount:namespace:telegraf (this time it is list instead of get)

Also can you share the ClusterRole file for telegraf?

@simoelmou
Copy link
Author

@shubrajp

➜ oc auth can-i list pods --as=system:serviceaccount:namespace:telegraf
yes

The service account telegraf has a role with [get list watch] verbs and pods as resources in the namespace not a ClusterRole.
I'm not an administrator of the cluster so I'm unable to create a ClusterRole, the plugin is useful for operators with no administration permissions.

@powersj
Copy link
Contributor

powersj commented Apr 8, 2022

ah shoot I didn't see this back and forth before I merged do we still need to land additional fixes? or even revert the change?

@shubrajp
Copy link

shubrajp commented Apr 8, 2022

@simoelmou...

➜ oc auth can-i list pods --as=system:serviceaccount:namespace:telegraf
yes

and

User "system:serviceaccount:namespace:telegraf" cannot list pods at the cluster scope: no RBAC policy matched 

are conflicting 😅

I had a similar issue:

pods "exporter-5d6bfd8696-qb9t9" is forbidden: User "system:serviceaccount:system:telegraf-agent" cannot get resource "pods" in API group "" in the namespace "system"

and

kubectl auth can-i get pods --as=system:serviceaccount:system:telegraf-agent -n system
no 

Adding get to verbs under ClusterRole resolved the issue for me...

@simoelmou
Copy link
Author

@shubrajp
it seems the PR requires now a ClusterRole which was not necessary before.

➜ oc project
Using project "namespace" on server "..."
➜ oc auth can-i list pods --as=system:serviceaccount:namespace:telegraf
yes
➜ oc auth can-i list pods --as=system:serviceaccount:namespace:telegraf --all-namespaces
no - no RBAC policy matched

I don't understand why the PR should introduce permissions's error on the user's end if the user enabled the namespace configuration monitor_kubernetes_pods_namespace

Can you please test the PR with Role instead of ClusterRole? I think you'll be able to reproduce the error from your end

@shubrajp
Copy link

shubrajp commented Apr 8, 2022

@simoelmou,
As I pointed out before, this is informer's limitation: kubernetes-sigs/controller-runtime#124
Currently the ListWatch for the cache's informers are non-namespaced.

	informerfactory := informers.NewSharedInformerFactory(clientset, resyncinterval)

	podinformer := informerfactory.Core().V1().Pods()
	podinformer.Informer().AddEventHandler(...)

So, it has to list all the pods then filter things out based on Namespace, Label / Field Selectors.

@shubrajp
Copy link

shubrajp commented Apr 8, 2022

FYI
watch was namespaced, but had a timeout for 30 minutes.

	watcher, err := client.CoreV1().Pods(p.PodNamespace).Watch(ctx, metav1.ListOptions{
		LabelSelector: p.KubernetesLabelSelector,
		FieldSelector: p.KubernetesFieldSelector,
	})

Any pod changes after 30 minutes weren't reported.

@simoelmou
Copy link
Author

@powersj

Ah thanks for the link kubernetes-sigs/controller-runtime#124 it's unfortunate to have this limitation for non-admins but this is not related to telegraf.
issue can be closed

@powersj
Copy link
Contributor

powersj commented Apr 8, 2022

@simoelmou, @shubrajp,

Thank you both for the back and forth. Is there any additional documentation that we should add to the plugin to aid others who may come across this?

Thanks

@shubrajp
Copy link

@powersj,

We can add links about informer and how it works.
Also, that telegraf should be able to list/get the pods.
Can put links to "kubectl auth can-i" commands, for anyone to verify their permissions.

Justification for change in permissions:
For earlier implementation, it was getting the pod object from events. So, these verbs (list, get) were not needed.
Now in the new implementation, it is getting the name and namespace, using which it has to list / get the pod.

Rest of the functionality is same.
"register" / "unregister" are called only for pods that pass the checks based on Namespace, Label / Field Selectors.

@powersj
Copy link
Contributor

powersj commented Apr 11, 2022

@shubrajp looks like 2269ff1 added to the README about needing ClusterRole already, so all we would want to add is the kubectl commands for debug/triage? If so would you be willing to add a debug section under the ### Kubernetes scraping section with those commands?

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prometheus bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants