Prometheus plugin #10928

simoelmou · 2022-04-01T07:52:51Z

Relevant telegraf.conf

[[inputs.prometheus]]
  ## An array of urls to scrape metrics from.
  # urls = ["$URL"]

  ## An array of Kubernetes services to scrape metrics from.
  # kubernetes_services = ["http://my-service-dns.my-namespace:9100/metrics" ]

  ## Kubernetes config file to create client from.
  # kube_config = "/path/to/kubernetes.config"

  ## Scrape Kubernetes pods for the following prometheus annotations:
  ## - prometheus.io/scrape: Enable scraping for this pod
  ## - prometheus.io/scheme: If the metrics endpoint is secured then you will need to
  ##     set this to 'https' & most likely set the tls config.
  ## - prometheus.io/path: If the metrics path is not /metrics, define it with this annotation.
  ## - prometheus.io/port: If port is not 9102 use this annotation
  monitor_kubernetes_pods = true

   ## Restricts Kubernetes monitoring to a single namespace

 monitor_kubernetes_pods_namespace = "$NAMESPACE"

 ## Use bearer token for authorization
  # bearer_token = /path/to/bearer/token

  ## Specify timeout duration for slower prometheus clients (default is 3s)
  response_timeout = "10s"

  ## Optional TLS Config
  # tls_ca = /path/to/cafile
  # tls_cert = /path/to/certfile
  # tls_key = /path/to/keyfile
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false

  ## Excluding useless tags
  tagexclude = ["openshift.io/*","prometheus.io/*","kubernetes.io/*","url","deployment","deploymentConfig","address","host","app","exception","outcome"]

Logs from Telegraf

2022-03-30T22:47:43Z E! [inputs.prometheus] Error in plugin: error making HTTP request to http://10.134.3.200:8080/actuator/prometheus: Get "http://10.134.3.200:8080/actuator/prometheus": dial tcp 10.134.3.200:8080: connect: no route to host
2022-03-30T22:47:43Z E! [inputs.prometheus] Error in plugin: error making HTTP request to http://10.134.5.70:8081/actuator/prometheus: Get "http://10.134.5.70:8081/actuator/prometheus": dial tcp 10.134.5.70:8081: connect: no route to host
2022-03-30T22:47:43Z E! [inputs.prometheus] Error in plugin: error making HTTP request to http://10.133.2.165:8080/actuator/prometheus: Get "http://10.133.2.165:8080/actuator/prometheus": dial tcp 10.133.2.165:8080: connect: no route to host
2022-03-30T22:47:43Z E! [inputs.prometheus] Error in plugin: error making HTTP request to http://10.134.3.206:8081/actuator/prometheus: Get "http://10.134.3.206:8081/actuator/prometheus": dial tcp 10.134.3.206:8081: connect: no route to host
2022-03-30T22:47:43Z E! [inputs.prometheus] Error in plugin: error making HTTP request to http://10.132.11.127:8080/actuator/prometheus: Get "http://10.132.11.127:8080/actuator/prometheus": dial tcp 10.132.11.127:8080: connect: no route to host

System info

Telegraf 1.20.4, Openshift v3.11.570 ( kubernetes v1.11.0+d4cacc0)

Docker

No response

Steps to reproduce

Deploy Telegraf in Openshift with input plugin prometheus with namespace configuration (monitor_kubernetes_pods_namespace)
Telegraf is able to retrieve pods and retrieve their metrics
Redeploy pods few times

Expected behavior

Deploy Telegraf in Openshift with input plugin prometheus with namespace configuration (monitor_kubernetes_pods_namespace)
Telegraf is able to retrieve pods and retrieve their metrics
Redeploy pods few times
Telegraf drops IP of old pods and retrieve IP of new pods

Actual behavior

Deploy Telegraf in Openshift with input plugin prometheus with namespace configuration (monitor_kubernetes_pods_namespace)
Telegraf is able to retrieve pods and retrieve their metrics
Redeploy pods few times
Telegraf always tries to retrieves metrics of old pods and doesn't detect new pods and shows the error :
2022-03-30T22:47:43Z E! [inputs.prometheus] Error in plugin: error making HTTP request to http://10.132.11.127:8080/actuator/prometheus: Get "http://10.132.11.127:8080/actuator/prometheus": dial tcp 10.132.11.127:8080: connect: no route to host

Additional info

No response

The text was updated successfully, but these errors were encountered:

MyaLongmire · 2022-04-04T15:37:45Z

@simoelmou will you please test pr #10932?

simoelmou · 2022-04-06T13:58:07Z

@MyaLongmire
We have these errors now that we didn't have with 1.21.x :

W0406 15:54:35.155507       1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:namespace:telegraf" cannot list pods at the cluster scope: no RBAC policy matched
E0406 15:54:35.155894       1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:namespace:telegraf" cannot list pods at the cluster scope: no RBAC policy matched

It seems the parameter monitor_kubernetes_pods_namespace is not used with the pr #10932

shubrajp · 2022-04-07T18:00:37Z

Hi @simoelmou,

Can you check if you have permissions to get pods:
kubectl auth can-i get pods --as=system:serviceaccount:namespace:telegraf

If you get "no"... You might have to edit the ClusterRole for telegraf agent

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
...
rules:
  - verbs:
      - list
      - watch
      - get
    apiGroups:
      - ''
    resources:
      - pods
      - nodes
      - endpoints
...

simoelmou · 2022-04-08T08:24:50Z

Hi @shubrajp

➜ oc auth can-i get pods --as=system:serviceaccount:namespace:telegraf
yes

Why do we need to retrieve pods at the cluster scope if monitor_kubernetes_pods_namespace is enabled ? it only needs to retrieve pods at the namespace scope.

shubrajp · 2022-04-08T08:44:51Z

Ahh... Right...
Looks like it is informer's limitation... Check: kubernetes-sigs/controller-runtime#124

As to "retrieve pods at the cluster scope if monitor_kubernetes_pods_namespace is enabled"
It is possible to add a check on pod's namespace... so that unregisterPod(...) and registerPod(...) are only triggered for given namespace... Do you think it will work?

shubrajp · 2022-04-08T10:12:27Z

@simoelmou,
The PR is updated... Can you please give it another try?

simoelmou · 2022-04-08T10:51:08Z

@shubrajp
I used the build artifact of the commit ~175ec58bfe90e324d3d12a0065b21b654f289e88 but I have the same error :

W0406 15:54:35.155507       1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:namespace:telegraf" cannot list pods at the cluster scope: no RBAC policy matched
E0406 15:54:35.155894       1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:namespace:telegraf" cannot list pods at the cluster scope: no RBAC policy matched

shubrajp · 2022-04-08T11:13:50Z

Hi @simoelmou,

Can you please try this:
oc auth can-i list pods --as=system:serviceaccount:namespace:telegraf (this time it is list instead of get)

Also can you share the ClusterRole file for telegraf?

simoelmou · 2022-04-08T14:08:32Z

@shubrajp

➜ oc auth can-i list pods --as=system:serviceaccount:namespace:telegraf
yes

The service account telegraf has a role with [get list watch] verbs and pods as resources in the namespace not a ClusterRole.
I'm not an administrator of the cluster so I'm unable to create a ClusterRole, the plugin is useful for operators with no administration permissions.

powersj · 2022-04-08T14:15:30Z

ah shoot I didn't see this back and forth before I merged do we still need to land additional fixes? or even revert the change?

shubrajp · 2022-04-08T14:17:24Z

@simoelmou...

➜ oc auth can-i list pods --as=system:serviceaccount:namespace:telegraf
yes

and

User "system:serviceaccount:namespace:telegraf" cannot list pods at the cluster scope: no RBAC policy matched

are conflicting 😅

I had a similar issue:

pods "exporter-5d6bfd8696-qb9t9" is forbidden: User "system:serviceaccount:system:telegraf-agent" cannot get resource "pods" in API group "" in the namespace "system"

and

kubectl auth can-i get pods --as=system:serviceaccount:system:telegraf-agent -n system
no

Adding get to verbs under ClusterRole resolved the issue for me...

simoelmou · 2022-04-08T14:52:35Z

@shubrajp
it seems the PR requires now a ClusterRole which was not necessary before.

➜ oc project
Using project "namespace" on server "..."
➜ oc auth can-i list pods --as=system:serviceaccount:namespace:telegraf
yes
➜ oc auth can-i list pods --as=system:serviceaccount:namespace:telegraf --all-namespaces
no - no RBAC policy matched

I don't understand why the PR should introduce permissions's error on the user's end if the user enabled the namespace configuration monitor_kubernetes_pods_namespace

Can you please test the PR with Role instead of ClusterRole? I think you'll be able to reproduce the error from your end

shubrajp · 2022-04-08T14:59:26Z

@simoelmou,
As I pointed out before, this is informer's limitation: kubernetes-sigs/controller-runtime#124
Currently the ListWatch for the cache's informers are non-namespaced.

	informerfactory := informers.NewSharedInformerFactory(clientset, resyncinterval)

	podinformer := informerfactory.Core().V1().Pods()
	podinformer.Informer().AddEventHandler(...)

So, it has to list all the pods then filter things out based on Namespace, Label / Field Selectors.

shubrajp · 2022-04-08T15:01:58Z

FYI
watch was namespaced, but had a timeout for 30 minutes.

	watcher, err := client.CoreV1().Pods(p.PodNamespace).Watch(ctx, metav1.ListOptions{
		LabelSelector: p.KubernetesLabelSelector,
		FieldSelector: p.KubernetesFieldSelector,
	})

Any pod changes after 30 minutes weren't reported.

simoelmou · 2022-04-08T15:25:35Z

@powersj

Ah thanks for the link kubernetes-sigs/controller-runtime#124 it's unfortunate to have this limitation for non-admins but this is not related to telegraf.
issue can be closed

powersj · 2022-04-08T20:44:37Z

@simoelmou, @shubrajp,

Thank you both for the back and forth. Is there any additional documentation that we should add to the plugin to aid others who may come across this?

Thanks

shubrajp · 2022-04-10T15:24:50Z

@powersj,

We can add links about informer and how it works.
Also, that telegraf should be able to list/get the pods.
Can put links to "kubectl auth can-i" commands, for anyone to verify their permissions.

Justification for change in permissions:
For earlier implementation, it was getting the pod object from events. So, these verbs (list, get) were not needed.
Now in the new implementation, it is getting the name and namespace, using which it has to list / get the pod.

Rest of the functionality is same.
"register" / "unregister" are called only for pods that pass the checks based on Namespace, Label / Field Selectors.

powersj · 2022-04-11T18:56:00Z

@shubrajp looks like 2269ff1 added to the README about needing ClusterRole already, so all we would want to add is the kubectl commands for debug/triage? If so would you be willing to add a debug section under the ### Kubernetes scraping section with those commands?

Thanks again!

simoelmou added the bug unexpected problem or unintended behavior label Apr 1, 2022

telegraf-tiger bot added the area/prometheus label Apr 1, 2022

shubrajp mentioned this issue Apr 3, 2022

cpu utilisation hitting cpu limit #10878

Closed

mesksr mentioned this issue Apr 3, 2022

fix(inputs.prometheus): moved from watcher to informer #10932

Merged

3 tasks

simoelmou closed this as completed Apr 8, 2022

skrech mentioned this issue Apr 6, 2023

inputs.prometheus require cluster level permissions even when scoped to a namespace #12780

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus plugin #10928

Prometheus plugin #10928

simoelmou commented Apr 1, 2022

MyaLongmire commented Apr 4, 2022

simoelmou commented Apr 6, 2022 •

edited

Loading

shubrajp commented Apr 7, 2022 •

edited

Loading

simoelmou commented Apr 8, 2022

shubrajp commented Apr 8, 2022

shubrajp commented Apr 8, 2022

simoelmou commented Apr 8, 2022

shubrajp commented Apr 8, 2022

simoelmou commented Apr 8, 2022

powersj commented Apr 8, 2022

shubrajp commented Apr 8, 2022

simoelmou commented Apr 8, 2022

shubrajp commented Apr 8, 2022 •

edited

Loading

shubrajp commented Apr 8, 2022

simoelmou commented Apr 8, 2022

powersj commented Apr 8, 2022

shubrajp commented Apr 10, 2022

powersj commented Apr 11, 2022 •

edited

Loading

Prometheus plugin #10928

Prometheus plugin #10928

Comments

simoelmou commented Apr 1, 2022

Relevant telegraf.conf

Logs from Telegraf

System info

Docker

Steps to reproduce

Expected behavior

Actual behavior

Additional info

MyaLongmire commented Apr 4, 2022

simoelmou commented Apr 6, 2022 • edited Loading

shubrajp commented Apr 7, 2022 • edited Loading

simoelmou commented Apr 8, 2022

shubrajp commented Apr 8, 2022

shubrajp commented Apr 8, 2022

simoelmou commented Apr 8, 2022

shubrajp commented Apr 8, 2022

simoelmou commented Apr 8, 2022

powersj commented Apr 8, 2022

shubrajp commented Apr 8, 2022

simoelmou commented Apr 8, 2022

shubrajp commented Apr 8, 2022 • edited Loading

shubrajp commented Apr 8, 2022

simoelmou commented Apr 8, 2022

powersj commented Apr 8, 2022

shubrajp commented Apr 10, 2022

powersj commented Apr 11, 2022 • edited Loading

simoelmou commented Apr 6, 2022 •

edited

Loading

shubrajp commented Apr 7, 2022 •

edited

Loading

shubrajp commented Apr 8, 2022 •

edited

Loading

powersj commented Apr 11, 2022 •

edited

Loading