-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpu utilisation hitting cpu limit #10878
Comments
Hi,
If requests were stacking over time I would expect a possible trend upwards and maybe not a direct spike. While I do not doubt there are possible issues with the prometheus input, we need an actionable, specific issue to make any changes here. Thanks |
Hi powersj...
We have tried it with different number of pods, different cpu request and cpu limit values but no luck. We have narrowed down to watchPod() as culprit.
if I add a big enough timeout (say a week) under ListOptions, I don't see this spike. |
@MyaLongmire
|
@shubrajp Yay! Thank you for testing it out and reporting back to us :) |
Fixed with #10932 |
Relevant telegraf.conf
Logs from Telegraf
2022-03-22T10:30:48Z I! Starting Telegraf 1.21.4
2022-03-22T10:30:48Z I! Using config file: /etc/telegraf/telegraf.conf
2022-03-22T10:30:48Z I! Loaded inputs: internal prometheus
2022-03-22T10:30:48Z I! Loaded aggregators:
2022-03-22T10:30:48Z I! Loaded processors:
2022-03-22T10:30:48Z I! Loaded outputs: prometheus_client
2022-03-22T10:30:48Z I! Tags enabled: host=[...]
2022-03-22T10:30:48Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"[...]", Flush Interval:1m10s
2022-03-22T10:30:48Z D! [agent] Initializing plugins
2022-03-22T10:30:48Z D! [agent] Connecting outputs
2022-03-22T10:30:48Z D! [agent] Attempting connection to [outputs.prometheus_client]
2022-03-22T10:30:48Z I! [outputs.prometheus_client] Listening on http://[::]:9273/metrics
2022-03-22T10:30:48Z D! [agent] Successfully connected to outputs.prometheus_client
2022-03-22T10:30:48Z D! [agent] Starting service inputs
System info
Telegraf 1.21.4
Docker
resources:
limits:
cpu: 100m
memory: 1Gi
requests:
cpu: 10m
memory: 40Mi
Steps to reproduce
kubectl top pod | grep telegraf
Expected behavior
cpu util % to remain nearly constant
Actual behavior
cpu util % to spikes up after some time
Additional info
Profile (after cpu increments):
(pprof) top
Showing nodes accounting for 4590ms, 92.17% of 4980ms total
Dropped 29 nodes (cum <= 24.90ms)
Showing top 10 nodes out of 33
flat flat% sum% cum cum%
1090ms 21.89% 21.89% 1090ms 21.89% runtime.futex
860ms 17.27% 39.16% 860ms 17.27% runtime.lock2
630ms 12.65% 51.81% 630ms 12.65% runtime.unlock2
560ms 11.24% 63.05% 2320ms 46.59% runtime.chanrecv
470ms 9.44% 72.49% 500ms 10.04% sync.(*Mutex).Unlock (inline)
450ms 9.04% 81.53% 970ms 19.48% context.(*cancelCtx).Done
200ms 4.02% 85.54% 3740ms 75.10% github.com/influxdata/telegraf/plugins/inputs/prometheus.(*Prometheus).watchPod
120ms 2.41% 87.95% 120ms 2.41% runtime.memclrNoHeapPointers
110ms 2.21% 90.16% 230ms 4.62% runtime.typedmemclr
100ms 2.01% 92.17% 310ms 6.22% runtime.selectnbrecv
I've looked at the profile multiple times after cpu spikes… watchpod is consistently present…
Possible culprit:
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/prometheus/kubernetes.go#L108
There is no exit from the infinite for loop.
It never hits "<-ctx.Done()" case
Related Issue:
#10148
Here watchPod stops monitoring pod changes (new pod creation / pod deletion / etc.) after 30 minutes.
The text was updated successfully, but these errors were encountered: