Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CPU system plugin that get stuck after suspend #3342

Merged
merged 1 commit into from
Oct 16, 2017

Conversation

PierreF
Copy link
Contributor

@PierreF PierreF commented Oct 16, 2017

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

This fix an issue when a Linux system is suspended, after resuming, Telegraf no longer send CPU metrics and log:

oct. 16 11:09:12 xps-pierref telegraf[8363]: 2017-10-16T09:09:12Z E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time
oct. 16 11:09:22 xps-pierref telegraf[8363]: 2017-10-16T09:09:22Z E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time
[...]

This occur after a suspend, because /proc/stat counter decrease ! [1]
On previous Telegraf version this issue did not occured because it updated lastStats and complained only ONCE. PR #3306 changed this behavior.

This PR restore the old behavior: If total CPU time decrease, still update lastStat (so next metrics gather should work) and complain.

[1]: Result of cat /proc/stat before and after a system suspend:

cpu  136963790 9176378 61945301 138764848 806738 0 2044968 0 0 0
cpu0 34341664 2336158 15383085 138710292 803998 0 682240 0 0 0
cpu1 34505950 2339404 15716869 17092 898 0 481592 0 0 0
cpu2 34076860 2258930 15290613 19244 944 0 518477 0 0 0
cpu3 34039314 2241884 15554733 18218 896 0 362658 0 0 0

## After a suspend of ~30 seconds

cpu  136964299 9176378 61945616 138711862 804044 0 2044977 0 0 0
cpu0 34341802 2336158 15383149 138710907 804014 0 682247 0 0 0
cpu1 34506062 2339404 15716962 303 30 0 481594 0 0 0
cpu2 34076987 2258930 15290694 325 0 0 518477 0 0 0
cpu3 34039446 2241884 15554810 326 0 0 362658 0 0 0

4th field (idle) and 5th field (iowait) of cpu1, cpu2, cpu3 are reseted to 0 after suspend. Which cause them for decrease.

@PierreF PierreF force-pushed the fix-cpu-times-decrease branch from 4b1953d to 57c75c5 Compare October 16, 2017 10:16
@danielnelson danielnelson added this to the 1.4.3 milestone Oct 16, 2017
@danielnelson danielnelson added the fix pr to fix corresponding bug label Oct 16, 2017
@danielnelson danielnelson merged commit f5a9d1b into influxdata:master Oct 16, 2017
danielnelson pushed a commit that referenced this pull request Oct 16, 2017
@danielnelson
Copy link
Contributor

Thank you, I'll include this in 1.4.3

@brookst
Copy link

brookst commented Nov 3, 2017

I upgraded from 1.4.2 to 1.4.3 to get this fix. Seems to work great for me. For reference; this fixes issue #721.

@ghost
Copy link

ghost commented Dec 28, 2017

I don't know If I'm able to re-open this issue but I am still seeing this problem. I have tried both telegraf 1.4.4 and 1.5.0-1 .

I see this error in the logs:

E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time

on telegraf version
Telegraf v1.4.4 (git: release-1.4 ddcb931)

This has started to happen since my Ubuntu host in AWS decided to briefly suspend. I still don't know why it suspended in the first place. It's the telegraf stats that alerted me to the problem.

The host is pending a reboot, which I will do tonight, but I've been uable to resolve the issue so far by re-installing different versions of telegraf.

@danielnelson
Copy link
Contributor

@richm-ww Do you see this message once or does it repeat each interval?

@ghost
Copy link

ghost commented Dec 28, 2017 via email

@krzysztof-indyk
Copy link
Contributor

krzysztof-indyk commented Dec 29, 2017

@danielnelson I have the same problem as @richm-ww. It started for couple of ours amazon linux AWS instances in about the same time. What is strange the instances was working when the problem arise.

Telegraf was updated to 1.5.0 but it didn't solve the problem. Only after rebooting instance telegraf starts working.

output from /proc/stat

cpu  144718161 824 31704023 1158093470 147685 0 2596711 47014973794 0 0
cpu0 75929654 429 16652163 576318362 62144 0 26748 92166828454 0 0
cpu1 68788506 394 15051860 581775108 85541 0 2569963 1799522552711 0 0
intr 5559820188 4955407 9 0 0 555 0 0 0 0 0 0 0 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 911664148 1047326976 0 0 5357230 0 918702874 138017838 0 0 5121124 0 273 1329090301 3716808 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 5150271662
btime 1507792915
processes 11644518
procs_running 5
procs_blocked 0
softirq 10977403649 0 1665140927 3302682450 2442900239 0 0 9512755 1565663427 0 1991503851

after couple of seconds

cpu  144718161 824 31704023 1158096908 147686 0 2596711 1609248466318 0 0
cpu0 75929654 429 16652163 576320096 62144 0 26748 1798613661078 0 0
cpu1 68788506 394 15051860 581776812 85541 0 2569963 1655309212611 0 0
intr 5559851263 4955816 9 0 0 555 0 0 0 0 0 0 0 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 911667128 1047330090 0 0 5357249 0 918706097 138018280 0 0 5121143 0 273 1329107562 3716822 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 5150288585
btime 1507792915
processes 11644560
procs_running 2
procs_blocked 0
softirq 10977439880 0 1665146263 3302692619 2442907724 0 0 9512780 1565668369 0 1991512125

after next couple of seconds

cpu0 75929654 429 16652163 576323146 62144 0 26748 1514864598836 0 0
cpu1 68788506 394 15051860 581779834 85542 0 2569963 1366684328260 0 0
intr 5559924750 4956571 9 0 0 555 0 0 0 0 0 0 0 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 911673635 1047337778 0 0 5357280 0 918712647 138019192 0 0 5121178 0 273 1329149883 3716844 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 5150327508
btime 1507792915
processes 11644624
procs_running 3
procs_blocked 0
softirq 10977517515 0 1665157423 3302717527 2442926121 0 0 9512844 1565678835 0 1991524765

@danielnelson
Copy link
Contributor

If I can properly count it looks like the decreasing column is cpu steal, and in fact it is going all over the place. You aren't suspending between these samples are you @krise3k?

Curious, what does top report as your cpu steal? Last item on line like this:

%Cpu(s):  3.5 us,  1.5 sy,  0.0 ni, 94.9 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st

@krzysztof-indyk
Copy link
Contributor

@danielnelson No I don't doing anything with this host.
top reports:
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,30000.0%id, 0.0%wa, 0.0%hi, 0.0%si,-150635151360.0%st

Three days ago we had the same problem with another instance.

@krzysztof-indyk
Copy link
Contributor

@danielnelson I wonder if adding option to ignore cpu steal would be a solution for this case. On cloud environment we can't do anything with host server

@danielnelson
Copy link
Contributor

I guess I would terminate and recreate any instance with -150635151360.0% cpu steal... this is just to weird for my taste.

I think what we should do is remove the error message and just report what /proc/stat shows. If that means the percents are negative like top is showing then at least you can see and alert on it.

@krzysztof-indyk
Copy link
Contributor

krzysztof-indyk commented Jan 3, 2018

First I thought that it's some kind of bug with reporting cpu steal but now I think that it's related to https://www.theregister.co.uk/2018/01/02/intel_cpu_design_flaw/ and https://twitter.com/jschauma/status/941447173245370368

@danielnelson
Copy link
Contributor

I've been thinking about my idea to remove the error message and report percentages, and I don't think we can do this. There are a couple known cases where the value will decrease, and I don't want to cause issues in these cases where the plugin can recover.

What does everyone think about waiting to see if this clears up on AWS and if it continues perhaps we could place guards around the logging message so that it will only log once. Another option would be to add an option to only report raw time, the opposite of collect_cpu_time, so those affected can use only the raw values?

@PierreF PierreF deleted the fix-cpu-times-decrease branch August 4, 2018 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix pr to fix corresponding bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants