Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boot_time() is wrong #658

Open
MauroMombelli opened this issue Jul 18, 2015 · 20 comments
Open

boot_time() is wrong #658

MauroMombelli opened this issue Jul 18, 2015 · 20 comments
Labels

Comments

@MauroMombelli
Copy link

hi,

on a system without HW clock (raspberry) the boot_time() give 45 years (since epoch), while "uptime" command from shell give correct results.

@giampaolo
Copy link
Owner

What OS is that? Linux?

@MauroMombelli
Copy link
Author

sorry, yes, arch linux arm, updated to latest version

@giampaolo
Copy link
Owner

Please paste the output of cat /proc/stat.

@MauroMombelli
Copy link
Author

cat /proc/stat
cpu 5896703 6291 435376 40166914 949451 381 55053 0 0 0
cpu0 5896703 6291 435376 40166914 949451 381 55053 0 0 0
intr 169712969 0 0 0 33510989 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 394073 158525 0 0 0 0 0 0 133113018 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 3 2536346 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 92586406
btime 1437869567
processes 7898
procs_running 1
procs_blocked 0
softirq 29134775 5943048 9746842 95726 2945587 142 0 7961860 0 22627 2418943

@giampaolo
Copy link
Owner

Uhm... that btime 1437869567 line looks correct:

>>> import datetime
>>> datetime.datetime.fromtimestamp(1437869567.0).strftime("%Y-%m-%d %H:%M:%S")
'2015-07-26 02:12:47'

Maybe your system clock was messed up?

@MauroMombelli
Copy link
Author

the system clock is messed up because the raspi does not have the HW clock, so it start as timestamp=0 and then NTP set it to correct value.
Now i can't replicate the problem, but i played a bit with NTP and systemd settings; and as i don't have physical access to the machine i won't play with it again :/

@giampaolo
Copy link
Owner

Since this is no longer reproducible I'm gonna close it for now. Feel free to reopen in case you bump into this issue again.

@Gui13
Copy link

Gui13 commented Feb 12, 2020

I have a pretty easy way to reproduce it here with telegraf, which uses the gopsutil library, which is pretty much a word for word rewrite of your library:

influxdata/telegraf#7018

I have added a bug at gopsutil but your bug is pretty much the same: launch the python script before NTP does its magic, then wait until NTP corrects the date, and then enjoy your 40+ years uptime :D

@giampaolo
Copy link
Owner

Mmm... I guess that means that /proc/stat doesn't get updated for some reason.
You should check whether also /proc/uptime is updated or not. If it's updated we can use that one.

@giampaolo giampaolo reopened this Feb 12, 2020
@Gui13
Copy link

Gui13 commented Feb 12, 2020

This is what I get for a problematic Rpi:

pi@eanode5:~ $ cat /proc/uptime
11779.00 45685.18
pi@eanode5:~ $ cat /proc/stat
cpu  9382 37 8924 4689169 923 0 143 0 0 0
cpu0 2066 0 2431 1171940 258 0 83 0 0 0
cpu1 2058 2 2132 1173105 141 0 20 0 0 0
cpu2 2263 3 2249 1172836 276 0 18 0 0 0
cpu3 2995 32 2112 1171286 247 0 22 0 0 0
intr 952432 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6690 7917 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 729 0 3763 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 269 7179 0 0 0 0 90339 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 526853 0 0 0
ctxt 1400316
btime 1581528798
processes 15829
procs_running 1
procs_blocked 0
softirq 811644 0 263970 2039 19548 0 0 86407 266106 0 173574
pi@eanode5:~ $ date +%s
1581540580

If you do the math, it actually makes sense, all the numbers are correct. btime has been adjusted by the NTP.

The issue is that you retrieve btime when your module is first instanciated and keep it for the rest of the duration of the process, so if the btime changes, you will do computations on an older value.

@giampaolo
Copy link
Owner

giampaolo commented Feb 12, 2020

The issue is that you retrieve btime when your module is first instanciated and keep it for the rest of the duration of the process, so if the btime changes, you will do computations on an older value.

That is true, Process.create_time() on Linux relies on a cached version of whatever psutil.boot_time() returned the last time it was called.
Maybe this should change.
The original issue was about psutil.boot_time() though.

@giampaolo
Copy link
Owner

giampaolo commented Feb 12, 2020

There is another flip of this coin which is probably worth mentioning.
Process.create_time() is different than other methods because it's value is set on __init__ in order to identify the process instance uniquely over time. This way we avoid killing the wrong process if its PID has been reused.
If the clock changes and create_time() returns something different, that Process instance will probably raise NoSuchProcess("this PID has been reused by another process") on kill(), terminate() and other "write" methods (set ionice() or whatever). Probably it's an edge case, and it's either that or live with process ctimes 40 years into the future.

The cost of giving up the cache and return the right process creation time after a system clock update is a 2x slowdown when creating a Process instance (after that the result is cached):

~/svn/psutil {master}$ python3 -m timeit -s "import psutil" "psutil.Process()"
10000 loops, best of 3: 25.2 usec per loop
~/svn/psutil {master}$ python3 -m timeit -s "import psutil" "psutil.Process()"
10000 loops, best of 3: 56.2 usec per loop

@Gui13
Copy link

Gui13 commented Apr 15, 2020

I understand the performance issue, but the returned value is still wrong for any system that encountered NTP adjustment during the life of the python process that uses psutil.

As I said in the original bug in telegraf (which uses the python gopsutil but has the same bug), I suspect that many monitoring systems might be minutes or hours off their real uptime, depending on how skewed their RTC is, and how often the NTP process has to compensate for it. Right now the only way to solve the issue is to restart the python process doing the monitoring.

Maybe there is a middle ground where the actual btime is read again at regular intervals (say, at least every 15 minutes) and cached in between?

@giampaolo
Copy link
Owner

giampaolo commented Apr 15, 2020

I agree this should be fixed. We can resolve the slowdown by getting the boot time by using sysinfo(2) syscall instead of parsing /proc/stat:
https://stackoverflow.com/a/1544090/376587
It's already exposed in psutil, we just have to return the extra struct info. I made a quick test calling sysinfo(2) on create_time() and the slowdown is minimal, so I'd say we can proceed with this:

before:
~/svn/psutil {master}$ python3 -m timeit -s "import psutil" "psutil.Process()"
10000 loops, best of 3: 23.5 usec per loop

after:
~/svn/psutil {master}$ python3 -m timeit -s "import psutil" "psutil.Process()"
10000 loops, best of 3: 25.8 usec per loop

Note to self: this issue is related to #1724.

@giampaolo
Copy link
Owner

giampaolo commented Apr 15, 2020

Also, another note to self: originally we were already using sysinfo(2) to get the boot_time(), but decided to use /proc/stat instead so one can realiably determine the boot time of a Docker instance (when running psutil inside docker). There's a ticket somewhere but I'm lazy. Maybe it makes sense to rely on /proc/stat if we're running inside Docker, else use sysinfo(2).

@giampaolo
Copy link
Owner

giampaolo commented Apr 15, 2020

sysinfo(2) gives the number of seconds since boot, which changes on every call, and that is not good as we have to do some math in order to calculate it, and that may lead to a variation of result between calls. We have a similar problem on Windows, which I solved with this hack:

psutil/psutil/_pswindows.py

Lines 435 to 440 in 7ce6c3c

def boot_time():
"""The system boot time expressed in seconds since the epoch."""
# This dirty hack is to adjust the precision of the returned
# value which may have a 1 second fluctuation, see:
# https://github.com/giampaolo/psutil/issues/1007
global _last_btime

It appears that uptime -s returns a stable time, but still not sure how. This should be the source code:
https://github.com/coreutils/coreutils/blob/master/src/uptime.c

@Gui13
Copy link

Gui13 commented Apr 16, 2020

@giampaolo if sysinfo returns the corrected value after an adjustment, then it makes sense yes.

If what is said in #1724 is right, you can bypass the problem for the process isRunning() issue by reading the /proc/PID/stat instead of basing on /proc/stat

Then the last thing to check is that in VM/docker environment, the sysinfo call correctly adjusts (but my guess is that yes, it adjusts, since in VM it is like in real machine and in docker it actually responds what the host kernel knows)

@Gui13
Copy link

Gui13 commented Apr 16, 2020

Ah my bad, sysinfo doesn't return the btime. Well then /proc/stat would be the solution.. maybe with some caching so that you don't re-read it all the time?

@giampaolo
Copy link
Owner

giampaolo commented Apr 16, 2020

Well then /proc/stat would be the solution.. maybe with some caching so that you don't re-read it all the time?

The caching is why we have this problem currenly: since boot_time() value is cached (when using it from Process.create_time()) we don't get NTP updates.

@Gui13
Copy link

Gui13 commented Apr 16, 2020

Yeah but it might be sufficient to read it at regular intervals (every N calls, or store the last read timestamp and read every 15 minutes). The problem lies in reading it only once. If the uptime is eventually consistent, then it might be OK.

I'm just suggesting this because you mentioned the performance hit, the real, irrefutable fix is to read /proc/stat every time.

@giampaolo giampaolo added the bug label Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants