Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic in elapsed() #80674

Closed
silence-coding opened this issue Jan 4, 2021 · 13 comments
Closed

panic in elapsed() #80674

silence-coding opened this issue Jan 4, 2021 · 13 comments
Labels
A-time Area: Time C-bug Category: This is a bug. C-defective-hardware Category: Issue that was filed as a possible software bug but is actually defective hardware T-libs Relevant to the library team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@silence-coding
Copy link

The panic occurs in the third-party hyper.

hyperium/hyper#2385

@silence-coding silence-coding added the C-bug Category: This is a bug. label Jan 4, 2021
@silence-coding
Copy link
Author

image

@jyn514
Copy link
Member

jyn514 commented Jan 4, 2021

https://doc.rust-lang.org/std/time/struct.Instant.html#panics-1

This function may panic if the current time is earlier than this instant, which is something that can happen if an Instant is produced synthetically.

What operating system are you on? How are you setting the instant? On some systems the clock time is not actually monotonic.

// And here we come upon a sad state of affairs. The whole point of
// `Instant` is that it's monotonically increasing. We've found in the
// wild, however, that it's not actually monotonically increasing for
// one reason or another. These appear to be OS and hardware level bugs,
// and there's not really a whole lot we can do about them. Here's a
// taste of what we've found:
//
// * #48514 - OpenBSD, x86_64
// * #49281 - linux arm64 and s390x
// * #51648 - windows, x86
// * #56560 - windows, x86_64, AWS
// * #56612 - windows, x86, vm (?)
// * #56940 - linux, arm64
// * https://bugzilla.mozilla.org/show_bug.cgi?id=1487778 - a similar
// Firefox bug
//
// It seems that this just happens a lot in the wild.
// We're seeing panics across various platforms where consecutive calls
// to `Instant::now`, such as via the `elapsed` function, are panicking
// as they're going backwards. Placed here is a last-ditch effort to try
// to fix things up. We keep a global "latest now" instance which is
// returned instead of what the OS says if the OS goes backwards.
//
// To hopefully mitigate the impact of this, a few platforms are
// excluded as "these at least haven't gone backwards yet".
if time::Instant::actually_monotonic() {
return Instant(os_now);
}

@camelid camelid added T-libs Relevant to the library team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jan 4, 2021
@silence-coding
Copy link
Author

rust version: 1.45.2
operating system: EulerOS 2.0SP5 x86_64
https://developer.huaweicloud.com/en-us/euleros/lifecycle-management.html

@silence-coding
Copy link
Author

@jyn514 Thank you for your reply. Can I ask how to set the instant will trigger this situation?

@jyn514
Copy link
Member

jyn514 commented Jan 4, 2021

@silence-coding like the comment says, if the system time is not monotonic then this panic can happen when you set it to a time that's earlier. I don't know any more than that.

@silence-coding
Copy link
Author

Thank you

@the8472
Copy link
Member

the8472 commented Apr 20, 2021

To get an idea what might be causing this the hardware environment matters too.

Can you provide the following info?

  • kernel version
  • /proc/cpuinfo, if the cores are homogenous then one core will do
  • /sys/devices/system/clocksource/clocksource0/current_clocksource
  • is the system virtualized? if yes which hypervisor?

@silence-coding
Copy link
Author

silence-coding commented Apr 21, 2021

Circumstances have been lost, just looking for a similar

Linux version 3.10.0-862.14.1.0.h197.eulerosv2r7.x86_64

/proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
stepping : 2
microcode : 0x1
cpu MHz : 2593.992
cache size : 16384 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat
bogomips : 5187.98
clflush size : 64
cache_alignment : 64
address sizes : 42 bits physical, 48 bits virtual
power management:

Hypervisor vendor: KVM

/sys/devices/system/clocksource/clocksource0/current_clocksource kvm-clock

@the8472
Copy link
Member

the8472 commented Apr 21, 2021

Ok, on the assumption that actually is identical to the hardware that triggered your initial report:

It's not actually using the tsc directly and instead relying on the hypervisor.

If I read this code correctly:

https://github.com/torvalds/linux/blob/8bb495e3f02401ee6f76d1b1d77f3ac9f079e376/arch/x86/kernel/kvmclock.c#L271-L272

and

https://github.com/torvalds/linux/blob/8bb495e3f02401ee6f76d1b1d77f3ac9f079e376/arch/x86/kernel/pvclock.c#L77-L100

then it'll only trust the KVM clock if the hypervisor explicitly promises that the host clock is stable and otherwise protects it with an atomic CAS. I only checked the mainline 3.10 source though. RHEL/CentOS are known to produce heavily patched frankenkernels, so their code might actually be doing something different there.

Anyway, it looks like the host explicitly promises that the clock is reliable and then breaks that promise.

So if this issue occurs again on a machine using kvm-clock you could report that to your hosting provider, they may have to update their hypervisor to fix this or you could switch to a different clock source, e.g. hpet. You can check which ones are available under /sys/devices/system/clocksource/clocksource0/available_clocksources

On the rust side we have the option to also change actually_monotonic() check to return false on x86 linux. Which would be unfortunate since this only affects broken hypervisors on old CPUs without nonstop_tsc since current kernels prefer tsc over kvm-clock if the cpu indicates constant_tsc + nonstop_tsc.

@the8472
Copy link
Member

the8472 commented Apr 21, 2021

Circumstances have been lost, just looking for a similar

That may not be reliable enough to report this issue to the linux kernel maintainers.

If you encounter the issue again can you gather the information (OS, cpuinfo, hypervisor, kernel version and clock source) for the particular failing system?

@silence-coding
Copy link
Author

@the8472 Next time I meet, I gather as much information as I can.

@saethlin
Copy link
Member

Should this have been closed by #89926?

@crlf0710
Copy link
Member

Seems indeed so. Closing.

@the8472 the8472 added the C-defective-hardware Category: Issue that was filed as a possible software bug but is actually defective hardware label Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-time Area: Time C-bug Category: This is a bug. C-defective-hardware Category: Issue that was filed as a possible software bug but is actually defective hardware T-libs Relevant to the library team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants