-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Second attempt to fix the CPU startup failure #4591
Conversation
This reverts commit fe4cc0e. The speculative patch that this commit reverts is proving to not be effective any more, so revert it and try a new approach. Signed-off-by: Phil Elwell <[email protected]>
A failure of some CPU cores to come online has been traced to the failure of a stm instruction while the cache is disabled. The symptom is that the saved values read back as zeroes, a catastrophic error since one of the values is a return address. This patch forces a readback and retry until the correct value is returned, Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and retrying is an attempt to avoid the problem. The previous attempted fix (forcing the accesses to only be 4-byte aligned) appears to have only worked for a while and likely for less obvious reasons such as a change in code alignment. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
@popcornmix has reminded me that the startup code was modified in 5.13 so as to no longer need to spill the registers to the stack. The changes are covered by this patchset (https://www.spinics.net/lists/arm-kernel/msg874939.html), comprising the following commits: c0e5073 ARM: 9057/1: cache-v7: add missing ISB after cache level selection A quick reading suggested that applying these patches on top of the reversion of my first patch would provide an alternative fix for the problem, but so far I've been unable to get it to boot. |
This patch fixes the problem for me to a reasonable level of confidence; I've seen 5+ CPU1 failures on a rev 1.1 2B running top-of-tree rpi-5.10.y (the failure mode also takes out USB) in about half an hour of reboots, and nothing in over an hour (including some cold boots) with the retry patch. |
The only question I had was whether the retry will always eventually succeed, or is there the vaguest possible of it infinite looping? |
Doing nothing is the lowest risk. |
Well, we've seen reports in the field, so anything is better than nothing. |
More importantly, since this is code that runs on each of the secondary cores, an infinite loop will only cause the symptoms people are seeing now ( |
I have a partial answer to that, having seen one failure (CPU2) out of 250 reboots - the primary core gives up waiting after a short interval and marks the secondary as offline. The difference with the patched kernel, apart from the reduced incidence, is that USB was functional after the failure. |
Sounds like a win to me. Can we increase the primary core timeout? |
Retesting now with a 10s timeout (was 1s). |
Just got another CPU2 failure, despite the 10s timeout - it's dead, Jim. |
I'm afraid I know sod-all about ARM assembler, but just out of curiosity I googled |
For us it boots (applied on top of 5.10.70), but with this the second CPU consistently fails to come up (4 times out of 4 tries):
95731b8 ARM: 9059/1: cache-v7: get rid of mini-stack is what seems to be breaking stuff, and not applying this one makes it come up successfully (though totally negating the intention). At a first glance don't see why though. So maybe there are some additional changes needed. |
Yes, 5.13/5.14/5.15 include "get rid of mini-stack" and boot, and never get the CPU startup failure. The good news is the CPU startup failure will go when we bump to the next LTS kernel (but that will be some way off). |
Good to know, then we'll maybe just sit this out until the next LTS kernel. |
After booting a lot of kernels, I identified the missing patch series..es: Applying commits torvalds/linux@4e79f02 to torvalds/linux@9443076 (11 commits) and torvalds/linux@67e3f82 to torvalds/linux@aaac373 (9 commits) on top of 5.10 makes the mini-stack removal actually work as expected on top of 5.10. A few of these commits have already been backported to 5.10 stable. I did not try to create a minimal patchset that makes it work, but having both series backported definitely works at a first glance. |
Thank you very much for doing that - it's been bothering me, since even with this second patch I have seen one or two failures. As you say, some of those patches have already been back-ported, leaving 17 additional commits and a reversion of my workaround, which are all now in the rpi-5.10y as of 462a5e2. |
kernel: clk: Move vec clock to clk-raspberrypi See: raspberrypi/linux#4639 kernel: ARM: v7: get rid of boot time mini stack See: raspberrypi/linux#4591 kernel: HVS Gamma LUT support for the RPi4 See: raspberrypi/linux#4435 kernel: ARM: dts: Add Pi Zero 2 support firmware: arm_loader: Allow VEC clock to be controlled by arm
kernel: clk: Move vec clock to clk-raspberrypi See: raspberrypi/linux#4639 kernel: ARM: v7: get rid of boot time mini stack See: raspberrypi/linux#4591 kernel: HVS Gamma LUT support for the RPi4 See: raspberrypi/linux#4435 kernel: ARM: dts: Add Pi Zero 2 support firmware: arm_loader: Allow VEC clock to be controlled by arm
The upstream form of fix is now in rpi-update kernel. Would be useful if anyone experiencing the issue can test and report back. Hopefully we can then get it into stable releases soon. |
We added a check after every reboot in our automated testing, and haven't seen a failure in about ~1.5k reboots. |
ARM: proc-v7: Retry uncached stmia if necessary
A failure of some CPU cores to come online has been traced to the
failure of a stm instruction while the cache is disabled. The symptom
is that the saved values read back as zeroes, a catastrophic error since
one of the values is a return address.
This patch forces a readback and retry until the correct value is
returned,
Notes:
At this stage in the boot process the core is running with its cache
disabled. Before enabling the cache its contents must be explicitly
invalidated, a process that requires quite a few registers that the
caller must preserve. Evidence suggests that something is writing a
block of zeroes over that space at a time when all other cores should
be idle, possibly some kind of write-combiner, and retrying is an
attempt to avoid the problem.
The previous attempted fix (forcing the accesses to only be 4-byte
aligned) appears to have only worked for a while and likely for less
obvious reasons such as a change in code alignment.
See: Hexxeh/rpi-firmware#232