Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second attempt to fix the CPU startup failure #4591

Merged
merged 2 commits into from
Sep 20, 2021

Conversation

pelwell
Copy link
Contributor

@pelwell pelwell commented Sep 20, 2021

ARM: proc-v7: Retry uncached stmia if necessary

A failure of some CPU cores to come online has been traced to the
failure of a stm instruction while the cache is disabled. The symptom
is that the saved values read back as zeroes, a catastrophic error since
one of the values is a return address.

This patch forces a readback and retry until the correct value is
returned,

Notes:

At this stage in the boot process the core is running with its cache
disabled. Before enabling the cache its contents must be explicitly
invalidated, a process that requires quite a few registers that the
caller must preserve. Evidence suggests that something is writing a
block of zeroes over that space at a time when all other cores should
be idle, possibly some kind of write-combiner, and retrying is an
attempt to avoid the problem.

The previous attempted fix (forcing the accesses to only be 4-byte
aligned) appears to have only worked for a while and likely for less
obvious reasons such as a change in code alignment.

See: Hexxeh/rpi-firmware#232

This reverts commit fe4cc0e.

The speculative patch that this commit reverts is proving to not be
effective any more, so revert it and try a new approach.

Signed-off-by: Phil Elwell <[email protected]>
A failure of some CPU cores to come online has been traced to the
failure of a stm instruction while the cache is disabled. The symptom
is that the saved values read back as zeroes, a catastrophic error since
one of the values is a return address.

This patch forces a readback and retry until the correct value is
returned,

Notes:

At this stage in the boot process the core is running with its cache
disabled. Before enabling the cache its contents must be explicitly
invalidated, a process that requires quite a few registers that the
caller must preserve. Evidence suggests that something is writing a
block of zeroes over that space at a time when all other cores should
be idle, possibly some kind of write-combiner, and retrying is an
attempt to avoid the problem.

The previous attempted fix (forcing the accesses to only be 4-byte
aligned) appears to have only worked for a while and likely for less
obvious reasons such as a change in code alignment.

See: Hexxeh/rpi-firmware#232

Signed-off-by: Phil Elwell <[email protected]>
@pelwell
Copy link
Contributor Author

pelwell commented Sep 20, 2021

@popcornmix has reminded me that the startup code was modified in 5.13 so as to no longer need to spill the registers to the stack. The changes are covered by this patchset (https://www.spinics.net/lists/arm-kernel/msg874939.html), comprising the following commits:

c0e5073 ARM: 9057/1: cache-v7: add missing ISB after cache level selection
f9e7a99 ARM: 9058/1: cache-v7: refactor v7_invalidate_l1 to avoid clobbering r5/r6
95731b8 ARM: 9059/1: cache-v7: get rid of mini-stack

A quick reading suggested that applying these patches on top of the reversion of my first patch would provide an alternative fix for the problem, but so far I've been unable to get it to boot.

@pelwell
Copy link
Contributor Author

pelwell commented Sep 20, 2021

This patch fixes the problem for me to a reasonable level of confidence; I've seen 5+ CPU1 failures on a rev 1.1 2B running top-of-tree rpi-5.10.y (the failure mode also takes out USB) in about half an hour of reboots, and nothing in over an hour (including some cold boots) with the retry patch.

@JamesH65
Copy link
Contributor

The only question I had was whether the retry will always eventually succeed, or is there the vaguest possible of it infinite looping?

@pelwell
Copy link
Contributor Author

pelwell commented Sep 20, 2021

Doing nothing is the lowest risk.

@JamesH65
Copy link
Contributor

Well, we've seen reports in the field, so anything is better than nothing.

@pelwell
Copy link
Contributor Author

pelwell commented Sep 20, 2021

More importantly, since this is code that runs on each of the secondary cores, an infinite loop will only cause the symptoms people are seeing now (CPUx failed to come online) but without the horrible side-effects caused by a branch to zero...

@pelwell pelwell merged commit 3e1698e into raspberrypi:rpi-5.10.y Sep 20, 2021
popcornmix added a commit to raspberrypi/firmware that referenced this pull request Sep 20, 2021
popcornmix added a commit to Hexxeh/rpi-firmware that referenced this pull request Sep 20, 2021
@pelwell
Copy link
Contributor Author

pelwell commented Sep 21, 2021

The only question I had was whether the retry will always eventually succeed

I have a partial answer to that, having seen one failure (CPU2) out of 250 reboots - the primary core gives up waiting after a short interval and marks the secondary as offline. The difference with the patched kernel, apart from the reduced incidence, is that USB was functional after the failure.

@JamesH65
Copy link
Contributor

Sounds like a win to me. Can we increase the primary core timeout?

@pelwell
Copy link
Contributor Author

pelwell commented Sep 21, 2021

Retesting now with a 10s timeout (was 1s).

@pelwell
Copy link
Contributor Author

pelwell commented Sep 21, 2021

Just got another CPU2 failure, despite the 10s timeout - it's dead, Jim.

@lurch
Copy link
Contributor

lurch commented Sep 23, 2021

I'm afraid I know sod-all about ARM assembler, but just out of curiosity I googled stmia and found https://developer.arm.com/documentation/ddi0406/b/Application-Level-Architecture/Instruction-Details/Alphabetical-list-of-instructions/STM---STMIA---STMEA?lang=en which links to https://developer.arm.com/documentation/ddi0406/b/System-Level-Architecture/System-Instructions/Alphabetical-list-of-instructions/STM--user-registers-?lang=en which in turn says "Store Multiple (user registers) is unpredictable in User or System modes."
Could that explain the unpredictability you're seeing here, or is that referring to something else entirely? (I also know nothing about different CPU-modes)

@KanjiMonster
Copy link
Contributor

@popcornmix has reminded me that the startup code was modified in 5.13 so as to no longer need to spill the registers to the stack. The changes are covered by this patchset (https://www.spinics.net/lists/arm-kernel/msg874939.html), comprising the following commits:

c0e5073 ARM: 9057/1: cache-v7: add missing ISB after cache level selection
f9e7a99 ARM: 9058/1: cache-v7: refactor v7_invalidate_l1 to avoid clobbering r5/r6
95731b8 ARM: 9059/1: cache-v7: get rid of mini-stack

A quick reading suggested that applying these patches on top of the reversion of my first patch would provide an alternative fix for the problem, but so far I've been unable to get it to boot.

For us it boots (applied on top of 5.10.70), but with this the second CPU consistently fails to come up (4 times out of 4 tries):

smp: Bringing up secondary CPUs ...
CPU1: failed to come online
smp: Brought up 1 node, 1 CPU
SMP: Total of 1 processors activated (1000.00 BogoMIPS).

95731b8 ARM: 9059/1: cache-v7: get rid of mini-stack is what seems to be breaking stuff, and not applying this one makes it come up successfully (though totally negating the intention). At a first glance don't see why though.

So maybe there are some additional changes needed.

@popcornmix
Copy link
Collaborator

Yes, 5.13/5.14/5.15 include "get rid of mini-stack" and boot, and never get the CPU startup failure.
So there definitely is something between 5.10 and 5.13 that makes it work.

The good news is the CPU startup failure will go when we bump to the next LTS kernel (but that will be some way off).

@KanjiMonster
Copy link
Contributor

Good to know, then we'll maybe just sit this out until the next LTS kernel.

@pelwell pelwell deleted the cpustartfix branch October 5, 2021 13:04
@KanjiMonster
Copy link
Contributor

After booting a lot of kernels, I identified the missing patch series..es:

Applying commits torvalds/linux@4e79f02 to torvalds/linux@9443076 (11 commits) and torvalds/linux@67e3f82 to torvalds/linux@aaac373 (9 commits) on top of 5.10 makes the mini-stack removal actually work as expected on top of 5.10.

A few of these commits have already been backported to 5.10 stable. I did not try to create a minimal patchset that makes it work, but having both series backported definitely works at a first glance.

@pelwell
Copy link
Contributor Author

pelwell commented Oct 28, 2021

Thank you very much for doing that - it's been bothering me, since even with this second patch I have seen one or two failures. As you say, some of those patches have already been back-ported, leaving 17 additional commits and a reversion of my workaround, which are all now in the rpi-5.10y as of 462a5e2.

popcornmix added a commit to raspberrypi/firmware that referenced this pull request Oct 28, 2021
kernel: clk: Move vec clock to clk-raspberrypi
See: raspberrypi/linux#4639

kernel: ARM: v7: get rid of boot time mini stack
See: raspberrypi/linux#4591

kernel: HVS Gamma LUT support for the RPi4
See: raspberrypi/linux#4435

kernel: ARM: dts: Add Pi Zero 2 support

firmware: arm_loader: Allow VEC clock to be controlled by arm
popcornmix added a commit to raspberrypi/rpi-firmware that referenced this pull request Oct 28, 2021
kernel: clk: Move vec clock to clk-raspberrypi
See: raspberrypi/linux#4639

kernel: ARM: v7: get rid of boot time mini stack
See: raspberrypi/linux#4591

kernel: HVS Gamma LUT support for the RPi4
See: raspberrypi/linux#4435

kernel: ARM: dts: Add Pi Zero 2 support

firmware: arm_loader: Allow VEC clock to be controlled by arm
@popcornmix
Copy link
Collaborator

The upstream form of fix is now in rpi-update kernel. Would be useful if anyone experiencing the issue can test and report back. Hopefully we can then get it into stable releases soon.

@KanjiMonster
Copy link
Contributor

We added a check after every reboot in our automated testing, and haven't seen a failure in about ~1.5k reboots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants