-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU1: failed to come online with 5.4.51-v7l+ #232
Comments
I believe I'm on that version:
I don't see anything is the list of changes that seems likely to cause this. |
I rebooted twice, and both times it came back with CPU offline. Honestly didn't even notice it until this morning when I ran htop. The kernel log shows every other boot other than the July 17th kernel build comes up fine. I am wondering if it is something with the bootloader, that seems to have been updated as well. I rolled back everything to the 5.4.49 commit. Edit: When I say bootloader, I guess I am talking about maybe second stage? /boot/start.elf |
The other thing I noticed is there is only 3 berries on the boot. First reboot 4 berries, second reboot 3 berries, so far. |
OK, looks like it may have been introduced with 8382ece (5.4.51). Just counting berries on the screen this is what I found:
|
Commands executed:
4/6 reboots resulted in 3 berries. |
I believe that is suggesting a kernel issue starting from 8382ece. |
@popcornmix Reverted that commit, rebuilt 5.4.51, rebooted 10 times and no problems. All 4 cores are coming up with all 10 reboots.
|
Just to be absolutely sure, with your own built kernel and that commit not reverted, is it failing? |
I haven't tried it, but the base was commit 9d49ae69a1448f2180229b82794bfaa1c78679f7.
|
Rebuilt using raspberrypi/linux@9d49ae69a144 and it's also totally fine (all CPUs are coming up for 10 reboots). So I used rpi-update to switch back to 8382ece and it still has the CPU problem about 60% of the time. Either something on rpi-5.4.y fixed it after the build (seems unlikely since I only see one additional commit) or maybe the build host has a local git repository that is out of sync. Or something else? |
Are you updating kernel, modules and dtbs after building your own? |
I don't buy that raspberrypi/linux@cc5c7ce is the cause -
With extra debugging (not emitted either) it started OK - something seems a bit marginal. I've also noticed some reboot failures - the firmware stopping with 7 short flashes, which means "kernel not found". |
I generally do not build the kernel, usually what you guys provide is fine. But yes, I update modules, dtbs, dtb overlays, and kernel. This is the script I use to cross-compile from Ubuntu 18.04 (x64), it spits out a tarball that can be unrolled from "/" on the pi. I run it with an argument of "arm", and I am using GCC 7.5.0 cross-compiler: #!/bin/bash
ARCH=$1
case $ARCH in
arm64)
KERNEL=kernel8
CROSS_COMPILE=aarch64-linux-gnu-
;;
arm)
KERNEL=kernel7l
CROSS_COMPILE=arm-linux-gnueabihf-
;;
*)
echo "No architecture specified."
exit 1
;;
esac
REV=$(git rev-parse --short HEAD)
KBASE=/tmp/kernel
NPROC=$(/usr/bin/nproc)
NPROC=$(( NPROC + NPROC / 2 ))
test -d ${KBASE} && rm -fr ${KBASE}
mkdir -p ${KBASE}/boot/overlays
make -j${NPROC} ARCH=${ARCH} CROSS_COMPILE=${CROSS_COMPILE} bcm2711_defconfig
KVER=$(make -j${NPROC} ARCH=${ARCH} CROSS_COMPILE=${CROSS_COMPILE} -s kernelrelease)
make -j${NPROC} ARCH=${ARCH} CROSS_COMPILE=${CROSS_COMPILE} Image modules dtbs
make -j${NPROC} ARCH=${ARCH} CROSS_COMPILE=${CROSS_COMPILE} INSTALL_MOD_PATH=${KBASE} modules_install
rm -f ${KBASE}/lib/modules/${KVER}/build ${KBASE}/lib/modules/${KVER}/source
cp arch/${ARCH}/boot/Image ${KBASE}/boot/${KERNEL}.img
case $ARCH in
arm64)
cp arch/${ARCH}/boot/dts/broadcom/*.dtb ${KBASE}/boot/
;;
arm)
cp arch/${ARCH}/boot/dts/*.dtb ${KBASE}/boot/
;;
esac
cp arch/${ARCH}/boot/dts/overlays/*.dtb* ${KBASE}/boot/overlays/
cp arch/${ARCH}/boot/dts/overlays/README ${KBASE}/boot/overlays/
tar cvzf ~/kernel-${ARCH}-${KVER}-${REV}.tar.gz -C ${KBASE} .
make -j${NPROC} ARCH=${ARCH} CROSS_COMPILE=${CROSS_COMPILE} mrproper It is always CPU 1 that fails to come online:
|
From the description of problem firmware seems more likely, but the rpi-update test didn't seem to confirm that. |
Could anybody tell me why I have "You’re not receiving notifications from this thread." but get an email for every post ? |
No - sorry. That's GitHub, not us. |
I think we did that up in this comment: #232 (comment) Gave it another go, just went about it a different way:
bafd743 is the known working one for me, all 10/10 reboots come up with all 4 cores. This scenario was 40% failed to bring up CPU 1.
|
I just got a failure with the "suspect" commit (raspberrypi/linux@cc5c7ce) reverted, so it isn't that. |
In the bad state, retrying the wake of CPU1 doesn't help.
I'll test the 634e380 kernel with the latest firmware next, then vice-versa. and take it from there. |
One core failing to start feels like an arm reset issue. We had this in the early days. Some chips were more susceptible than others. But that doesn't seem to tie up with either of tests. |
Interesting, because da3752a is pretty solid for me, I didn't reboot it for 30 minutes but 10 cycles and never had a problem. The other thing is if I build 5.4.51 with GCC 7.5.0 (Ubuntu/Linaro) cross compiler I also have no issues. I don't know if the later version of the tool chain happens to be slightly more or less optimized and it's just by chance timing. |
Does this ever fail with the 4.19 kernel? I’ve looked though the ARM and clock related firmware changes and so far failed to reproduce the failure there. Although, it’s possible that both latest firmware and 5.4 are required |
Having found what I consider to be a 5.4 LKG I'm now working forwards, not backwards. |
Update: I think (and I can't be categorical because of the probabilistic nature of the failure) I've isolated the problem change to the kernel portion of a50c7d5 ("Bump to 5.4.45"). These commit hashes aren't all present in the current tree due to rebasing, but the last known good release is 3f54521ea and the 5.4.45 release is 9be502df. Comparing those two, 9be502df adds the following:
None of those commits stand out as obvious candidates, but I think we can rule out many of them as being either for the wrong platform (i.e. not compiled) or affecting code not yet run at the point of failure. I just hope it isn't a code placement problem. |
Having moved to the 5.4.47 release (dec0ddc5) after failing to find a bad commit in 5.4.45, I think we have a culprit:
It's a plausible result because it's a downstream patch that applies to our platform, and it deals with low-level stuff that might be run very early in the boot process. If anyone has a moment and a Pi to spare, build either or both of those commits and put it in some kind of a reboot loop to see how long it takes for CPU1 not to come up. N.B. Don't do this unless you have thought carefully about how to break out of the reboot loop - you need an exit strategy. |
Let me see if I can find a board that fails somewhat reliably. |
I added this just before the exit 0 on /etc/rc.local |
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
commit 79d4926021402cf03b97ba04bbfb580367d7c42a from https://github.com/raspberrypi/linux.git rpi-5.12.y In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]> Signed-off-by: Meng Li <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
FYI The ARMv7 startup code has been changed significantly in the 5.13 kernel; it no longer needs to save registers to the stack at that point, so my questionable patch no longer applies, and we may find that the problem no longer occurs (the commit comment includes the statement that:
which might be a subtle reference to this problem. rpi-5.13.y is still very new and is lacking both some of the commits in rpi-5.10.y and a lot of testing, but I'll be curious to hear if this long-standing issue can finally by laid to rest. |
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, raspberrypi#4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
A failure of some CPU cores to come online has been traced to the failure of a stm instruction while the cache is disabled. The symptom is that the saved values read back as zeroes, a catastrophic error since one of the values is a return address. This patch forces a readback and retry until the correct value is returned, Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and retrying is an attempt to avoid the problem. The previous attempted fix (forcing the accesses to only be 4-byte aligned) appears to have only worked for a while and likely for less obvious reasons such as a change in code alignment. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
A failure of some CPU cores to come online has been traced to the failure of a stm instruction while the cache is disabled. The symptom is that the saved values read back as zeroes, a catastrophic error since one of the values is a return address. This patch forces a readback and retry until the correct value is returned, Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and retrying is an attempt to avoid the problem. The previous attempted fix (forcing the accesses to only be 4-byte aligned) appears to have only worked for a while and likely for less obvious reasons such as a change in code alignment. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>
The second attempt at fixing this is available in rpi-update kernel. |
commit 3e1698ed5013c1054e3dbf8a9fcd3a8549a95ece from https://github.com/raspberrypi/linux.git rpi-5.10.y A failure of some CPU cores to come online has been traced to the failure of a stm instruction while the cache is disabled. The symptom is that the saved values read back as zeroes, a catastrophic error since one of the values is a return address. This patch forces a readback and retry until the correct value is returned, Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and retrying is an attempt to avoid the problem. The previous attempted fix (forcing the accesses to only be 4-byte aligned) appears to have only worked for a while and likely for less obvious reasons such as a change in code alignment. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]> Signed-off-by: Meng Li <[email protected]>
Appears to be something in commit 7059841, because 8382ece booted fine, and I ended up going back to da3752a (5.4.49) and that is also fine.
kern.log
The text was updated successfully, but these errors were encountered: