CPU1: failed to come online with 5.4.51-v7l+ #232

wagnerch · 2020-07-18T12:43:31Z

Appears to be something in commit 7059841, because 8382ece booted fine, and I ended up going back to da3752a (5.4.49) and that is also fine.

kern.log

Jul 17 20:54:17 kernel: [    0.000000] Booting Linux on physical CPU 0x0
Jul 17 20:54:18 kernel: [    0.000000] Linux version 5.4.51-v7l+ (dom@buildbot) (gcc version 4.9.3 (crosstool-NG crosstool-ng-1.22.0-88-g8460611)) #1326 SMP Fri Jul 17 10:51:18 BST 2020
Jul 17 20:54:18 kernel: [    0.000000] CPU: ARMv7 Processor [410fd083] revision 3 (ARMv7), cr=30c5383d
Jul 17 20:54:18 kernel: [    0.000000] CPU: div instructions available: patching division code
Jul 17 20:54:18 kernel: [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
Jul 17 20:54:18 kernel: [    0.000000] OF: fdt: Machine model: Raspberry Pi 4 Model B Rev 1.2
...
Jul 17 20:54:18 kernel: [    0.003438] CPU: Testing write buffer coherency: ok
Jul 17 20:54:18 kernel: [    0.003949] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
Jul 17 20:54:18 kernel: [    0.004818] Setting up static identity map for 0x200000 - 0x20003c
Jul 17 20:54:18 kernel: [    0.005022] rcu: Hierarchical SRCU implementation.
Jul 17 20:54:18 kernel: [    0.005678] smp: Bringing up secondary CPUs ...
Jul 17 20:54:18 kernel: [    1.041640] CPU1: failed to come online
Jul 17 20:54:18 kernel: [    1.042925] CPU2: thread -1, cpu 2, socket 0, mpidr 80000002
Jul 17 20:54:18 kernel: [    1.044188] CPU3: thread -1, cpu 3, socket 0, mpidr 80000003
Jul 17 20:54:18 kernel: [    1.044333] smp: Brought up 1 node, 3 CPUs
Jul 17 20:54:18 kernel: [    1.044390] SMP: Total of 3 processors activated (324.00 BogoMIPS).
Jul 17 20:54:18 kernel: [    1.044415] CPU: All CPU(s) started in HYP mode.
Jul 17 20:54:18 kernel: [    1.044437] CPU: Virtualization extensions available.

The text was updated successfully, but these errors were encountered:

popcornmix · 2020-07-18T13:06:36Z

I believe I'm on that version:

pi@pi4:~ $ uname -a
Linux domnfs 5.4.51-v7l+ #1326 SMP Fri Jul 17 10:51:18 BST 2020 armv7l GNU/Linux
pi@pi4:~ $ vcgencmd version
Jun 10 2020 17:47:19 
Copyright (c) 2012 Broadcom
version e46bba1638cca2708b31b9daf4636770ef981735 (clean) (release) (start)

[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 5.4.51-v7l+ (dom@buildbot) (gcc version 4.9.3 (crosstool-NG crosstool-ng-1.22.0-88-g8460611)) #1326 SMP Fri Jul 17 10:51:18 BST 2020
[    0.000000] CPU: ARMv7 Processor [410fd083] revision 3 (ARMv7), cr=30c5383d
[    0.000000] CPU: div instructions available: patching division code
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
[    0.000000] OF: fdt: Machine model: Raspberry Pi 4 Model B Rev 1.2
...
[    0.003385] CPU: Testing write buffer coherency: ok
[    0.003874] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
[    0.004715] Setting up static identity map for 0x200000 - 0x20003c
[    0.004915] rcu: Hierarchical SRCU implementation.
[    0.005550] smp: Bringing up secondary CPUs ...
[    0.006707] CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
[    0.007980] CPU2: thread -1, cpu 2, socket 0, mpidr 80000002
[    0.009185] CPU3: thread -1, cpu 3, socket 0, mpidr 80000003
[    0.009328] smp: Brought up 1 node, 4 CPUs
[    0.009398] SMP: Total of 4 processors activated (432.00 BogoMIPS).
[    0.009424] CPU: All CPU(s) started in HYP mode.
[    0.009447] CPU: Virtualization extensions available.

I don't see anything is the list of changes that seems likely to cause this.
Does it happen every time, or was it just once?

wagnerch · 2020-07-18T13:30:30Z

I don't see anything is the list of changes that seems likely to cause this.
Does it happen every time, or was it just once?

I rebooted twice, and both times it came back with CPU offline. Honestly didn't even notice it until this morning when I ran htop. The kernel log shows every other boot other than the July 17th kernel build comes up fine. I am wondering if it is something with the bootloader, that seems to have been updated as well. I rolled back everything to the 5.4.49 commit.

Edit: When I say bootloader, I guess I am talking about maybe second stage? /boot/start.elf

wagnerch · 2020-07-18T13:46:39Z

The other thing I noticed is there is only 3 berries on the boot. First reboot 4 berries, second reboot 3 berries, so far.

popcornmix · 2020-07-18T13:52:19Z

nproc will return number of processors detected and number of berries will match that.
It would be useful to do a number of reboots on 7059841
and a number of reboots on 8382ece
and note how many processors are detected with each.

wagnerch · 2020-07-18T14:21:23Z

OK, looks like it may have been introduced with 8382ece (5.4.51). Just counting berries on the screen this is what I found:

5.4.50 bafd743 4444444444, all 10 reboots came up with 4 berries
5.4.51 8382ece 4433343433, 6/10 reboots came up with 3 berries

popcornmix · 2020-07-18T14:29:37Z

Can you confirm if it was a firmware or kernel change? e.g. start with 8382ece
sudo SKIP_KERNEL=1 rpi-update bafd743
should give you firmware from bafd743 but still kernel from 8382ece. Is that okay?

wagnerch · 2020-07-18T14:40:15Z

Commands executed:

sudo \
SKIP_WARNING=1 \
UPDATE_SELF=0 \
    rpi-update 8382ece
    
sudo \
SKIP_WARNING=1 \
UPDATE_SELF=0 \
SKIP_KERNEL=1 \
    rpi-update bafd743

$ reboot
$ vcgencmd version
Jul  2 2020 14:59:18
Copyright (c) 2012 Broadcom
version 36c8be9515deddc9d2b1f469374f00d0a2df13f9 (clean) (release) (start)

$ uname -r
5.4.51-v7l+

4/6 reboots resulted in 3 berries.

popcornmix · 2020-07-18T14:44:58Z

I believe that is suggesting a kernel issue starting from 8382ece.
@pelwell any thoughts? raspberrypi/linux@cc5c7ce ?

wagnerch · 2020-07-18T16:47:30Z

@popcornmix Reverted that commit, rebuilt 5.4.51, rebooted 10 times and no problems. All 4 cores are coming up with all 10 reboots.

$ vcgencmd version
Jul 17 2020 10:59:17
Copyright (c) 2012 Broadcom
version 21a15cb094f41c7506ad65d2cb9b29c550693057 (clean) (release) (start)

$ uname -rmvs
Linux 5.4.51-v7l+ #1 SMP Sat Jul 18 15:19:16 UTC 2020 armv7l

popcornmix · 2020-07-18T16:49:16Z

Just to be absolutely sure, with your own built kernel and that commit not reverted, is it failing?

wagnerch · 2020-07-18T16:50:50Z

I haven't tried it, but the base was commit 9d49ae69a1448f2180229b82794bfaa1c78679f7.

commit 948290923306a7302a14869beae7a560f67cef94 (HEAD -> rpi-5.4.y)
Author: Chad Wagner <[email protected]>
Date:   Sat Jul 18 11:10:49 2020 -0400

    Revert "irqchip/bcm2835: Quiesce IRQs left enabled by bootloader"

    This reverts commit d178d70080f4691a4a5cb69b116d9b7fba4b5e16.

commit 9d49ae69a1448f2180229b82794bfaa1c78679f7 (raspberrypi/rpi-5.4.y)
Author: Phil Elwell <[email protected]>
Date:   Fri Jul 17 17:56:17 2020 +0100

    configs: Add MAXIM_THERMOCOUPLE=m

    See: https://github.com/raspberrypi/linux/issues/3732

    Signed-off-by: Phil Elwell <[email protected]>

wagnerch · 2020-07-18T19:16:35Z

Rebuilt using raspberrypi/linux@9d49ae69a144 and it's also totally fine (all CPUs are coming up for 10 reboots). So I used rpi-update to switch back to 8382ece and it still has the CPU problem about 60% of the time.

Either something on rpi-5.4.y fixed it after the build (seems unlikely since I only see one additional commit) or maybe the build host has a local git repository that is out of sync. Or something else?

popcornmix · 2020-07-18T19:20:36Z

Are you updating kernel, modules and dtbs after building your own?
If you start with a problematic rpi-update versions and then update with your built kernel does it cure the problem?

pelwell · 2020-07-18T19:29:00Z

I don't buy that raspberrypi/linux@cc5c7ce is the cause - armctrl_of_init isn't even called on Pi 4, and neither is bcm2836_arm_irqchip_l1_intc_of_init. However, in confirming this I did just see CPU1 fail to come online:

[    0.004958] rcu: Hierarchical SRCU implementation.
[    0.009544] smp: Bringing up secondary CPUs ...
[    1.041633] CPU1: failed to come online
[    1.043883] CPU2: thread -1, cpu 2, socket 0, mpidr 80000002
[    1.046066] CPU3: thread -1, cpu 3, socket 0, mpidr 80000003
[    1.046212] smp: Brought up 1 node, 3 CPUs

With extra debugging (not emitted either) it started OK - something seems a bit marginal.

I've also noticed some reboot failures - the firmware stopping with 7 short flashes, which means "kernel not found".

wagnerch · 2020-07-18T19:37:42Z

I generally do not build the kernel, usually what you guys provide is fine. But yes, I update modules, dtbs, dtb overlays, and kernel. This is the script I use to cross-compile from Ubuntu 18.04 (x64), it spits out a tarball that can be unrolled from "/" on the pi. I run it with an argument of "arm", and I am using GCC 7.5.0 cross-compiler:
arm-linux-gnueabihf-gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0

#!/bin/bash
ARCH=$1
case $ARCH in
   arm64)
      KERNEL=kernel8
      CROSS_COMPILE=aarch64-linux-gnu-
      ;;
   arm)
      KERNEL=kernel7l
      CROSS_COMPILE=arm-linux-gnueabihf-
      ;;
   *)
      echo "No architecture specified."
      exit 1
      ;;
esac

REV=$(git rev-parse --short HEAD)
KBASE=/tmp/kernel
NPROC=$(/usr/bin/nproc)
NPROC=$(( NPROC + NPROC / 2 ))

test -d ${KBASE} && rm -fr ${KBASE}
mkdir -p ${KBASE}/boot/overlays

make -j${NPROC} ARCH=${ARCH} CROSS_COMPILE=${CROSS_COMPILE} bcm2711_defconfig
KVER=$(make -j${NPROC} ARCH=${ARCH} CROSS_COMPILE=${CROSS_COMPILE} -s kernelrelease)
make -j${NPROC} ARCH=${ARCH} CROSS_COMPILE=${CROSS_COMPILE} Image modules dtbs
make -j${NPROC} ARCH=${ARCH} CROSS_COMPILE=${CROSS_COMPILE} INSTALL_MOD_PATH=${KBASE} modules_install

rm -f ${KBASE}/lib/modules/${KVER}/build ${KBASE}/lib/modules/${KVER}/source
cp arch/${ARCH}/boot/Image ${KBASE}/boot/${KERNEL}.img
case $ARCH in
   arm64)
      cp arch/${ARCH}/boot/dts/broadcom/*.dtb ${KBASE}/boot/
      ;;
   arm)
      cp arch/${ARCH}/boot/dts/*.dtb ${KBASE}/boot/
      ;;
esac
cp arch/${ARCH}/boot/dts/overlays/*.dtb* ${KBASE}/boot/overlays/
cp arch/${ARCH}/boot/dts/overlays/README ${KBASE}/boot/overlays/
tar cvzf ~/kernel-${ARCH}-${KVER}-${REV}.tar.gz -C ${KBASE} .
make -j${NPROC} ARCH=${ARCH} CROSS_COMPILE=${CROSS_COMPILE} mrproper

It is always CPU 1 that fails to come online:

Jul 17 20:54:18 kernel: [    1.041640] CPU1: failed to come online
Jul 18 08:27:28 kernel: [    1.041632] CPU1: failed to come online
Jul 18 09:45:23 kernel: [    1.041638] CPU1: failed to come online
Jul 18 09:47:32 kernel: [    1.041633] CPU1: failed to come online
Jul 18 09:50:32 kernel: [    1.041635] CPU1: failed to come online
Jul 18 09:52:24 kernel: [    1.041633] CPU1: failed to come online
Jul 18 10:02:18 kernel: [    1.041637] CPU1: failed to come online
Jul 18 10:02:42 kernel: [    1.041635] CPU1: failed to come online
Jul 18 10:04:22 kernel: [    1.041634] CPU1: failed to come online
Jul 18 10:05:00 kernel: [    1.041636] CPU1: failed to come online
Jul 18 10:05:41 kernel: [    1.041635] CPU1: failed to come online
Jul 18 10:06:04 kernel: [    1.041631] CPU1: failed to come online
Jul 18 10:34:06 kernel: [    1.041638] CPU1: failed to come online
Jul 18 10:34:24 kernel: [    1.041634] CPU1: failed to come online
Jul 18 10:34:46 kernel: [    1.041636] CPU1: failed to come online
Jul 18 10:35:31 kernel: [    1.041634] CPU1: failed to come online
Jul 18 14:53:17 kernel: [    1.041640] CPU1: failed to come online
Jul 18 14:55:20 kernel: [    1.041637] CPU1: failed to come online
Jul 18 14:55:47 kernel: [    1.041639] CPU1: failed to come online
Jul 18 14:56:11 kernel: [    1.041637] CPU1: failed to come online
Jul 18 14:57:23 kernel: [    1.041634] CPU1: failed to come online
Jul 18 14:57:48 kernel: [    1.041635] CPU1: failed to come online
Jul 18 14:58:35 kernel: [    1.041633] CPU1: failed to come online

popcornmix · 2020-07-18T19:40:01Z

From the description of problem firmware seems more likely, but the rpi-update test didn't seem to confirm that.
You could try with everything at latest, then manually copy the firmware (start4.elf/fixup4.dat) from a good commit and see if problem is resolved.

MikeDB1 · 2020-07-18T19:43:04Z

Could anybody tell me why I have "You’re not receiving notifications from this thread." but get an email for every post ?

pelwell · 2020-07-18T19:44:25Z

No - sorry. That's GitHub, not us.

wagnerch · 2020-07-18T20:03:26Z

I think we did that up in this comment: #232 (comment)

Gave it another go, just went about it a different way:

sudo \
SKIP_WARNING=1 \
UPDATE_SELF=0 \
    rpi-update

curl -L -A curl https://github.com/Hexxeh/rpi-firmware/tarball/bafd743eeb3e8a2a863936594cd7201a0af136fa |tar xzf - -C "/tmp/firmware" --strip-components=1
cd /tmp/firmware
cp -p *.elf /boot/
cp -p *.dat /boot/
cp -p *.bin /boot/

bafd743 is the known working one for me, all 10/10 reboots come up with all 4 cores. This scenario was 40% failed to bring up CPU 1.

$ vcgencmd version
Jul  2 2020 14:59:18
Copyright (c) 2012 Broadcom
version 36c8be9515deddc9d2b1f469374f00d0a2df13f9 (clean) (release) (start)

$ uname -rmvs
Linux 5.4.51-v7l+ #1326 SMP Fri Jul 17 10:51:18 BST 2020 armv7l

pelwell · 2020-07-18T20:07:27Z

I just got a failure with the "suspect" commit (raspberrypi/linux@cc5c7ce) reverted, so it isn't that.

pelwell · 2020-07-20T15:34:07Z

In the bad state, retrying the wake of CPU1 doesn't help.
Working backwards through releases:

7059841 fails.
8382ece fails.
bafd743 fails.
da3752a fails.
634e380 hasn't failed yet in >30 minutes of rebooting.

I'll test the 634e380 kernel with the latest firmware next, then vice-versa. and take it from there.

popcornmix · 2020-07-20T15:47:26Z

One core failing to start feels like an arm reset issue. We had this in the early days. Some chips were more susceptible than others.
The fix involved ensuring one of the stb clocks was running prior to the (synchronous) arm reset.
So my guess was firmware, and a commit related to changing when clocks were enabled.

But that doesn't seem to tie up with either of tests.

wagnerch · 2020-07-20T21:42:50Z

Interesting, because da3752a is pretty solid for me, I didn't reboot it for 30 minutes but 10 cycles and never had a problem. The other thing is if I build 5.4.51 with GCC 7.5.0 (Ubuntu/Linaro) cross compiler I also have no issues. I don't know if the later version of the tool chain happens to be slightly more or less optimized and it's just by chance timing.

pelwell · 2020-07-21T08:23:26Z

New kernel with 634e380 firmware fails. New firmware with 634e380 kernel also fails eventually. So I retested 634e380 as a whole and did eventually get it to fail.

The first 5.4 release (f0236cc) rebooted all night, and I'll continue to bisect through the day.

timg236 · 2020-07-21T08:44:01Z

Does this ever fail with the 4.19 kernel? I’ve looked though the ARM and clock related firmware changes and so far failed to reproduce the failure there. Although, it’s possible that both latest firmware and 5.4 are required

pelwell · 2020-07-21T08:50:53Z

Having found what I consider to be a 5.4 LKG I'm now working forwards, not backwards.

pelwell · 2020-07-22T13:03:54Z

Update: I think (and I can't be categorical because of the probabilistic nature of the failure) I've isolated the problem change to the kernel portion of a50c7d5 ("Bump to 5.4.45").

These commit hashes aren't all present in the current tree due to rebasing, but the last known good release is 3f54521ea and the 5.4.45 release is 9be502df. Comparing those two, 9be502df adds the following:

  Upstream:
    3604bc0 Linux 5.4.45
    40caf1b net: smsc911x: Fix runtime PM imbalance on error
    2528015 selftests: mlxsw: qos_mc_aware: Specify arping timeout as an integer
    aea1423 net: ethernet: stmmac: Enable interface clocks on probe for IPQ806x
    6992c89 net/ethernet/freescale: rework quiesce/activate for ucc_geth
    6a90489 null_blk: return error for invalid zone size
    b5cb7fe s390/mm: fix set_huge_pte_at() for empty ptes
    c0063f39 drm/edid: Add Oculus Rift S to non-desktop list
    c90e773 net: bmac: Fix read of MAC address from ROM
    92c09e8 x86/mmiotrace: Use cpumask_available() for cpumask_var_t variables
    ba55015 io_uring: initialize ctx->sqo_wait earlier
    f1c5821 i2c: altera: Fix race between xfer_msg and isr thread
    1857d7d scsi: pm: Balance pm_only counter of request queue during system resume
    1610cd9 evm: Fix RCU list related warnings
    31ca642 ARC: [plat-eznps]: Restrict to CONFIG_ISA_ARCOMPACT
    935ba01 ARC: Fix ICCM & DCCM runtime size checks
    8a69220 RDMA/qedr: Fix synchronization methods and memory leaks in qedr
    49e9267 RDMA/qedr: Fix qpids xarray api used
    0377fda s390/ftrace: save traced function caller
    0734b58 ASoC: intel - fix the card names
    6106585 spi: dw: use "smp_mb()" to avoid sending spi data error
    99c63ba powerpc/xmon: Restrict when kernel is locked down
    f2adfe1 powerpc/powernv: Avoid re-registration of imc debugfs directory
    a293045 scsi: hisi_sas: Check sas_port before using it
    cfd5ac76 drm/i915: fix port checks for MST support on gen >= 11
    74028c9 airo: Fix read overflows sending packets
    63ad3fb net: dsa: mt7530: set CPU port to fallback mode
    d628f7a scsi: ufs: Release clock if DMA map fails
    95ffc2a media: staging: ipu3-imgu: Move alignment attribute to field
    5b6e152 media: Revert "staging: imgu: Address a compiler warning on alignment"
    a122eef mmc: fix compilation of user API
    1c44e6e kernel/relay.c: handle alloc_percpu returning NULL in relay_open
    91e863a mt76: mt76x02u: Add support for newer versions of the XBox One wifi adapter
    8a6744e p54usb: add AirVasT USB stick device-id
    ac09eae HID: i2c-hid: add Schneider SCL142ALM to descriptor override
    3e8410c HID: multitouch: enable multi-input as a quirk for some devices
    aa0dd0e HID: sony: Fix for broken buttons on DS3 USB dongles
    df4988a mm: Fix mremap not considering huge pmd devmap
    3209e3e Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window"

  Downstream:
    9be502d w1_therm: remove redundant assignments to variable ret
    cd9e064 w1_therm: Free the correct variable
    525d235 w1_therm: adding bulk read support to trigger multiple conversion on bus
    6272c0b w1_therm: adding alarm sysfs entry
    56d2e43 w1_therm: optimizing temperature read timings
    0e55ffd w1_therm: adding eeprom sysfs entry
    6bc69d4 w1_therm: adding resolution sysfs entry
    fadb881 w1_therm: adding ext_power sysfs entry
    0931a4c5 w1_therm: fix reset_select_slave during discovery
    0a6dbaa w1_therm: adding code comments and code reordering
    3ee63cb overlays: Update upstream overlays after vc4-kms-v3d change
    20509f5 overlays: i2c-gpio: Avoid open-drain warnings
    7744086 Revert "overlays: gpio-keys: Avoid open-drain warnings"
    46b071e snd_bcm2835: disable HDMI audio when vc4 is used (#3640)
    0654fb6 vc4: cec: Restore cec physical address on reconnect
    4203e65 staging: vchiq_arm: Use g_dma_dev for dma_unmap_sg

None of those commits stand out as obvious candidates, but I think we can rule out many of them as being either for the wrong platform (i.e. not compiled) or affecting code not yet run at the point of failure. I just hope it isn't a code placement problem.

pelwell · 2020-07-23T12:51:36Z

Having moved to the 5.4.47 release (dec0ddc5) after failing to find a bad commit in 5.4.45, I think we have a culprit:

d79f26f99acb is the last known good commit, and
2c2a2ea4d585 is the first bad commit.

It's a plausible result because it's a downstream patch that applies to our platform, and it deals with low-level stuff that might be run very early in the boot process.

If anyone has a moment and a Pi to spare, build either or both of those commits and put it in some kind of a reboot loop to see how long it takes for CPU1 not to come up. N.B. Don't do this unless you have thought carefully about how to break out of the reboot loop - you need an exit strategy.

popcornmix · 2020-07-23T13:00:20Z

Let me see if I can find a board that fails somewhat reliably.
I did run a reboot loop script a couple of days ago and it did fail but took a long time.
I'll try on a few other boards in case I have a quicker one.

popcornmix · 2020-07-23T13:04:18Z

I added this just before the exit 0 on /etc/rc.local
if [ $(nproc) -eq 4 ]; then sudo reboot; fi
But as @pelwell says, you'll need to use a linux machine to remove that line if it doesn't fail.
(if it comes up with 3 cores you will get a prompt and can edit it directly).

In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>

commit 79d4926021402cf03b97ba04bbfb580367d7c42a from https://github.com/raspberrypi/linux.git rpi-5.12.y In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]> Signed-off-by: Meng Li <[email protected]>

In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>

pelwell · 2021-07-08T09:36:33Z

FYI The ARMv7 startup code has been changed significantly in the 5.13 kernel; it no longer needs to save registers to the stack at that point, so my questionable patch no longer applies, and we may find that the problem no longer occurs (the commit comment includes the statement that:

This works around any issues regarding cache behavior in relation to the uncached accesses to this memory

which might be a subtle reference to this problem.

rpi-5.13.y is still very new and is lacking both some of the commits in rpi-5.10.y and a lot of testing, but I'll be curious to hear if this long-standing issue can finally by laid to rest.

In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, #4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>

In an attempt to prevent the problem of CPUn not starting, explicitly misalign the scratch space used to save registers acros the cache invalidation. Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and the misalignment is designed to disrupt any write-coalescing. In truth, I don't understand why this patch works, and when the failure is so random it is hard to be certain that this isn't just rolling the dice again. One interesting test would be to change the "addeq r12, raspberrypi#4"s to "addeq r12, #0"s determine see if the offset itself is significant or just the additional code. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>

A failure of some CPU cores to come online has been traced to the failure of a stm instruction while the cache is disabled. The symptom is that the saved values read back as zeroes, a catastrophic error since one of the values is a return address. This patch forces a readback and retry until the correct value is returned, Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and retrying is an attempt to avoid the problem. The previous attempted fix (forcing the accesses to only be 4-byte aligned) appears to have only worked for a while and likely for less obvious reasons such as a change in code alignment. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]>

popcornmix · 2021-09-20T16:17:41Z

The second attempt at fixing this is available in rpi-update kernel.
If you have the issue please update and report.

commit 3e1698ed5013c1054e3dbf8a9fcd3a8549a95ece from https://github.com/raspberrypi/linux.git rpi-5.10.y A failure of some CPU cores to come online has been traced to the failure of a stm instruction while the cache is disabled. The symptom is that the saved values read back as zeroes, a catastrophic error since one of the values is a return address. This patch forces a readback and retry until the correct value is returned, Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and retrying is an attempt to avoid the problem. The previous attempted fix (forcing the accesses to only be 4-byte aligned) appears to have only worked for a while and likely for less obvious reasons such as a change in code alignment. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <[email protected]> Signed-off-by: Meng Li <[email protected]>

pelwell mentioned this issue Sep 20, 2021

Second attempt to fix the CPU startup failure raspberrypi/linux#4591

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU1: failed to come online with 5.4.51-v7l+ #232

CPU1: failed to come online with 5.4.51-v7l+ #232

wagnerch commented Jul 18, 2020

popcornmix commented Jul 18, 2020

wagnerch commented Jul 18, 2020 •

edited

Loading

wagnerch commented Jul 18, 2020

popcornmix commented Jul 18, 2020 •

edited

Loading

wagnerch commented Jul 18, 2020

popcornmix commented Jul 18, 2020

wagnerch commented Jul 18, 2020

popcornmix commented Jul 18, 2020

wagnerch commented Jul 18, 2020

popcornmix commented Jul 18, 2020

wagnerch commented Jul 18, 2020

wagnerch commented Jul 18, 2020

popcornmix commented Jul 18, 2020

pelwell commented Jul 18, 2020 •

edited

Loading

wagnerch commented Jul 18, 2020

popcornmix commented Jul 18, 2020

MikeDB1 commented Jul 18, 2020

pelwell commented Jul 18, 2020

wagnerch commented Jul 18, 2020 •

edited

Loading

pelwell commented Jul 18, 2020

pelwell commented Jul 20, 2020 •

edited

Loading

popcornmix commented Jul 20, 2020

wagnerch commented Jul 20, 2020 •

edited

Loading

pelwell commented Jul 21, 2020

timg236 commented Jul 21, 2020

pelwell commented Jul 21, 2020

pelwell commented Jul 22, 2020

pelwell commented Jul 23, 2020

popcornmix commented Jul 23, 2020

popcornmix commented Jul 23, 2020

pelwell commented Jul 8, 2021

popcornmix commented Sep 20, 2021

CPU1: failed to come online with 5.4.51-v7l+ #232

CPU1: failed to come online with 5.4.51-v7l+ #232

Comments

wagnerch commented Jul 18, 2020

popcornmix commented Jul 18, 2020

wagnerch commented Jul 18, 2020 • edited Loading

wagnerch commented Jul 18, 2020

popcornmix commented Jul 18, 2020 • edited Loading

wagnerch commented Jul 18, 2020

popcornmix commented Jul 18, 2020

wagnerch commented Jul 18, 2020

popcornmix commented Jul 18, 2020

wagnerch commented Jul 18, 2020

popcornmix commented Jul 18, 2020

wagnerch commented Jul 18, 2020

wagnerch commented Jul 18, 2020

popcornmix commented Jul 18, 2020

pelwell commented Jul 18, 2020 • edited Loading

wagnerch commented Jul 18, 2020

popcornmix commented Jul 18, 2020

MikeDB1 commented Jul 18, 2020

pelwell commented Jul 18, 2020

wagnerch commented Jul 18, 2020 • edited Loading

pelwell commented Jul 18, 2020

pelwell commented Jul 20, 2020 • edited Loading

popcornmix commented Jul 20, 2020

wagnerch commented Jul 20, 2020 • edited Loading

pelwell commented Jul 21, 2020

timg236 commented Jul 21, 2020

pelwell commented Jul 21, 2020

pelwell commented Jul 22, 2020

pelwell commented Jul 23, 2020

popcornmix commented Jul 23, 2020

popcornmix commented Jul 23, 2020

pelwell commented Jul 8, 2021

popcornmix commented Sep 20, 2021

wagnerch commented Jul 18, 2020 •

edited

Loading

popcornmix commented Jul 18, 2020 •

edited

Loading

pelwell commented Jul 18, 2020 •

edited

Loading

wagnerch commented Jul 18, 2020 •

edited

Loading

pelwell commented Jul 20, 2020 •

edited

Loading

wagnerch commented Jul 20, 2020 •

edited

Loading