assert hit on PM_IDLE exit in subsys/pm/pm.c:133 #69807

kv2019i · 2024-03-05T14:13:57Z

Describe the bug
Starting with commit b9d4b9d ("pm: Remove CURRENT_CPU macro"), SOF public test have started failing with

ASSERTION FAIL [!z_smp_cpu_mobile()] @ ZEPHYR_BASE/subsys/pm/pm.c:133
[ 220.790855] os: print_fatal_exception: ** FATAL EXCEPTION

... the system is running active use-case on core0 and the assert happens when SOF releases the IDLE state lock with
"pm_policy_state_lock_put(PM_STATE_RUNTIME_IDLE, PM_ALL_SUBSTATES);" and allow Zephyr to enter RUNTIME_IDLE on core0.

To Reproduce
Steps to reproduce the behavior:

build SOF with Zephyr for Intel MTL target
run sof-test case: TPLG=/lib/firmware/intel/development/sof-mtl-nocodec.tplg MODEL=MTLP_RVP_NOCODEC SOF_TEST_INTERVAL=5 ~/sof-test/test-case/check-playback.sh -d 10 -l 1 -r 1

Expected behavior
Test passes.

Impact
Loss of audio on end-user system.

Logs and console output
https://sof-ci.01.org/sofpr/PR8901/build3107/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=check-playback-10sec
See the "mtrace" tab to see Zephyr log.

Environment (please complete the following information):

OS: Linux
Toolchain: XCC / cadence
commit: v3.6.0 tag (also tested with v3.6.0 plus 672c21d1db68678b171729caf746d9ddf1427548 )

Additional context
With the identified Zephyr commit revert, the same test passes reliably with no other failures seen in the test.

cc:

[BUG] DSP panic with Zephyr upstream main thesofproject/sof#8908

The text was updated successfully, but these errors were encountered:

kv2019i · 2024-03-05T16:22:26Z

We did have an earlier discussion about this in #68234 . The consensus back then was that the case I originally reported was fixed by #67863 . We are however seeing this assert hit again, even with #67863 part of the tree.

kv2019i · 2024-03-05T16:28:25Z

Notable Kconfig settings used that affect the PM state transitions:

CONFIG_ADSP_POWER_DOWN_HPSRAM=y
CONFIG_ADSP_IMR_CONTEXT_SAVE=y
CONFIG_ADSP_IDLE_CLOCK_GATING=y
CONFIG_PM_POLICY_CUSTOM=y
CONFIG_PM_DEVICE=y
CONFIG_PM_DEVICE_POWER_DOMAIN=y
CONFIG_PM_DEVICE_RUNTIME=y

Update to Zephyr in sof/main-rebase-20240305 branch of SOF project's clone of Zephyr upstream repository. Revert one Zephyr commit "pm: Remove CURRENT_CPU macro" that is leading to failed tests in SOF CI test suite. The revert allows us to update Zephyr to a newer version and tackle the SMP boot and cache interface changes in SOF. The latest Zephyr upstream has further changes needed in SOF for platform configuration and these will require separarate changes. Link: thesofproject#8818 Link: zephyrproject-rtos/zephyr#69807 Signed-off-by: Kai Vehmanen <[email protected]>

Update to Zephyr in sof/main-rebase-20240305 branch of SOF project's clone of Zephyr upstream repository. Revert one Zephyr commit "pm: Remove CURRENT_CPU macro" that is leading to failed tests in SOF CI test suite. The revert allows us to update Zephyr to a newer version and tackle the SMP boot and cache interface changes in SOF. The latest Zephyr upstream has further changes needed in SOF for platform configuration and these will require separarate changes. Link: #8818 Link: zephyrproject-rtos/zephyr#69807 Signed-off-by: Kai Vehmanen <[email protected]>

kv2019i · 2024-03-06T12:17:48Z

Still hit the error with Zephyr main of today (commit 0d5a670 ). Needed some HWMv2 work to make SOF build again.

The issue does not happen on every run, so reproduction is not 100% but with enough repeats, I still get it. The crash looks similar, so interrupt state is unexpected when system resume is done:

[37076.296076] <inf> ipc: ipc_cmd: rx   : 0x48000000|0x12
[37076.380685] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 1627, max 2743, overruns 0
[37077.404685] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 1840, max 2279, overruns 0
[37078.428685] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 1841, max 3378, overruns 0
ASSERTION FAIL [!z_smp_cpu_mobile()] @ ZEPHYR_BASE/subsys/pm/pm.c:133
[37079.232761] <err> os: print_fatal_exception:  ** FATAL EXCEPTION
[37079.232770] <err> os: print_fatal_exception:  ** CPU 0 EXCCAUSE 63 (zephyr exception)
[37079.232775] <err> os: print_fatal_exception:  **  PC 0xa0065773 VADDR (nil)
[37079.232780] <err> os: print_fatal_exception:  **  PS 0x60020
[37079.232798] <err> os: print_fatal_exception:  **    (INTLEVEL:0 EXCM: 0 UM:1 RING:0 WOE:1 OWB:0 CALLINC:2)
[37079.232803] <err> os: xtensa_dump_stack:  **  A0 0xa0063e08  SP 0xa010b290  A2 0x4  A3 0xa010b2a0
[37079.232808] <err> os: xtensa_dump_stack:  **  A4 0xa010b280  A5 0x4  A6 0xfffbfff0  A7 0x4
[37079.232815] <err> os: xtensa_dump_stack:  **  A8 0xa0063d52  A9 0xa010b220 A10 0xa00d9451 A11 0xa010b2a0
[37079.232820] <err> os: xtensa_dump_stack:  ** A12 0xa010b280 A13 0x4 A14 (nil) A15 0xffffffff
[37079.232826] <err> os: xtensa_dump_stack:  ** LBEG 0xa0062e3b LEND 0xa0062e48 LCOUNT 0xa003669e
[37079.232831] <err> os: xtensa_dump_stack:  ** SAR 0xc
[37079.232836] <err> os: xtensa_dump_stack:  **  THREADPTR 0xa00da203


Backtrace:0xa0065770:0xa010b290 0xa0063e05:0xa010b2a0 0xa0036c6c:0xa010b2c0 0xa0037107:0xa010b2f0 0xa00847d3:0xa010b310 0xa0063ddb:0xa010b330

[37079.232910] <err> os: z_fatal_error: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
[37079.232918] <err> os: z_fatal_error: Current thread: 0x4010e030 (unknown)
[37079.234985] <err> coredump_error: coredump_mem_window_backend_start: #CD:BEGIN#
[37079.235156] <err> coredump_error: coredump_mem_window_backend_end: #CD:END#
[37079.235273] <err> zephyr: k_sys_fatal_error_handler: Halting system

This reverts commit b9d4b9d. System resume fails occasionally on Intel ACE platform to assert with this patch in place. Revert the change until a proper fix is found. Link: zephyrproject-rtos#69807 Signed-off-by: Kai Vehmanen <[email protected]>

ceolin · 2024-03-09T16:19:14Z

@kv2019i does it happens using a386493 ? The reason to see this problem is because intel_adsp/ace power management is unlocking interruptions before the pm subsystem calling pm_state_exit_post_ops.

kv2019i · 2024-03-11T11:27:34Z

@ceolin wrote:

@kv2019i does it happens using a386493 ? The reason to see this problem is because intel_adsp/ace power management is unlocking interruptions before the pm subsystem calling pm_state_exit_post_ops.

In local test this at least affects the reproduction rate. I now scheduled a larger test plan with SOF CI infra at thesofproject/sof#8928 .. will update results here

UPDATE: results seem good, there's one fail on MTL, but the DSP panics are gone:
https://sof-ci.01.org/sofpr/PR8928/build3255/devicetest/index.html

kv2019i · 2024-03-11T12:40:57Z

@ceolin It does seem the power gated state is not entered when the bug occurs and PR68493 does help . See comment in #68493 (comment)

In case the core is not power gated, waiti will restore intlevel. In this case we lock interruption after it. In the bug scenario, the host starts streaming and via SOF APIs, keeps a lock to prevent Zephyr from entering PM_STATE_RUNTIME_IDLE. During the test case, host removes this block and core0 is allowed to enter IDLE state. When core0 enters power gated state, interrrupts are left enabled (so the core can be woken up when something happens). This leaves a race where suitably timed interrupt will actually block entry to power gated state and k_cpu_idle() in power_gate_entry() will return. This is rare, but happens often enough that the relatively short test plan run on SOF pull-requests will trigger this case. Fixes zephyrproject-rtos#69807 Signed-off-by: Flavio Ceolin <[email protected]> Signed-off-by: Anas Nashif <[email protected]>

In case the core is not power gated, waiti will restore intlevel. In this case we lock interruption after it. In the bug scenario, the host starts streaming and via SOF APIs, keeps a lock to prevent Zephyr from entering PM_STATE_RUNTIME_IDLE. During the test case, host removes this block and core0 is allowed to enter IDLE state. When core0 enters power gated state, interrrupts are left enabled (so the core can be woken up when something happens). This leaves a race where suitably timed interrupt will actually block entry to power gated state and k_cpu_idle() in power_gate_entry() will return. This is rare, but happens often enough that the relatively short test plan run on SOF pull-requests will trigger this case. Fixes #69807 Signed-off-by: Flavio Ceolin <[email protected]> Signed-off-by: Anas Nashif <[email protected]> (cherry picked from commit 07426a8)

marc-hb · 2024-03-13T00:24:04Z

Incredibly stupid, Github auto-close anti-feature: https://github.com/orgs/community/discussions/17308

The tentative fix is ~~in a different repo~~ and not even part of sof/west.yml yet!

Issues should be closed when tests are passing again and never before.

kv2019i · 2024-03-13T11:44:02Z

@marc-hb We do have a separate bug in SOF to track this issue thesofproject/sof#8908 , so I think we can close this bug on Zephyr as I've tested the fix. We can use the SOF bug to track integration of the fix into SOF (but that's outside Zephyr). Let's keep open for 24h as we complete more testing.

marc-hb · 2024-03-13T14:27:23Z

Sorry I confused this with thesofproject/sof#8908

(the feature is still stupid, it didn't know you tested it)

In case the core is not power gated, waiti will restore intlevel. In this case we lock interruption after it. In the bug scenario, the host starts streaming and via SOF APIs, keeps a lock to prevent Zephyr from entering PM_STATE_RUNTIME_IDLE. During the test case, host removes this block and core0 is allowed to enter IDLE state. When core0 enters power gated state, interrrupts are left enabled (so the core can be woken up when something happens). This leaves a race where suitably timed interrupt will actually block entry to power gated state and k_cpu_idle() in power_gate_entry() will return. This is rare, but happens often enough that the relatively short test plan run on SOF pull-requests will trigger this case. Fixes zephyrproject-rtos#69807 Signed-off-by: Flavio Ceolin <[email protected]> Signed-off-by: Anas Nashif <[email protected]>

In case the core is not power gated, waiti will restore intlevel. In this case we lock interruption after it. In the bug scenario, the host starts streaming and via SOF APIs, keeps a lock to prevent Zephyr from entering PM_STATE_RUNTIME_IDLE. During the test case, host removes this block and core0 is allowed to enter IDLE state. When core0 enters power gated state, interrrupts are left enabled (so the core can be woken up when something happens). This leaves a race where suitably timed interrupt will actually block entry to power gated state and k_cpu_idle() in power_gate_entry() will return. This is rare, but happens often enough that the relatively short test plan run on SOF pull-requests will trigger this case. Fixes #69807 Signed-off-by: Flavio Ceolin <[email protected]> Signed-off-by: Anas Nashif <[email protected]> (cherry picked from commit 07426a8)

kv2019i added bug The issue is a bug, or the PR is fixing a bug platform: Intel ADSP Intel Audio platforms labels Mar 5, 2024

kv2019i assigned ceolin Mar 5, 2024

This was referenced Mar 5, 2024

kernel: Update k_wakeup() #69255

Merged

west.yml: update Zephyr to 9d1df132b1c + one revert thesofproject/sof#8903

Merged

kv2019i mentioned this issue Mar 5, 2024

[BUG] Zephyr baseline update blocked thesofproject/sof#8818

Closed

marc-hb mentioned this issue Mar 5, 2024

pm: Remove CURRENT_CPU macro #67753

Merged

aescolar added priority: medium Medium impact/importance bug area: Power Management labels Mar 5, 2024

kv2019i mentioned this issue Mar 6, 2024

[BUG] DSP panic with Zephyr upstream main thesofproject/sof#8908

Closed

kv2019i mentioned this issue Mar 6, 2024

west.yml: update Zephyr to f9f44b6dcdd (Mar 07) thesofproject/sof#8913

Merged

kv2019i mentioned this issue Mar 7, 2024

Revert "pm: Remove CURRENT_CPU macro" #69937

Closed

marc-hb mentioned this issue Mar 8, 2024

kernel: smp: Optimize delivery of IPIs #69770

Merged

marc-hb added priority: high High impact/importance bug and removed priority: medium Medium impact/importance bug labels Mar 8, 2024

kv2019i mentioned this issue Mar 11, 2024

intel_adsp/ace: power: Lock interruption when power gate fails #68493

Merged

dleach02 closed this as completed in 07426a8 Mar 12, 2024

marc-hb reopened this Mar 13, 2024

marc-hb closed this as completed Mar 13, 2024

kv2019i mentioned this issue Mar 13, 2024

assert hit on non-zero core PM_SOFT_OFF exit in subsys/pm/pm.c:133 #70181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assert hit on PM_IDLE exit in subsys/pm/pm.c:133 #69807

assert hit on PM_IDLE exit in subsys/pm/pm.c:133 #69807

kv2019i commented Mar 5, 2024 •

edited by marc-hb

Loading

kv2019i commented Mar 5, 2024

kv2019i commented Mar 5, 2024

kv2019i commented Mar 6, 2024

ceolin commented Mar 9, 2024

kv2019i commented Mar 11, 2024 •

edited

Loading

kv2019i commented Mar 11, 2024

marc-hb commented Mar 13, 2024 •

edited

Loading

kv2019i commented Mar 13, 2024

marc-hb commented Mar 13, 2024 •

edited

Loading

assert hit on PM_IDLE exit in subsys/pm/pm.c:133 #69807

assert hit on PM_IDLE exit in subsys/pm/pm.c:133 #69807

Comments

kv2019i commented Mar 5, 2024 • edited by marc-hb Loading

kv2019i commented Mar 5, 2024

kv2019i commented Mar 5, 2024

kv2019i commented Mar 6, 2024

ceolin commented Mar 9, 2024

kv2019i commented Mar 11, 2024 • edited Loading

kv2019i commented Mar 11, 2024

marc-hb commented Mar 13, 2024 • edited Loading

kv2019i commented Mar 13, 2024

marc-hb commented Mar 13, 2024 • edited Loading

kv2019i commented Mar 5, 2024 •

edited by marc-hb

Loading

kv2019i commented Mar 11, 2024 •

edited

Loading

marc-hb commented Mar 13, 2024 •

edited

Loading

marc-hb commented Mar 13, 2024 •

edited

Loading