Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assert hit on PM_IDLE exit in subsys/pm/pm.c:133 #69807

Closed
kv2019i opened this issue Mar 5, 2024 · 9 comments
Closed

assert hit on PM_IDLE exit in subsys/pm/pm.c:133 #69807

kv2019i opened this issue Mar 5, 2024 · 9 comments
Assignees
Labels
area: Power Management bug The issue is a bug, or the PR is fixing a bug platform: Intel ADSP Intel Audio platforms priority: high High impact/importance bug

Comments

@kv2019i
Copy link
Collaborator

kv2019i commented Mar 5, 2024

Describe the bug
Starting with commit b9d4b9d ("pm: Remove CURRENT_CPU macro"), SOF public test have started failing with

ASSERTION FAIL [!z_smp_cpu_mobile()] @ ZEPHYR_BASE/subsys/pm/pm.c:133
[ 220.790855] os: print_fatal_exception: ** FATAL EXCEPTION

... the system is running active use-case on core0 and the assert happens when SOF releases the IDLE state lock with
"pm_policy_state_lock_put(PM_STATE_RUNTIME_IDLE, PM_ALL_SUBSTATES);" and allow Zephyr to enter RUNTIME_IDLE on core0.

To Reproduce
Steps to reproduce the behavior:

  1. build SOF with Zephyr for Intel MTL target
  2. run sof-test case: TPLG=/lib/firmware/intel/development/sof-mtl-nocodec.tplg MODEL=MTLP_RVP_NOCODEC SOF_TEST_INTERVAL=5 ~/sof-test/test-case/check-playback.sh -d 10 -l 1 -r 1

Expected behavior
Test passes.

Impact
Loss of audio on end-user system.

Logs and console output
https://sof-ci.01.org/sofpr/PR8901/build3107/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=check-playback-10sec
See the "mtrace" tab to see Zephyr log.

Environment (please complete the following information):

  • OS: Linux
  • Toolchain: XCC / cadence
  • commit: v3.6.0 tag (also tested with v3.6.0 plus 672c21d1db68678b171729caf746d9ddf1427548 )

Additional context
With the identified Zephyr commit revert, the same test passes reliably with no other failures seen in the test.

cc:

@kv2019i kv2019i added bug The issue is a bug, or the PR is fixing a bug platform: Intel ADSP Intel Audio platforms labels Mar 5, 2024
@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 5, 2024

We did have an earlier discussion about this in #68234 . The consensus back then was that the case I originally reported was fixed by #67863 . We are however seeing this assert hit again, even with #67863 part of the tree.

@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 5, 2024

Notable Kconfig settings used that affect the PM state transitions:

CONFIG_ADSP_POWER_DOWN_HPSRAM=y
CONFIG_ADSP_IMR_CONTEXT_SAVE=y
CONFIG_ADSP_IDLE_CLOCK_GATING=y
CONFIG_PM_POLICY_CUSTOM=y
CONFIG_PM_DEVICE=y
CONFIG_PM_DEVICE_POWER_DOMAIN=y
CONFIG_PM_DEVICE_RUNTIME=y

@aescolar aescolar added priority: medium Medium impact/importance bug area: Power Management labels Mar 5, 2024
kv2019i added a commit to kv2019i/sof that referenced this issue Mar 5, 2024
Update to Zephyr in sof/main-rebase-20240305 branch of SOF
project's clone of Zephyr upstream repository. Revert one
Zephyr commit "pm: Remove CURRENT_CPU macro" that is leading to
failed tests in SOF CI test suite.

The revert allows us to update Zephyr to a newer version and tackle the
SMP boot and cache interface changes in SOF. The latest Zephyr upstream
has further changes needed in SOF for platform configuration and these
will require separarate changes.

Link: thesofproject#8818
Link: zephyrproject-rtos/zephyr#69807
Signed-off-by: Kai Vehmanen <[email protected]>
dbaluta pushed a commit to thesofproject/sof that referenced this issue Mar 6, 2024
Update to Zephyr in sof/main-rebase-20240305 branch of SOF
project's clone of Zephyr upstream repository. Revert one
Zephyr commit "pm: Remove CURRENT_CPU macro" that is leading to
failed tests in SOF CI test suite.

The revert allows us to update Zephyr to a newer version and tackle the
SMP boot and cache interface changes in SOF. The latest Zephyr upstream
has further changes needed in SOF for platform configuration and these
will require separarate changes.

Link: #8818
Link: zephyrproject-rtos/zephyr#69807
Signed-off-by: Kai Vehmanen <[email protected]>
@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 6, 2024

Still hit the error with Zephyr main of today (commit 0d5a670 ). Needed some HWMv2 work to make SOF build again.

The issue does not happen on every run, so reproduction is not 100% but with enough repeats, I still get it. The crash looks similar, so interrupt state is unexpected when system resume is done:

[37076.296076] <inf> ipc: ipc_cmd: rx   : 0x48000000|0x12
[37076.380685] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 1627, max 2743, overruns 0
[37077.404685] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 1840, max 2279, overruns 0
[37078.428685] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 1841, max 3378, overruns 0
ASSERTION FAIL [!z_smp_cpu_mobile()] @ ZEPHYR_BASE/subsys/pm/pm.c:133
[37079.232761] <err> os: print_fatal_exception:  ** FATAL EXCEPTION
[37079.232770] <err> os: print_fatal_exception:  ** CPU 0 EXCCAUSE 63 (zephyr exception)
[37079.232775] <err> os: print_fatal_exception:  **  PC 0xa0065773 VADDR (nil)
[37079.232780] <err> os: print_fatal_exception:  **  PS 0x60020
[37079.232798] <err> os: print_fatal_exception:  **    (INTLEVEL:0 EXCM: 0 UM:1 RING:0 WOE:1 OWB:0 CALLINC:2)
[37079.232803] <err> os: xtensa_dump_stack:  **  A0 0xa0063e08  SP 0xa010b290  A2 0x4  A3 0xa010b2a0
[37079.232808] <err> os: xtensa_dump_stack:  **  A4 0xa010b280  A5 0x4  A6 0xfffbfff0  A7 0x4
[37079.232815] <err> os: xtensa_dump_stack:  **  A8 0xa0063d52  A9 0xa010b220 A10 0xa00d9451 A11 0xa010b2a0
[37079.232820] <err> os: xtensa_dump_stack:  ** A12 0xa010b280 A13 0x4 A14 (nil) A15 0xffffffff
[37079.232826] <err> os: xtensa_dump_stack:  ** LBEG 0xa0062e3b LEND 0xa0062e48 LCOUNT 0xa003669e
[37079.232831] <err> os: xtensa_dump_stack:  ** SAR 0xc
[37079.232836] <err> os: xtensa_dump_stack:  **  THREADPTR 0xa00da203


Backtrace:0xa0065770:0xa010b290 0xa0063e05:0xa010b2a0 0xa0036c6c:0xa010b2c0 0xa0037107:0xa010b2f0 0xa00847d3:0xa010b310 0xa0063ddb:0xa010b330

[37079.232910] <err> os: z_fatal_error: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
[37079.232918] <err> os: z_fatal_error: Current thread: 0x4010e030 (unknown)
[37079.234985] <err> coredump_error: coredump_mem_window_backend_start: #CD:BEGIN#
[37079.235156] <err> coredump_error: coredump_mem_window_backend_end: #CD:END#
[37079.235273] <err> zephyr: k_sys_fatal_error_handler: Halting system

kv2019i added a commit to kv2019i/zephyr that referenced this issue Mar 7, 2024
This reverts commit b9d4b9d.

System resume fails occasionally on Intel ACE platform to assert with
this patch in place. Revert the change until a proper fix is found.

Link: zephyrproject-rtos#69807
Signed-off-by: Kai Vehmanen <[email protected]>
@marc-hb marc-hb added priority: high High impact/importance bug and removed priority: medium Medium impact/importance bug labels Mar 8, 2024
@ceolin
Copy link
Member

ceolin commented Mar 9, 2024

@kv2019i does it happens using a386493 ? The reason to see this problem is because intel_adsp/ace power management is unlocking interruptions before the pm subsystem calling pm_state_exit_post_ops.

@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 11, 2024

@ceolin wrote:

@kv2019i does it happens using a386493 ? The reason to see this problem is because intel_adsp/ace power management is unlocking interruptions before the pm subsystem calling pm_state_exit_post_ops.

In local test this at least affects the reproduction rate. I now scheduled a larger test plan with SOF CI infra at thesofproject/sof#8928 .. will update results here

UPDATE: results seem good, there's one fail on MTL, but the DSP panics are gone:
https://sof-ci.01.org/sofpr/PR8928/build3255/devicetest/index.html

@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 11, 2024

@ceolin It does seem the power gated state is not entered when the bug occurs and PR68493 does help . See comment in #68493 (comment)

nashif pushed a commit to ceolin/zephyr that referenced this issue Mar 12, 2024
In case the core is not power gated, waiti will restore intlevel. In
this case we lock interruption after it.

In the bug scenario, the host starts streaming and via SOF APIs, keeps a
lock to prevent Zephyr from entering PM_STATE_RUNTIME_IDLE. During the
test case, host removes this block and core0 is allowed to enter IDLE
state.

When core0 enters power gated state, interrrupts are left enabled (so
the core can be woken up when something happens). This leaves a race
where suitably timed interrupt will actually block entry to power gated
state and k_cpu_idle() in power_gate_entry() will return. This is rare,
but happens often enough that the relatively short test plan run on SOF
pull-requests will trigger this case.

Fixes zephyrproject-rtos#69807

Signed-off-by: Flavio Ceolin <[email protected]>
Signed-off-by: Anas Nashif <[email protected]>
zephyrbot pushed a commit that referenced this issue Mar 12, 2024
In case the core is not power gated, waiti will restore intlevel. In
this case we lock interruption after it.

In the bug scenario, the host starts streaming and via SOF APIs, keeps a
lock to prevent Zephyr from entering PM_STATE_RUNTIME_IDLE. During the
test case, host removes this block and core0 is allowed to enter IDLE
state.

When core0 enters power gated state, interrrupts are left enabled (so
the core can be woken up when something happens). This leaves a race
where suitably timed interrupt will actually block entry to power gated
state and k_cpu_idle() in power_gate_entry() will return. This is rare,
but happens often enough that the relatively short test plan run on SOF
pull-requests will trigger this case.

Fixes #69807

Signed-off-by: Flavio Ceolin <[email protected]>
Signed-off-by: Anas Nashif <[email protected]>
(cherry picked from commit 07426a8)
@marc-hb
Copy link
Collaborator

marc-hb commented Mar 13, 2024

Incredibly stupid, Github auto-close anti-feature: https://github.com/orgs/community/discussions/17308

The tentative fix is in a different repo and not even part of sof/west.yml yet!

Issues should be closed when tests are passing again and never before.

@marc-hb marc-hb reopened this Mar 13, 2024
@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 13, 2024

@marc-hb We do have a separate bug in SOF to track this issue thesofproject/sof#8908 , so I think we can close this bug on Zephyr as I've tested the fix. We can use the SOF bug to track integration of the fix into SOF (but that's outside Zephyr). Let's keep open for 24h as we complete more testing.

@marc-hb
Copy link
Collaborator

marc-hb commented Mar 13, 2024

Sorry I confused this with thesofproject/sof#8908

(the feature is still stupid, it didn't know you tested it)

@marc-hb marc-hb closed this as completed Mar 13, 2024
ashiroji pushed a commit to ashiroji/zephyr that referenced this issue Mar 19, 2024
In case the core is not power gated, waiti will restore intlevel. In
this case we lock interruption after it.

In the bug scenario, the host starts streaming and via SOF APIs, keeps a
lock to prevent Zephyr from entering PM_STATE_RUNTIME_IDLE. During the
test case, host removes this block and core0 is allowed to enter IDLE
state.

When core0 enters power gated state, interrrupts are left enabled (so
the core can be woken up when something happens). This leaves a race
where suitably timed interrupt will actually block entry to power gated
state and k_cpu_idle() in power_gate_entry() will return. This is rare,
but happens often enough that the relatively short test plan run on SOF
pull-requests will trigger this case.

Fixes zephyrproject-rtos#69807

Signed-off-by: Flavio Ceolin <[email protected]>
Signed-off-by: Anas Nashif <[email protected]>
henrikbrixandersen pushed a commit that referenced this issue Mar 29, 2024
In case the core is not power gated, waiti will restore intlevel. In
this case we lock interruption after it.

In the bug scenario, the host starts streaming and via SOF APIs, keeps a
lock to prevent Zephyr from entering PM_STATE_RUNTIME_IDLE. During the
test case, host removes this block and core0 is allowed to enter IDLE
state.

When core0 enters power gated state, interrrupts are left enabled (so
the core can be woken up when something happens). This leaves a race
where suitably timed interrupt will actually block entry to power gated
state and k_cpu_idle() in power_gate_entry() will return. This is rare,
but happens often enough that the relatively short test plan run on SOF
pull-requests will trigger this case.

Fixes #69807

Signed-off-by: Flavio Ceolin <[email protected]>
Signed-off-by: Anas Nashif <[email protected]>
(cherry picked from commit 07426a8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Power Management bug The issue is a bug, or the PR is fixing a bug platform: Intel ADSP Intel Audio platforms priority: high High impact/importance bug
Projects
None yet
Development

No branches or pull requests

4 participants