intel_adsp/ace: power: Lock interruption when power gate fails #68493

ceolin · 2024-02-02T20:24:27Z

In case the core is not power gated, waiti will restore intlevel. In this case we lock interruption after it.

andyross

Needs to use the correct API to mask interrupts. But I'm confused as to why this is needed or what the circumstances are under which the WAITI returns? If this is a "should never happen" maybe a k_panic() is a better idea?

andyross · 2024-02-02T23:21:15Z

soc/xtensa/intel_adsp/ace/power.c

+	/* It is unlikely we get in here, but when this happens
+	 * we need to lock interruptions again.
+	 */
+	(void)irq_lock();


Needs to be arch_irq_lock(), or else an RSIL instruction or some other sequence to directly set PS. When SMP=y, irq_lock() is a global recursive lock that emulates the traditional API for the benefit of drivers not yet ported to spinlocks, not what you want at all (and dangerous if there are other irq_lock() users as it looks to me like this call is unmached and would produce a deadlock if someone else tries it).

Ouch thought it was just an indirection to arch_irq_lock. Yep, it has to be arch_irq_lock(), It is rstored later calling directly rsil.

andyross · 2024-02-02T23:22:07Z

soc/xtensa/intel_adsp/ace/power.c

+	/* It is unlikely we get in here, but when this happens
+	 * we need to lock interruptions again.
+	 */
+	(void)irq_lock();
 	z_xt_ints_off(0xffffffff);


Looks like this line is already masking all interrupts, though at one level higher in the abstraction stack (INTENABLE instead of PS.INTLVL). So why do you need both?

Because Zephyr only looks PS.INTENABLE and assumes that interruptions are not masked and triggers an assert.

ceolin · 2024-02-03T19:31:22Z

Needs to use the correct API to mask interrupts. But I'm confused as to why this is needed or what the circumstances are under which the WAITI returns? If this is a "should never happen" maybe a k_panic() is a better idea?

When this can happen in HW is not clear for me, I had removed it in a different pr and was asked to get it back saying that something could happen and prevent the power gate. That's said, the simulator that we have does not power gate the core and this is need as well.

ceolin · 2024-02-06T14:31:43Z

@andyross would mind check it again ?

andyross

When this can happen in HW is not clear for me, I had removed it in a different pr and was asked to get it back saying that something could happen and prevent the power gate.

Can you get a proper explanation from whoever made that request? :)

Again, as I read this patch it's a noop. The adjacent line is already masking all interrupts through a different mechanism. Both/either have the effect of causing the core to resume handling thread code but with interrupts masked, which can't possibly be correct: you wouldn't be able to receive the IDC interrupt delivering the "shut down now" command from the host (exactly the action that seems to have failed in the first place). So the only way to recover would be to hard-bounce the whole DSP anyway, which obviously doesn't care about interrupt masking state.

Basically I just don't see it. If this needs to be handled, this isn't the right way to handle it. If it shouldn't ever happen but might, it should be a panic.

All that said: this isn't a -1 over this issue. It's a noop, after all, and it's not my job to tell Intel how to manage its own hardware. (For propriety reasons. But also to keep powder dry for fights over SOF where I do have more of a stake, heh.)

ceolin · 2024-02-06T20:33:44Z

When this can happen in HW is not clear for me, I had removed it in a different pr and was asked to get it back saying that something could happen and prevent the power gate.

Can you get a proper explanation from whoever made that request? :)

Again, as I read this patch it's a noop. The adjacent line is already masking all interrupts through a different mechanism. Both/either have the effect of causing the core to resume handling thread code but with interrupts masked, which can't possibly be correct: you wouldn't be able to receive the IDC interrupt delivering the "shut down now" command from the host (exactly the action that seems to have failed in the first place). So the only way to recover would be to hard-bounce the whole DSP anyway, which obviously doesn't care about interrupt masking state.

Nope, if that code is executed, we will return to the idle thread that will end up calling pm_state_exit_post_ops that restores interruptions properly.

Basically I just don't see it. If this needs to be handled, this isn't the right way to handle it. If it shouldn't ever happen but might, it should be a panic.

I know we are talking upstream here, but we have simulator used for validation where this happens (the soc is not power gated) and as it was told me that this may happen in HW, it seems to me that it is better to have it, also the system gets idle and execution continues correctly.

All that said: this isn't a -1 over this issue. It's a noop, after all, and it's not my job to tell Intel how to manage its own hardware. (For propriety reasons. But also to keep powder dry for fights over SOF where I do have more of a stake, heh.)

That is not noop, Zephyr only sees PS.INTLEVEL to check if interruptions are locked, without it, e.g any use of _current_cpu will trigger an assert.

andyross · 2024-02-13T17:55:27Z

Not a -1, but still asking nicely if the commit message can be clarified to explain why this is necessary and under what circumstances this code can execute. Because it's not actually testable, right? Has this actually been observed to happen?

kv2019i · 2024-03-11T12:39:45Z

@andyross @ceolin I think this can actually fix #69807 .

In the bug scenario, the host starts streaming and via SOF APIs, keeps a lock to prevent Zephyr from entering PM_STATE_RUNTIME_IDLE. During the test case, host removes this block and core0 is allowed to enter IDLE state.

When core0 enters power gated state, interrrupts are left enabled (so the core can be woken up when something happens). This leaves a race where suitably timed interrupt will actually block entry to power gated state and k_cpu_idle() in power_gate_entry() will return. This is rare, but happens often enough thatthe relatively short test plan run on SOF pull-requests will trigger this case.

Without this patch, current Zephyr main will hit a DSP panic as described in #69807 .

Can you please revisit, commit message was updated.

kv2019i

Very minor typo in commit "thatthe", but good otherwise. Thanks @nashif !

In case the core is not power gated, waiti will restore intlevel. In this case we lock interruption after it. In the bug scenario, the host starts streaming and via SOF APIs, keeps a lock to prevent Zephyr from entering PM_STATE_RUNTIME_IDLE. During the test case, host removes this block and core0 is allowed to enter IDLE state. When core0 enters power gated state, interrrupts are left enabled (so the core can be woken up when something happens). This leaves a race where suitably timed interrupt will actually block entry to power gated state and k_cpu_idle() in power_gate_entry() will return. This is rare, but happens often enough that the relatively short test plan run on SOF pull-requests will trigger this case. Fixes zephyrproject-rtos#69807 Signed-off-by: Flavio Ceolin <[email protected]> Signed-off-by: Anas Nashif <[email protected]>

nashif · 2024-03-12T14:13:49Z

Very minor typo in commit "thatthe", but good otherwise. Thanks @nashif !

fixed

nashif · 2024-03-12T16:58:44Z

@andyross can you please take another look at this?

ceolin requested review from nashif and tmleman February 2, 2024 20:24

zephyrbot added the platform: Intel ADSP Intel Audio platforms label Feb 2, 2024

zephyrbot requested review from andyross, dcpleung, jxstelter, kv2019i, lgirdwood, lyakh, marc-hb, marcinszkudlinski and softwarecki February 2, 2024 20:25

zephyrbot assigned nashif Feb 2, 2024

nashif added this to the v3.6.0 milestone Feb 2, 2024

dcpleung previously approved these changes Feb 2, 2024

View reviewed changes

andyross previously requested changes Feb 2, 2024

View reviewed changes

ceolin dismissed dcpleung’s stale review via 1743b31 February 3, 2024 03:28

ceolin force-pushed the intel-adsp-lock branch from 46ac84e to 1743b31 Compare February 3, 2024 03:28

ceolin requested a review from andyross February 3, 2024 03:28

ceolin force-pushed the intel-adsp-lock branch from 1743b31 to f2252be Compare February 3, 2024 03:59

ceolin force-pushed the intel-adsp-lock branch from f2252be to a386493 Compare February 3, 2024 19:32

andyross reviewed Feb 6, 2024

View reviewed changes

andyross self-requested a review February 6, 2024 18:50

kv2019i previously approved these changes Feb 12, 2024

View reviewed changes

dcpleung previously approved these changes Feb 12, 2024

View reviewed changes

henrikbrixandersen added the backport v3.6-branch label Feb 21, 2024

henrikbrixandersen modified the milestones: v3.6.0, v3.7.0 Feb 21, 2024

This was referenced Mar 11, 2024

assert hit on PM_IDLE exit in subsys/pm/pm.c:133 #69807

Closed

[BUG] DSP panic with Zephyr upstream main thesofproject/sof#8908

Closed

Revert "pm: Remove CURRENT_CPU macro" #69937

Closed

nashif dismissed stale reviews from dcpleung and kv2019i via 74a879c March 11, 2024 19:29

nashif force-pushed the intel-adsp-lock branch from a386493 to 74a879c Compare March 11, 2024 19:29

nashif force-pushed the intel-adsp-lock branch from 74a879c to 4aa59ef Compare March 11, 2024 19:44

kv2019i approved these changes Mar 12, 2024

View reviewed changes

lyakh approved these changes Mar 12, 2024

View reviewed changes

tmleman approved these changes Mar 12, 2024

View reviewed changes

nashif approved these changes Mar 12, 2024

View reviewed changes

nashif force-pushed the intel-adsp-lock branch from 4aa59ef to 243419c Compare March 12, 2024 14:13

dleach02 merged commit 07426a8 into zephyrproject-rtos:main Mar 12, 2024
21 checks passed

zephyrbot mentioned this pull request Mar 12, 2024

[Backport v3.6-branch] intel_adsp/ace: power: Lock interruption when power gate fails #70127

Merged

ceolin deleted the intel-adsp-lock branch March 19, 2024 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intel_adsp/ace: power: Lock interruption when power gate fails #68493

intel_adsp/ace: power: Lock interruption when power gate fails #68493

ceolin commented Feb 2, 2024

andyross left a comment

andyross Feb 2, 2024

ceolin Feb 3, 2024 •

edited

Loading

andyross Feb 2, 2024

ceolin Feb 3, 2024

ceolin commented Feb 3, 2024

ceolin commented Feb 6, 2024

andyross left a comment

ceolin commented Feb 6, 2024

andyross commented Feb 13, 2024

kv2019i commented Mar 11, 2024

kv2019i left a comment

nashif commented Mar 12, 2024

nashif commented Mar 12, 2024

intel_adsp/ace: power: Lock interruption when power gate fails #68493

intel_adsp/ace: power: Lock interruption when power gate fails #68493

Conversation

ceolin commented Feb 2, 2024

andyross left a comment

Choose a reason for hiding this comment

andyross Feb 2, 2024

Choose a reason for hiding this comment

ceolin Feb 3, 2024 • edited Loading

Choose a reason for hiding this comment

andyross Feb 2, 2024

Choose a reason for hiding this comment

ceolin Feb 3, 2024

Choose a reason for hiding this comment

ceolin commented Feb 3, 2024

ceolin commented Feb 6, 2024

andyross left a comment

Choose a reason for hiding this comment

ceolin commented Feb 6, 2024

andyross commented Feb 13, 2024

kv2019i commented Mar 11, 2024

kv2019i left a comment

Choose a reason for hiding this comment

nashif commented Mar 12, 2024

nashif commented Mar 12, 2024

ceolin Feb 3, 2024 •

edited

Loading