Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intel_adsp/ace: power: Lock interruption when power gate fails #68493

Merged
merged 1 commit into from
Mar 12, 2024

Conversation

ceolin
Copy link
Member

@ceolin ceolin commented Feb 2, 2024

In case the core is not power gated, waiti will restore intlevel. In this case we lock interruption after it.

dcpleung
dcpleung previously approved these changes Feb 2, 2024
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to use the correct API to mask interrupts. But I'm confused as to why this is needed or what the circumstances are under which the WAITI returns? If this is a "should never happen" maybe a k_panic() is a better idea?

/* It is unlikely we get in here, but when this happens
* we need to lock interruptions again.
*/
(void)irq_lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be arch_irq_lock(), or else an RSIL instruction or some other sequence to directly set PS. When SMP=y, irq_lock() is a global recursive lock that emulates the traditional API for the benefit of drivers not yet ported to spinlocks, not what you want at all (and dangerous if there are other irq_lock() users as it looks to me like this call is unmached and would produce a deadlock if someone else tries it).

Copy link
Member Author

@ceolin ceolin Feb 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch thought it was just an indirection to arch_irq_lock. Yep, it has to be arch_irq_lock(), It is rstored later calling directly rsil.

/* It is unlikely we get in here, but when this happens
* we need to lock interruptions again.
*/
(void)irq_lock();
z_xt_ints_off(0xffffffff);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this line is already masking all interrupts, though at one level higher in the abstraction stack (INTENABLE instead of PS.INTLVL). So why do you need both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Zephyr only looks PS.INTENABLE and assumes that interruptions are not masked and triggers an assert.

@ceolin
Copy link
Member Author

ceolin commented Feb 3, 2024

Needs to use the correct API to mask interrupts. But I'm confused as to why this is needed or what the circumstances are under which the WAITI returns? If this is a "should never happen" maybe a k_panic() is a better idea?

When this can happen in HW is not clear for me, I had removed it in a different pr and was asked to get it back saying that something could happen and prevent the power gate. That's said, the simulator that we have does not power gate the core and this is need as well.

@ceolin
Copy link
Member Author

ceolin commented Feb 6, 2024

@andyross would mind check it again ?

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this can happen in HW is not clear for me, I had removed it in a different pr and was asked to get it back saying that something could happen and prevent the power gate.

Can you get a proper explanation from whoever made that request? :)

Again, as I read this patch it's a noop. The adjacent line is already masking all interrupts through a different mechanism. Both/either have the effect of causing the core to resume handling thread code but with interrupts masked, which can't possibly be correct: you wouldn't be able to receive the IDC interrupt delivering the "shut down now" command from the host (exactly the action that seems to have failed in the first place). So the only way to recover would be to hard-bounce the whole DSP anyway, which obviously doesn't care about interrupt masking state.

Basically I just don't see it. If this needs to be handled, this isn't the right way to handle it. If it shouldn't ever happen but might, it should be a panic.

All that said: this isn't a -1 over this issue. It's a noop, after all, and it's not my job to tell Intel how to manage its own hardware. (For propriety reasons. But also to keep powder dry for fights over SOF where I do have more of a stake, heh.)

@andyross andyross self-requested a review February 6, 2024 18:50
@ceolin
Copy link
Member Author

ceolin commented Feb 6, 2024

When this can happen in HW is not clear for me, I had removed it in a different pr and was asked to get it back saying that something could happen and prevent the power gate.

Can you get a proper explanation from whoever made that request? :)

Again, as I read this patch it's a noop. The adjacent line is already masking all interrupts through a different mechanism. Both/either have the effect of causing the core to resume handling thread code but with interrupts masked, which can't possibly be correct: you wouldn't be able to receive the IDC interrupt delivering the "shut down now" command from the host (exactly the action that seems to have failed in the first place). So the only way to recover would be to hard-bounce the whole DSP anyway, which obviously doesn't care about interrupt masking state.

Nope, if that code is executed, we will return to the idle thread that will end up calling pm_state_exit_post_ops that restores interruptions properly.

Basically I just don't see it. If this needs to be handled, this isn't the right way to handle it. If it shouldn't ever happen but might, it should be a panic.

I know we are talking upstream here, but we have simulator used for validation where this happens (the soc is not power gated) and as it was told me that this may happen in HW, it seems to me that it is better to have it, also the system gets idle and execution continues correctly.

All that said: this isn't a -1 over this issue. It's a noop, after all, and it's not my job to tell Intel how to manage its own hardware. (For propriety reasons. But also to keep powder dry for fights over SOF where I do have more of a stake, heh.)

That is not noop, Zephyr only sees PS.INTLEVEL to check if interruptions are locked, without it, e.g any use of _current_cpu will trigger an assert.

kv2019i
kv2019i previously approved these changes Feb 12, 2024
dcpleung
dcpleung previously approved these changes Feb 12, 2024
@andyross
Copy link
Contributor

Not a -1, but still asking nicely if the commit message can be clarified to explain why this is necessary and under what circumstances this code can execute. Because it's not actually testable, right? Has this actually been observed to happen?

@henrikbrixandersen henrikbrixandersen modified the milestones: v3.6.0, v3.7.0 Feb 21, 2024
@kv2019i
Copy link
Collaborator

kv2019i commented Mar 11, 2024

@andyross @ceolin I think this can actually fix #69807 .

In the bug scenario, the host starts streaming and via SOF APIs, keeps a lock to prevent Zephyr from entering PM_STATE_RUNTIME_IDLE. During the test case, host removes this block and core0 is allowed to enter IDLE state.

When core0 enters power gated state, interrrupts are left enabled (so the core can be woken up when something happens). This leaves a race where suitably timed interrupt will actually block entry to power gated state and k_cpu_idle() in power_gate_entry() will return. This is rare, but happens often enough thatthe relatively short test plan run on SOF pull-requests will trigger this case.

Without this patch, current Zephyr main will hit a DSP panic as described in #69807 .

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor typo in commit "thatthe", but good otherwise. Thanks @nashif !

In case the core is not power gated, waiti will restore intlevel. In
this case we lock interruption after it.

In the bug scenario, the host starts streaming and via SOF APIs, keeps a
lock to prevent Zephyr from entering PM_STATE_RUNTIME_IDLE. During the
test case, host removes this block and core0 is allowed to enter IDLE
state.

When core0 enters power gated state, interrrupts are left enabled (so
the core can be woken up when something happens). This leaves a race
where suitably timed interrupt will actually block entry to power gated
state and k_cpu_idle() in power_gate_entry() will return. This is rare,
but happens often enough that the relatively short test plan run on SOF
pull-requests will trigger this case.

Fixes zephyrproject-rtos#69807

Signed-off-by: Flavio Ceolin <[email protected]>
Signed-off-by: Anas Nashif <[email protected]>
@nashif
Copy link
Member

nashif commented Mar 12, 2024

Very minor typo in commit "thatthe", but good otherwise. Thanks @nashif !

fixed

@nashif
Copy link
Member

nashif commented Mar 12, 2024

@andyross can you please take another look at this?

@dleach02 dleach02 merged commit 07426a8 into zephyrproject-rtos:main Mar 12, 2024
21 checks passed
@ceolin ceolin deleted the intel-adsp-lock branch March 19, 2024 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform: Intel ADSP Intel Audio platforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants