-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
intel_adsp/ace: power: Lock interruption when power gate fails #68493
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to use the correct API to mask interrupts. But I'm confused as to why this is needed or what the circumstances are under which the WAITI returns? If this is a "should never happen" maybe a k_panic() is a better idea?
soc/xtensa/intel_adsp/ace/power.c
Outdated
/* It is unlikely we get in here, but when this happens | ||
* we need to lock interruptions again. | ||
*/ | ||
(void)irq_lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be arch_irq_lock(), or else an RSIL instruction or some other sequence to directly set PS. When SMP=y, irq_lock() is a global recursive lock that emulates the traditional API for the benefit of drivers not yet ported to spinlocks, not what you want at all (and dangerous if there are other irq_lock() users as it looks to me like this call is unmached and would produce a deadlock if someone else tries it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch thought it was just an indirection to arch_irq_lock
. Yep, it has to be arch_irq_lock()
, It is rstored later calling directly rsil
.
soc/xtensa/intel_adsp/ace/power.c
Outdated
/* It is unlikely we get in here, but when this happens | ||
* we need to lock interruptions again. | ||
*/ | ||
(void)irq_lock(); | ||
z_xt_ints_off(0xffffffff); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this line is already masking all interrupts, though at one level higher in the abstraction stack (INTENABLE instead of PS.INTLVL). So why do you need both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because Zephyr only looks PS.INTENABLE and assumes that interruptions are not masked and triggers an assert.
46ac84e
to
1743b31
Compare
1743b31
to
f2252be
Compare
When this can happen in HW is not clear for me, I had removed it in a different pr and was asked to get it back saying that something could happen and prevent the power gate. That's said, the simulator that we have does not power gate the core and this is need as well. |
f2252be
to
a386493
Compare
@andyross would mind check it again ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this can happen in HW is not clear for me, I had removed it in a different pr and was asked to get it back saying that something could happen and prevent the power gate.
Can you get a proper explanation from whoever made that request? :)
Again, as I read this patch it's a noop. The adjacent line is already masking all interrupts through a different mechanism. Both/either have the effect of causing the core to resume handling thread code but with interrupts masked, which can't possibly be correct: you wouldn't be able to receive the IDC interrupt delivering the "shut down now" command from the host (exactly the action that seems to have failed in the first place). So the only way to recover would be to hard-bounce the whole DSP anyway, which obviously doesn't care about interrupt masking state.
Basically I just don't see it. If this needs to be handled, this isn't the right way to handle it. If it shouldn't ever happen but might, it should be a panic.
All that said: this isn't a -1 over this issue. It's a noop, after all, and it's not my job to tell Intel how to manage its own hardware. (For propriety reasons. But also to keep powder dry for fights over SOF where I do have more of a stake, heh.)
Nope, if that code is executed, we will return to the idle thread that will end up calling
I know we are talking upstream here, but we have simulator used for validation where this happens (the soc is not power gated) and as it was told me that this may happen in HW, it seems to me that it is better to have it, also the system gets idle and execution continues correctly.
That is not noop, Zephyr only sees PS.INTLEVEL to check if interruptions are locked, without it, e.g any use of |
Not a -1, but still asking nicely if the commit message can be clarified to explain why this is necessary and under what circumstances this code can execute. Because it's not actually testable, right? Has this actually been observed to happen? |
@andyross @ceolin I think this can actually fix #69807 . In the bug scenario, the host starts streaming and via SOF APIs, keeps a lock to prevent Zephyr from entering PM_STATE_RUNTIME_IDLE. During the test case, host removes this block and core0 is allowed to enter IDLE state. When core0 enters power gated state, interrrupts are left enabled (so the core can be woken up when something happens). This leaves a race where suitably timed interrupt will actually block entry to power gated state and k_cpu_idle() in power_gate_entry() will return. This is rare, but happens often enough thatthe relatively short test plan run on SOF pull-requests will trigger this case. Without this patch, current Zephyr main will hit a DSP panic as described in #69807 . |
Can you please revisit, commit message was updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor typo in commit "thatthe", but good otherwise. Thanks @nashif !
In case the core is not power gated, waiti will restore intlevel. In this case we lock interruption after it. In the bug scenario, the host starts streaming and via SOF APIs, keeps a lock to prevent Zephyr from entering PM_STATE_RUNTIME_IDLE. During the test case, host removes this block and core0 is allowed to enter IDLE state. When core0 enters power gated state, interrrupts are left enabled (so the core can be woken up when something happens). This leaves a race where suitably timed interrupt will actually block entry to power gated state and k_cpu_idle() in power_gate_entry() will return. This is rare, but happens often enough that the relatively short test plan run on SOF pull-requests will trigger this case. Fixes zephyrproject-rtos#69807 Signed-off-by: Flavio Ceolin <[email protected]> Signed-off-by: Anas Nashif <[email protected]>
fixed |
@andyross can you please take another look at this? |
In case the core is not power gated, waiti will restore intlevel. In this case we lock interruption after it.