Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] pipeline with DP-scheduled src_lite upon 0x13000002 IPC #9124

Closed
kv2019i opened this issue May 14, 2024 · 6 comments
Closed

[BUG] pipeline with DP-scheduled src_lite upon 0x13000002 IPC #9124

kv2019i opened this issue May 14, 2024 · 6 comments
Assignees
Labels
bug Something isn't working as expected MTL Applies to Meteor Lake platform P1 Blocker bugs or important features
Milestone

Comments

@kv2019i
Copy link
Collaborator

kv2019i commented May 14, 2024

Describe the bug
Found in CI run of #9119 and as per discussion in #9119 (comment) , has been seen in other PRs as well.

Filing as a bug to track incidences in CI (and elsewhere).

To Reproduce
No known recipe

Reproduction Rate
<1/100

Expected behavior
No DSP crash

Impact
Audio loss when bug triggers

Environment

SOF Commit: Merge f389107 into 618bdbd
Zephyr Commit: b501278237bf
Platform: MTL

Screenshots or console output

The internal Intel CI test hit a bug on SrcLite test on MTL with following log (via https://sof-ci.01.org/sof-pr-viewer/#/build/PR9119/build13962171 , internal "quickbuild")

[    0.000000] <inf> init: print_version_banner: FW ABI 0x301d000 DBG ABI 0x5003000 tags SOF:v2.5-stable-branch-2376-g174d0cff4c63 zephyr:v3.6.0-3805-gb501278237bf src hash 0xd6863264 (ref hash 0xd6863264)
[    0.000000] <inf> clock: clock_set_freq: clock 0 set freq 20000000Hz freq_idx 0
[    0.000000] <inf> clock: clock_set_freq: clock 1 set freq 20000000Hz freq_idx 0
[    0.000000] <inf> clock: clock_set_freq: clock 2 set freq 20000000Hz freq_idx 0
*** Booting Zephyr OS build v3.6.0-3805-gb501278237bf ***
[    0.000000] <inf> ipc: telemetry_init: Telemetry enabled. May affect performance
[    0.000000] <inf> main: sof_app_main: SOF on intel_adsp
[    0.000000] <inf> main: sof_app_main: SOF initialized
[    0.000008] <inf> ipc: ipc_cmd: rx	: 0x43000000|0x30701000
[    0.000010] <inf> ipc: ipc_cmd: rx	: 0x43000000|0x30801000
[    0.000010] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x3070000c
[    0.000010] <wrn> basefw_intel: basefw_set_fw_config: returning success for Set FW_CONFIG without handling it
[    0.000010] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x31400008
[    0.001903] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x30600064
[    0.007090] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x3070000c
[    0.007188] <wrn> basefw_intel: basefw_set_fw_config: returning success for Set FW_CONFIG without handling it
[    0.032851] <inf> ipc: ipc_cmd: rx	: 0x11000009|0x0
[    0.032891] <inf> pipe: pipeline_new: pipeline new pipe_id 0 priority 0
[    0.035505] <inf> ipc: ipc_cmd: rx	: 0x40000004|0x15
[    0.035841] <inf> dma: dma_get: dma_get() ID 0 sref = 1 busy channels 0
[    0.037945] <inf> ipc: ipc_cmd: rx	: 0x40000019|0x1000000b
[    0.040640] <inf> ipc: ipc_cmd: rx	: 0x40010004|0x15
[    0.040955] <inf> dma: dma_get: dma_get() ID 0 sref = 1 busy channels 0
[    0.043401] <inf> ipc: ipc_cmd: rx	: 0x45000004|0x19
[    0.043465] <inf> ipc: buffer_new: buffer new size 0xb40 id 0.0 flags 0x0
[    0.043601] <inf> dp_queue: dp_queue_create: DpQueue created, id: 0 shared: 0 min_available: 1440 min_free_space 1440, size 2880
[    0.043685] <inf> buffer: buffer_create_shadow_dp_queue: buf:0 0x0 dp_queue attached to buffer as a shadow, at_input: 0
[    0.043715] <inf> pipe: pipeline_connect: comp:0 0x4 connect buffer 0 as sink
[    0.043746] <inf> pipe: pipeline_connect: comp:0 0x19 connect buffer 0 as source
[    0.045236] <inf> ipc: ipc_cmd: rx	: 0x45000019|0x10004
[    0.045300] <inf> ipc: buffer_new: buffer new size 0xc00 id 0.0 flags 0x0
[    0.045435] <inf> dp_queue: dp_queue_create: DpQueue created, id: 0 shared: 0 min_available: 1536 min_free_space 2048, size 4096
[    0.045465] <inf> buffer: buffer_create_shadow_dp_queue: buf:0 0x0 dp_queue attached to buffer as a shadow, at_input: 1
[    0.045493] <inf> pipe: pipeline_connect: comp:0 0x19 connect buffer 0 as sink
[    0.045523] <inf> pipe: pipeline_connect: comp:0 0x10004 connect buffer 0 as source
[    0.161408] <inf> ipc: ipc_cmd: rx	: 0x13000003|0x0
[    0.161515] <wrn> ipc: ipc_pipeline_complete: ipc_pipeline_complete(): no scheduling component specified, use comp 0x10004
[    0.163000] <inf> ipc: ipc_cmd: rx	: 0x13000004|0x0
[    0.163436] <inf> copier: copier_prepare: comp:0 0x4 copier_prepare()
[    0.163603] <inf> src: src_prepare: comp:0 0x19 src_prepare()
[    0.163688] <inf> src: src_params_general: comp:0 0x19 src_params()
[    0.163726] <inf> src: src_set_params: comp:0 0x19 SRC DP period calculated as: 1333
[    0.163783] <inf> src: src_params_general: comp:0 0x19 src_params(), source_rate = 44100, sink_rate = 48000
[    0.164208] <inf> copier: copier_prepare: comp:0 0x10004 copier_prepare()
[    0.164361] <inf> pipe: pipeline_trigger: pipe:0 0x0 pipe trigger cmd 7
[    0.164425] <inf> clock: clock_set_freq: clock 0 set freq 115018000Hz freq_idx 1
[    0.164448] <inf> clock: clock_set_freq: clock 1 set freq 115018000Hz freq_idx 1
[    0.164453] <inf> clock: clock_set_freq: clock 2 set freq 115018000Hz freq_idx 1
[    0.164458] <inf> pipe: pipeline_trigger: Sum of KCPS consumption: 115018, core: 0
[    0.164466] <inf> ll_schedule: zephyr_ll_task_schedule_common: task add 0xa0114cc0 0x20210U priority 0 flags 0x0
[    0.164483] <inf> ll_schedule: zephyr_domain_register: zephyr_domain_register domain->type 1 domain->clk 3 domain->ticks_per_ms 38400 period 1000
[    0.164633] <inf> copier: copier_comp_trigger: comp:0 0x4 No dai copier found, start/end offset is not calculated
[    0.164656] <inf> copier: copier_comp_trigger: comp:0 0x4 No dai copier found, start/end offset is not calculated
[    0.164663] <inf> ll_schedule: zephyr_ll_task_schedule_common: task add 0xa0109288 0x20150U priority 0 flags 0x0
[    0.164706] <inf> host_comp: host_get_copy_bytes_normal: comp:0 0x10004 no bytes to copy, available samples: 0, free_samples: 1536
[    1.187626] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 1970, max 3700, overruns 0
[    1.768773] <inf> ipc: ipc_cmd: rx	: 0x13000003|0x0
[    1.768785] <inf> pipe: pipeline_trigger: pipe:0 0x0 pipe trigger cmd 2
[    1.768796] <inf> clock: clock_set_freq: clock 0 set freq 20000000Hz freq_idx 0
[    1.768801] <inf> clock: clock_set_freq: clock 1 set freq 20000000Hz freq_idx 0
[    1.768841] <inf> clock: clock_set_freq: clock 2 set freq 20000000Hz freq_idx 0
[    1.768923] <inf> pipe: pipeline_trigger: Sum of KCPS consumption: 20000, core: 0
[    1.769866] <inf> copier: copier_comp_trigger: comp:0 0x4 No dai copier found, start/end offset is not calculated
[    1.769956] <inf> ll_schedule: zephyr_ll_task_done: task complete 0xa0114cc0 0x20210U
[    1.769983] <inf> ll_schedule: zephyr_ll_task_done: num_tasks 2 total_num_tasks 2
[    1.770013] <inf> ll_schedule: zephyr_ll_task_done: task complete 0xa0109288 0x20150U
[    1.770040] <inf> ll_schedule: zephyr_ll_task_done: num_tasks 1 total_num_tasks 1
[    1.770083] <inf> ll_schedule: zephyr_domain_unregister: zephyr_domain_unregister domain->type 1 domain->clk 3
[    1.771623] <inf> ipc: ipc_cmd: rx	: 0x13000002|0x0
[    1.771683] <inf> pipe: pipeline_trigger: pipe:0 0x0 pipe trigger cmd 0
[    1.771816] <inf> copier: copier_comp_trigger: comp:0 0x4 No dai copier found, start/end offset is not calculated
[    1.772156] <inf> src: src_reset: comp:0 0x19 src_reset()
[    1.772951] <err> os: print_fatal_exception:  ** FATAL EXCEPTION
[    1.773031] <err> os: print_fatal_exception:  ** CPU 0 EXCCAUSE 13 (load/store PIF data error)
[    1.773058] <err> os: print_fatal_exception:  **  PC 0xa0072f6c VADDR 0x18
[    1.773085] <err> os: print_fatal_exception:  **  PS 0x60620
[    1.773111] <err> os: print_fatal_exception:  **    (INTLEVEL:0 EXCM: 0 UM:1 RING:0 WOE:1 OWB:6 CALLINC:2)
[    1.773138] <err> os: xtensa_dump_stack:  **  A0 0xa00736a1  SP 0xa0110830  A2 0xa010e700  A3 0xa0112008
[    1.773166] <err> os: xtensa_dump_stack:  **  A4 0xa011399c  A5 (nil)  A6 0x2e8  A7 0xa004bd50
[    1.773193] <err> os: xtensa_dump_stack:  **  A8 0xa01176e0  A9 (nil) A10 0xa0110844 A11 0x3
[    1.773220] <err> os: xtensa_dump_stack:  ** A12 0x4 A13 0xfffffffc A14 0x4e0 A15 0x3
[    1.773246] <err> os: xtensa_dump_stack:  ** LBEG 0xa004bf1d LEND 0xa004bf2c LCOUNT (nil)
[    1.773271] <err> os: xtensa_dump_stack:  ** SAR (nil)
[    1.773298] <err> os: xtensa_dump_stack:  **  THREADPTR 0xa011a360

Backtrace:0xa0072f69:0xa0110830  |<-CORRUPTED

[    1.773638] <err> os: z_fatal_error: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
@kv2019i kv2019i added bug Something isn't working as expected MTL Applies to Meteor Lake platform labels May 14, 2024
@marcinszkudlinski
Copy link
Contributor

marcinszkudlinski commented May 15, 2024

root cause found:

  • DP module runs in its own thread
  • IPC is stopping a pipeiline ==> module reset
  • there's no sync between DP and IPC operations, IPC has higher priority

module is set to reset in the middle of processing ===> Crash

fix is on its way, may not be a trivial one as we can't simply wait for a module to finish processing - DP module may process for several ms, soimetimes >100ms.

@lgirdwood
Copy link
Member

root cause found:

  • DP module runs in its own thread
  • IPC is stopping a pipeiline ==> module reset
  • there's no sync between DP and IPC operations, IPC has higher priority

module is set to reset in the middle of processing ===> Crash

fix is on its way, may not be a trivial one as we can't simply wait for a module to finish processing - DP module may process for several ms, soimetimes >100ms.

Good find - thinking about it, could we have DP modules check pipeline state at the start of each process() time slice. i.e. if pipeline is in reset state then we exit DP thread ?

@marcinszkudlinski
Copy link
Contributor

marcinszkudlinski commented May 15, 2024

@lgirdwood there are 2 ways

  1. abort the thread at reset
    simple&fast, but risky. There may be some unexpected behavior if the thread is just killed in the middle. I'm not talking about resource leak, as no resources should be allocated during process, but there may be other problems very hard to predict and track in case of issues

  2. when executing "reset" in IPC handler - wait till DP thread finishes - mark the thread to terminate itsef and execute k_thread_join()
    Clean and reliable, more complicated, slow!!!!!!! sometimes IPC will have to wait several ms till DP is finished. In case of AEC it may be as long as 10ms

Solution number 1 is already implemented and tested (4 lines...), but I strongly recommend number 2 (in progress)

could we have DP modules check pipeline state at the start of each process() time slice. i.e. if pipeline is in reset state then we exit DP thread ?

it is not as simple, process() use resources, which may (and usually are) freed when pipeline stops. So leaving the process running is not an option

@lgirdwood
Copy link
Member

@lgirdwood there are 2 ways

  1. abort the thread at reset
    simple&fast, but risky. There may be some unexpected behavior if the thread is just killed in the middle. I'm not talking about resource leak, as no resources should be allocated during process, but there may be other problems very hard to predict and track in case of issues
  2. when executing "reset" in IPC handler - wait till DP thread finishes - mark the thread to terminate itsef and execute k_thread_join()
    Clean and reliable, more complicated, slow!!!!!!! sometimes IPC will have to wait several ms till DP is finished. In case of AEC it may be as long as 10ms

Solution number 1 is already implemented and tested (4 lines...), but I strongly recommend number 2 (in progress)

could we have DP modules check pipeline state at the start of each process() time slice. i.e. if pipeline is in reset state then we exit DP thread ?

it is not as simple, process() use resources, which may (and usually are) freed when pipeline stops. So leaving the process running is not an option

Ack, 2) sounds the cleanest - IIRC, driver has a 300ms IPC timeout today.

@marcinszkudlinski
Copy link
Contributor

marcinszkudlinski commented May 17, 2024

@marcinszkudlinski marcinszkudlinski self-assigned this May 17, 2024
@abonislawski abonislawski added the P1 Blocker bugs or important features label May 21, 2024
@lgirdwood lgirdwood modified the milestones: v2.10, v2.11 Jun 11, 2024
@kv2019i
Copy link
Collaborator Author

kv2019i commented Jun 24, 2024

Bugfix merged to main and also to stable-v2.10 via #9248

@kv2019i kv2019i closed this as completed Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected MTL Applies to Meteor Lake platform P1 Blocker bugs or important features
Projects
None yet
Development

No branches or pull requests

4 participants