up_squared: slowdown on test execution and timing out on multiple tests #30573

jenmwms · 2020-12-10T01:43:25Z

Describe the bug
There is a regression in some of the tests for up2 in which the tests are taking longer to execute than previously, to the extent that sanitycheck gives a timeout error. Sanitycheck is not an issue here, it is working properly. The issue here is the significant increase in time of test execution on these test suites. This is observed on HW (up2). On emulator (qemu) the tests work well during default time and have no noticeable change.

What have you tried to diagnose or workaround this issue?
A longer timeout can be specified in the testcase.yml. But the workaround is not scalable (#30374). This will pass if sanitycheck is informed to wait e.g. 120-320 seconds for the test to complete, especially sys_put (e.g. test_sys_put_le24 took 1 minute 52 seconds approx.)

This occurs in the following:

tests/kernel/common (all of the test_sys_put_)
tests/kernel/mem_protect/mem_protect (including test_permission_inheritance, test_mem_domain_remove_add_partition)
tests/kernel/device (pm - test_dummy_device)

To Reproduce
Compare the output for:

Use sanitycheck to run tests/kernel/common
Use west to build and flash tests/kernel/common

west build -p -b up_squared zephyr/tests/kernel/common/ -DCONFIG_THREAD_LOCAL_STORAGE=y
use a stopwatch when a test in the suite appears to hang after START (this is the longer test execution)

Expected behavior
No significant change in test execution time that would yield a sanitycheck timeout error. The default time for sanitycheck on these tests should be sufficient.

Impact
The increased test execution time is exceeding the default timeout used by sanitycheck so a timeout error is generated for those tests. Due to the scope of these tests, it is high impact because the PASS/FAIL of the individual tests in the test suite cannot be checked in CI making it harder to detect further regression without using workaround or alternative.

Logs and console output
Using sanitycheck without the workaround, the issue is observed as timeout errors when the test appears to "freeze"/hang. The results are clipped, it's marked as an error, and moves on to the next test.

*** Booting Zephyr OS build zephyr-v2.4.0-1894-g04a421d1b86f  (delayed boot 500ms) ***
Running test suite common
===================================================================
START - test_bootdelay
 PASS - test_bootdelay
===================================================================
START - test_irq_offload
 PASS - test_irq_offload
===================================================================
START - test_byteorder_memcpy_swap
 PASS - test_byteorder_memcpy_swap
===================================================================
START - test_byteorder_mem_swap
 PASS - test_byteorder_mem_swap
===================================================================
START - test_sys_get_be64
 PASS - test_sys_get_be64
===================================================================
START - test_sys_put_be64

Using the workaround for an extended timeout for sanitycheck to use, or using west method to give the test the time it needs to complete:

*** Booting Zephyr OS build zephyr-v2.4.0-1894-g04a421d1b86f  (delayed boot 500ms) ***
Running test suite common
===================================================================
START - test_bootdelay
 PASS - test_bootdelay
===================================================================
START - test_irq_offload
 PASS - test_irq_offload
===================================================================
START - test_byteorder_memcpy_swap
 PASS - test_byteorder_memcpy_swap
===================================================================
START - test_byteorder_mem_swap
 PASS - test_byteorder_mem_swap
===================================================================
START - test_sys_get_be64
 PASS - test_sys_get_be64
===================================================================
START - test_sys_put_be64