-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
up_squared: slowdown on test execution and timing out on multiple tests #30573
Comments
@nashif @andrewboie - fyi |
is it consistent where it hangs for ~2 minutes or does the place vary from run to run? |
On my UP2, it pauses in different places between reboots. |
some things to try:
My suspicion at this time is SMP, but we need to gather more data. The previous issue that @dcpleung isolated was a lengthy period to instantiate page tables for 2GB of RAM on thread creation which no longer happens. |
|
@dcpleung thanks this is really helpful. OK, I think this may be some race related to SMP and the recent changes I made to have self-exiting threads do their cleanup tasks in the context of the idle thread.
This looks like the expected code path where ztest calls
Problem is, it's not being woken up in a timely manner. The self-abort logic is as follows:
What's supposed to happen is that immediately after we set
The
Which calls
But what seems to be happening is that the idle thread is not immediately running afterwards. Instead some time passes and THEN it runs, probably due to some other event. So there may be some problem/race condition in how the offload-to-idle scheduling is implemented with SMP. I think this is a scheduling problem and likely orthogonal to user mode or recent paging changes which might just be making races turn out differently. |
Oh wow I think I see it. If SMP is enabled, |
Hmm but that's not the whole story. In the SMP case, we eventually get into
And that is implemented to call
So it should get the idle thread back, and that should be the next thing to run. Except apparently it doesn't sometimes. |
I would not be surprised if this issue has a relationship to #28105 which has been confounding us for some time. |
here another issue most likely caused by this bug: #30574, still fails with SMP disabled. |
Note that #30574 is caused by |
Inside the idle loop, in some configuration, IRQ is unlocked and then immediately locked again. There is a side effect: 1. IRQ is unlocked in middle of the loop. 2. Another thread (A) can now run so idle thread is un-scheduled. 3. Thread A runs to its end and going through the thread self-abort path. 4. Idle thread is rescheduled again, and continues to run the remaining loop when it eventuall calls k_cpu_idle(). The "pending abort" path is not being executed on thread A at this point. 5. Now, thread A is suspended, and the CPU is in idle waiting for interrupts (e.g. timeouts). 6. Thread B is waiting to join on thread A. Since thread A has not been terminated yet so thread B is waiting until the idle thread runs again and starts executing from the beginning of while loop. 7. Depending on how many threads are running and how active the platform is, idle thread may not run again for a while, resulting in thread B appearing to be stuck. To avoid this situation, the unlock/lock pair in middle of the loop is removed so no rescheduling can be done mid-loop. When there is no thread abort pending, it simply locks IRQ and calls k_cpu_idle(). This is almost identical to the idle loop before the thread abort code was introduced (except the check for cpu->pending_abort). Fixes zephyrproject-rtos#30573 Signed-off-by: Daniel Leung <[email protected]>
Inside the idle loop, in some configuration, IRQ is unlocked and then immediately locked again. There is a side effect: 1. IRQ is unlocked in middle of the loop. 2. Another thread (A) can now run so idle thread is un-scheduled. 3. Thread A runs to its end and going through the thread self-abort path. 4. Idle thread is rescheduled again, and continues to run the remaining loop when it eventuall calls k_cpu_idle(). The "pending abort" path is not being executed on thread A at this point. 5. Now, thread A is suspended, and the CPU is in idle waiting for interrupts (e.g. timeouts). 6. Thread B is waiting to join on thread A. Since thread A has not been terminated yet so thread B is waiting until the idle thread runs again and starts executing from the beginning of while loop. 7. Depending on how many threads are running and how active the platform is, idle thread may not run again for a while, resulting in thread B appearing to be stuck. To avoid this situation, the unlock/lock pair in middle of the loop is removed so no rescheduling can be done mid-loop. When there is no thread abort pending, it simply locks IRQ and calls k_cpu_idle(). This is almost identical to the idle loop before the thread abort code was introduced (except the check for cpu->pending_abort). Fixes #30573 Signed-off-by: Daniel Leung <[email protected]>
Inside the idle loop, in some configuration, IRQ is unlocked and then immediately locked again. There is a side effect: 1. IRQ is unlocked in middle of the loop. 2. Another thread (A) can now run so idle thread is un-scheduled. 3. Thread A runs to its end and going through the thread self-abort path. 4. Idle thread is rescheduled again, and continues to run the remaining loop when it eventuall calls k_cpu_idle(). The "pending abort" path is not being executed on thread A at this point. 5. Now, thread A is suspended, and the CPU is in idle waiting for interrupts (e.g. timeouts). 6. Thread B is waiting to join on thread A. Since thread A has not been terminated yet so thread B is waiting until the idle thread runs again and starts executing from the beginning of while loop. 7. Depending on how many threads are running and how active the platform is, idle thread may not run again for a while, resulting in thread B appearing to be stuck. To avoid this situation, the unlock/lock pair in middle of the loop is removed so no rescheduling can be done mid-loop. When there is no thread abort pending, it simply locks IRQ and calls k_cpu_idle(). This is almost identical to the idle loop before the thread abort code was introduced (except the check for cpu->pending_abort). Fixes zephyrproject-rtos#30573 Signed-off-by: Daniel Leung <[email protected]>
Describe the bug
There is a regression in some of the tests for up2 in which the tests are taking longer to execute than previously, to the extent that sanitycheck gives a timeout error. Sanitycheck is not an issue here, it is working properly. The issue here is the significant increase in time of test execution on these test suites. This is observed on HW (up2). On emulator (qemu) the tests work well during default time and have no noticeable change.
What have you tried to diagnose or workaround this issue?
A longer timeout can be specified in the testcase.yml. But the workaround is not scalable (#30374). This will pass if sanitycheck is informed to wait e.g. 120-320 seconds for the test to complete, especially sys_put (e.g.
test_sys_put_le24
took 1 minute 52 seconds approx.)This occurs in the following:
To Reproduce
Compare the output for:
west build -p -b up_squared zephyr/tests/kernel/common/ -DCONFIG_THREAD_LOCAL_STORAGE=y
Expected behavior
No significant change in test execution time that would yield a sanitycheck timeout error. The default time for sanitycheck on these tests should be sufficient.
Impact
The increased test execution time is exceeding the default timeout used by sanitycheck so a timeout error is generated for those tests. Due to the scope of these tests, it is high impact because the PASS/FAIL of the individual tests in the test suite cannot be checked in CI making it harder to detect further regression without using workaround or alternative.
Logs and console output
Using sanitycheck without the workaround, the issue is observed as timeout errors when the test appears to "freeze"/hang. The results are clipped, it's marked as an error, and moves on to the next test.
Using the workaround for an extended timeout for sanitycheck to use, or using west method to give the test the time it needs to complete:
Became frozen for ~2 minutes
then resumed
Became frozen for ~2 minutes
then resumed
Became frozen for ~2 minutes
then resumed
Became frozen for ~2 minutes
then resumed
Became frozen for ~2 minutes
then resumed
Environment (please complete the following information):
Additional context
Slow down on test execution started with this PR: #29411.
Works fine up to this commit: 86cfd90 from the above PR, then the issue starts.
For tests/kernel/common/, the issue starts at b8242bf.
For tests/kernel/mem_protect/mem_protect/, the test suite completes quickly back at b8242bf, then next commit 1367c4a we start seeing it take longer for test_permission_inheritance.
Now consolidated here, this issue was originally captured in separate bugs by test suite:
#30275
#30305
#30306
#30307
The text was updated successfully, but these errors were encountered: