Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible SMP race with k_thread_join() #26486

Closed
andrewboie opened this issue Jun 26, 2020 · 3 comments · Fixed by #27995
Closed

possible SMP race with k_thread_join() #26486

andrewboie opened this issue Jun 26, 2020 · 3 comments · Fixed by #27995
Assignees
Labels
area: Kernel area: SMP Symmetric multiprocessing bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@andrewboie
Copy link
Contributor

andrewboie commented Jun 26, 2020

Describe the bug
From code inspection I think I have found a potential race condition in k_thread_join().

The specific scenario:

  1. Thread A is running
  2. Thread B blocks with k_thread_join(A, K_FOREVER);
  3. Thread A self-exits via k_thread_abort(k_current_get()) either via explicit call or by returning from its entry function
  4. In the code path for k_thread_abort(), we enter z_thread_single_abort(). While holding the sched_spinlock, we wake up waiters, in this case thread B. The sched_spinlock is released.
  5. Two things then happen simultaneously:
  • On the current CPU, thread A finishes its aborting process. We return from z_thread_single_abort() and then call z_swap_unlocked() to finally context switch out. This happens quickly but isn't instantaneous.
  • On another CPU, as soon as sched_spinlock was released in step 4, thread B wakes up. Since thread A has joined, thread B tries to launch a new thread re-using thread A's k_thread object.

The race is step 5. If thread B manages to instantiate the new thread using thread A's object, before thread A finishes its aborting process, we will end up with the same thread object (and the stack too if it was re-used) being active on two different CPUs causing data corruption.

This only affects SMP. In UP, if Thread A gets preempted after returning from z_thread_single_abort() it is never scheduled again and there is no race. This issue requires physical concurrency AFAICT.

To Reproduce
I don't have test code to demonstrate this reliably but it would explain some very intermittent crashes I see in tests that re-use thread objects after a join where both CPUs are crashing with the same thread object as current.

Expected behavior
We need to close this race somehow. Thread B needs to not wake up until thread A's k_thread and thread stack objects are truly unused.

It would of course be simplest to somehow have thread A keep holding sched_spinlock until it completely finishes aborting, but this is not feasible.

A promising alternative is to somehow bounce context off of Thread A to complete the aborting process; we could enter a special exception state (sort of like what irq_offload() does although this API was never intended for anything but test usage), or wake up and raise the priority of the idle thread and have the idle thread perform the logic in z_thread_single_abort().

I like the idea of using the idle thread since we already have one per CPU and we could avoid having thread abort tasks being queued up.

In general, we never want z_thread_single_abort() to run in the context of the thread being aborted.

Impact
This is concerning, but to be clear this only affects SMP and it requires a race to re-use a thread object, something that I expect happens more often in tests than in a real application.

@andrewboie andrewboie added bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug labels Jun 26, 2020
@andrewboie andrewboie self-assigned this Jun 26, 2020
@andrewboie
Copy link
Contributor Author

Fixing this is I think a pre-condition to getting #23062 and #23063 working.

@andrewboie
Copy link
Contributor Author

andrewboie commented Jun 26, 2020

This will also help with code that wants to use the thread->fn_abort to free thread object or stack memory back to a heap, this can't be done if fn_abort function runs in the context of the thread being aborted. See #23062

@carlescufi carlescufi added area: Kernel area: SMP Symmetric multiprocessing labels Jul 8, 2020
andrewboie pushed a commit to andrewboie/zephyr that referenced this issue Sep 4, 2020
Fixes races where threads on another CPU are joining the
exiting thread, since it could still be running when
the joiners wake up on a different CPU.

Fixes problems where the thread object is still being
used by the kernel when the fn_abort() function is called,
preventing the thread object from being recycled or
freed back to a slab pool.

Precedent for doing this comes from FreeRTOS, which also
performs final thread cleanup in the idle thread.

Add some extra comments and log messages to the meta-irq
test case.

Some logic in z_thread_single_abort() rearranged such that
when we release sched_spinlock, the thread object pointer
is never dereferenced by the kernel again; join waiters
or fn_abort() logic may free it immediately.

An assertion added to z_thread_single_abort() to ensure
it never gets called with thread == _current outside of an ISR.

This requires an increase of the idle thread stack size.
This was already necessary for a lot of applications
that have any kind of power management since PM infrastructure
gets invoked from the idle thread.

Some logic has been added to ensure z_thread_single_abort()
tasks don't run more than once.

Fixes: zephyrproject-rtos#26486
Related to: zephyrproject-rtos#23063 zephyrproject-rtos#23062

Signed-off-by: Andrew Boie <[email protected]>
@github-actions
Copy link

github-actions bot commented Sep 7, 2020

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

@github-actions github-actions bot added the Stale label Sep 7, 2020
@andrewboie andrewboie removed the Stale label Sep 8, 2020
andrewboie pushed a commit to andrewboie/zephyr that referenced this issue Sep 23, 2020
Fixes races where threads on another CPU are joining the
exiting thread, since it could still be running when
the joiners wake up on a different CPU.

Fixes problems where the thread object is still being
used by the kernel when the fn_abort() function is called,
preventing the thread object from being recycled or
freed back to a slab pool.

Fixes a race where a thread is aborted from one CPU while
it self-aborts on another CPU, that was currently worked
around with a busy-wait.

Precedent for doing this comes from FreeRTOS, which also
performs final thread cleanup in the idle thread.

Some logic in z_thread_single_abort() rearranged such that
when we release sched_spinlock, the thread object pointer
is never dereferenced by the kernel again; join waiters
or fn_abort() logic may free it immediately.

An assertion added to z_thread_single_abort() to ensure
it never gets called with thread == _current outside of an ISR.

Some logic has been added to ensure z_thread_single_abort()
tasks don't run more than once.

Fixes: zephyrproject-rtos#26486
Related to: zephyrproject-rtos#23063 zephyrproject-rtos#23062

Signed-off-by: Andrew Boie <[email protected]>
andrewboie pushed a commit to andrewboie/zephyr that referenced this issue Sep 30, 2020
Fixes races where threads on another CPU are joining the
exiting thread, since it could still be running when
the joiners wake up on a different CPU.

Fixes problems where the thread object is still being
used by the kernel when the fn_abort() function is called,
preventing the thread object from being recycled or
freed back to a slab pool.

Fixes a race where a thread is aborted from one CPU while
it self-aborts on another CPU, that was currently worked
around with a busy-wait.

Precedent for doing this comes from FreeRTOS, which also
performs final thread cleanup in the idle thread.

Some logic in z_thread_single_abort() rearranged such that
when we release sched_spinlock, the thread object pointer
is never dereferenced by the kernel again; join waiters
or fn_abort() logic may free it immediately.

An assertion added to z_thread_single_abort() to ensure
it never gets called with thread == _current outside of an ISR.

Some logic has been added to ensure z_thread_single_abort()
tasks don't run more than once.

Fixes: zephyrproject-rtos#26486
Related to: zephyrproject-rtos#23063 zephyrproject-rtos#23062

Signed-off-by: Andrew Boie <[email protected]>
nashif pushed a commit that referenced this issue Sep 30, 2020
Fixes races where threads on another CPU are joining the
exiting thread, since it could still be running when
the joiners wake up on a different CPU.

Fixes problems where the thread object is still being
used by the kernel when the fn_abort() function is called,
preventing the thread object from being recycled or
freed back to a slab pool.

Fixes a race where a thread is aborted from one CPU while
it self-aborts on another CPU, that was currently worked
around with a busy-wait.

Precedent for doing this comes from FreeRTOS, which also
performs final thread cleanup in the idle thread.

Some logic in z_thread_single_abort() rearranged such that
when we release sched_spinlock, the thread object pointer
is never dereferenced by the kernel again; join waiters
or fn_abort() logic may free it immediately.

An assertion added to z_thread_single_abort() to ensure
it never gets called with thread == _current outside of an ISR.

Some logic has been added to ensure z_thread_single_abort()
tasks don't run more than once.

Fixes: #26486
Related to: #23063 #23062

Signed-off-by: Andrew Boie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel area: SMP Symmetric multiprocessing bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants