Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix and re-enable dynamic scheduler scaling
Change dynamic scheduler scaling implementation in order to resolve the hangs encountered in #2451. The previous implementation assumed that signalling to wake a thread was a reliable operation. Apparently, that's not necessarily true (see https://en.wikipedia.org/wiki/Spurious_wakeup and https://askldjd.com/2010/04/24/the-lost-wakeup-problem/). Seeing as we couldn't find any other explanation for why the previous implementation was experiencing hangs, I've assumed it is either because of lost wake ups or spurious wake ups and redesigned the logic accordingly. Now, when a thread is about to suspend, it will decrement the `active_scheduler_count` and then suspend. When it wakes up, it will check to see if the `active_scheduler_count` is at least as big as its `index`. If the `active_scheduler_count` isn't big enough, the thread will suspend itself again immediately. If it is big enough, it will resume. Threads no longer modify `active_scheduler_count` when they wake up. `active_scheduler_count` must now be modified by the thread that is waking up another thread prior to sending the wake up notification. Additionally, since we're now assuming that wake up signals can be lost, we now send multiple wake up notifications just in case. While this is somewhat wasteful, it is better than being in a situation where some threads aren't woken up at all (i.e. a hang). Additionally, only use `scheduler_count_changing` for `signals` implementation of dynamic scheduler scaling. `pthreads` implementation now uses a mutex (`sched_mut`) in its place. We also now change logic to only unlock mutex in `pthreads` implementation once threads have been woken to avoid potential lost wake ups. This isn't an issue for the `signals` implementation and the unlocking of `scheduler_count_changing` can remain where it is prior to threads being woken up. This commit also splits out scheduler block/unblock message handling logic into their own functions (this is so that sched 0 can call those functions directly instead of sending messages to itself). This commit also includes a change inspired by #2474. Now, *all* scheduler threads can suspend as long as there is at least one noisy actor registered with the ASIO subsystem. If there are no noisy actors registered with the ASIO subsystem then scheduler 0 is not allowed to suspend itself because it is reponsible for quiescence detection. Lastly, this commit adds logic to allow a scheduler thread to suspend even if it has already sent a scheduler block message so that we can now suspend scheduler threads in most scenarios.
- Loading branch information