-
-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix and re-enable dynamic scheduler scaling #2483
Fix and re-enable dynamic scheduler scaling #2483
Conversation
@winksaville This PR is my alternative implementation of dynamic scheduler scaling that I mentioned in the other PR. It would be great if you can test it out using |
As an example of all schedulers suspending due to lack of work, I compiled/tested the
|
NOTE: I've marked this as "DO NOT MERGE" until after @winksaville is able to confirm whether this resolves the hangs from #2451 or not. |
NIce |
I'll be testing this evening, @dipinhora what testing have you done and any particular flag/switches to test? |
@winksaville I honestly haven't done a lot of testing so far beyond basic correctness testing to ensure things are functioning as expected. I'll likely get a chance to do more testing of for the hangs case tomorrow. It would be great if you could do the following:
Assuming the
|
SG, which is preferred long term, pthreads or signals? |
My understanding of things is that the Ideally, we would keep both implementations, but if we had to pick one, I'd pick |
Ran release builds with signals for about 10hrs, no hang!!!
Now testing stdlib release-scheduler_scaling_pthreads/ponyc, will report back soon. |
That's great news! Hopefully the |
Bad news, stdlib hung after about 20min. Here are the versions:
Here is the gdb output:
|
Debug run with pthreads:
|
@dipinhora if you rebase this against master, the next time around, your 3.9.1 tests will run in CircleCI and you won't have any LLVM download issues. |
9d2150e
to
e55c1a1
Compare
@winksaville I've rebased on master and updated the logic for both the |
Will do some tests! |
Update, I've run stdlib using release pthreads for 2+ hours with no issues. I'm running my pony-ring now and so far no issues having run about 30min, will let it run for 2+ hours. Then I'll test stdlib and pony-ring again using signals. |
Tested stdlib and pony-ring with signals each for 2+ hours and no hangs!! |
AWESOME! |
Change dynamic scheduler scaling implementation in order to resolve the hangs encountered in ponylang#2451. The previous implementation assumed that signalling to wake a thread was a reliable operation. Apparently, that's not necessarily true (see https://en.wikipedia.org/wiki/Spurious_wakeup and https://askldjd.com/2010/04/24/the-lost-wakeup-problem/). Seeing as we couldn't find any other explanation for why the previous implementation was experiencing hangs, I've assumed it is either because of lost wake ups or spurious wake ups and redesigned the logic accordingly. Now, when a thread is about to suspend, it will decrement the `active_scheduler_count` and then suspend. When it wakes up, it will check to see if the `active_scheduler_count` is at least as big as its `index`. If the `active_scheduler_count` isn't big enough, the thread will suspend itself again immediately. If it is big enough, it will resume. Threads no longer modify `active_scheduler_count` when they wake up. `active_scheduler_count` must now be modified by the thread that is waking up another thread prior to sending the wake up notification. Additionally, since we're now assuming that wake up signals can be lost, we now send multiple wake up notifications just in case. While this is somewhat wasteful, it is better than being in a situation where some threads aren't woken up at all (i.e. a hang). Additionally, only use `scheduler_count_changing` for `signals` implementation of dynamic scheduler scaling. `pthreads` implementation now uses a mutex (`sched_mut`) in its place. We also now change logic to only unlock mutex in `pthreads` implementation once threads have been woken to avoid potential lost wake ups. This isn't an issue for the `signals` implementation and the unlocking of `scheduler_count_changing` can remain where it is prior to threads being woken up. This commit also splits out scheduler block/unblock message handling logic into their own functions (this is so that sched 0 can call those functions directly instead of sending messages to itself). This commit also includes a change inspired by ponylang#2474. Now, *all* scheduler threads can suspend as long as there is at least one noisy actor registered with the ASIO subsystem. If there are no noisy actors registered with the ASIO subsystem then scheduler 0 is not allowed to suspend itself because it is reponsible for quiescence detection. Lastly, this commit adds logic to allow a scheduler thread to suspend even if it has already sent a scheduler block message so that we can now suspend scheduler threads in most scenarios.
e55c1a1
to
ec6503f
Compare
@winksaville That's great! Thanks for testing all the variations. It is much appreciated. I've rebased onto latest master and squashed the commits with a more complete commit message describing the changes. Assuming CI passes and there aren't any additional issues or concerns, this is likely ready to merge. |
Change dynamic scheduler scaling implementation in order to resolve
the hangs encountered in #2451.
The previous implementation assumed that signalling to wake a thread
was a reliable operation. Apparently, that's not necessarily true
(see https://en.wikipedia.org/wiki/Spurious_wakeup and
https://askldjd.com/2010/04/24/the-lost-wakeup-problem/). Seeing
as we couldn't find any other explanation for why the previous
implementation was experiencing hangs, I've assumed it is either
because of lost wake ups or spurious wake ups and redesigned the
logic accordingly.
Now, when a thread is about to suspend, it will decrement the
active_scheduler_count
and then suspend. When it wakes up, it willcheck to see if the
active_scheduler_count
is at least as big asits
index
. If theactive_scheduler_count
isn't big enough, thethread will suspend itself again immediately. If it is big enough,
it will resume. Threads no longer modify
active_scheduler_count
when they wake up.
active_scheduler_count
must now be modified by the thread that iswaking up another thread prior to sending the wake up notification.
Additionally, since we're now assuming that wake up signals can be
lost, we now send multiple wake up notifications just in case. While
this is somewhat wasteful, it is better than being in a situation
where some threads aren't woken up at all (i.e. a hang).
This commit also includes a change inspired by #2474. Now, all
scheduler threads can suspend as long as there is at least one
noisy actor registered with the ASIO subsystem. If there are no
noisy actors registered with the ASIO subsystem then scheduler 0
is not allowed to suspend itself.