-
-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic scheduler thread scaling based on workload #2386
Conversation
I've marked this as In addition, it would be great if this were reviewed by folks on @ponylang/core. Especially around the use of |
I've done some basic performance testing using a modified version of Note: This performance testing was done on an OS X Macbook where I compiled two versions of the application Summary: Overall, it seems that the scheduler scaling doesn't have any significant negative impact on performance (except for the expected impact from using less scheduler threads) when compared with master but more rigorous performance testing with more complex applications is necessary to be more certain of the impact. Not enough work for default number of Master with not enough work for the 4 default
Master with not enough work for the 4 default
This scaling PR with not enough work for the 4 default
This scaling PR with not enough work for the 4 default
Overall, there doesn't seem to be much difference in performance when using scheduler thread scaling if you compare it with master with less Enough work for default number of Master with enough work for the 4 default
This scaling PR with enough work for the 4 default
This scaling PR with enough work for the 4 default
Overall, there doesn't seem to be much difference in performance when using scheduler thread scaling if you compare it with master whether scheduler thread scaling is enabled or disabled. The minor differences can likely be attributed to the fact that my macbook had tons of other applications running while I ran the programs to get numbers. |
Also, forgot to mention, the following claims are based on my understanding (which could be wrong) of how the C code compiles down to machine code and not because I disassembled the final code to confirm things:
|
src/libponyrt/sched/scheduler.c
Outdated
// and there are more active schedulers than the minimum requested | ||
if ((sched == &scheduler[current_active_scheduler_count - 1]) | ||
&& (current_active_scheduler_count > min_scheduler_count) && | ||
atomic_compare_exchange_strong_explicit(&scheduler_count_changing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be atomic_exchange_explicit(&scheduler_count_changing, true, memory_order_acquire)
. If the atomic is false
, it will set it to true
and return false
(the old value), and if it's true
, it will set it to true
and return true
, so you can know whether the locking was successful by looking at the return value.
The memory order should be acquire because the operation is used to mark the start of a critical region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I've changed the logic accordingly.
src/libponyrt/sched/scheduler.c
Outdated
|
||
// decrement active_scheduler_count so other schedulers know we're | ||
// sleeping | ||
atomic_fetch_sub_explicit(&active_scheduler_count, 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the above suggestion in wake_suspended_threads
is implemented, this can be atomic load; non-atomic dec; atomic store
instead since it would be guaranteed that only one thread can modify the variable at one time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I've changed the logic accordingly.
src/libponyrt/sched/scheduler.c
Outdated
|
||
// unlock the bool that controls modifying the active scheduler count | ||
// variable | ||
atomic_compare_exchange_strong_explicit(&scheduler_count_changing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a plain atomic_store
since nobody will try to modify the variable while the current thread owns it.
The operation should have a release memory ordering since it marks the end of the critical section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I've changed the logic accordingly.
src/libponyrt/sched/scheduler.c
Outdated
#if defined(PLATFORM_IS_WINDOWS) | ||
ponyint_thread_suspend(sched->wait_event_object); | ||
#else | ||
ponyint_thread_suspend(PONY_SCHED_SLEEP_WAKE_SIGNAL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that there is a race condition here, with the variable modification and thread suspend not being a single atomic operation. For instance, with the following sequence:
T1: unlock (count_changing is false)
T2: in sched_maybe_wakeup: lock, read, signal wakeup to T1
T1: suspend
If Windows events and pthread signals aren't queued up if sent when the receiver isn't waiting for them, then this is a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. Good point about the potential race condition here.
For Windows, SetEvent
documentation (https://msdn.microsoft.com/en-us/library/windows/desktop/ms686211(v=vs.85).aspx) states:
The state of an auto-reset event object remains signaled until a single waiting thread is released, at which time the system automatically sets the state to nonsignaled. If no threads are waiting, the event object's state remains signaled.
Based on the above quote, I don't think the race condition exists on Windows and if the event gets signalled by T2
before T1
suspends, T1
should noticed it is signalled immediately and return right away due to the event remaining signalled until at least one thread is woken up by it.
For Posix platforms, I've made a change to block
the signal for waking threads by default for all threads. The note at https://notes.shichao.io/apue/ch10/#reliable-signal-terminology-and-semantics states:
A process has the option of blocking the delivery of a signal. If a signal that is blocked is generated for a process, and if the action for that signal is either the default action or to catch the signal, then the signal remains pending for the process until the process either:
- unblocks the signal, or
- changes the action to ignore the signal.
Based on the above quote, I don't think the race condition exists on Posix platforms (as long as the signal is properly blocked) and if T1
gets signalled by T2
before T1
suspends, T1
should noticed it is signalled immediately and return right away due to the signal remaining in pending
state until the thread unblocks it for the sigwait
.
src/libponyrt/sched/scheduler.c
Outdated
|
||
// increment active_scheduler_count so other schedulers know we're | ||
// awake again | ||
atomic_fetch_add_explicit(&active_scheduler_count, 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as decrement, this can be atomic load; non-atomic inc; atomic store
if the suggestion in wake_suspended_threads
is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I've changed the logic accordingly.
src/libponyrt/sched/scheduler.c
Outdated
// unlock the bool that controls modifying the active scheduler count | ||
// variable. this is because the signalling thread locks the control | ||
// variable before signalling except on termination/shutdown | ||
atomic_compare_exchange_strong_explicit(&scheduler_count_changing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, store with release memory ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I've changed the logic accordingly.
src/libponyrt/sched/scheduler.c
Outdated
|
||
// if we have some schedulers that are sleeping, wake one up | ||
if((current_active_scheduler_count < scheduler_count) && | ||
atomic_compare_exchange_strong_explicit(&scheduler_count_changing, &cmp_val, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as in steal
, exchange with acquire memory ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I've changed the logic accordingly.
src/libponyrt/sched/scheduler.c
Outdated
uint32_t current_active_scheduler_count = get_active_scheduler_count(); | ||
|
||
// wake up any sleeping threads | ||
while (current_active_scheduler_count < scheduler_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is only used at program termination, a modification here would enable the use of less expensive atomic operations while the program is running (see comment below in steal
.)
The required change would be to set sheduler_count_changing
to true
before sending a signal and then wait for it to go back to false
before locking it again and sending the next signal. This would also avoid the loop when calling wake_suspended_threads
as it would be guaranteed that every thread has woken up when the function returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I've changed the logic accordingly.
Sorry, fat fingers |
@Praetonus Thanks again for your feedback and suggestions. I've incorporated all of your suggestions and addressed the potential race condition issue in my comment above. I'd appreciate it if you could take another look whenever you have a chance. BTW, the CI failures on OSX all seem to be the |
src/libponyrt/sched/scheduler.c
Outdated
// unlock the bool that controls modifying the active scheduler count | ||
// variable | ||
atomic_store_explicit(&scheduler_count_changing, false, | ||
memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory_order_release
here since the operation marks the end of a critical section.
src/libponyrt/sched/scheduler.c
Outdated
// variable. this is because the signalling thread locks the control | ||
// variable before signalling | ||
atomic_store_explicit(&scheduler_count_changing, false, | ||
memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory_order_release
here.
@Praetonus Ugh, sorry... I could have sworn I changed those. Fixed now. |
@dipinhora can you rebase this against master? |
316a5d9
to
4a72436
Compare
Rebased. |
packages/signals/sig.pony
Outdated
ifdef linux or bsd or osx then | ||
compile_error "SIGUSR2 reserved for runtime use" | ||
else | ||
compile_error "no SIGUSR1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this still be SIGUSR2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'll fix this when I rebase and fix conflicts.
@dipinhora can you rebase this again? @Praetonus looking good to you? |
cb414f2
to
8e7695e
Compare
@SeanTAllen Rebased. @Praetonus @SeanTAllen (and anyone else interested), I also made some other changes (NOTE: the commit message has been updated with all these details and I'd suggest reading that but the following is a summary of the changes): Now:
This is different from before where the code was waking up a thread only at the time an actor was muted. This would mean that if only one actor was muted, only one thread would be woken up (or possibly none if it couldn't acquire the lock). This could have potentially resulted in a situation where not enough threads would be woken to handle the workload. Before the code was also waking up a thread at the time an actor became overloaded. It doesn't do that any longer since an overloaded actor doesn't necessarily mean another thread has work to do (for example: a single actor that creates work for itself until it's overloaded). For Posix environments, the default implementation still relies on signals as before. However, now an alternate implementation is available using pthread condition variables via a Now the PR also adds DTRACE probes for thread suspend and thread |
8e7695e
to
7494e1a
Compare
@SeanTAllen @Praetonus I don't know why but for some reason the |
Restarting OSX job that timed out due to an LLVM install issue. |
I've removed the "DO NOT MERGE" label because from my perspective this is ready to be merged unless there is additional feedback requiring changes. |
f4b9ea4
to
a18b0b1
Compare
Rebased to resolve conflict in |
Prior to this commit, the runtime would start up a specific number of scheduler threads (by default the same as the number of physical cores) on initialization and these scheduler threads would run actors and send block/unblock messages and steal actors from each other regardless of how many actors and what the workload of the program actually was. This usually resulted in wasted cpu cycles and cache thrashing if there wasn't enough work to keep all scheduler threads busy. This commit changes things so that the runtime still starts up the threads on initialization, but now the threads can suspend execution if there isn't enough work to do to minimize the work stealing overhead. The rough outline of how this works is: * We now have three variables related to number of schedulers; `maximum_scheduler_count` (the normal `--ponythreads` option), `active_scheduler_count`, and `minimum_scheduler_count` (a new `--ponyminthreads` option) * On startup, we create all possible scheduler threads (up to `maximum_scheduler_count`) * We can never have more than `maximum_scheduler_count` threads active at a time * We can never have less than `minimum_scheduler_count` threads active at a time * Scheduler threads can suspend themselves (i.e. effectively pretend as if they don't exist) * A scheduler thread can only suspend itself if its actor queue is empty and it has no actors in it's mute map and it would normally send a block message * Only one scheduler thread can suspend or resume at a time (the largest one running or the smallest one suspended respectively) * We can never skip a scheduler thread and suspend or wake up a scheduler thread out of order (i.e. thread 6 is active, but thread 5 gets suspended or thread 5 is suspended but thread 6 gets resumed) * If there isn't enough work and a scheduler thread would normally block and it's the largest active scheduler thread, it suspends itself instead * If there isn't enough work and a scheduler thread would normally block and it's not the largest active scheduler thread, it does normal scheduler block message sending * If there's a lot of work to do and at least one actor is muted in a scheduler thread, that thread tries to resume a suspended scheduler thread (if there are any) every time it is about to run an actor. NOTE: This could result in a pathological case where only one thread has a muted actor but there is only one overloaded actor. In this case the extra scheduler threads will keep being woken up and then go back to sleep over and over again. * The overhead to check if this scheduler thread is a candidate to be suspended (`&scheduler[current_active_scheduler_count - 1] == current scheduler address`) is a load and single branch check * The overhead to check if this scheduler thread is a candidate to be suspended (because `&scheduler[current_active_scheduler_count - 1] == current scheduler address`) but cannot actually be suspended because we're at `maximum_scheduler_count ` is one branch (this is in addition to the overhead from the previous bullet) * The overhead to check if there are any scheduler threads to resume is a load and single branch check The implementation of the scheduler suspend/resume is different depending on the platform. For Windows, it relies on Event Objects and `WaitForSingleObject` to suspend threads and `SetEvent` to wake suspended threads. For Posix environments, by default it relies on signals (specifically, SIGUSR2) as they are quicker than other mechanisms (pthread condition variables) (according to stackoverflow at: https://stackoverflow.com/a/4676069 and https://stackoverflow.com/a/23945651). It uses `sigwait` to suspend threads and `pthread_kill` to wake suspended threads. The signal allotted for this is `SIGUSR2` and so `SIGUSR2` has been disabled for use in the `signals` package with an error indicating that it is used by the runtime. An alternative implementation using pthread condition variables is also available via a `use=scheduler_scaling_pthreads` argument to make. This implementation relies on pthread condition variables and frees `SIGUSR2` so it is available for use in the `signals` package. It uses `pthread_cond_wait` to suspend threads and `pthread_cond_signal` to wake suspended threads. The old behavior of having all scheduler threads active all the time can be achieved by passing `--ponyminthreads=9999999` as an argument to a program (because minimum scheduler threads is capped to never exceed total scheduler threads). This commit also adds DTRACE probes for thread suspend and thread resume. This commit also switches from using `signal` to `sigaction` for the epoll/kqueue asio signals logic because `sigaction` is more robust and reliable across platforms (https://stackoverflow.com/a/232711).
a18b0b1
to
438c625
Compare
Rebased again to resolve conflict in |
Release notes for this need to note that SIGUSR2 is no longer available to user programs |
Prior to this commit, the runtime would start up a specific number of scheduler threads (by default the same as the number of physical cores) on initialization and these scheduler threads would run actors and send block/unblock messages and steal actors from each other regardless of how many actors and what the workload of the program actually was. This usually resulted in wasted cpu cycles and cache thrashing if there wasn't enough work to keep all scheduler threads busy. This commit changes things so that the runtime still starts up the threads on initialization, but now the threads can suspend execution if there isn't enough work to do to minimize the work stealing overhead. The rough outline of how this works is: * We now have three variables related to number of schedulers; `maximum_scheduler_count` (the normal `--ponythreads` option), `active_scheduler_count`, and `minimum_scheduler_count` (a new `--ponyminthreads` option) * On startup, we create all possible scheduler threads (up to `maximum_scheduler_count`) * We can never have more than `maximum_scheduler_count` threads active at a time * We can never have less than `minimum_scheduler_count` threads active at a time * Scheduler threads can suspend themselves (i.e. effectively pretend as if they don't exist) * A scheduler thread can only suspend itself if its actor queue is empty and it has no actors in it's mute map and it would normally send a block message * Only one scheduler thread can suspend or resume at a time (the largest one running or the smallest one suspended respectively) * We can never skip a scheduler thread and suspend or wake up a scheduler thread out of order (i.e. thread 6 is active, but thread 5 gets suspended or thread 5 is suspended but thread 6 gets resumed) * If there isn't enough work and a scheduler thread would normally block and it's the largest active scheduler thread, it suspends itself instead * If there isn't enough work and a scheduler thread would normally block and it's not the largest active scheduler thread, it does normal scheduler block message sending * If there's a lot of work to do and at least one actor is muted in a scheduler thread, that thread tries to resume a suspended scheduler thread (if there are any) every time it is about to run an actor. NOTE: This could result in a pathological case where only one thread has a muted actor but there is only one overloaded actor. In this case the extra scheduler threads will keep being woken up and then go back to sleep over and over again. * The overhead to check if this scheduler thread is a candidate to be suspended (`&scheduler[current_active_scheduler_count - 1] == current scheduler address`) is a load and single branch check * The overhead to check if this scheduler thread is a candidate to be suspended (because `&scheduler[current_active_scheduler_count - 1] == current scheduler address`) but cannot actually be suspended because we're at `maximum_scheduler_count ` is one branch (this is in addition to the overhead from the previous bullet) * The overhead to check if there are any scheduler threads to resume is a load and single branch check The implementation of the scheduler suspend/resume is different depending on the platform. For Windows, it relies on Event Objects and `WaitForSingleObject` to suspend threads and `SetEvent` to wake suspended threads. For Linux environments, by default it relies on signals (specifically, SIGUSR2) as they are quicker than other mechanisms (pthread condition variables) (according to stackoverflow at: https://stackoverflow.com/a/4676069 and https://stackoverflow.com/a/23945651). It uses `sigwait` to suspend threads and `pthread_kill` to wake suspended threads. The signal allotted for this is `SIGUSR2` and so `SIGUSR2` has been disabled for use in the `signals` package with an error indicating that it is used by the runtime. For MacOS, we use pthread condition variables is also available via a `use=scheduler_scaling_pthreads` argument to make. This implementation relies on pthread condition variables and frees `SIGUSR2` so it is available for use in the `signals` package. It uses `pthread_cond_wait` to suspend threads and `pthread_cond_signal` to wake suspended threads. The old behavior of having all scheduler threads active all the time can be achieved by passing `--ponyminthreads=9999999` as an argument to a program (because minimum scheduler threads is capped to never exceed total scheduler threads). This commit also adds DTRACE probes for thread suspend and thread resume. This commit also switches from using `signal` to `sigaction` for the epoll/kqueue asio signals logic because `sigaction` is more robust and reliable across platforms (https://stackoverflow.com/a/232711).
🎉 |
Prior to this commit, the runtime would start up a specific number
of scheduler threads (by default the same as the number of physical
cores) on initialization and these scheduler threads would run
actors and send block/unblock messages and steal actors from each
other regardless of how many actors and what the workload of the
program actually was. This usually resulted in wasted cpu cycles
and cache thrashing if there wasn't enough work to keep all
scheduler threads busy.
This commit changes things so that the runtime still starts up
the threads on initialization, but now the threads can suspend
execution if there isn't enough work to do to minimize the work
stealing overhead. The rough outline of how this works is:
maximum_scheduler_count
(the normal--ponythreads
option),active_scheduler_count
, andminimum_scheduler_count
(a new
--ponyminthreads
option)maximum_scheduler_count
)maximum_scheduler_count
threadsactive at a time
minimum_scheduler_count
threadsactive at a time
pretend as if they don't exist)
is empty and it has no actors in it's mute map and it would
normally send a block message
largest one running or the smallest one suspended respectively)
scheduler thread out of order (i.e. thread 6 is active, but
thread 5 gets suspended or thread 5 is suspended but thread 6
gets resumed)
block and it's the largest active scheduler thread, it suspends
itself instead
block and it's not the largest active scheduler thread, it does
normal scheduler block message sending
the runtime tries to resume a suspended scheduler thread if there
are any
be suspended (
&scheduler[current_active_scheduler_count - 1] == current scheduler address
) is a load and single branch checkbe suspended (because
&scheduler[current_active_scheduler_count - 1] == current scheduler address
) but cannot actually besuspended because we're at
maximum_scheduler_count
is onebranch (this is in addition to the overhead from the previous
bullet)
resume is a load and single branch check
The implementation of the scheduler suspend/resume is different
depending on the platform.
For Windows, it relies on Event Objects and
WaitForSingleObject
to suspend threads and
SetEvent
to wake suspended threads.For Posix environments, it relies on signals (specifically,
SIGUSR2) as they are quicker than other mechanisms (pthread
condition variables) (according to stackoverflow at:
https://stackoverflow.com/a/4676069 and
https://stackoverflow.com/a/23945651). It uses
sigwait
tosuspend threads and
pthread_kill
to wake suspended threads. Thesignal allotted for this is
SIGUSR2
and soSIGUSR2
has beendisabled for use in the
signals
package with an error indicatingthat it is used by the runtime.
The old behavior of having all scheduler threads active all the
time can be achieved by passing
--ponyminthreads=9999999
as anargument to a program (because minimum scheduler threads is capped
to never exceed total scheduler threads).
This commit also switches from using
signal
tosigaction
forthe epoll/kqueue asio signals logic because
sigaction
is morerobust and reliable across platforms
(https://stackoverflow.com/a/232711).