Dynamic scheduler thread scaling based on workload #2386

dipinhora · 2017-11-28T23:25:44Z

Prior to this commit, the runtime would start up a specific number
of scheduler threads (by default the same as the number of physical
cores) on initialization and these scheduler threads would run
actors and send block/unblock messages and steal actors from each
other regardless of how many actors and what the workload of the
program actually was. This usually resulted in wasted cpu cycles
and cache thrashing if there wasn't enough work to keep all
scheduler threads busy.

This commit changes things so that the runtime still starts up
the threads on initialization, but now the threads can suspend
execution if there isn't enough work to do to minimize the work
stealing overhead. The rough outline of how this works is:

We now have three variables related to number of schedulers;
maximum_scheduler_count (the normal --ponythreads option),
active_scheduler_count, and minimum_scheduler_count
(a new --ponyminthreads option)
On startup, we create all possible scheduler threads (up to
maximum_scheduler_count)
We can never have more than maximum_scheduler_count threads
active at a time
We can never have less than minimum_scheduler_count threads
active at a time
Scheduler threads can suspend themselves (i.e. effectively
pretend as if they don't exist)
A scheduler thread can only suspend itself if its actor queue
is empty and it has no actors in it's mute map and it would
normally send a block message
Only one scheduler thread can suspend or resume at a time (the
largest one running or the smallest one suspended respectively)
We can never skip a scheduler thread and suspend or wake up a
scheduler thread out of order (i.e. thread 6 is active, but
thread 5 gets suspended or thread 5 is suspended but thread 6
gets resumed)
If there isn't enough work and a scheduler thread would normally
block and it's the largest active scheduler thread, it suspends
itself instead
If there isn't enough work and a scheduler thread would normally
block and it's not the largest active scheduler thread, it does
normal scheduler block message sending
If there's a lot of work to do and an actor is muted,
the runtime tries to resume a suspended scheduler thread if there
are any
The overhead to check if this scheduler thread is a candidate to
be suspended (&scheduler[current_active_scheduler_count - 1] == current scheduler address) is a load and single branch check
The overhead to check if this scheduler thread is a candidate to
be suspended (because &scheduler[current_active_scheduler_count - 1] == current scheduler address) but cannot actually be
suspended because we're at maximum_scheduler_count is one
branch (this is in addition to the overhead from the previous
bullet)
The overhead to check if there are any scheduler threads to
resume is a load and single branch check

The implementation of the scheduler suspend/resume is different
depending on the platform.

For Windows, it relies on Event Objects and WaitForSingleObject
to suspend threads and SetEvent to wake suspended threads.

For Posix environments, it relies on signals (specifically,
SIGUSR2) as they are quicker than other mechanisms (pthread
condition variables) (according to stackoverflow at:
https://stackoverflow.com/a/4676069 and
https://stackoverflow.com/a/23945651). It uses sigwait to
suspend threads and pthread_kill to wake suspended threads. The
signal allotted for this is SIGUSR2 and so SIGUSR2 has been
disabled for use in the signals package with an error indicating
that it is used by the runtime.

The old behavior of having all scheduler threads active all the
time can be achieved by passing --ponyminthreads=9999999 as an
argument to a program (because minimum scheduler threads is capped
to never exceed total scheduler threads).

This commit also switches from using signal to sigaction for
the epoll/kqueue asio signals logic because sigaction is more
robust and reliable across platforms
(https://stackoverflow.com/a/232711).

dipinhora · 2017-11-28T23:30:28Z

I've marked this as DO NOT MERGE as it needs testing for behavior and performance by someone who isn't me. I am by no means an expert on this stuff, but everything seems to be working correctly.

In addition, it would be great if this were reviewed by folks on @ponylang/core. Especially around the use of atomics and the use of signals for sleeping/waking threads on non-windows platforms.

dipinhora · 2017-11-29T00:00:21Z

I've done some basic performance testing using a modified version of examples/message-ubench (diff can be found at https://gist.github.com/dipinhora/964debac0217b745063189aba557b715) as that seems to reflect overhead of the runtime in it's output. The modification was to add a --total-iterations command line option so that it will automagically exit once finished.

Note: This performance testing was done on an OS X Macbook where ponythreads defaults to 4 with tons of other applications running so is definitely not very reliable.

I compiled two versions of the application message-ubench-master from master at commit 18533c5 and with the changes from this PR called message-ubench-scaling. I then ran both versions with two sets of parameters. One set which would not generate enough work to keep all of the default number of ponythreads busy and one that would generate enough work to keep all of the default number of ponythreads busy.

Summary:

Overall, it seems that the scheduler scaling doesn't have any significant negative impact on performance (except for the expected impact from using less scheduler threads) when compared with master but more rigorous performance testing with more complex applications is necessary to be more certain of the impact.

Not enough work for default number of ponythreads comparison

Master with not enough work for the 4 default ponythreads.

Dipins-MBP:dhp dipinhora$ ./message-ubench-master 
# pingers 8, report-interval 10, initial-pings 5, total-iterations 10
time,run-ns,rate
1511911041.827202000,1000831000,22057862
1511911042.826349000,999113000,22329028
1511911043.824619000,998241000,22368515
1511911044.825412000,1000765000,22384713
1511911045.824592000,999149000,22429331
1511911046.823824000,999202000,22408859
1511911047.823014000,999152000,22492305
1511911048.822167000,999127000,21469565
1511911049.820643000,998445000,22360695
1511911050.820187000,999513000,22387627

Master with not enough work for the 4 default ponythreads but told to use only 1 ponythread.

Dipins-MBP:dhp dipinhora$ ./message-ubench-master --ponythreads=1
# pingers 8, report-interval 10, initial-pings 5, total-iterations 10
time,run-ns,rate
1511911911.727585000,1000352000,17616653
1511911912.726874000,999234000,17689682
1511911913.726117000,999205000,17688413
1511911914.725406000,999267000,17995684
1511911915.724648000,999202000,17787099
1511911916.723896000,999197000,17843643
1511911917.722969000,999034000,17373278
1511911918.722226000,999219000,17671836
1511911919.721540000,999273000,18011496
1511911920.720794000,999213000,18023472

This scaling PR with not enough work for the 4 default ponythreads.

Dipins-MBP:dhp dipinhora$ ./message-ubench-scaling 
# pingers 8, report-interval 10, initial-pings 5, total-iterations 10
time,run-ns,rate
1511911981.702790000,1001429000,17459670
1511911982.701071000,998234000,17213620
1511911983.700963000,999870000,17438576
1511911984.700254000,999250000,17540727
1511911985.698522000,998237000,17601469
1511911986.698879000,1000336000,17427735
1511911987.698114000,999193000,17561811
1511911988.697408000,999244000,17579399
1511911989.696666000,999215000,16896758
1511911990.695934000,999227000,17522404

This scaling PR with not enough work for the 4 default ponythreads but with minponythreads forced to be same as ponythreads effectively disabling scheduler thread scaling.

Dipins-MBP:dhp dipinhora$ ./message-ubench-scaling --ponyminthreads=999999
# pingers 8, report-interval 10, initial-pings 5, total-iterations 10
time,run-ns,rate
1511911076.381846000,1000242000,22018045
1511911077.381077000,999198000,22328975
1511911078.380258000,999150000,22257245
1511911079.379441000,999145000,22314592
1511911080.379006000,999533000,22375454
1511911081.378140000,999112000,22351829
1511911082.376693000,998525000,22371559
1511911083.376186000,999464000,22428753
1511911084.376473000,1000258000,22359273
1511911085.375011000,998511000,22460565

Overall, there doesn't seem to be much difference in performance when using scheduler thread scaling if you compare it with master with less ponythreads or force scheduler thread scaling to effectively be disabled and compare it with master with default ponythreads. The minor differences can likely be attributed to the fact that my macbook had tons of other applications running while I ran the programs to get numbers.

Enough work for default number of ponythreads comparison (--pingers=100)

Master with enough work for the 4 default ponythreads.

Dipins-MBP:dhp dipinhora$ ./message-ubench-master --pingers=100
# pingers 100, report-interval 10, initial-pings 5, total-iterations 10
time,run-ns,rate
1511911148.596786000,1001088000,17598105
1511911149.596143000,999250000,17948221
1511911150.595327000,999079000,17927916
1511911151.594556000,999131000,17890342
1511911152.593733000,999084000,17756510
1511911153.592962000,999130000,17783998
1511911154.592205000,999146000,17788162
1511911155.591335000,999030000,17816223
1511911156.589843000,998403000,16383893
1511911157.589749000,999796000,16962434

This scaling PR with enough work for the 4 default ponythreads.

Dipins-MBP:dhp dipinhora$ ./message-ubench-scaling --pingers=100
# pingers 100, report-interval 10, initial-pings 5, total-iterations 10
time,run-ns,rate
1511911166.722688000,999474000,17729182
1511911167.722974000,1000166000,17879973
1511911168.721716000,998632000,17350126
1511911169.721010000,999173000,17818367
1511911170.721340000,1000229000,17684335
1511911171.720116000,998661000,17917820
1511911172.719359000,999139000,17791456
1511911173.718605000,999145000,17898287
1511911174.717948000,999243000,17982331
1511911175.716522000,998467000,17866785

This scaling PR with enough work for the 4 default ponythreads but with minponythreads forced to be same as ponythreads effectively disabling scheduler thread scaling.

Dipins-MBP:dhp dipinhora$ ./message-ubench-scaling --pingers=100 --ponyminthreads=9999
# pingers 100, report-interval 10, initial-pings 5, total-iterations 10
time,run-ns,rate
1511911187.395119000,1000551000,17021046
1511911188.394023000,998787000,16751350
1511911189.393210000,999086000,17834277
1511911190.392446000,999133000,17885750
1511911191.391633000,999080000,17896542
1511911192.390877000,999134000,17828693
1511911193.390268000,999291000,17950982
1511911194.390123000,999752000,17927982
1511911195.389304000,999084000,17740281
1511911196.388159000,998760000,17921059

Overall, there doesn't seem to be much difference in performance when using scheduler thread scaling if you compare it with master whether scheduler thread scaling is enabled or disabled. The minor differences can likely be attributed to the fact that my macbook had tons of other applications running while I ran the programs to get numbers.

dipinhora · 2017-11-29T00:05:55Z

Also, forgot to mention, the following claims are based on my understanding (which could be wrong) of how the C code compiles down to machine code and not because I disassembled the final code to confirm things:

The overhead to check if this scheduler thread is a candidate to
be suspended (&scheduler[current_active_scheduler_count - 1] == current scheduler address) is a load and single branch check
The overhead to check if this scheduler thread is a candidate to
be suspended (because &scheduler[current_active_scheduler_count - 1] == current scheduler address) but cannot actually be
suspended because we're at maximum_scheduler_count is one
branch (this is in addition to the overhead from the previous
bullet)
The overhead to check if there are any scheduler threads to
resume is a load and single branch check

Praetonus · 2017-11-28T23:59:41Z

src/libponyrt/sched/scheduler.c

+        // and there are more active schedulers than the minimum requested
+        if ((sched == &scheduler[current_active_scheduler_count - 1])
+          && (current_active_scheduler_count > min_scheduler_count) &&
+          atomic_compare_exchange_strong_explicit(&scheduler_count_changing,


This can be atomic_exchange_explicit(&scheduler_count_changing, true, memory_order_acquire). If the atomic is false, it will set it to true and return false (the old value), and if it's true, it will set it to true and return true, so you can know whether the locking was successful by looking at the return value.

The memory order should be acquire because the operation is used to mark the start of a critical region.

Thanks for the suggestion. I've changed the logic accordingly.

Praetonus · 2017-11-29T00:03:49Z

src/libponyrt/sched/scheduler.c

+
+          // decrement active_scheduler_count so other schedulers know we're
+          // sleeping
+          atomic_fetch_sub_explicit(&active_scheduler_count, 1,


If the above suggestion in wake_suspended_threads is implemented, this can be atomic load; non-atomic dec; atomic store instead since it would be guaranteed that only one thread can modify the variable at one time.

Thanks for the suggestion. I've changed the logic accordingly.

Praetonus · 2017-11-29T00:05:01Z

src/libponyrt/sched/scheduler.c

+
+          // unlock the bool that controls modifying the active scheduler count
+          // variable
+          atomic_compare_exchange_strong_explicit(&scheduler_count_changing,


This can be a plain atomic_store since nobody will try to modify the variable while the current thread owns it.

The operation should have a release memory ordering since it marks the end of the critical section.

Thanks for the suggestion. I've changed the logic accordingly.

Praetonus · 2017-11-29T00:09:19Z

src/libponyrt/sched/scheduler.c

+#if defined(PLATFORM_IS_WINDOWS)
+          ponyint_thread_suspend(sched->wait_event_object);
+#else
+          ponyint_thread_suspend(PONY_SCHED_SLEEP_WAKE_SIGNAL);


It seems to me that there is a race condition here, with the variable modification and thread suspend not being a single atomic operation. For instance, with the following sequence:

T1: unlock (count_changing is false) T2: in sched_maybe_wakeup: lock, read, signal wakeup to T1 T1: suspend

If Windows events and pthread signals aren't queued up if sent when the receiver isn't waiting for them, then this is a bug.

Hmm.. Good point about the potential race condition here.

For Windows, SetEvent documentation (https://msdn.microsoft.com/en-us/library/windows/desktop/ms686211(v=vs.85).aspx) states:

The state of an auto-reset event object remains signaled until a single waiting thread is released, at which time the system automatically sets the state to nonsignaled. If no threads are waiting, the event object's state remains signaled.

Based on the above quote, I don't think the race condition exists on Windows and if the event gets signalled by T2 before T1 suspends, T1 should noticed it is signalled immediately and return right away due to the event remaining signalled until at least one thread is woken up by it.

For Posix platforms, I've made a change to block the signal for waking threads by default for all threads. The note at https://notes.shichao.io/apue/ch10/#reliable-signal-terminology-and-semantics states:

A process has the option of blocking the delivery of a signal. If a signal that is blocked is generated for a process, and if the action for that signal is either the default action or to catch the signal, then the signal remains pending for the process until the process either:

unblocks the signal, or

changes the action to ignore the signal.

Based on the above quote, I don't think the race condition exists on Posix platforms (as long as the signal is properly blocked) and if T1 gets signalled by T2 before T1 suspends, T1 should noticed it is signalled immediately and return right away due to the signal remaining in pending state until the thread unblocks it for the sigwait.

Praetonus · 2017-11-29T00:09:46Z

src/libponyrt/sched/scheduler.c

+
+          // increment active_scheduler_count so other schedulers know we're
+          // awake again
+          atomic_fetch_add_explicit(&active_scheduler_count, 1,


Same as decrement, this can be atomic load; non-atomic inc; atomic store if the suggestion in wake_suspended_threads is implemented.

Thanks for the suggestion. I've changed the logic accordingly.

Praetonus · 2017-11-29T00:10:02Z

src/libponyrt/sched/scheduler.c

+          // unlock the bool that controls modifying the active scheduler count
+          // variable. this is because the signalling thread locks the control
+          // variable before signalling except on termination/shutdown
+          atomic_compare_exchange_strong_explicit(&scheduler_count_changing,


Same as above, store with release memory ordering.

Thanks for the suggestion. I've changed the logic accordingly.

Praetonus · 2017-11-29T00:11:01Z

src/libponyrt/sched/scheduler.c

+
+  // if we have some schedulers that are sleeping, wake one up
+  if((current_active_scheduler_count < scheduler_count) &&
+    atomic_compare_exchange_strong_explicit(&scheduler_count_changing, &cmp_val,


Same as in steal, exchange with acquire memory ordering.

Thanks for the suggestion. I've changed the logic accordingly.

Praetonus · 2017-11-29T00:18:11Z

src/libponyrt/sched/scheduler.c

+  uint32_t current_active_scheduler_count = get_active_scheduler_count();
+
+  // wake up any sleeping threads
+  while (current_active_scheduler_count < scheduler_count)


Since this is only used at program termination, a modification here would enable the use of less expensive atomic operations while the program is running (see comment below in steal.)

The required change would be to set sheduler_count_changing to true before sending a signal and then wait for it to go back to false before locking it again and sending the next signal. This would also avoid the loop when calling wake_suspended_threads as it would be guaranteed that every thread has woken up when the function returns.

Thanks for the suggestion. I've changed the logic accordingly.

plietar · 2017-11-29T01:23:17Z

Sorry, fat fingers

dipinhora · 2017-12-01T03:05:00Z

@Praetonus Thanks again for your feedback and suggestions. I've incorporated all of your suggestions and addressed the potential race condition issue in my comment above. I'd appreciate it if you could take another look whenever you have a chance.

BTW, the CI failures on OSX all seem to be the net/Broadcast test failures that seems to be an issue with OSX builds in Travis in the past couple of days.

Praetonus · 2017-12-01T08:38:37Z

src/libponyrt/sched/scheduler.c

+          // unlock the bool that controls modifying the active scheduler count
+          // variable
+          atomic_store_explicit(&scheduler_count_changing, false,
+            memory_order_relaxed);


memory_order_release here since the operation marks the end of a critical section.

Praetonus · 2017-12-01T08:39:09Z

src/libponyrt/sched/scheduler.c

+          // variable. this is because the signalling thread locks the control
+          // variable before signalling
+          atomic_store_explicit(&scheduler_count_changing, false,
+            memory_order_relaxed);


memory_order_release here.

dipinhora · 2017-12-01T09:03:59Z

@Praetonus Ugh, sorry... I could have sworn I changed those. Fixed now.

SeanTAllen · 2017-12-05T02:10:51Z

@dipinhora can you rebase this against master?

dipinhora · 2017-12-05T04:04:49Z

Rebased.

Praetonus · 2017-12-08T12:35:06Z

packages/signals/sig.pony

+    ifdef linux or bsd or osx then
+      compile_error "SIGUSR2 reserved for runtime use"
+    else
+      compile_error "no SIGUSR1"


Shouldn't this still be SIGUSR2?

Yes. I'll fix this when I rebase and fix conflicts.

SeanTAllen · 2017-12-15T22:02:57Z

@dipinhora can you rebase this again?

@Praetonus looking good to you?

dipinhora · 2017-12-16T06:27:46Z

@SeanTAllen Rebased.

@Praetonus @SeanTAllen (and anyone else interested), I also made some other changes (NOTE: the commit message has been updated with all these details and I'd suggest reading that but the following is a summary of the changes):

Now:

If there's a lot of work to do and at least one actor is muted
in a scheduler thread, that thread tries to resume a suspended
scheduler thread (if there are any) every time it is about to run an actor. NOTE: This could result in a
pathological case where only one thread has a muted actor but
there is only one overloaded actor. In this case the extra
scheduler threads will keep being woken up and then go back to
sleep over and over again.

This is different from before where the code was waking up a thread only at the time an actor was muted. This would mean that if only one actor was muted, only one thread would be woken up (or possibly none if it couldn't acquire the lock). This could have potentially resulted in a situation where not enough threads would be woken to handle the workload. Before the code was also waking up a thread at the time an actor became overloaded. It doesn't do that any longer since an overloaded actor doesn't necessarily mean another thread has work to do (for example: a single actor that creates work for itself until it's overloaded).

For Posix environments, the default implementation still relies on signals as before. However, now an alternate implementation is available using pthread condition variables via a use=scheduler_scaling_pthreads argument to make.

Now the PR also adds DTRACE probes for thread suspend and thread
resume actions.

dipinhora · 2017-12-16T16:59:17Z

@SeanTAllen @Praetonus I don't know why but for some reason the signals implementation of scheduler scaling seems to cause hangs in the codegen tests on travis OSX (which I'm not able to reproduce and it works perfectly on my machine so I have no clue what's going on). Regardless, it seems switching to the pthreads implementation for OSX resolves that issue.

dipinhora · 2017-12-16T21:48:53Z

Restarting OSX job that timed out due to an LLVM install issue.

dipinhora · 2017-12-16T21:51:21Z

I've removed the "DO NOT MERGE" label because from my perspective this is ready to be merged unless there is additional feedback requiring changes.

dipinhora · 2017-12-19T03:17:29Z

Rebased to resolve conflict in Makefile.

Prior to this commit, the runtime would start up a specific number of scheduler threads (by default the same as the number of physical cores) on initialization and these scheduler threads would run actors and send block/unblock messages and steal actors from each other regardless of how many actors and what the workload of the program actually was. This usually resulted in wasted cpu cycles and cache thrashing if there wasn't enough work to keep all scheduler threads busy. This commit changes things so that the runtime still starts up the threads on initialization, but now the threads can suspend execution if there isn't enough work to do to minimize the work stealing overhead. The rough outline of how this works is: * We now have three variables related to number of schedulers; `maximum_scheduler_count` (the normal `--ponythreads` option), `active_scheduler_count`, and `minimum_scheduler_count` (a new `--ponyminthreads` option) * On startup, we create all possible scheduler threads (up to `maximum_scheduler_count`) * We can never have more than `maximum_scheduler_count` threads active at a time * We can never have less than `minimum_scheduler_count` threads active at a time * Scheduler threads can suspend themselves (i.e. effectively pretend as if they don't exist) * A scheduler thread can only suspend itself if its actor queue is empty and it has no actors in it's mute map and it would normally send a block message * Only one scheduler thread can suspend or resume at a time (the largest one running or the smallest one suspended respectively) * We can never skip a scheduler thread and suspend or wake up a scheduler thread out of order (i.e. thread 6 is active, but thread 5 gets suspended or thread 5 is suspended but thread 6 gets resumed) * If there isn't enough work and a scheduler thread would normally block and it's the largest active scheduler thread, it suspends itself instead * If there isn't enough work and a scheduler thread would normally block and it's not the largest active scheduler thread, it does normal scheduler block message sending * If there's a lot of work to do and at least one actor is muted in a scheduler thread, that thread tries to resume a suspended scheduler thread (if there are any) every time it is about to run an actor. NOTE: This could result in a pathological case where only one thread has a muted actor but there is only one overloaded actor. In this case the extra scheduler threads will keep being woken up and then go back to sleep over and over again. * The overhead to check if this scheduler thread is a candidate to be suspended (`&scheduler[current_active_scheduler_count - 1] == current scheduler address`) is a load and single branch check * The overhead to check if this scheduler thread is a candidate to be suspended (because `&scheduler[current_active_scheduler_count - 1] == current scheduler address`) but cannot actually be suspended because we're at `maximum_scheduler_count ` is one branch (this is in addition to the overhead from the previous bullet) * The overhead to check if there are any scheduler threads to resume is a load and single branch check The implementation of the scheduler suspend/resume is different depending on the platform. For Windows, it relies on Event Objects and `WaitForSingleObject` to suspend threads and `SetEvent` to wake suspended threads. For Posix environments, by default it relies on signals (specifically, SIGUSR2) as they are quicker than other mechanisms (pthread condition variables) (according to stackoverflow at: https://stackoverflow.com/a/4676069 and https://stackoverflow.com/a/23945651). It uses `sigwait` to suspend threads and `pthread_kill` to wake suspended threads. The signal allotted for this is `SIGUSR2` and so `SIGUSR2` has been disabled for use in the `signals` package with an error indicating that it is used by the runtime. An alternative implementation using pthread condition variables is also available via a `use=scheduler_scaling_pthreads` argument to make. This implementation relies on pthread condition variables and frees `SIGUSR2` so it is available for use in the `signals` package. It uses `pthread_cond_wait` to suspend threads and `pthread_cond_signal` to wake suspended threads. The old behavior of having all scheduler threads active all the time can be achieved by passing `--ponyminthreads=9999999` as an argument to a program (because minimum scheduler threads is capped to never exceed total scheduler threads). This commit also adds DTRACE probes for thread suspend and thread resume. This commit also switches from using `signal` to `sigaction` for the epoll/kqueue asio signals logic because `sigaction` is more robust and reliable across platforms (https://stackoverflow.com/a/232711).

dipinhora · 2017-12-19T04:49:47Z

Rebased again to resolve conflict in main.c.

SeanTAllen · 2017-12-20T21:04:25Z

Release notes for this need to note that SIGUSR2 is no longer available to user programs

Prior to this commit, the runtime would start up a specific number of scheduler threads (by default the same as the number of physical cores) on initialization and these scheduler threads would run actors and send block/unblock messages and steal actors from each other regardless of how many actors and what the workload of the program actually was. This usually resulted in wasted cpu cycles and cache thrashing if there wasn't enough work to keep all scheduler threads busy. This commit changes things so that the runtime still starts up the threads on initialization, but now the threads can suspend execution if there isn't enough work to do to minimize the work stealing overhead. The rough outline of how this works is: * We now have three variables related to number of schedulers; `maximum_scheduler_count` (the normal `--ponythreads` option), `active_scheduler_count`, and `minimum_scheduler_count` (a new `--ponyminthreads` option) * On startup, we create all possible scheduler threads (up to `maximum_scheduler_count`) * We can never have more than `maximum_scheduler_count` threads active at a time * We can never have less than `minimum_scheduler_count` threads active at a time * Scheduler threads can suspend themselves (i.e. effectively pretend as if they don't exist) * A scheduler thread can only suspend itself if its actor queue is empty and it has no actors in it's mute map and it would normally send a block message * Only one scheduler thread can suspend or resume at a time (the largest one running or the smallest one suspended respectively) * We can never skip a scheduler thread and suspend or wake up a scheduler thread out of order (i.e. thread 6 is active, but thread 5 gets suspended or thread 5 is suspended but thread 6 gets resumed) * If there isn't enough work and a scheduler thread would normally block and it's the largest active scheduler thread, it suspends itself instead * If there isn't enough work and a scheduler thread would normally block and it's not the largest active scheduler thread, it does normal scheduler block message sending * If there's a lot of work to do and at least one actor is muted in a scheduler thread, that thread tries to resume a suspended scheduler thread (if there are any) every time it is about to run an actor. NOTE: This could result in a pathological case where only one thread has a muted actor but there is only one overloaded actor. In this case the extra scheduler threads will keep being woken up and then go back to sleep over and over again. * The overhead to check if this scheduler thread is a candidate to be suspended (`&scheduler[current_active_scheduler_count - 1] == current scheduler address`) is a load and single branch check * The overhead to check if this scheduler thread is a candidate to be suspended (because `&scheduler[current_active_scheduler_count - 1] == current scheduler address`) but cannot actually be suspended because we're at `maximum_scheduler_count ` is one branch (this is in addition to the overhead from the previous bullet) * The overhead to check if there are any scheduler threads to resume is a load and single branch check The implementation of the scheduler suspend/resume is different depending on the platform. For Windows, it relies on Event Objects and `WaitForSingleObject` to suspend threads and `SetEvent` to wake suspended threads. For Linux environments, by default it relies on signals (specifically, SIGUSR2) as they are quicker than other mechanisms (pthread condition variables) (according to stackoverflow at: https://stackoverflow.com/a/4676069 and https://stackoverflow.com/a/23945651). It uses `sigwait` to suspend threads and `pthread_kill` to wake suspended threads. The signal allotted for this is `SIGUSR2` and so `SIGUSR2` has been disabled for use in the `signals` package with an error indicating that it is used by the runtime. For MacOS, we use pthread condition variables is also available via a `use=scheduler_scaling_pthreads` argument to make. This implementation relies on pthread condition variables and frees `SIGUSR2` so it is available for use in the `signals` package. It uses `pthread_cond_wait` to suspend threads and `pthread_cond_signal` to wake suspended threads. The old behavior of having all scheduler threads active all the time can be achieved by passing `--ponyminthreads=9999999` as an argument to a program (because minimum scheduler threads is capped to never exceed total scheduler threads). This commit also adds DTRACE probes for thread suspend and thread resume. This commit also switches from using `signal` to `sigaction` for the epoll/kqueue asio signals logic because `sigaction` is more robust and reliable across platforms (https://stackoverflow.com/a/232711).

jemc · 2017-12-21T02:27:26Z

🎉

dipinhora added the do not merge This PR should not be merged at this time label Nov 28, 2017

jemc requested review from Praetonus and sylvanc November 28, 2017 23:53

Praetonus reviewed Nov 29, 2017

View reviewed changes

plietar closed this Nov 29, 2017

plietar reopened this Nov 29, 2017

Praetonus reviewed Dec 1, 2017

View reviewed changes

dipinhora force-pushed the scheduler_scaling branch from 316a5d9 to 4a72436 Compare December 5, 2017 04:04

Praetonus reviewed Dec 8, 2017

View reviewed changes

SeanTAllen mentioned this pull request Dec 16, 2017

Pony runtime goes out of memory under high CPU load / concurrency #517

Closed

dipinhora force-pushed the scheduler_scaling branch 2 times, most recently from cb414f2 to 8e7695e Compare December 16, 2017 06:25

dipinhora force-pushed the scheduler_scaling branch from 8e7695e to 7494e1a Compare December 16, 2017 06:31

dipinhora removed the do not merge This PR should not be merged at this time label Dec 16, 2017

dipinhora force-pushed the scheduler_scaling branch from f4b9ea4 to a18b0b1 Compare December 19, 2017 03:16

dipinhora added 2 commits December 18, 2017 23:46

Fix help text formatting alignment.

49b9de3

Use pthreads for scheduler scaling on OSX

438c625

dipinhora force-pushed the scheduler_scaling branch from a18b0b1 to 438c625 Compare December 19, 2017 04:47

Praetonus approved these changes Dec 19, 2017

View reviewed changes

SeanTAllen added the changelog - added Automatically add "Added" CHANGELOG entry on merge label Dec 20, 2017

SeanTAllen merged commit 2137eee into ponylang:master Dec 20, 2017

ponylang-main added a commit that referenced this pull request Dec 20, 2017

Update CHANGELOG for PR #2386 [skip ci]

8fbfbb3

slayful pushed a commit to slayful/ponyc that referenced this pull request Dec 20, 2017

Update CHANGELOG for PR ponylang#2386 [skip ci]

13dcecf

SeanTAllen mentioned this pull request Dec 21, 2017

memory leak with timer #2317

Closed

winksaville mentioned this pull request Dec 28, 2017

stdlib in a loop hangs #2451

Closed

dipinhora mentioned this pull request Jan 11, 2018

Fix and re-enable dynamic scheduler scaling #2483

Merged

Dynamic scheduler thread scaling based on workload #2386

Dynamic scheduler thread scaling based on workload #2386

Conversation

dipinhora commented Nov 28, 2017

dipinhora commented Nov 28, 2017

dipinhora commented Nov 29, 2017 • edited Loading

dipinhora commented Nov 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plietar commented Nov 29, 2017

dipinhora commented Dec 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dipinhora commented Dec 1, 2017

SeanTAllen commented Dec 5, 2017

dipinhora commented Dec 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeanTAllen commented Dec 15, 2017

dipinhora commented Dec 16, 2017

dipinhora commented Dec 16, 2017

dipinhora commented Dec 16, 2017

dipinhora commented Dec 16, 2017

dipinhora commented Dec 19, 2017

dipinhora commented Dec 19, 2017

SeanTAllen commented Dec 20, 2017

jemc commented Dec 21, 2017

dipinhora commented Nov 29, 2017 •

edited

Loading