Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic scheduler thread scaling based on workload #2386

Merged
merged 3 commits into from
Dec 20, 2017

Conversation

dipinhora
Copy link
Contributor

Prior to this commit, the runtime would start up a specific number
of scheduler threads (by default the same as the number of physical
cores) on initialization and these scheduler threads would run
actors and send block/unblock messages and steal actors from each
other regardless of how many actors and what the workload of the
program actually was. This usually resulted in wasted cpu cycles
and cache thrashing if there wasn't enough work to keep all
scheduler threads busy.

This commit changes things so that the runtime still starts up
the threads on initialization, but now the threads can suspend
execution if there isn't enough work to do to minimize the work
stealing overhead. The rough outline of how this works is:

  • We now have three variables related to number of schedulers;
    maximum_scheduler_count (the normal --ponythreads option),
    active_scheduler_count, and minimum_scheduler_count
    (a new --ponyminthreads option)
  • On startup, we create all possible scheduler threads (up to
    maximum_scheduler_count)
  • We can never have more than maximum_scheduler_count threads
    active at a time
  • We can never have less than minimum_scheduler_count threads
    active at a time
  • Scheduler threads can suspend themselves (i.e. effectively
    pretend as if they don't exist)
  • A scheduler thread can only suspend itself if its actor queue
    is empty and it has no actors in it's mute map and it would
    normally send a block message
  • Only one scheduler thread can suspend or resume at a time (the
    largest one running or the smallest one suspended respectively)
  • We can never skip a scheduler thread and suspend or wake up a
    scheduler thread out of order (i.e. thread 6 is active, but
    thread 5 gets suspended or thread 5 is suspended but thread 6
    gets resumed)
  • If there isn't enough work and a scheduler thread would normally
    block and it's the largest active scheduler thread, it suspends
    itself instead
  • If there isn't enough work and a scheduler thread would normally
    block and it's not the largest active scheduler thread, it does
    normal scheduler block message sending
  • If there's a lot of work to do and an actor is muted,
    the runtime tries to resume a suspended scheduler thread if there
    are any
  • The overhead to check if this scheduler thread is a candidate to
    be suspended (&scheduler[current_active_scheduler_count - 1] == current scheduler address) is a load and single branch check
  • The overhead to check if this scheduler thread is a candidate to
    be suspended (because &scheduler[current_active_scheduler_count - 1] == current scheduler address) but cannot actually be
    suspended because we're at maximum_scheduler_count is one
    branch (this is in addition to the overhead from the previous
    bullet)
  • The overhead to check if there are any scheduler threads to
    resume is a load and single branch check

The implementation of the scheduler suspend/resume is different
depending on the platform.

For Windows, it relies on Event Objects and WaitForSingleObject
to suspend threads and SetEvent to wake suspended threads.

For Posix environments, it relies on signals (specifically,
SIGUSR2) as they are quicker than other mechanisms (pthread
condition variables) (according to stackoverflow at:
https://stackoverflow.com/a/4676069 and
https://stackoverflow.com/a/23945651). It uses sigwait to
suspend threads and pthread_kill to wake suspended threads. The
signal allotted for this is SIGUSR2 and so SIGUSR2 has been
disabled for use in the signals package with an error indicating
that it is used by the runtime.

The old behavior of having all scheduler threads active all the
time can be achieved by passing --ponyminthreads=9999999 as an
argument to a program (because minimum scheduler threads is capped
to never exceed total scheduler threads).

This commit also switches from using signal to sigaction for
the epoll/kqueue asio signals logic because sigaction is more
robust and reliable across platforms
(https://stackoverflow.com/a/232711).

@dipinhora dipinhora added the do not merge This PR should not be merged at this time label Nov 28, 2017
@dipinhora
Copy link
Contributor Author

I've marked this as DO NOT MERGE as it needs testing for behavior and performance by someone who isn't me. I am by no means an expert on this stuff, but everything seems to be working correctly.

In addition, it would be great if this were reviewed by folks on @ponylang/core. Especially around the use of atomics and the use of signals for sleeping/waking threads on non-windows platforms.

@jemc jemc requested review from Praetonus and sylvanc November 28, 2017 23:53
@dipinhora
Copy link
Contributor Author

dipinhora commented Nov 29, 2017

I've done some basic performance testing using a modified version of examples/message-ubench (diff can be found at https://gist.github.com/dipinhora/964debac0217b745063189aba557b715) as that seems to reflect overhead of the runtime in it's output. The modification was to add a --total-iterations command line option so that it will automagically exit once finished.

Note: This performance testing was done on an OS X Macbook where ponythreads defaults to 4 with tons of other applications running so is definitely not very reliable.

I compiled two versions of the application message-ubench-master from master at commit 18533c5 and with the changes from this PR called message-ubench-scaling. I then ran both versions with two sets of parameters. One set which would not generate enough work to keep all of the default number of ponythreads busy and one that would generate enough work to keep all of the default number of ponythreads busy.


Summary:

Overall, it seems that the scheduler scaling doesn't have any significant negative impact on performance (except for the expected impact from using less scheduler threads) when compared with master but more rigorous performance testing with more complex applications is necessary to be more certain of the impact.


Not enough work for default number of ponythreads comparison

Master with not enough work for the 4 default ponythreads.

Dipins-MBP:dhp dipinhora$ ./message-ubench-master 
# pingers 8, report-interval 10, initial-pings 5, total-iterations 10
time,run-ns,rate
1511911041.827202000,1000831000,22057862
1511911042.826349000,999113000,22329028
1511911043.824619000,998241000,22368515
1511911044.825412000,1000765000,22384713
1511911045.824592000,999149000,22429331
1511911046.823824000,999202000,22408859
1511911047.823014000,999152000,22492305
1511911048.822167000,999127000,21469565
1511911049.820643000,998445000,22360695
1511911050.820187000,999513000,22387627

Master with not enough work for the 4 default ponythreads but told to use only 1 ponythread.

Dipins-MBP:dhp dipinhora$ ./message-ubench-master --ponythreads=1
# pingers 8, report-interval 10, initial-pings 5, total-iterations 10
time,run-ns,rate
1511911911.727585000,1000352000,17616653
1511911912.726874000,999234000,17689682
1511911913.726117000,999205000,17688413
1511911914.725406000,999267000,17995684
1511911915.724648000,999202000,17787099
1511911916.723896000,999197000,17843643
1511911917.722969000,999034000,17373278
1511911918.722226000,999219000,17671836
1511911919.721540000,999273000,18011496
1511911920.720794000,999213000,18023472

This scaling PR with not enough work for the 4 default ponythreads.

Dipins-MBP:dhp dipinhora$ ./message-ubench-scaling 
# pingers 8, report-interval 10, initial-pings 5, total-iterations 10
time,run-ns,rate
1511911981.702790000,1001429000,17459670
1511911982.701071000,998234000,17213620
1511911983.700963000,999870000,17438576
1511911984.700254000,999250000,17540727
1511911985.698522000,998237000,17601469
1511911986.698879000,1000336000,17427735
1511911987.698114000,999193000,17561811
1511911988.697408000,999244000,17579399
1511911989.696666000,999215000,16896758
1511911990.695934000,999227000,17522404

This scaling PR with not enough work for the 4 default ponythreads but with minponythreads forced to be same as ponythreads effectively disabling scheduler thread scaling.

Dipins-MBP:dhp dipinhora$ ./message-ubench-scaling --ponyminthreads=999999
# pingers 8, report-interval 10, initial-pings 5, total-iterations 10
time,run-ns,rate
1511911076.381846000,1000242000,22018045
1511911077.381077000,999198000,22328975
1511911078.380258000,999150000,22257245
1511911079.379441000,999145000,22314592
1511911080.379006000,999533000,22375454
1511911081.378140000,999112000,22351829
1511911082.376693000,998525000,22371559
1511911083.376186000,999464000,22428753
1511911084.376473000,1000258000,22359273
1511911085.375011000,998511000,22460565

Overall, there doesn't seem to be much difference in performance when using scheduler thread scaling if you compare it with master with less ponythreads or force scheduler thread scaling to effectively be disabled and compare it with master with default ponythreads. The minor differences can likely be attributed to the fact that my macbook had tons of other applications running while I ran the programs to get numbers.


Enough work for default number of ponythreads comparison (--pingers=100)

Master with enough work for the 4 default ponythreads.

Dipins-MBP:dhp dipinhora$ ./message-ubench-master --pingers=100
# pingers 100, report-interval 10, initial-pings 5, total-iterations 10
time,run-ns,rate
1511911148.596786000,1001088000,17598105
1511911149.596143000,999250000,17948221
1511911150.595327000,999079000,17927916
1511911151.594556000,999131000,17890342
1511911152.593733000,999084000,17756510
1511911153.592962000,999130000,17783998
1511911154.592205000,999146000,17788162
1511911155.591335000,999030000,17816223
1511911156.589843000,998403000,16383893
1511911157.589749000,999796000,16962434

This scaling PR with enough work for the 4 default ponythreads.

Dipins-MBP:dhp dipinhora$ ./message-ubench-scaling --pingers=100
# pingers 100, report-interval 10, initial-pings 5, total-iterations 10
time,run-ns,rate
1511911166.722688000,999474000,17729182
1511911167.722974000,1000166000,17879973
1511911168.721716000,998632000,17350126
1511911169.721010000,999173000,17818367
1511911170.721340000,1000229000,17684335
1511911171.720116000,998661000,17917820
1511911172.719359000,999139000,17791456
1511911173.718605000,999145000,17898287
1511911174.717948000,999243000,17982331
1511911175.716522000,998467000,17866785

This scaling PR with enough work for the 4 default ponythreads but with minponythreads forced to be same as ponythreads effectively disabling scheduler thread scaling.

Dipins-MBP:dhp dipinhora$ ./message-ubench-scaling --pingers=100 --ponyminthreads=9999
# pingers 100, report-interval 10, initial-pings 5, total-iterations 10
time,run-ns,rate
1511911187.395119000,1000551000,17021046
1511911188.394023000,998787000,16751350
1511911189.393210000,999086000,17834277
1511911190.392446000,999133000,17885750
1511911191.391633000,999080000,17896542
1511911192.390877000,999134000,17828693
1511911193.390268000,999291000,17950982
1511911194.390123000,999752000,17927982
1511911195.389304000,999084000,17740281
1511911196.388159000,998760000,17921059

Overall, there doesn't seem to be much difference in performance when using scheduler thread scaling if you compare it with master whether scheduler thread scaling is enabled or disabled. The minor differences can likely be attributed to the fact that my macbook had tons of other applications running while I ran the programs to get numbers.

@dipinhora
Copy link
Contributor Author

Also, forgot to mention, the following claims are based on my understanding (which could be wrong) of how the C code compiles down to machine code and not because I disassembled the final code to confirm things:

  • The overhead to check if this scheduler thread is a candidate to
    be suspended (&scheduler[current_active_scheduler_count - 1] == current scheduler address) is a load and single branch check
  • The overhead to check if this scheduler thread is a candidate to
    be suspended (because &scheduler[current_active_scheduler_count - 1] == current scheduler address) but cannot actually be
    suspended because we're at maximum_scheduler_count is one
    branch (this is in addition to the overhead from the previous
    bullet)
  • The overhead to check if there are any scheduler threads to
    resume is a load and single branch check

// and there are more active schedulers than the minimum requested
if ((sched == &scheduler[current_active_scheduler_count - 1])
&& (current_active_scheduler_count > min_scheduler_count) &&
atomic_compare_exchange_strong_explicit(&scheduler_count_changing,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be atomic_exchange_explicit(&scheduler_count_changing, true, memory_order_acquire). If the atomic is false, it will set it to true and return false (the old value), and if it's true, it will set it to true and return true, so you can know whether the locking was successful by looking at the return value.

The memory order should be acquire because the operation is used to mark the start of a critical region.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I've changed the logic accordingly.


// decrement active_scheduler_count so other schedulers know we're
// sleeping
atomic_fetch_sub_explicit(&active_scheduler_count, 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the above suggestion in wake_suspended_threads is implemented, this can be atomic load; non-atomic dec; atomic store instead since it would be guaranteed that only one thread can modify the variable at one time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I've changed the logic accordingly.


// unlock the bool that controls modifying the active scheduler count
// variable
atomic_compare_exchange_strong_explicit(&scheduler_count_changing,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a plain atomic_store since nobody will try to modify the variable while the current thread owns it.

The operation should have a release memory ordering since it marks the end of the critical section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I've changed the logic accordingly.

#if defined(PLATFORM_IS_WINDOWS)
ponyint_thread_suspend(sched->wait_event_object);
#else
ponyint_thread_suspend(PONY_SCHED_SLEEP_WAKE_SIGNAL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that there is a race condition here, with the variable modification and thread suspend not being a single atomic operation. For instance, with the following sequence:

T1: unlock (count_changing is false)
T2: in sched_maybe_wakeup: lock, read, signal wakeup to T1
T1: suspend

If Windows events and pthread signals aren't queued up if sent when the receiver isn't waiting for them, then this is a bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. Good point about the potential race condition here.

For Windows, SetEvent documentation (https://msdn.microsoft.com/en-us/library/windows/desktop/ms686211(v=vs.85).aspx) states:

The state of an auto-reset event object remains signaled until a single waiting thread is released, at which time the system automatically sets the state to nonsignaled. If no threads are waiting, the event object's state remains signaled.

Based on the above quote, I don't think the race condition exists on Windows and if the event gets signalled by T2 before T1 suspends, T1 should noticed it is signalled immediately and return right away due to the event remaining signalled until at least one thread is woken up by it.

For Posix platforms, I've made a change to block the signal for waking threads by default for all threads. The note at https://notes.shichao.io/apue/ch10/#reliable-signal-terminology-and-semantics states:

A process has the option of blocking the delivery of a signal. If a signal that is blocked is generated for a process, and if the action for that signal is either the default action or to catch the signal, then the signal remains pending for the process until the process either:

  • unblocks the signal, or
  • changes the action to ignore the signal.

Based on the above quote, I don't think the race condition exists on Posix platforms (as long as the signal is properly blocked) and if T1 gets signalled by T2 before T1 suspends, T1 should noticed it is signalled immediately and return right away due to the signal remaining in pending state until the thread unblocks it for the sigwait.


// increment active_scheduler_count so other schedulers know we're
// awake again
atomic_fetch_add_explicit(&active_scheduler_count, 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as decrement, this can be atomic load; non-atomic inc; atomic store if the suggestion in wake_suspended_threads is implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I've changed the logic accordingly.

// unlock the bool that controls modifying the active scheduler count
// variable. this is because the signalling thread locks the control
// variable before signalling except on termination/shutdown
atomic_compare_exchange_strong_explicit(&scheduler_count_changing,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, store with release memory ordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I've changed the logic accordingly.


// if we have some schedulers that are sleeping, wake one up
if((current_active_scheduler_count < scheduler_count) &&
atomic_compare_exchange_strong_explicit(&scheduler_count_changing, &cmp_val,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in steal, exchange with acquire memory ordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I've changed the logic accordingly.

uint32_t current_active_scheduler_count = get_active_scheduler_count();

// wake up any sleeping threads
while (current_active_scheduler_count < scheduler_count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only used at program termination, a modification here would enable the use of less expensive atomic operations while the program is running (see comment below in steal.)

The required change would be to set sheduler_count_changing to true before sending a signal and then wait for it to go back to false before locking it again and sending the next signal. This would also avoid the loop when calling wake_suspended_threads as it would be guaranteed that every thread has woken up when the function returns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I've changed the logic accordingly.

@plietar plietar closed this Nov 29, 2017
@plietar plietar reopened this Nov 29, 2017
@plietar
Copy link
Contributor

plietar commented Nov 29, 2017

Sorry, fat fingers

@dipinhora
Copy link
Contributor Author

@Praetonus Thanks again for your feedback and suggestions. I've incorporated all of your suggestions and addressed the potential race condition issue in my comment above. I'd appreciate it if you could take another look whenever you have a chance.

BTW, the CI failures on OSX all seem to be the net/Broadcast test failures that seems to be an issue with OSX builds in Travis in the past couple of days.

// unlock the bool that controls modifying the active scheduler count
// variable
atomic_store_explicit(&scheduler_count_changing, false,
memory_order_relaxed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory_order_release here since the operation marks the end of a critical section.

// variable. this is because the signalling thread locks the control
// variable before signalling
atomic_store_explicit(&scheduler_count_changing, false,
memory_order_relaxed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory_order_release here.

@dipinhora
Copy link
Contributor Author

@Praetonus Ugh, sorry... I could have sworn I changed those. Fixed now.

@SeanTAllen
Copy link
Member

@dipinhora can you rebase this against master?

@dipinhora
Copy link
Contributor Author

Rebased.

ifdef linux or bsd or osx then
compile_error "SIGUSR2 reserved for runtime use"
else
compile_error "no SIGUSR1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this still be SIGUSR2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'll fix this when I rebase and fix conflicts.

@SeanTAllen
Copy link
Member

@dipinhora can you rebase this again?

@Praetonus looking good to you?

@dipinhora
Copy link
Contributor Author

@SeanTAllen Rebased.

@Praetonus @SeanTAllen (and anyone else interested), I also made some other changes (NOTE: the commit message has been updated with all these details and I'd suggest reading that but the following is a summary of the changes):

Now:

  • If there's a lot of work to do and at least one actor is muted
    in a scheduler thread, that thread tries to resume a suspended
    scheduler thread (if there are any) every time it is about to run an actor. NOTE: This could result in a
    pathological case where only one thread has a muted actor but
    there is only one overloaded actor. In this case the extra
    scheduler threads will keep being woken up and then go back to
    sleep over and over again.

This is different from before where the code was waking up a thread only at the time an actor was muted. This would mean that if only one actor was muted, only one thread would be woken up (or possibly none if it couldn't acquire the lock). This could have potentially resulted in a situation where not enough threads would be woken to handle the workload. Before the code was also waking up a thread at the time an actor became overloaded. It doesn't do that any longer since an overloaded actor doesn't necessarily mean another thread has work to do (for example: a single actor that creates work for itself until it's overloaded).

For Posix environments, the default implementation still relies on signals as before. However, now an alternate implementation is available using pthread condition variables via a use=scheduler_scaling_pthreads argument to make.

Now the PR also adds DTRACE probes for thread suspend and thread
resume actions.

@dipinhora
Copy link
Contributor Author

@SeanTAllen @Praetonus I don't know why but for some reason the signals implementation of scheduler scaling seems to cause hangs in the codegen tests on travis OSX (which I'm not able to reproduce and it works perfectly on my machine so I have no clue what's going on). Regardless, it seems switching to the pthreads implementation for OSX resolves that issue.

@dipinhora
Copy link
Contributor Author

Restarting OSX job that timed out due to an LLVM install issue.

@dipinhora dipinhora removed the do not merge This PR should not be merged at this time label Dec 16, 2017
@dipinhora
Copy link
Contributor Author

I've removed the "DO NOT MERGE" label because from my perspective this is ready to be merged unless there is additional feedback requiring changes.

@dipinhora
Copy link
Contributor Author

Rebased to resolve conflict in Makefile.

Prior to this commit, the runtime would start up a specific number
of scheduler threads (by default the same as the number of physical
cores) on initialization and these scheduler threads would run
actors and send block/unblock messages and steal actors from each
other regardless of how many actors and what the workload of the
program actually was. This usually resulted in wasted cpu cycles
and cache thrashing if there wasn't enough work to keep all
scheduler threads busy.

This commit changes things so that the runtime still starts up
the threads on initialization, but now the threads can suspend
execution if there isn't enough work to do to minimize the work
stealing overhead. The rough outline of how this works is:

* We now have three variables related to number of schedulers;
  `maximum_scheduler_count` (the normal `--ponythreads` option),
  `active_scheduler_count`, and `minimum_scheduler_count`
  (a new `--ponyminthreads` option)
* On startup, we create all possible scheduler threads (up to
  `maximum_scheduler_count`)
* We can never have more than `maximum_scheduler_count` threads
  active at a time
* We can never have less than `minimum_scheduler_count` threads
  active at a time
* Scheduler threads can suspend themselves (i.e. effectively
  pretend as if they don't exist)
* A scheduler thread can only suspend itself if its actor queue
  is empty and it has no actors in it's mute map and it would
  normally send a block message
* Only one scheduler thread can suspend or resume at a time (the
  largest one running or the smallest one suspended respectively)
* We can never skip a scheduler thread and suspend or wake up a
  scheduler thread out of order (i.e. thread 6 is active, but
  thread 5 gets suspended or thread 5 is suspended but thread 6
  gets resumed)
* If there isn't enough work and a scheduler thread would normally
  block and it's the largest active scheduler thread, it suspends
  itself instead
* If there isn't enough work and a scheduler thread would normally
  block and it's not the largest active scheduler thread, it does
  normal scheduler block message sending
* If there's a lot of work to do and at least one actor is muted
  in a scheduler thread, that thread tries to resume a suspended
  scheduler thread (if there are any) every time it is about to
  run an actor. NOTE: This could result in a pathological case
  where only one thread has a muted actor but there is only one
  overloaded actor. In this case the extra scheduler threads will
  keep being woken up and then go back to sleep over and over again.
* The overhead to check if this scheduler thread is a candidate to
  be suspended (`&scheduler[current_active_scheduler_count - 1] ==
  current scheduler address`) is a load and single branch check
* The overhead to check if this scheduler thread is a candidate to
  be suspended (because `&scheduler[current_active_scheduler_count
  - 1] == current scheduler address`) but cannot actually be
  suspended because we're at `maximum_scheduler_count ` is one
  branch (this is in addition to the overhead from the previous
  bullet)
* The overhead to check if there are any scheduler threads to
  resume is a load and single branch check

The implementation of the scheduler suspend/resume is different
depending on the platform.

For Windows, it relies on Event Objects and `WaitForSingleObject`
to suspend threads and `SetEvent` to wake suspended threads.

For Posix environments, by default it relies on signals (specifically,
SIGUSR2) as they are quicker than other mechanisms (pthread
condition variables) (according to stackoverflow at:
https://stackoverflow.com/a/4676069 and
https://stackoverflow.com/a/23945651). It uses `sigwait` to
suspend threads and `pthread_kill` to wake suspended threads. The
signal allotted for this is `SIGUSR2` and so `SIGUSR2` has been
disabled for use in the `signals` package with an error indicating
that it is used by the runtime.
An alternative implementation using pthread condition variables is
also available via a `use=scheduler_scaling_pthreads` argument to
make. This implementation relies on pthread condition variables
and frees `SIGUSR2` so it is available for use in the `signals`
package. It uses `pthread_cond_wait` to suspend threads and
`pthread_cond_signal` to wake suspended threads.

The old behavior of having all scheduler threads active all the
time can be achieved by passing `--ponyminthreads=9999999` as an
argument to a program (because minimum scheduler threads is capped
to never exceed total scheduler threads).

This commit also adds DTRACE probes for thread suspend and thread
resume.

This commit also switches from using `signal` to `sigaction` for
the epoll/kqueue asio signals logic because `sigaction` is more
robust and reliable across platforms
(https://stackoverflow.com/a/232711).
@dipinhora
Copy link
Contributor Author

Rebased again to resolve conflict in main.c.

@SeanTAllen SeanTAllen added the changelog - added Automatically add "Added" CHANGELOG entry on merge label Dec 20, 2017
@SeanTAllen
Copy link
Member

Release notes for this need to note that SIGUSR2 is no longer available to user programs

@SeanTAllen SeanTAllen merged commit 2137eee into ponylang:master Dec 20, 2017
ponylang-main added a commit that referenced this pull request Dec 20, 2017
slayful pushed a commit to slayful/ponyc that referenced this pull request Dec 20, 2017
Prior to this commit, the runtime would start up a specific number
of scheduler threads (by default the same as the number of physical
cores) on initialization and these scheduler threads would run
actors and send block/unblock messages and steal actors from each
other regardless of how many actors and what the workload of the
program actually was. This usually resulted in wasted cpu cycles
and cache thrashing if there wasn't enough work to keep all
scheduler threads busy.

This commit changes things so that the runtime still starts up
the threads on initialization, but now the threads can suspend
execution if there isn't enough work to do to minimize the work
stealing overhead. The rough outline of how this works is:

* We now have three variables related to number of schedulers;
  `maximum_scheduler_count` (the normal `--ponythreads` option),
  `active_scheduler_count`, and `minimum_scheduler_count`
  (a new `--ponyminthreads` option)
* On startup, we create all possible scheduler threads (up to
  `maximum_scheduler_count`)
* We can never have more than `maximum_scheduler_count` threads
  active at a time
* We can never have less than `minimum_scheduler_count` threads
  active at a time
* Scheduler threads can suspend themselves (i.e. effectively
  pretend as if they don't exist)
* A scheduler thread can only suspend itself if its actor queue
  is empty and it has no actors in it's mute map and it would
  normally send a block message
* Only one scheduler thread can suspend or resume at a time (the
  largest one running or the smallest one suspended respectively)
* We can never skip a scheduler thread and suspend or wake up a
  scheduler thread out of order (i.e. thread 6 is active, but
  thread 5 gets suspended or thread 5 is suspended but thread 6
  gets resumed)
* If there isn't enough work and a scheduler thread would normally
  block and it's the largest active scheduler thread, it suspends
  itself instead
* If there isn't enough work and a scheduler thread would normally
  block and it's not the largest active scheduler thread, it does
  normal scheduler block message sending
* If there's a lot of work to do and at least one actor is muted
  in a scheduler thread, that thread tries to resume a suspended
  scheduler thread (if there are any) every time it is about to
  run an actor. NOTE: This could result in a pathological case
  where only one thread has a muted actor but there is only one
  overloaded actor. In this case the extra scheduler threads will
  keep being woken up and then go back to sleep over and over again.
* The overhead to check if this scheduler thread is a candidate to
  be suspended (`&scheduler[current_active_scheduler_count - 1] ==
  current scheduler address`) is a load and single branch check
* The overhead to check if this scheduler thread is a candidate to
  be suspended (because `&scheduler[current_active_scheduler_count
  - 1] == current scheduler address`) but cannot actually be
  suspended because we're at `maximum_scheduler_count ` is one
  branch (this is in addition to the overhead from the previous
  bullet)
* The overhead to check if there are any scheduler threads to
  resume is a load and single branch check

The implementation of the scheduler suspend/resume is different
depending on the platform.

For Windows, it relies on Event Objects and `WaitForSingleObject`
to suspend threads and `SetEvent` to wake suspended threads.

For Linux environments, by default it relies on signals (specifically,
SIGUSR2) as they are quicker than other mechanisms (pthread
condition variables) (according to stackoverflow at:
https://stackoverflow.com/a/4676069 and
https://stackoverflow.com/a/23945651). It uses `sigwait` to
suspend threads and `pthread_kill` to wake suspended threads. The
signal allotted for this is `SIGUSR2` and so `SIGUSR2` has been
disabled for use in the `signals` package with an error indicating
that it is used by the runtime.

For MacOS, we use pthread condition variables is
also available via a `use=scheduler_scaling_pthreads` argument to
make. This implementation relies on pthread condition variables
and frees `SIGUSR2` so it is available for use in the `signals`
package. It uses `pthread_cond_wait` to suspend threads and
`pthread_cond_signal` to wake suspended threads.

The old behavior of having all scheduler threads active all the
time can be achieved by passing `--ponyminthreads=9999999` as an
argument to a program (because minimum scheduler threads is capped
to never exceed total scheduler threads).

This commit also adds DTRACE probes for thread suspend and thread
resume.

This commit also switches from using `signal` to `sigaction` for
the epoll/kqueue asio signals logic because `sigaction` is more
robust and reliable across platforms
(https://stackoverflow.com/a/232711).
slayful pushed a commit to slayful/ponyc that referenced this pull request Dec 20, 2017
@jemc
Copy link
Member

jemc commented Dec 21, 2017

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog - added Automatically add "Added" CHANGELOG entry on merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants