Fix and re-enable dynamic scheduler scaling #2483

dipinhora · 2018-01-10T23:32:00Z

Change dynamic scheduler scaling implementation in order to resolve
the hangs encountered in #2451.

The previous implementation assumed that signalling to wake a thread
was a reliable operation. Apparently, that's not necessarily true
(see https://en.wikipedia.org/wiki/Spurious_wakeup and
https://askldjd.com/2010/04/24/the-lost-wakeup-problem/). Seeing
as we couldn't find any other explanation for why the previous
implementation was experiencing hangs, I've assumed it is either
because of lost wake ups or spurious wake ups and redesigned the
logic accordingly.

Now, when a thread is about to suspend, it will decrement the
active_scheduler_count and then suspend. When it wakes up, it will
check to see if the active_scheduler_count is at least as big as
its index. If the active_scheduler_count isn't big enough, the
thread will suspend itself again immediately. If it is big enough,
it will resume. Threads no longer modify active_scheduler_count
when they wake up.

active_scheduler_count must now be modified by the thread that is
waking up another thread prior to sending the wake up notification.
Additionally, since we're now assuming that wake up signals can be
lost, we now send multiple wake up notifications just in case. While
this is somewhat wasteful, it is better than being in a situation
where some threads aren't woken up at all (i.e. a hang).

This commit also includes a change inspired by #2474. Now, all
scheduler threads can suspend as long as there is at least one
noisy actor registered with the ASIO subsystem. If there are no
noisy actors registered with the ASIO subsystem then scheduler 0
is not allowed to suspend itself.

dipinhora · 2018-01-10T23:33:54Z

@winksaville This PR is my alternative implementation of dynamic scheduler scaling that I mentioned in the other PR. It would be great if you can test it out using stdlib to see if you still get hangs or not.

dipinhora · 2018-01-10T23:36:12Z

As an example of all schedulers suspending due to lack of work, I compiled/tested the examples/echo with some printfs included in the runtime:

Dipins-MBP:dhp dipinhora$ ./echo 
Thread: 3 sleeping...
listening on 127.0.0.1:53097
Thread: 2 sleeping...
Thread: 0 sleeping...
Thread: 1 sleeping...
Thread: 0 resuming...
connection accepted
Thread: 0 sleeping...
Thread: 0 resuming...
data received, looping it back
Thread: 0 sleeping...
Thread: 0 resuming...
data received, looping it back
Thread: 0 sleeping...
Thread: 0 resuming...
data received, looping it back
Thread: 0 sleeping...
Thread: 0 resuming...
data received, looping it back
Thread: 0 sleeping...
Thread: 0 resuming...
data received, looping it back
Thread: 0 sleeping...
Thread: 0 resuming...
data received, looping it back
Thread: 0 sleeping...
Thread: 0 resuming...
data received, looping it back
Thread: 0 sleeping...
Thread: 0 resuming...
data received, looping it back
Thread: 0 sleeping...
Thread: 0 resuming...
server closed
Thread: 0 sleeping...

dipinhora · 2018-01-10T23:37:16Z

NOTE: I've marked this as "DO NOT MERGE" until after @winksaville is able to confirm whether this resolves the hangs from #2451 or not.

SeanTAllen · 2018-01-11T00:47:48Z

NIce

winksaville · 2018-01-11T01:35:04Z

I'll be testing this evening, @dipinhora what testing have you done and any particular flag/switches to test?

dipinhora · 2018-01-11T02:42:45Z

@winksaville I honestly haven't done a lot of testing so far beyond basic correctness testing to ensure things are functioning as expected. I'll likely get a chance to do more testing of for the hangs case tomorrow.

It would be great if you could do the following:

Run stdlib or pony-ring or message-ubench (or all three) using pthreads and debug ponyc (to confirm if the hanging issue still exists or not)
Run stdlib or pony-ring or message-ubench (or all three) using signals and debug ponyc (to confirm if the hanging issue still exists or not)

Assuming the debug tests are successful and nothing hangs and you have time:

Run stdlib or pony-ring or message-ubench (or all three) using pthreads and release ponyc (to confirm if the hanging issue still exists or not for a release build)
Run stdlib or pony-ring or message-ubench (or all three) using signals and release ponyc (to confirm if the hanging issue still exists or not for a release build)

winksaville · 2018-01-11T03:05:13Z

SG, which is preferred long term, pthreads or signals?

dipinhora · 2018-01-11T03:18:38Z

My understanding of things is that the signals implementation should be faster than the pthreads implementation (see my original dynamic scaling PR comment at #2386 (comment)) but the signals implementation has issues on OSX (for some reason travis CI sometimes hangs when running the test suite but I'm unable to reproduce) so pthreads is the default there.

Ideally, we would keep both implementations, but if we had to pick one, I'd pick pthreads based on the existence of the issue on OSX for the signals implementation.

winksaville · 2018-01-11T14:28:17Z

Ran release builds with signals for about 10hrs, no hang!!!

$ ./build/release/ponyc --version
0.21.2-2c31eda9a [release]
compiled with: llvm 5.0.1 -- cc (GCC) 7.2.1 20171224
wink@wink-desktop:~/prgs/pony/ponyc (test-Fix-and-re-enable-dynamic-scheduler-scaling-pr-2483)
$ ./stdlib --ponyversion
0.21.2-2c31eda9a [release]
compiled with: llvm 5.0.1 -- cc (GCC) 7.2.1 20171224

Now testing stdlib release-scheduler_scaling_pthreads/ponyc, will report back soon.

dipinhora · 2018-01-11T15:07:40Z

That's great news! Hopefully the pthreads implementation also works equally well.

winksaville · 2018-01-11T15:40:58Z

Bad news, stdlib hung after about 20min. Here are the versions:

$ ./build/release-scheduler_scaling_pthreads/ponyc --version
0.21.2-2c31eda9a [release]
compiled with: llvm 5.0.1 -- cc (GCC) 7.2.1 20171224
wink@wink-desktop:~/prgs/pony/ponyc (test-Fix-and-re-enable-dynamic-scheduler-scaling-pr-2483)
$ ./stdlib --ponyversion
0.21.2-2c31eda9a [release]
compiled with: llvm 5.0.1 -- cc (GCC) 7.2.1 20171224

Here is the gdb output:

GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./stdlib...(no debugging symbols found)...done.
Attaching to program: /home/wink/prgs/pony/ponyc/stdlib, process 23806
[New LWP 23807]
[New LWP 23808]
[New LWP 23809]
[New LWP 23810]
[New LWP 23811]
[New LWP 23812]
[New LWP 23813]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007fd4f239144d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7fd4f350b6c0 (LWP 23806) "stdlib" 0x00007fd4f239144d in pthread_join () from /usr/lib/libpthread.so.0
  2    Thread 0x7fd4e965f700 (LWP 23807) "stdlib" 0x00007fd4f1960f86 in epoll_pwait () from /usr/lib/libc.so.6
  3    Thread 0x7fd4e8e5e700 (LWP 23808) "stdlib" 0x00007fd4f239a240 in nanosleep () from /usr/lib/libpthread.so.0
  4    Thread 0x7fd4e065e700 (LWP 23809) "stdlib" 0x00007fd4f239a240 in nanosleep () from /usr/lib/libpthread.so.0
  5    Thread 0x7fd4dfe5e700 (LWP 23810) "stdlib" 0x00007fd4f239a240 in nanosleep () from /usr/lib/libpthread.so.0
  6    Thread 0x7fd4cf65e700 (LWP 23811) "stdlib" 0x00007fd4f239a240 in nanosleep () from /usr/lib/libpthread.so.0
  7    Thread 0x7fd4c6e5e700 (LWP 23812) "stdlib" 0x00007fd4f239a240 in nanosleep () from /usr/lib/libpthread.so.0
  8    Thread 0x7fd4c665e700 (LWP 23813) "stdlib" 0x00007fd4f239639d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7fd4f350b6c0 (LWP 23806))]
#0  0x00007fd4f239144d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fd4f239144d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x000055695cfd675b in ponyint_thread_join ()
#2  0x000055695cfcd513 in ponyint_sched_shutdown ()
#3  0x000055695cfcd7d7 in ponyint_sched_start ()
#4  0x000055695cfcd459 in pony_start ()
#5  0x000055695cfbf0aa in main ()
(gdb) p *this_scheduler 
Cannot access memory at address 0xd53e50279460
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fd4e965f700 (LWP 23807))]
#0  0x00007fd4f1960f86 in epoll_pwait () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007fd4f1960f86 in epoll_pwait () from /usr/lib/libc.so.6
#1  0x000055695cfd2813 in ponyint_asio_backend_dispatch ()
#2  0x00007fd4f239008c in start_thread () from /usr/lib/libpthread.so.0
#3  0x00007fd4f1960e1f in clone () from /usr/lib/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread 0x7fd4e8e5e700 (LWP 23808))]
#0  0x00007fd4f239a240 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fd4f239a240 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055695cfda8d3 in ponyint_cpu_core_pause ()
#2  0x000055695cfce4a4 in run_thread ()
#3  0x00007fd4f239008c in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007fd4f1960e1f in clone () from /usr/lib/libc.so.6
(gdb) thread 4
[Switching to thread 4 (Thread 0x7fd4e065e700 (LWP 23809))]
#0  0x00007fd4f239a240 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fd4f239a240 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055695cfda8d3 in ponyint_cpu_core_pause ()
#2  0x000055695cfce4a4 in run_thread ()
#3  0x00007fd4f239008c in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007fd4f1960e1f in clone () from /usr/lib/libc.so.6
(gdb) thread 5
[Switching to thread 5 (Thread 0x7fd4dfe5e700 (LWP 23810))]
#0  0x00007fd4f239a240 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fd4f239a240 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055695cfda8d3 in ponyint_cpu_core_pause ()
#2  0x000055695cfce4a4 in run_thread ()
#3  0x00007fd4f239008c in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007fd4f1960e1f in clone () from /usr/lib/libc.so.6
(gdb) thread 6
[Switching to thread 6 (Thread 0x7fd4cf65e700 (LWP 23811))]
#0  0x00007fd4f239a240 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fd4f239a240 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055695cfda8d3 in ponyint_cpu_core_pause ()
#2  0x000055695cfce4a4 in run_thread ()
#3  0x00007fd4f239008c in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007fd4f1960e1f in clone () from /usr/lib/libc.so.6
(gdb) thread 7
[Switching to thread 7 (Thread 0x7fd4c6e5e700 (LWP 23812))]
#0  0x00007fd4f239a240 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fd4f239a240 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055695cfda8d3 in ponyint_cpu_core_pause ()
#2  0x000055695cfce4a4 in run_thread ()
#3  0x00007fd4f239008c in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007fd4f1960e1f in clone () from /usr/lib/libc.so.6
(gdb) thread 8
[Switching to thread 8 (Thread 0x7fd4c665e700 (LWP 23813))]
#0  0x00007fd4f239639d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fd4f239639d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x000055695cfd67af in ponyint_thread_suspend ()
#2  0x000055695cfce7c9 in run_thread ()
#3  0x00007fd4f239008c in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007fd4f1960e1f in clone () from /usr/lib/libc.so.6
(gdb) printf "asio_cpu=%d scheduler_count=%d min_scheduler_count=%d active_scheduler_count=%d scheduler_count_changing=%d detect_quiescence=%d use_yield=%d\n", asio_cpu,scheduler_count,min_scheduler_count,active_scheduler_count,scheduler_count_changing,detect_quiescence,use_yield
asio_cpu=-1 scheduler_count=6 min_scheduler_count=0 active_scheduler_count=6 scheduler_count_changing=0 detect_quiescence=1 use_yield=257

winksaville · 2018-01-11T16:21:28Z

Debug run with pthreads:

GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./stdlib...done.
Attaching to program: /home/wink/prgs/pony/ponyc/stdlib, process 32506
[New LWP 32507]
[New LWP 32508]
[New LWP 32509]
[New LWP 32510]
[New LWP 32511]
[New LWP 32512]
[New LWP 32513]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007fdcb70d144d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) info threads
  Id   Target Id         Frame 
  1    Thread 0x7fdcb824b6c0 (LWP 32506) "stdlib" 0x00007fdcb70d144d in pthread_join () from /usr/lib/libpthread.so.0
  2    Thread 0x7fdcae39f700 (LWP 32507) "stdlib" 0x00007fdcb66a0f86 in epoll_pwait () from /usr/lib/libc.so.6
* 3    Thread 0x7fdcadb9e700 (LWP 32508) "stdlib" 0x00007fdcb70da240 in nanosleep () from /usr/lib/libpthread.so.0
  4    Thread 0x7fdcad39e700 (LWP 32509) "stdlib" 0x00007fdcb70da240 in nanosleep () from /usr/lib/libpthread.so.0
  5    Thread 0x7fdca4b9e700 (LWP 32510) "stdlib" 0x00007fdcb70da240 in nanosleep () from /usr/lib/libpthread.so.0
  6    Thread 0x7fdc9439e700 (LWP 32511) "stdlib" 0x00007fdcb70da240 in nanosleep () from /usr/lib/libpthread.so.0
  7    Thread 0x7fdc93b9e700 (LWP 32512) "stdlib" 0x00007fdcb70da240 in nanosleep () from /usr/lib/libpthread.so.0
  8    Thread 0x7fdc8b39e700 (LWP 32513) "stdlib" 0x00007fdcb70d639d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7fdcb824b6c0 (LWP 32506))]
#0  0x00007fdcb70d144d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fdcb70d144d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x0000556701ad5f88 in ponyint_thread_join (thread=140585784174336) at src/libponyrt/platform/threads.c:229
#2  0x0000556701ac8ca5 in ponyint_sched_shutdown () at src/libponyrt/sched/scheduler.c:643
#3  0x0000556701ac92cd in ponyint_sched_start (library=false) at src/libponyrt/sched/scheduler.c:764
#4  0x0000556701ac7d3e in pony_start (library=false, language_features=true, exit_code=0x0) at src/libponyrt/sched/start.c:168
#5  0x0000556701ab98ba in main ()
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fdcae39f700 (LWP 32507))]
#0  0x00007fdcb66a0f86 in epoll_pwait () from /usr/lib/libc.so.6
(gdb) thread 1
[Switching to thread 1 (Thread 0x7fdcb824b6c0 (LWP 32506))]
#0  0x00007fdcb70d144d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fdcb70d144d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x0000556701ad5f88 in ponyint_thread_join (thread=140585784174336) at src/libponyrt/platform/threads.c:229
#2  0x0000556701ac8ca5 in ponyint_sched_shutdown () at src/libponyrt/sched/scheduler.c:643
#3  0x0000556701ac92cd in ponyint_sched_start (library=false) at src/libponyrt/sched/scheduler.c:764
#4  0x0000556701ac7d3e in pony_start (library=false, language_features=true, exit_code=0x0) at src/libponyrt/sched/start.c:168
#5  0x0000556701ab98ba in main ()
(gdb) p *this_scheduler
$3 = {tid = 140585958946496, index = 0, cpu = 0, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x0, last_victim = 0x0, ctx = {scheduler = 0x0, current = 0x7fdcb639d800, 
    trace_object = 0x556701accfe1 <ponyint_gc_sendobject>, trace_actor = 0x556701acd351 <ponyint_gc_sendactor>, stack = 0x0, acquire = {
      contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {
      contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x0, tail = {{object = 0x0, counter = 0}, 
      raw = 0x00000000000000000000000000000000}}, mq = {head = 0x0, tail = 0x0}}
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fdcae39f700 (LWP 32507))]
#0  0x00007fdcb66a0f86 in epoll_pwait () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007fdcb66a0f86 in epoll_pwait () from /usr/lib/libc.so.6
#1  0x0000556701ad0476 in ponyint_asio_backend_dispatch (arg=0x7fdcb639dc00) at src/libponyrt/asio/epoll.c:215
#2  0x00007fdcb70d008c in start_thread () from /usr/lib/libpthread.so.0
#3  0x00007fdcb66a0e1f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$4 = {tid = 140585792567040, index = 0, cpu = 0, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x0, last_victim = 0x0, ctx = {scheduler = 0x0, current = 0x0, trace_object = 0x0, trace_actor = 0x0, stack = 0x0, 
    acquire = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, 
    serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, 
    serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {
      count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x0, tail = {{object = 0x0, counter = 0}, 
      raw = 0x00000000000000000000000000000000}}, mq = {head = 0x0, tail = 0x0}}
(gdb) thread 3
[Switching to thread 3 (Thread 0x7fdcadb9e700 (LWP 32508))]
#0  0x00007fdcb70da240 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fdcb70da240 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x0000556701adc417 in ponyint_cpu_core_pause (tsc=2194124946679732, tsc2=2194505353769212, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x0000556701ac864a in quiescent (sched=0x7fdcb639e800, tsc=2194124946679732, tsc2=2194505353769212)
    at src/libponyrt/sched/scheduler.c:314
#3  0x0000556701ac87bd in steal (sched=0x7fdcb639e800) at src/libponyrt/sched/scheduler.c:399
#4  0x0000556701ac8b2c in run (sched=0x7fdcb639e800) at src/libponyrt/sched/scheduler.c:565
#5  0x0000556701ac8c48 in run_thread (arg=0x7fdcb639e800) at src/libponyrt/sched/scheduler.c:630
#6  0x00007fdcb70d008c in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007fdcb66a0e1f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$5 = {tid = 140585784174336, index = 0, cpu = 0, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x7fdcb639e400, last_victim = 0x7fdcb639e9c0, ctx = {scheduler = 0x7fdcb639e800, current = 0x7fdca434d800, 
    trace_object = 0x556701acd08c <ponyint_gc_recvobject>, trace_actor = 0x556701acd3cc <ponyint_gc_recvactor>, stack = 0x0, acquire = {
      contents = {count = 0, size = 8, item_bitmap = 0x7fdca4387b00, buckets = 0x7fdca4387b08}}, serialise_buffer = 0x0, 
    serialise_size = 0, serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, 
    serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 5, ack_token = 45, ack_count = 0, mute_mapping = {contents = {
      count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7fdc93256aa0, tail = {{object = 0x7fdc93256aa0, 
        counter = 688}, raw = 0x00000000000002b000007fdc93256aa0}}, mq = {head = 0x7fdc5bf89900, tail = 0x7fdc5bf89900}}
(gdb) thread 4
[Switching to thread 4 (Thread 0x7fdcad39e700 (LWP 32509))]
#0  0x00007fdcb70da240 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fdcb70da240 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x0000556701adc417 in ponyint_cpu_core_pause (tsc=2194128307079564, tsc2=2194505353765065, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x0000556701ac864a in quiescent (sched=0x7fdcb639e9c0, tsc=2194128307079564, tsc2=2194505353765065)
    at src/libponyrt/sched/scheduler.c:314
#3  0x0000556701ac87bd in steal (sched=0x7fdcb639e9c0) at src/libponyrt/sched/scheduler.c:399
#4  0x0000556701ac8b2c in run (sched=0x7fdcb639e9c0) at src/libponyrt/sched/scheduler.c:565
#5  0x0000556701ac8c48 in run_thread (arg=0x7fdcb639e9c0) at src/libponyrt/sched/scheduler.c:630
#6  0x00007fdcb70d008c in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007fdcb66a0e1f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$6 = {tid = 140585775785728, index = 1, cpu = 1, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x7fdcb639e440, last_victim = 0x7fdcb639ef00, ctx = {scheduler = 0x7fdcb639e9c0, current = 0x7fdcb639fa00, 
    trace_object = 0x556701accfe1 <ponyint_gc_sendobject>, trace_actor = 0x556701acd351 <ponyint_gc_sendactor>, stack = 0x0, acquire = {
      contents = {count = 0, size = 8, item_bitmap = 0x7fdc8ab80000, buckets = 0x7fdc8ab80008}}, serialise_buffer = 0x0, 
    serialise_size = 0, serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, 
    serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {
      count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7fdc5b881e80, tail = {{object = 0x7fdc5b881e80, 
        counter = 846}, raw = 0x000000000000034e00007fdc5b881e80}}, mq = {head = 0x7fdc5b32b300, tail = 0x7fdc5b32b300}}
(gdb) thread 5
[Switching to thread 5 (Thread 0x7fdca4b9e700 (LWP 32510))]
#0  0x00007fdcb70da240 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fdcb70da240 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x0000556701adc417 in ponyint_cpu_core_pause (tsc=2194128325226844, tsc2=2194505353703958, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x0000556701ac864a in quiescent (sched=0x7fdcb639eb80, tsc=2194128325226844, tsc2=2194505353703958)
    at src/libponyrt/sched/scheduler.c:314
#3  0x0000556701ac87bd in steal (sched=0x7fdcb639eb80) at src/libponyrt/sched/scheduler.c:399
#4  0x0000556701ac8b2c in run (sched=0x7fdcb639eb80) at src/libponyrt/sched/scheduler.c:565
#5  0x0000556701ac8c48 in run_thread (arg=0x7fdcb639eb80) at src/libponyrt/sched/scheduler.c:630
#6  0x00007fdcb70d008c in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007fdcb66a0e1f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$7 = {tid = 140585633179392, index = 2, cpu = 2, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x7fdcb639e480, last_victim = 0x7fdcb639e9c0, ctx = {scheduler = 0x7fdcb639eb80, current = 0x7fdcb639fa00, 
    trace_object = 0x556701accfe1 <ponyint_gc_sendobject>, trace_actor = 0x556701acd351 <ponyint_gc_sendactor>, stack = 0x0, acquire = {
      contents = {count = 0, size = 8, item_bitmap = 0x7fdc7ab62900, buckets = 0x7fdc7ab62908}}, serialise_buffer = 0x0, 
    serialise_size = 0, serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, 
    serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {
      count = 0, size = 8, item_bitmap = 0x7fdc93337e00, buckets = 0x7fdc93337e08}}, q = {head = 0x7fdc5aea3380, tail = {{
        object = 0x7fdc5aea3380, counter = 1422}, raw = 0x000000000000058e00007fdc5aea3380}}, mq = {head = 0x7fdc5b32b380, 
    tail = 0x7fdc5b32b380}}
(gdb) thread 6
[Switching to thread 6 (Thread 0x7fdc9439e700 (LWP 32511))]
#0  0x00007fdcb70da240 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fdcb70da240 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x0000556701adc417 in ponyint_cpu_core_pause (tsc=2194128265828645, tsc2=2194505353886054, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x0000556701ac864a in quiescent (sched=0x7fdcb639ed40, tsc=2194128265828645, tsc2=2194505353886054)
    at src/libponyrt/sched/scheduler.c:314
#3  0x0000556701ac87bd in steal (sched=0x7fdcb639ed40) at src/libponyrt/sched/scheduler.c:399
#4  0x0000556701ac8b2c in run (sched=0x7fdcb639ed40) at src/libponyrt/sched/scheduler.c:565
#5  0x0000556701ac8c48 in run_thread (arg=0x7fdcb639ed40) at src/libponyrt/sched/scheduler.c:630
#6  0x00007fdcb70d008c in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007fdcb66a0e1f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$8 = {tid = 140585356355328, index = 3, cpu = 3, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x7fdcb639e4c0, last_victim = 0x7fdcb639eb80, ctx = {scheduler = 0x7fdcb639ed40, current = 0x7fdcb639fa00, 
    trace_object = 0x556701accfe1 <ponyint_gc_sendobject>, trace_actor = 0x556701acd351 <ponyint_gc_sendactor>, stack = 0x0, acquire = {
      contents = {count = 0, size = 8, item_bitmap = 0x7fdc932a0e00, buckets = 0x7fdc932a0e08}}, serialise_buffer = 0x0, 
    serialise_size = 0, serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, 
    serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {
      count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7fdc5bcc2400, tail = {{object = 0x7fdc5bcc2400, 
        counter = 1339}, raw = 0x000000000000053b00007fdc5bcc2400}}, mq = {head = 0x7fdc5b323c00, tail = 0x7fdc5b323c00}}
(gdb) thread 7
[Switching to thread 7 (Thread 0x7fdc93b9e700 (LWP 32512))]
#0  0x00007fdcb70da240 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fdcb70da240 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x0000556701adc417 in ponyint_cpu_core_pause (tsc=2194128325284946, tsc2=2194419522497772, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x0000556701ac864a in quiescent (sched=0x7fdcb639ef00, tsc=2194128325284946, tsc2=2194419522497772)
    at src/libponyrt/sched/scheduler.c:314
#3  0x0000556701ac87bd in steal (sched=0x7fdcb639ef00) at src/libponyrt/sched/scheduler.c:399
#4  0x0000556701ac8b2c in run (sched=0x7fdcb639ef00) at src/libponyrt/sched/scheduler.c:565
#5  0x0000556701ac8c48 in run_thread (arg=0x7fdcb639ef00) at src/libponyrt/sched/scheduler.c:630
#6  0x00007fdcb70d008c in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007fdcb66a0e1f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$9 = {tid = 140585347966720, index = 4, cpu = 4, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x7fdcb639e500, last_victim = 0x7fdcb639f0c0, ctx = {scheduler = 0x7fdcb639ef00, current = 0x7fdcb639fa00, 
    trace_object = 0x556701acd08c <ponyint_gc_recvobject>, trace_actor = 0x556701acd3cc <ponyint_gc_recvactor>, stack = 0x0, acquire = {
      contents = {count = 0, size = 8, item_bitmap = 0x7fdc9c31f000, buckets = 0x7fdc9c31f008}}, serialise_buffer = 0x7fdc8ab3a0c0, 
    serialise_size = 0, serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, 
    serialise_alloc = 0x556701abd580 <serialise_$36$2_apply_oZo>, serialise_alloc_final = 0x556701978b60 <serialise_$36$3_apply_oZo>, 
    serialise_throw = 0x5567019c2600 <serialise_$36$4_apply_o>}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {
    contents = {count = 0, size = 8, item_bitmap = 0x7fdc7ab88200, buckets = 0x7fdc7ab88208}}, q = {head = 0x7fdc5ba81d00, tail = {{
        object = 0x7fdc5ba81d00, counter = 929}, raw = 0x00000000000003a100007fdc5ba81d00}}, mq = {head = 0x7fdc5b323c80, 
    tail = 0x7fdc5b323c80}}
(gdb) thread 8
[Switching to thread 8 (Thread 0x7fdc8b39e700 (LWP 32513))]
#0  0x00007fdcb70d639d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fdcb70d639d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x0000556701ad5fee in ponyint_thread_suspend (signal=0x7fdcb639e540) at src/libponyrt/platform/threads.c:262
#2  0x0000556701ac89a1 in steal (sched=0x7fdcb639f0c0) at src/libponyrt/sched/scheduler.c:497
#3  0x0000556701ac8b2c in run (sched=0x7fdcb639f0c0) at src/libponyrt/sched/scheduler.c:565
#4  0x0000556701ac8c48 in run_thread (arg=0x7fdcb639f0c0) at src/libponyrt/sched/scheduler.c:630
#5  0x00007fdcb70d008c in start_thread () from /usr/lib/libpthread.so.0
#6  0x00007fdcb66a0e1f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$10 = {tid = 140585205360384, index = 5, cpu = 5, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x7fdcb639e540, last_victim = 0x7fdcb639e800, ctx = {scheduler = 0x7fdcb639f0c0, current = 0x7fdcb639fa00, 
    trace_object = 0x556701acd08c <ponyint_gc_recvobject>, trace_actor = 0x556701acd3cc <ponyint_gc_recvactor>, stack = 0x0, acquire = {
      contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {
      contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7fdc9c3856e0, tail = {{object = 0x7fdc9c3856e0, counter = 81}, 
      raw = 0x000000000000005100007fdc9c3856e0}}, mq = {head = 0x7fdc5b323d00, tail = 0x7fdcb639fd40}}
(gdb) printf "asio_cpu=%d scheduler_count=%d min_scheduler_count=%d active_scheduler_count=%d scheduler_count_changing=%d detect_quiescence=%d use_yield=%d\n", asio_cpu,scheduler_count,min_scheduler_count,active_scheduler_count,scheduler_count_changing,detect_quiescence,use_yield
asio_cpu=-1 scheduler_count=6 min_scheduler_count=0 active_scheduler_count=6 scheduler_count_changing=0 detect_quiescence=1 use_yield=1

SeanTAllen · 2018-01-13T17:36:41Z

@dipinhora if you rebase this against master, the next time around, your 3.9.1 tests will run in CircleCI and you won't have any LLVM download issues.

dipinhora · 2018-01-14T22:38:56Z

@winksaville I've rebased on master and updated the logic for both the pthreads and signals implementations to suspend the threads in more scenarios. I haven't gotten a chance to test things fully yet (mainly for the hanging issue; it seems to be working correctly) and might not be able to for a few days. If you have time and are interested, please test out this latest code for the hanging issue.

winksaville · 2018-01-15T00:54:39Z

Will do some tests!

winksaville · 2018-01-16T15:59:59Z

Update, I've run stdlib using release pthreads for 2+ hours with no issues. I'm running my pony-ring now and so far no issues having run about 30min, will let it run for 2+ hours. Then I'll test stdlib and pony-ring again using signals.

winksaville · 2018-01-17T01:13:38Z

Tested stdlib and pony-ring with signals each for 2+ hours and no hangs!!

SeanTAllen · 2018-01-17T02:24:08Z

AWESOME!

Change dynamic scheduler scaling implementation in order to resolve the hangs encountered in ponylang#2451. The previous implementation assumed that signalling to wake a thread was a reliable operation. Apparently, that's not necessarily true (see https://en.wikipedia.org/wiki/Spurious_wakeup and https://askldjd.com/2010/04/24/the-lost-wakeup-problem/). Seeing as we couldn't find any other explanation for why the previous implementation was experiencing hangs, I've assumed it is either because of lost wake ups or spurious wake ups and redesigned the logic accordingly. Now, when a thread is about to suspend, it will decrement the `active_scheduler_count` and then suspend. When it wakes up, it will check to see if the `active_scheduler_count` is at least as big as its `index`. If the `active_scheduler_count` isn't big enough, the thread will suspend itself again immediately. If it is big enough, it will resume. Threads no longer modify `active_scheduler_count` when they wake up. `active_scheduler_count` must now be modified by the thread that is waking up another thread prior to sending the wake up notification. Additionally, since we're now assuming that wake up signals can be lost, we now send multiple wake up notifications just in case. While this is somewhat wasteful, it is better than being in a situation where some threads aren't woken up at all (i.e. a hang). Additionally, only use `scheduler_count_changing` for `signals` implementation of dynamic scheduler scaling. `pthreads` implementation now uses a mutex (`sched_mut`) in its place. We also now change logic to only unlock mutex in `pthreads` implementation once threads have been woken to avoid potential lost wake ups. This isn't an issue for the `signals` implementation and the unlocking of `scheduler_count_changing` can remain where it is prior to threads being woken up. This commit also splits out scheduler block/unblock message handling logic into their own functions (this is so that sched 0 can call those functions directly instead of sending messages to itself). This commit also includes a change inspired by ponylang#2474. Now, *all* scheduler threads can suspend as long as there is at least one noisy actor registered with the ASIO subsystem. If there are no noisy actors registered with the ASIO subsystem then scheduler 0 is not allowed to suspend itself because it is reponsible for quiescence detection. Lastly, this commit adds logic to allow a scheduler thread to suspend even if it has already sent a scheduler block message so that we can now suspend scheduler threads in most scenarios.

dipinhora · 2018-01-19T20:04:57Z

@winksaville That's great! Thanks for testing all the variations. It is much appreciated.

I've rebased onto latest master and squashed the commits with a more complete commit message describing the changes.

Assuming CI passes and there aren't any additional issues or concerns, this is likely ready to merge.

dipinhora added the do not merge This PR should not be merged at this time label Jan 10, 2018

SeanTAllen mentioned this pull request Jan 13, 2018

Huge number of unix-systemcalls under macOS #2474

Closed

dipinhora force-pushed the fix_dynamic_scheduler_scaling branch from 9d2150e to e55c1a1 Compare January 14, 2018 22:30

dipinhora force-pushed the fix_dynamic_scheduler_scaling branch from e55c1a1 to ec6503f Compare January 19, 2018 20:00

dipinhora removed the do not merge This PR should not be merged at this time label Jan 19, 2018

winksaville approved these changes Jan 19, 2018

View reviewed changes

SeanTAllen added the changelog - changed Automatically add "Changed" CHANGELOG entry on merge label Jan 19, 2018

SeanTAllen merged commit fc80968 into ponylang:master Jan 20, 2018

ponylang-main added a commit that referenced this pull request Jan 20, 2018

Update CHANGELOG for PR #2483 [skip ci]

4cc65d6

dipinhora mentioned this pull request Jan 24, 2018

stdlib in a loop hangs #2451

Closed

dipinhora pushed a commit to dipinhora/ponyc that referenced this pull request Jun 5, 2018

Update CHANGELOG for PR ponylang#2483 [skip ci]

468818b

SeanTAllen mentioned this pull request Jun 28, 2018

Parallel twitter hashtags example fails consistently on Wallaro:master WallarooLabs/wally#2225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and re-enable dynamic scheduler scaling #2483

Fix and re-enable dynamic scheduler scaling #2483

dipinhora commented Jan 10, 2018

dipinhora commented Jan 10, 2018

dipinhora commented Jan 10, 2018

dipinhora commented Jan 10, 2018

SeanTAllen commented Jan 11, 2018

winksaville commented Jan 11, 2018

dipinhora commented Jan 11, 2018

winksaville commented Jan 11, 2018

dipinhora commented Jan 11, 2018

winksaville commented Jan 11, 2018

dipinhora commented Jan 11, 2018

winksaville commented Jan 11, 2018

winksaville commented Jan 11, 2018

SeanTAllen commented Jan 13, 2018

dipinhora commented Jan 14, 2018

winksaville commented Jan 15, 2018

winksaville commented Jan 16, 2018

winksaville commented Jan 17, 2018 •

edited

Loading

SeanTAllen commented Jan 17, 2018

dipinhora commented Jan 19, 2018

Fix and re-enable dynamic scheduler scaling #2483

Fix and re-enable dynamic scheduler scaling #2483

Conversation

dipinhora commented Jan 10, 2018

dipinhora commented Jan 10, 2018

dipinhora commented Jan 10, 2018

dipinhora commented Jan 10, 2018

SeanTAllen commented Jan 11, 2018

winksaville commented Jan 11, 2018

dipinhora commented Jan 11, 2018

winksaville commented Jan 11, 2018

dipinhora commented Jan 11, 2018

winksaville commented Jan 11, 2018

dipinhora commented Jan 11, 2018

winksaville commented Jan 11, 2018

winksaville commented Jan 11, 2018

SeanTAllen commented Jan 13, 2018

dipinhora commented Jan 14, 2018

winksaville commented Jan 15, 2018

winksaville commented Jan 16, 2018

winksaville commented Jan 17, 2018 • edited Loading

SeanTAllen commented Jan 17, 2018

dipinhora commented Jan 19, 2018

winksaville commented Jan 17, 2018 •

edited

Loading