Segfault on Ruby 3.3 with `RUBY_MN_THREADS=1` #33

jpcamara · 2023-11-10T02:39:04Z

Probably not a huge surprise, but vernier segfaults if run with MaNy threads turned on. I've compiled Ruby source locally and the following test.rb can trigger it:

require "net/http"
require "vernier"

uri = "https://www.google.com"

Vernier.trace(out: "../time_profile.json") do
  5.times.map do
    Thread.new { Net::HTTP.get(URI(uri)) }
  end.map(&:join)
end

Without RUBY_MN_THREADS=1, it runs fine. With it, I just get a generic Segmentation fault. Run with gdb-ruby I get:

Thread 4 "test.rb:9" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffff7219f120 (LWP 29637)]
0x0000aaaaea8f45c4 in thread_profile_frames (ec=0x0, start=0, limit=2048, buff=0xffff929a8850, lines=0xffff929ac850)
    at ../vm_backtrace.c:1591
1591	    const rb_control_frame_t *cfp = ec->cfp, *end_cfp = RUBY_VM_END_CONTROL_FRAME(ec);

Happy to provide more info and steps if there's interest. But I can also understand if there's no interest in supporting such a nascent feature.

The text was updated successfully, but these errors were encountered:

casperisfine · 2023-11-10T07:38:30Z

Could you provide the full backtrace? There is a known incompatibility with MaNy that is Vernier and other GVL instrumentation consumer have no way to know what the Ruby thread is. ruby/ruby#8885 solves that but isn't merged yet.

jpcamara · 2023-11-10T09:30:07Z

Here's the full backtrace:

Thread 4 "test.rb:8" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffff6360f120 (LWP 109765)]
0x0000aaaae08148e8 in thread_profile_frames (ec=0x0, start=0, limit=2048, buff=0xffff83e18850, lines=0xffff83e1c850) at ../vm_backtrace.c:1591
1591	    const rb_control_frame_t *cfp = ec->cfp, *end_cfp = RUBY_VM_END_CONTROL_FRAME(ec);
(gdb) backtrace
#0  0x0000aaaae08148e8 in thread_profile_frames (ec=0x0, start=0, limit=2048, buff=0xffff83e18850, lines=0xffff83e1c850) at ../vm_backtrace.c:1591
#1  0x0000aaaae0814b4c in rb_profile_frames (start=0, limit=2048, buff=0xffff83e18850, lines=0xffff83e1c850) at ../vm_backtrace.c:1656
#2  0x0000ffff8415ef2c in RawSample::sample (this=this@entry=0xffff83e18850) at vernier.cc:313
#3  0x0000ffff8415ef54 in LiveSample::sample_current_thread (this=0xffff83e18850) at vernier.cc:355
#4  GlobalSignalHandler::signal_handler (sig=<optimized out>, sinfo=<optimized out>, ucontext=<optimized out>) at vernier.cc:1161
#5  <signal handler called>
#6  0x0000ffff9ebaff9c in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0xaaab02177c24)
    at ./nptl/futex-internal.c:57
#7  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0xaaab02177c24) at ./nptl/futex-internal.c:87
#8  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0xaaab02177c24, expected=expected@entry=0, clockid=clockid@entry=0,
    abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#9  0x0000ffff9ebb2ab0 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0xaaab02177bb8, cond=0xaaab02177bf8) at ./nptl/pthread_cond_wait.c:503
#10 ___pthread_cond_wait (cond=0xaaab02177bf8, mutex=0xaaab02177bb8) at ./nptl/pthread_cond_wait.c:627
#11 0x0000aaaae078e5a4 in rb_native_cond_wait (cond=0xaaab02177bf8, mutex=0xaaab02177bb8) at ../thread_pthread.c:214
#12 0x0000aaaae079012c in ractor_sched_deq (vm=0xaaab02177b10, cr=0x0) at ../thread_pthread.c:1230
#13 0x0000aaaae0791884 in nt_start (ptr=0xaaab024885a0) at ../thread_pthread.c:2209
#14 0x0000ffff9ebb37d0 in start_thread (arg=0x0) at ./nptl/pthread_create.c:444
#15 0x0000ffff9ec1f35c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79

jhawthorn · 2023-11-13T22:35:37Z

@jpcamara Thanks for the report.

I haven't been able to reproduce (on ruby/ruby@e8ab3f7). What version of Ruby 3.3 are you using.

jhawthorn@spire:~/src/vernier arm64 (main ) [ruby 3.3.0]
$ RUBY_MN_THREADS=1 bundle exec ruby examples/threaded_http_requests.rb
http://example.com 200
https://www.johnhawthorn.com 200
https://tenderlovemaking.com/ 200

jpcamara · 2023-11-14T03:12:02Z

@jhawthorn dang, thanks for giving it a try! Unfortunate it wasn't an easy repro for you!

I was able to repro again using the latest commit as of this message (ruby/ruby@fabf5be). I brought in your example as well from https://github.com/jhawthorn/vernier/blob/main/examples/threaded_http_requests.rb and hit the same error.

Here's some context on my setup:

M1 Macbook
Code is executed in an ubuntu:23.10 docker container since MaNy only works on Linux, and is ignored on other platforms
Running directly from the ruby source folder, and the command i'm running is RUBY_MN_THREADS=1 GEM_PATH=$(./ruby -e 'puts Gem.dir') make gdb-ruby and executing a test.rb containing your threaded_http_requests.rb code
Running outside of gdb it's the same, just with less context (a generic Segmentation Fault message) - RUBY_MN_THREADS=1 GEM_PATH=$(./ruby -e 'puts Gem.dir') ./ruby ../test.rb

Here's the backtrace from the threaded_http_requests.rb run:

RUBY_MN_THREADS=1 GEM_PATH=$(./ruby -e 'puts Gem.dir') make gdb-ruby
# or
RUBY_MN_THREADS=1 GEM_PATH=$(./ruby -e 'puts Gem.dir') ./ruby ../test.rb

Thread 4 "test.rb:23" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffff9d71f120 (LWP 79945)]
0x0000aaaae9a77dc4 in thread_profile_frames (ec=0x0, start=0, limit=2048, buff=0xffff9e6d8850, lines=0xffff9e6dc850) at ../vm_backtrace.c:1589
1589	    const rb_control_frame_t *cfp = ec->cfp, *end_cfp = RUBY_VM_END_CONTROL_FRAME(ec);
(gdb) backtrace
#0  0x0000aaaae9a77dc4 in thread_profile_frames (ec=0x0, start=0, limit=2048, buff=0xffff9e6d8850, lines=0xffff9e6dc850) at ../vm_backtrace.c:1589
#1  0x0000aaaae9a78028 in rb_profile_frames (start=0, limit=2048, buff=0xffff9e6d8850, lines=0xffff9e6dc850) at ../vm_backtrace.c:1654
#2  0x0000ffff9ea0ef2c in RawSample::sample (this=this@entry=0xffff9e6d8850) at vernier.cc:313
#3  0x0000ffff9ea0ef54 in LiveSample::sample_current_thread (this=0xffff9e6d8850) at vernier.cc:355
#4  GlobalSignalHandler::signal_handler (sig=<optimized out>, sinfo=<optimized out>, ucontext=<optimized out>) at vernier.cc:1161
#5  <signal handler called>
#6  0x0000ffffb92fff9c in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0xaaaaf063ec24) at ./nptl/futex-internal.c:57
#7  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0xaaaaf063ec24) at ./nptl/futex-internal.c:87
#8  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0xaaaaf063ec24, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0)
    at ./nptl/futex-internal.c:139
#9  0x0000ffffb9302ab0 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0xaaaaf063ebb8, cond=0xaaaaf063ebf8) at ./nptl/pthread_cond_wait.c:503
#10 ___pthread_cond_wait (cond=0xaaaaf063ebf8, mutex=0xaaaaf063ebb8) at ./nptl/pthread_cond_wait.c:627
#11 0x0000aaaae99f0860 in rb_native_cond_wait (cond=0xaaaaf063ebf8, mutex=0xaaaaf063ebb8) at ../thread_pthread.c:249
#12 0x0000aaaae99f2424 in ractor_sched_deq (vm=0xaaaaf063eb10, cr=0x0) at ../thread_pthread.c:1270
#13 0x0000aaaae99f3b68 in nt_start (ptr=0xaaaaf0c39050) at ../thread_pthread.c:2240
#14 0x0000ffffb93037d0 in start_thread (arg=0x0) at ./nptl/pthread_create.c:444
#15 0x0000ffffb936f35c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79

Thread.new do # <= test.rb:23
  threads.each(&:join)
  received.close
end

jhawthorn · 2023-11-23T08:14:32Z

Thanks! I've been able to reproduce and think I'm approaching a solution. The segfault occurs because we're trying to sample a M-N native thread which isn't running any code, and at that time Ruby sets EC to NULL, which is the same issue stackprof is seeing in tmm1/stackprof#221.

Vernier should be able to deal with this, and we shouldn't be sampling a thread which is suspended. However I think the new M-N thread implementation isn't feeding us the same GVL instrumentation events which traditional threads do, which causes us to mistake a suspended thread as running and attempt to sample it from a native thread which actually has no thread running.

Traditional thread

Correctly recorded as spending most of its time suspended

M-N thread

Incorrectly recorded as "running" most of the time (all threads have this behaviour in this profile). This was recorded by adding a failsafe upstream to just record nothing if we try to sample a thread with no EC.

casperisfine · 2023-11-23T08:49:55Z

Incorrectly recorded as "running" most of the time

So the GVL instrumentation API is broken with M-N threads?

jpcamara · 2023-11-23T12:39:18Z

Incorrectly recorded as "running" most of the time

So the GVL instrumentation API is broken with M-N threads?

I've seen this as well, yea. It still fires all the correct events, but in the wrong order. I was just about to open an issue around it - started always occurs after ready and resumed, and exited fires before suspended. There are probably others but those are the simplest to reproduce.

It also happens even when not running MN threads - the order was broken by refactoring the overall pthread code.

casperisfine · 2023-11-23T13:50:53Z

Aaron mentioned something like that. Happy to dig into it and try to fix it if you have some repros.

jpcamara · 2023-11-23T14:23:27Z

Aaron mentioned something like that. Happy to dig into it and try to fix it if you have some repros.

Sounds great - I'll provide some today

jpcamara · 2023-11-23T15:51:53Z

@casperisfine I'll open a Ruby issue as well, but here's a draft PR with a spec i'd been working on to verify the order: ruby/ruby#9019

casperisfine · 2023-11-23T15:58:47Z

here's a draft PR with a spec i'd been working on to verify the order: ruby/ruby#9019

Nice! I'll try to make it pass tomorrow (Well except MaNy cause I don't know much about it). Thank you very much.

casperisfine · 2023-11-24T15:51:49Z

ruby/ruby#9029 should fix this (not sure about M:N though).

jhawthorn · 2023-12-09T22:58:16Z

Fixed by ruby/ruby#9073 and I added a CI build in #49

jhawthorn added bug Something isn't working C Likely requires C/C++ changes labels Nov 12, 2023

casperisfine mentioned this issue Nov 24, 2023

Refactor and fix the GVL instrumentation API ruby/ruby#9029

Merged

jhawthorn closed this as completed Dec 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault on Ruby 3.3 with `RUBY_MN_THREADS=1` #33

Segfault on Ruby 3.3 with `RUBY_MN_THREADS=1` #33

jpcamara commented Nov 10, 2023 •

edited

Loading

casperisfine commented Nov 10, 2023

jpcamara commented Nov 10, 2023

jhawthorn commented Nov 13, 2023

jpcamara commented Nov 14, 2023 •

edited

Loading

jhawthorn commented Nov 23, 2023 •

edited

Loading

casperisfine commented Nov 23, 2023

jpcamara commented Nov 23, 2023 •

edited

Loading

casperisfine commented Nov 23, 2023

jpcamara commented Nov 23, 2023

jpcamara commented Nov 23, 2023

casperisfine commented Nov 23, 2023

casperisfine commented Nov 24, 2023

jhawthorn commented Dec 9, 2023

Segfault on Ruby 3.3 with RUBY_MN_THREADS=1 #33

Segfault on Ruby 3.3 with RUBY_MN_THREADS=1 #33

Comments

jpcamara commented Nov 10, 2023 • edited Loading

casperisfine commented Nov 10, 2023

jpcamara commented Nov 10, 2023

jhawthorn commented Nov 13, 2023

jpcamara commented Nov 14, 2023 • edited Loading

jhawthorn commented Nov 23, 2023 • edited Loading

casperisfine commented Nov 23, 2023

jpcamara commented Nov 23, 2023 • edited Loading

casperisfine commented Nov 23, 2023

jpcamara commented Nov 23, 2023

jpcamara commented Nov 23, 2023

casperisfine commented Nov 23, 2023

casperisfine commented Nov 24, 2023

jhawthorn commented Dec 9, 2023

Segfault on Ruby 3.3 with `RUBY_MN_THREADS=1` #33

Segfault on Ruby 3.3 with `RUBY_MN_THREADS=1` #33

jpcamara commented Nov 10, 2023 •

edited

Loading

jpcamara commented Nov 14, 2023 •

edited

Loading

jhawthorn commented Nov 23, 2023 •

edited

Loading

jpcamara commented Nov 23, 2023 •

edited

Loading