SIGSEGV in `reply_append` #221

patrobinson · 2024-12-06T01:18:15Z

We've experienced numerous segfaults in production that point to this specific line of code
We can reliably reproduce this by simply triggering sidekiq to pause and unpause a queue, which causes it to receive a message from a pubsub channel.

queue = Sidekiq::Queue.new(ApplicationWorker::Queue::QUEUE_NAME)
queue.pause!
queue.unpause!

This happens on a handful of the dozens of containers we run.

I was able to get a coredump from one of the containers and here's the backtrace:

(gdb) bt full
#0  0x00007f8c6d51cebc in ?? () from buildkite-hiredis-segfault/eloquent_ride/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x00007f8c6d4cdfb2 in raise () from buildkite-hiredis-segfault/eloquent_ride/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#2  0x00007f8c6da9c8bf in ruby_default_signal (sig=<optimized out>) at signal.c:422
No locals.
#3  0x00007f8c6d8864b8 in rb_bug_for_fatal_signal (default_sighandler=0x0, sig=sig@entry=11, ctx=ctx@entry=0x7f8c11fd9880, fmt=fmt@entry=0x7f8c6dcac4d5 "Segmentation fault at %p") at error.c:1069
        file = <optimized out>
        line = 92
#4  0x00007f8c6da9b84b in sigsegv (sig=11, info=0x7f8c11fd99b0, ctx=0x7f8c11fd9880) at signal.c:926
No locals.
#5  <signal handler called>
No symbol table info available.
#6  0x00007f8c48a792bd in reply_append (value=<REDACTED>, task=0x7f8c12223690) at hiredis_connection.c:143
        state = 0x7f8c0fb5d300
        task_index = <optimized out>
        state = <optimized out>
        task_index = <optimized out>
        parent = <optimized out>
        key = <optimized out>
        rb_gc_guarded_ptr = <optimized out>
        rb_gc_guarded_ptr = <optimized out>
        rb_gc_guarded_ptr = <optimized out>
#7  reply_create_array (task=0x7f8c12223690, elements=<optimized out>) at hiredis_connection.c:212
        value = <REDACTED>

(gdb) p *((hiredis_reader_state_t *)(0x7f8c0fb5d300))
$1 = {stack = <REDACTED>, task_index = 0x0}

Somehow task_index is a null pointer, which shouldn't be possible given the code path?

The text was updated successfully, but these errors were encountered:

byroot · 2024-12-06T08:23:03Z

We can reliably reproduce this

Any chance you could create a repro script using bundler/inline, or even a Dockerfile? That would be very useful for me to investigate.

patrobinson · 2024-12-09T01:06:00Z

@byroot I don't think I will be able to, Sidekiq Pro is commercially licensed and there's multiple threads which seem to interact in some unknown way to trigger the bug.

I'm trying to replicate the issue with ASAN, as we did in #208

mperham · 2024-12-09T05:02:43Z

I can give @byroot Sidekiq Pro access, just have your gemfile use a local :path by using “gem unpack”.

byroot · 2024-12-12T08:24:19Z

@patrobinson any news?

rianmcguire · 2025-01-06T05:22:18Z

@byroot I managed to uncover the cause and write a minimal reproduction:

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "redis-client"
  gem "hiredis-client"
end

CHANNEL = "test"

redis_config = RedisClient.config(host: "localhost", port: 6379)
redis = redis_config.new_client
redis_pub = redis_config.new_client

pubsub = redis.pubsub
pubsub.call("subscribe", CHANNEL)

# This will be the subscribe event
event = pubsub.next_event
puts "event: #{event.inspect}"

# Read and client time out. This leaves hiredis in a state where:
# connection->context->reader->ridx == 0 (not -1, so there's 1 entry in the task stack), and
# connection->context->reader->task[0]->privdata points to hiredis_read_internal's stack-allocated 
# reader_state after return
event = pubsub.next_event
puts "event: #{event.inspect}"

# Publish an event
redis_pub.call("publish", CHANNEL, "hello")

# Invoke RedisClient::HiredisConnection#read enough times to get it JITed. This causes
# the stack to have a different layout when it's called, so the next call to
# next_event/hiredis_read_internal won't end up with identical stack addresses and
# work by accident
RubyVM::YJIT.enable
30.times do
    redis_pub.call("ping")
end

# Crash in reply_append for task[0], as privdata is pointing at invalid memory
event = pubsub.next_event
puts "event: #{event.inspect}"

This crashes reliably for me on:

ruby 3.3.6 (2024-11-05 revision 75015d4c1f) [x86_64-linux]
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
ruby 3.3.6 (2024-11-05 revision 75015d4c1f) [arm64-darwin23]

byroot · 2025-01-06T09:58:59Z

Oh thank you so much. I can confirm it repro on my machine too.

Unfortunately I have some unpleasant personal stuff to deal with today, but I'll dig into this as soon as I have time, either tomorrow or Wednesday.

byroot · 2025-01-08T11:38:04Z

That repro is really excellent, but I'm strugling to understand what's going on.

You comment mention the pointer to the stack allocated hiredis_reader_state_t struct being invalid after the code is JITed, but it's updated from hiredis_read_internal, so I don't understand how that could happen. I'll keep digging though.

Fix: #221 Avoid that pointer becoming invalid when YJIT kicks in.

byroot · 2025-01-08T12:04:01Z

I have a fix here: #224

It's purely based on the explanation of the repro, and it prevents the crash, but I still don't get it, which worries me a bit.

@XrXr perhaps you understand what's going on? If so I'd love if you could enlighten me.

XrXr · 2025-01-08T18:15:15Z

Taking the comments in the repro at face value, ASAN should be able to catch this type of things readily. YJIT seems to be used just to scrub the stack, and on paper you should be able to write a repro that doesn't use YJIT (maybe through something like 1.times.each{1.times.each{1.times.each{1.times.each{1.times.each{redis_pub.call("ping")}}}}}?)

byroot · 2025-01-10T06:51:22Z

ASAN should be able to catch this type of things readily.

Interestingly the bug disapear if I compile with -fsanitize=address -fno-omit-frame-pointer.

rianmcguire · 2025-01-12T06:45:22Z

You comment mention the pointer to the stack allocated hiredis_reader_state_t struct being invalid after the code is JITed, but it's updated from hiredis_read_internal, so I don't understand how that could happen. I'll keep digging though.

It's not that the pointer to the stack-allocated hiredis_reader_state_t becomes invalid after JIT - a client timeout in next_event always leaves hiredis pointing at invalid stack memory.

During a client timeout, hiredis will still copy the reader->privdata pointer into reader->task[0]->privdata:

redis-client/hiredis-client/ext/redis_client/hiredis/vendor/read.c

Line 711 in 2040c15

r->task[0]->privdata = r->privdata;

...and that task[0]->privdata value is used when processing replies in subsequent calls to next_event.

Surprisingly we get away with this most of the time, because subsequent calls to next_event are typically from the same C stack frame and end up putting the new reader_state in the same stack memory location on the next call.

YJIT (through no fault of its own) triggers the crash for us in production, because it changes the stack address of reader_state.

You can see the address changing in the repro:

(gdb) break hiredis_read_internal
Function "hiredis_read_internal" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (hiredis_read_internal) pending.
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>p &reader_state
>continue
>end
(gdb) r
Starting program: /home/rian/.asdf/installs/ruby/3.4.1/bin/ruby hiredis_pubsub_crash.rb
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7fffddbff6c0 (LWP 5981)]
$1 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$2 = (hiredis_reader_state_t *) 0x7fffffffd1c0
event: ["subscribe", "test", 1]
$3 = (hiredis_reader_state_t *) 0x7fffffffd1c0
event: nil
$4 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$5 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$6 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$7 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$8 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$9 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$10 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$11 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$12 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$13 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$14 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$15 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$16 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$17 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$18 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$19 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$20 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$21 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$22 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$23 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$24 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$25 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$26 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$27 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$28 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$29 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$30 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$31 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$32 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$33 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$34 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$35 = (hiredis_reader_state_t *) 0x7fffffffd270
$36 = (hiredis_reader_state_t *) 0x7fffffffd270

Thread 1 "ruby" received signal SIGSEGV, Segmentation fault.
0x00007fffdbe494ec in reply_append (task=0x555555f83840, value=140736882525640) at hiredis_connection.c:143
143	   int task_index = *state->task_index;

I don't think this affects non-pubsub use cases, because the connection would normally be discarded after a client timeout?

rianmcguire · 2025-01-12T07:11:04Z

Taking the comments in the repro at face value, ASAN should be able to catch this type of things readily.

I would have thought so too. I'll have a go at investigating why it isn't.

YJIT seems to be used just to scrub the stack, and on paper you should be able to write a repro that doesn't use YJIT (maybe through something like 1.times.each{1.times.each{1.times.each{1.times.each{1.times.each{redis_pub.call("ping")}}}}}?)

👍 removing the YJIT code and doing this instead also segfaults:

# Crash in reply_append for task[0], as privdata is pointing at invalid memory
1.times.each do
    event = pubsub.next_event
    puts "event: #{event.inspect}"
end

byroot · 2025-01-12T09:39:46Z

...and that task[0]->privdata value is used when processing replies in subsequent calls to next_event.

That's the part I can't seem to see in the code.

rianmcguire · 2025-01-12T11:06:28Z

...and that task[0]->privdata value is used when processing replies in subsequent calls to next_event.

That's the part I can't seem to see in the code.

hiredis_read_internal does update connection->context->reader->privdata with a valid pointer to the stack allocated hiredis_reader_state_t on every call:

redis-client/hiredis-client/ext/redis_client/hiredis/hiredis_connection.c

Lines 728 to 732 in 2040c15

    
           hiredis_reader_state_t reader_state = { 
        
               .stack = stack, 
        
               .task_index = &connection->context->reader->ridx, 
        
           }; 
        
           connection->context->reader->privdata = &reader_state;

But during the call to next_event that reads no data and hits the client timeout, hiredis also initializes reader->task[0], which has its own copy of that privdata pointer:

redis-client/hiredis-client/ext/redis_client/hiredis/vendor/read.c

Lines 704 to 713 in 2040c15

    
           /* Set first item to process when the stack is empty. */ 
        
           if (r->ridx == -1) { 
        
               r->task[0]->type = -1; 
        
               r->task[0]->elements = -1; 
        
               r->task[0]->idx = -1; 
        
               r->task[0]->obj = NULL; 
        
               r->task[0]->parent = NULL; 
        
               r->task[0]->privdata = r->privdata; 
        
               r->ridx = 0; 
        
           }

connection->context->reader (r in the code above) lives for the duration of the connection.

During the following call to next_event that has a reply to read, it sees that r->ridx is no longer -1 and doesn't re-initialize task[0] with the new privdata value - it still has the old invalid value. That task[0] is assigned type = REDIS_REPLY_PUSH and gets passed to the createArray callback function:

redis-client/hiredis-client/ext/redis_client/hiredis/vendor/read.c

Line 498 in 2040c15

obj = r->fn->createArray(cur,elements);

And that crashes in reply_append when it tries to use the invalid privdata pointer:

redis-client/hiredis-client/ext/redis_client/hiredis/hiredis_connection.c

Lines 141 to 143 in 2040c15

    
           static void *reply_append(const redisReadTask *task, VALUE value) { 
        
               hiredis_reader_state_t *state = (hiredis_reader_state_t *)task->privdata; 
        
               int task_index = *state->task_index;

byroot · 2025-01-12T11:07:53Z

Ah! Thank you that's the part I was missing.

Fix: #221 Avoid that pointer becoming invalid when YJIT kicks in. Co-Authored-By: Rian McGuire <[email protected]>

byroot · 2025-01-12T11:30:22Z

Thanks for the really excellent repro, I turned it into a unit test and released 0.23.1 with the fix.

byroot added a commit that referenced this issue Jan 8, 2025

Allocate hiredis_reader_state_t inside hiredis_connection_t

686077d

Fix: #221 Avoid that pointer becoming invalid when YJIT kicks in.

byroot mentioned this issue Jan 8, 2025

Allocate hiredis_reader_state_t inside hiredis_connection_t #224

Merged

byroot added a commit that referenced this issue Jan 12, 2025

Allocate hiredis_reader_state_t inside hiredis_connection_t

95af2b3

Fix: #221 Avoid that pointer becoming invalid when YJIT kicks in. Co-Authored-By: Rian McGuire <[email protected]>

byroot added a commit that referenced this issue Jan 12, 2025

Allocate hiredis_reader_state_t inside hiredis_connection_t

66c4a2c

Fix: #221 Avoid that pointer becoming invalid when YJIT kicks in. Co-Authored-By: Rian McGuire <[email protected]>

byroot closed this as completed in #224 Jan 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV in `reply_append` #221

SIGSEGV in `reply_append` #221

patrobinson commented Dec 6, 2024 •

edited

Loading

byroot commented Dec 6, 2024

patrobinson commented Dec 9, 2024

mperham commented Dec 9, 2024

byroot commented Dec 12, 2024

rianmcguire commented Jan 6, 2025

byroot commented Jan 6, 2025

byroot commented Jan 8, 2025

byroot commented Jan 8, 2025

XrXr commented Jan 8, 2025 •

edited

Loading

byroot commented Jan 10, 2025

rianmcguire commented Jan 12, 2025 •

edited

Loading

rianmcguire commented Jan 12, 2025

byroot commented Jan 12, 2025

rianmcguire commented Jan 12, 2025

byroot commented Jan 12, 2025

byroot commented Jan 12, 2025

SIGSEGV in reply_append #221

SIGSEGV in reply_append #221

Comments

patrobinson commented Dec 6, 2024 • edited Loading

byroot commented Dec 6, 2024

patrobinson commented Dec 9, 2024

mperham commented Dec 9, 2024

byroot commented Dec 12, 2024

rianmcguire commented Jan 6, 2025

byroot commented Jan 6, 2025

byroot commented Jan 8, 2025

byroot commented Jan 8, 2025

XrXr commented Jan 8, 2025 • edited Loading

byroot commented Jan 10, 2025

rianmcguire commented Jan 12, 2025 • edited Loading

rianmcguire commented Jan 12, 2025

byroot commented Jan 12, 2025

rianmcguire commented Jan 12, 2025

byroot commented Jan 12, 2025

byroot commented Jan 12, 2025

SIGSEGV in `reply_append` #221

SIGSEGV in `reply_append` #221

patrobinson commented Dec 6, 2024 •

edited

Loading

XrXr commented Jan 8, 2025 •

edited

Loading

rianmcguire commented Jan 12, 2025 •

edited

Loading