Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV in reply_append #221

Closed
patrobinson opened this issue Dec 6, 2024 · 16 comments · Fixed by #224
Closed

SIGSEGV in reply_append #221

patrobinson opened this issue Dec 6, 2024 · 16 comments · Fixed by #224

Comments

@patrobinson
Copy link

patrobinson commented Dec 6, 2024

We've experienced numerous segfaults in production that point to this specific line of code
We can reliably reproduce this by simply triggering sidekiq to pause and unpause a queue, which causes it to receive a message from a pubsub channel.

queue = Sidekiq::Queue.new(ApplicationWorker::Queue::QUEUE_NAME)
queue.pause!
queue.unpause!

This happens on a handful of the dozens of containers we run.

I was able to get a coredump from one of the containers and here's the backtrace:

(gdb) bt full
#0  0x00007f8c6d51cebc in ?? () from buildkite-hiredis-segfault/eloquent_ride/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x00007f8c6d4cdfb2 in raise () from buildkite-hiredis-segfault/eloquent_ride/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#2  0x00007f8c6da9c8bf in ruby_default_signal (sig=<optimized out>) at signal.c:422
No locals.
#3  0x00007f8c6d8864b8 in rb_bug_for_fatal_signal (default_sighandler=0x0, sig=sig@entry=11, ctx=ctx@entry=0x7f8c11fd9880, fmt=fmt@entry=0x7f8c6dcac4d5 "Segmentation fault at %p") at error.c:1069
        file = <optimized out>
        line = 92
#4  0x00007f8c6da9b84b in sigsegv (sig=11, info=0x7f8c11fd99b0, ctx=0x7f8c11fd9880) at signal.c:926
No locals.
#5  <signal handler called>
No symbol table info available.
#6  0x00007f8c48a792bd in reply_append (value=<REDACTED>, task=0x7f8c12223690) at hiredis_connection.c:143
        state = 0x7f8c0fb5d300
        task_index = <optimized out>
        state = <optimized out>
        task_index = <optimized out>
        parent = <optimized out>
        key = <optimized out>
        rb_gc_guarded_ptr = <optimized out>
        rb_gc_guarded_ptr = <optimized out>
        rb_gc_guarded_ptr = <optimized out>
#7  reply_create_array (task=0x7f8c12223690, elements=<optimized out>) at hiredis_connection.c:212
        value = <REDACTED>
(gdb) p *((hiredis_reader_state_t *)(0x7f8c0fb5d300))
$1 = {stack = <REDACTED>, task_index = 0x0}

Somehow task_index is a null pointer, which shouldn't be possible given the code path?

@byroot
Copy link
Member

byroot commented Dec 6, 2024

We can reliably reproduce this

Any chance you could create a repro script using bundler/inline, or even a Dockerfile? That would be very useful for me to investigate.

@patrobinson
Copy link
Author

@byroot I don't think I will be able to, Sidekiq Pro is commercially licensed and there's multiple threads which seem to interact in some unknown way to trigger the bug.

I'm trying to replicate the issue with ASAN, as we did in #208

@mperham
Copy link

mperham commented Dec 9, 2024

I can give @byroot Sidekiq Pro access, just have your gemfile use a local :path by using “gem unpack”.

@byroot
Copy link
Member

byroot commented Dec 12, 2024

@patrobinson any news?

@rianmcguire
Copy link
Contributor

@byroot I managed to uncover the cause and write a minimal reproduction:

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "redis-client"
  gem "hiredis-client"
end

CHANNEL = "test"

redis_config = RedisClient.config(host: "localhost", port: 6379)
redis = redis_config.new_client
redis_pub = redis_config.new_client

pubsub = redis.pubsub
pubsub.call("subscribe", CHANNEL)

# This will be the subscribe event
event = pubsub.next_event
puts "event: #{event.inspect}"

# Read and client time out. This leaves hiredis in a state where:
# connection->context->reader->ridx == 0 (not -1, so there's 1 entry in the task stack), and
# connection->context->reader->task[0]->privdata points to hiredis_read_internal's stack-allocated 
# reader_state after return
event = pubsub.next_event
puts "event: #{event.inspect}"

# Publish an event
redis_pub.call("publish", CHANNEL, "hello")

# Invoke RedisClient::HiredisConnection#read enough times to get it JITed. This causes
# the stack to have a different layout when it's called, so the next call to
# next_event/hiredis_read_internal won't end up with identical stack addresses and
# work by accident
RubyVM::YJIT.enable
30.times do
    redis_pub.call("ping")
end

# Crash in reply_append for task[0], as privdata is pointing at invalid memory
event = pubsub.next_event
puts "event: #{event.inspect}"

This crashes reliably for me on:

  • ruby 3.3.6 (2024-11-05 revision 75015d4c1f) [x86_64-linux]
  • ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
  • ruby 3.3.6 (2024-11-05 revision 75015d4c1f) [arm64-darwin23]

@byroot
Copy link
Member

byroot commented Jan 6, 2025

Oh thank you so much. I can confirm it repro on my machine too.

Unfortunately I have some unpleasant personal stuff to deal with today, but I'll dig into this as soon as I have time, either tomorrow or Wednesday.

@byroot
Copy link
Member

byroot commented Jan 8, 2025

That repro is really excellent, but I'm strugling to understand what's going on.

You comment mention the pointer to the stack allocated hiredis_reader_state_t struct being invalid after the code is JITed, but it's updated from hiredis_read_internal, so I don't understand how that could happen. I'll keep digging though.

byroot added a commit that referenced this issue Jan 8, 2025
Fix: #221

Avoid that pointer becoming invalid when YJIT kicks in.
@byroot
Copy link
Member

byroot commented Jan 8, 2025

I have a fix here: #224

It's purely based on the explanation of the repro, and it prevents the crash, but I still don't get it, which worries me a bit.

@XrXr perhaps you understand what's going on? If so I'd love if you could enlighten me.

@XrXr
Copy link
Contributor

XrXr commented Jan 8, 2025

Taking the comments in the repro at face value, ASAN should be able to catch this type of things readily. YJIT seems to be used just to scrub the stack, and on paper you should be able to write a repro that doesn't use YJIT (maybe through something like 1.times.each{1.times.each{1.times.each{1.times.each{1.times.each{redis_pub.call("ping")}}}}}?)

@byroot
Copy link
Member

byroot commented Jan 10, 2025

ASAN should be able to catch this type of things readily.

Interestingly the bug disapear if I compile with -fsanitize=address -fno-omit-frame-pointer.

@rianmcguire
Copy link
Contributor

rianmcguire commented Jan 12, 2025

You comment mention the pointer to the stack allocated hiredis_reader_state_t struct being invalid after the code is JITed, but it's updated from hiredis_read_internal, so I don't understand how that could happen. I'll keep digging though.

It's not that the pointer to the stack-allocated hiredis_reader_state_t becomes invalid after JIT - a client timeout in next_event always leaves hiredis pointing at invalid stack memory.

During a client timeout, hiredis will still copy the reader->privdata pointer into reader->task[0]->privdata:

r->task[0]->privdata = r->privdata;

...and that task[0]->privdata value is used when processing replies in subsequent calls to next_event.

Surprisingly we get away with this most of the time, because subsequent calls to next_event are typically from the same C stack frame and end up putting the new reader_state in the same stack memory location on the next call.

YJIT (through no fault of its own) triggers the crash for us in production, because it changes the stack address of reader_state.

You can see the address changing in the repro:

(gdb) break hiredis_read_internal
Function "hiredis_read_internal" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (hiredis_read_internal) pending.
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>p &reader_state
>continue
>end
(gdb) r
Starting program: /home/rian/.asdf/installs/ruby/3.4.1/bin/ruby hiredis_pubsub_crash.rb
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7fffddbff6c0 (LWP 5981)]
$1 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$2 = (hiredis_reader_state_t *) 0x7fffffffd1c0
event: ["subscribe", "test", 1]
$3 = (hiredis_reader_state_t *) 0x7fffffffd1c0
event: nil
$4 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$5 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$6 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$7 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$8 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$9 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$10 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$11 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$12 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$13 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$14 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$15 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$16 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$17 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$18 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$19 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$20 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$21 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$22 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$23 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$24 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$25 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$26 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$27 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$28 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$29 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$30 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$31 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$32 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$33 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$34 = (hiredis_reader_state_t *) 0x7fffffffd1c0
$35 = (hiredis_reader_state_t *) 0x7fffffffd270
$36 = (hiredis_reader_state_t *) 0x7fffffffd270

Thread 1 "ruby" received signal SIGSEGV, Segmentation fault.
0x00007fffdbe494ec in reply_append (task=0x555555f83840, value=140736882525640) at hiredis_connection.c:143
143	   int task_index = *state->task_index;

I don't think this affects non-pubsub use cases, because the connection would normally be discarded after a client timeout?

@rianmcguire
Copy link
Contributor

Taking the comments in the repro at face value, ASAN should be able to catch this type of things readily.

I would have thought so too. I'll have a go at investigating why it isn't.

YJIT seems to be used just to scrub the stack, and on paper you should be able to write a repro that doesn't use YJIT (maybe through something like 1.times.each{1.times.each{1.times.each{1.times.each{1.times.each{redis_pub.call("ping")}}}}}?)

👍 removing the YJIT code and doing this instead also segfaults:

# Crash in reply_append for task[0], as privdata is pointing at invalid memory
1.times.each do
    event = pubsub.next_event
    puts "event: #{event.inspect}"
end

@byroot
Copy link
Member

byroot commented Jan 12, 2025

...and that task[0]->privdata value is used when processing replies in subsequent calls to next_event.

That's the part I can't seem to see in the code.

@rianmcguire
Copy link
Contributor

...and that task[0]->privdata value is used when processing replies in subsequent calls to next_event.

That's the part I can't seem to see in the code.

hiredis_read_internal does update connection->context->reader->privdata with a valid pointer to the stack allocated hiredis_reader_state_t on every call:

hiredis_reader_state_t reader_state = {
.stack = stack,
.task_index = &connection->context->reader->ridx,
};
connection->context->reader->privdata = &reader_state;

But during the call to next_event that reads no data and hits the client timeout, hiredis also initializes reader->task[0], which has its own copy of that privdata pointer:

/* Set first item to process when the stack is empty. */
if (r->ridx == -1) {
r->task[0]->type = -1;
r->task[0]->elements = -1;
r->task[0]->idx = -1;
r->task[0]->obj = NULL;
r->task[0]->parent = NULL;
r->task[0]->privdata = r->privdata;
r->ridx = 0;
}

connection->context->reader (r in the code above) lives for the duration of the connection.

During the following call to next_event that has a reply to read, it sees that r->ridx is no longer -1 and doesn't re-initialize task[0] with the new privdata value - it still has the old invalid value. That task[0] is assigned type = REDIS_REPLY_PUSH and gets passed to the createArray callback function:

obj = r->fn->createArray(cur,elements);

And that crashes in reply_append when it tries to use the invalid privdata pointer:

static void *reply_append(const redisReadTask *task, VALUE value) {
hiredis_reader_state_t *state = (hiredis_reader_state_t *)task->privdata;
int task_index = *state->task_index;

@byroot
Copy link
Member

byroot commented Jan 12, 2025

Ah! Thank you that's the part I was missing.

byroot added a commit that referenced this issue Jan 12, 2025
Fix: #221

Avoid that pointer becoming invalid when YJIT kicks in.

Co-Authored-By: Rian McGuire <[email protected]>
byroot added a commit that referenced this issue Jan 12, 2025
Fix: #221

Avoid that pointer becoming invalid when YJIT kicks in.

Co-Authored-By: Rian McGuire <[email protected]>
@byroot
Copy link
Member

byroot commented Jan 12, 2025

Thanks for the really excellent repro, I turned it into a unit test and released 0.23.1 with the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants