-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGSEGV in reply_append
#221
Comments
Any chance you could create a repro script using |
@byroot I don't think I will be able to, Sidekiq Pro is commercially licensed and there's multiple threads which seem to interact in some unknown way to trigger the bug. I'm trying to replicate the issue with ASAN, as we did in #208 |
I can give @byroot Sidekiq Pro access, just have your gemfile use a local :path by using “gem unpack”. |
@patrobinson any news? |
@byroot I managed to uncover the cause and write a minimal reproduction: require "bundler/inline"
gemfile do
source "https://rubygems.org"
gem "redis-client"
gem "hiredis-client"
end
CHANNEL = "test"
redis_config = RedisClient.config(host: "localhost", port: 6379)
redis = redis_config.new_client
redis_pub = redis_config.new_client
pubsub = redis.pubsub
pubsub.call("subscribe", CHANNEL)
# This will be the subscribe event
event = pubsub.next_event
puts "event: #{event.inspect}"
# Read and client time out. This leaves hiredis in a state where:
# connection->context->reader->ridx == 0 (not -1, so there's 1 entry in the task stack), and
# connection->context->reader->task[0]->privdata points to hiredis_read_internal's stack-allocated
# reader_state after return
event = pubsub.next_event
puts "event: #{event.inspect}"
# Publish an event
redis_pub.call("publish", CHANNEL, "hello")
# Invoke RedisClient::HiredisConnection#read enough times to get it JITed. This causes
# the stack to have a different layout when it's called, so the next call to
# next_event/hiredis_read_internal won't end up with identical stack addresses and
# work by accident
RubyVM::YJIT.enable
30.times do
redis_pub.call("ping")
end
# Crash in reply_append for task[0], as privdata is pointing at invalid memory
event = pubsub.next_event
puts "event: #{event.inspect}" This crashes reliably for me on:
|
Oh thank you so much. I can confirm it repro on my machine too. Unfortunately I have some unpleasant personal stuff to deal with today, but I'll dig into this as soon as I have time, either tomorrow or Wednesday. |
That repro is really excellent, but I'm strugling to understand what's going on. You comment mention the pointer to the stack allocated |
Fix: #221 Avoid that pointer becoming invalid when YJIT kicks in.
Taking the comments in the repro at face value, ASAN should be able to catch this type of things readily. YJIT seems to be used just to scrub the stack, and on paper you should be able to write a repro that doesn't use YJIT (maybe through something like |
Interestingly the bug disapear if I compile with |
It's not that the pointer to the stack-allocated During a client timeout, hiredis will still copy the
...and that task[0]->privdata value is used when processing replies in subsequent calls to next_event .
Surprisingly we get away with this most of the time, because subsequent calls to YJIT (through no fault of its own) triggers the crash for us in production, because it changes the stack address of You can see the address changing in the repro:
I don't think this affects non-pubsub use cases, because the connection would normally be discarded after a client timeout? |
I would have thought so too. I'll have a go at investigating why it isn't.
👍 removing the YJIT code and doing this instead also segfaults:
|
That's the part I can't seem to see in the code. |
redis-client/hiredis-client/ext/redis_client/hiredis/hiredis_connection.c Lines 728 to 732 in 2040c15
But during the call to redis-client/hiredis-client/ext/redis_client/hiredis/vendor/read.c Lines 704 to 713 in 2040c15
During the following call to
And that crashes in redis-client/hiredis-client/ext/redis_client/hiredis/hiredis_connection.c Lines 141 to 143 in 2040c15
|
Ah! Thank you that's the part I was missing. |
Fix: #221 Avoid that pointer becoming invalid when YJIT kicks in. Co-Authored-By: Rian McGuire <[email protected]>
Fix: #221 Avoid that pointer becoming invalid when YJIT kicks in. Co-Authored-By: Rian McGuire <[email protected]>
Thanks for the really excellent repro, I turned it into a unit test and released 0.23.1 with the fix. |
We've experienced numerous segfaults in production that point to this specific line of code
We can reliably reproduce this by simply triggering sidekiq to pause and unpause a queue, which causes it to receive a message from a pubsub channel.
This happens on a handful of the dozens of containers we run.
I was able to get a coredump from one of the containers and here's the backtrace:
Somehow
task_index
is a null pointer, which shouldn't be possible given the code path?The text was updated successfully, but these errors were encountered: