-
Notifications
You must be signed in to change notification settings - Fork 15.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use RUBY_ASSERT when available #12216
Conversation
It should be present from Ruby 2.7 onwards. The advantage over a raw `assert` is that it will trigger MRI's crash report on failure, making it much easier to debug at a glance. For context I'm trying to debug an assertion failure we've been seeing on Ruby nigthly builds: ``` ruby: protobuf.c:404: ObjectCache_Add: Assertion `ObjectCache_Get(key) == val' failed. ```
8d38656
to
cb3b121
Compare
Looks like this PR triggers the same assertion failure you are referring to:
|
Oh, that's very interesting. I was suspecting it was because I recently added write barriers to On our CI it failed in a bout 3% of workers, so is suspect it's caused by GC triggering at a very specifc place, I may be able to repro with |
I was indeed able to reproduce on first try with: diff --git a/conformance/conformance_ruby.rb b/conformance/conformance_ruby.rb
index 0fa43391d..bf9c00d15 100755
--- a/conformance/conformance_ruby.rb
+++ b/conformance/conformance_ruby.rb
@@ -31,7 +31,10 @@
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
require 'conformance/conformance_pb'
+p [:here]
+GC.stress = true
require 'google/protobuf/test_messages_proto3_pb'
+GC.stress = false
require 'google/protobuf/test_messages_proto2_pb'
$test_count = 0
diff --git a/ruby/ext/google/protobuf_c/protobuf.c b/ruby/ext/google/protobuf_c/protobuf.c
index bee8ac288..bc318366a 100644
--- a/ruby/ext/google/protobuf_c/protobuf.c
+++ b/ruby/ext/google/protobuf_c/protobuf.c
@@ -403,6 +403,9 @@ void ObjectCache_Add(const void *key, VALUE val) {
#if USE_SECONDARY_MAP
rb_mutex_unlock(secondary_map_mutex);
#endif
+ if (ObjectCache_Get(key) != val) {
+ rb_bug("ObjectCache_Get(key) != val");
+ }
PBRUBY_ASSERT(ObjectCache_Get(key) == val);
} |
a732a44
to
e834ad3
Compare
I have a good repro loop, but I'm having a hard time figuring out why it's happening. Clearly either What is certain is that it happens frequently with GC.stress. This may be a bug in Ruby. I'll try to find some time to dig deeper with @peterzhu2118 |
I wonder if we might need an |
That was my initial thought, but no,
I tried nonetheless just to be sure, and it's still failing. At this stage I suspect it might be a bug in WeakMap, as it uses two I'll try to reproduce this in a standalone script. |
Seems like I can't repro, in isolation :/ |
I found an unrelated crash though -_- wm = ObjectSpace::WeakMap.new
GC.stress = true
GC.auto_compact = true
obj = Object.new
1000.times do |i|
wm[i] = obj # [BUG] Segmentation fault at 0x0000000000000003
raise "wat" unless wm[i].equal?(obj)
end |
Ok, so I have a good news and a bad news. The good new is that I have figured out the bug. The bad news is that it's a bug in Ruby. Here's the reproduction: wmap = ObjectSpace::WeakMap.new
a = "A"
b = "B"
wmap[1] = a
wmap[1] = b # the table entry with 1 is still in the list of entries to clear when `a` is GCed
a = nil
GC.start
p wmap[1] # Should be `"B"`, but is `nil` The bug is here: https://github.com/ruby/ruby/blob/f613c18912f4bf5492a8dc02ab5e3661f4ae907c/weakmap.c#L193-L194 When an object is added in a WeakMap as a value, the weak map keeps a list of all keys associated with that object, e.g. class WeakMap
def []=(key, value)
@hash[key] = value
@reverse[value] << key
end
end However when we store another value for that given key, that I think I can fix it upstream, but no idea how to make this protobuf code work on versions that won't receive a backported fix. |
Upstream bug report: https://bugs.ruby-lang.org/issues/19531 |
Excellent work on the repro. Thanks so much for tracking this down, filing the upstream bug, and working on a fix! As a side note, it would be so nice if WeakMap was capable of making only the values weak, but not the keys. Then we could remove the code that uses a secondary map. Any chance this could be added to WeakMap upstream? |
We discussed this last year: https://bugs.ruby-lang.org/issues/16038, At this stage I wonder if the simplest wouldn't be to implement you own WeakMap, which I don't think would be too hard, especially given that your keys are pointer types, so a simple |
I always figured that WeakMap had privileged access to the GC, and that its efficiency depended on this. But it looks like it is just defining finalizers for every map element. In that case, I agree we could just define our own weak map that makes only the values weak. It would probably be more efficient, because we could use plain void* as the key instead of converting it to a |
We could also potentially avoid the finalizer by clearing the map from the |
It's not high priority for me, but I'll see if I can find some time to try that around next week. |
I was going to say that these objects don't currently have free callbacks. But I guess that RUBY_DEFAULT_FREE is still a free callback. Is there any guarantee about how promptly the free callbacks will be called? If the free() was delayed, it could run after a different insert had already happened, and delete the map entry inappropriately. |
btw, an alternative to st_table would be to use upb's internal hash table: https://github.com/protocolbuffers/upb/blob/main/upb/hash/int_table.h It could make us less dependent on the internals of Ruby, if that is a possible concern. |
Yes. And even if they didn't have one, it wouldn't be a problem, we could just define one without any worry.
If If either the flag isn't set or a finalizer is set, then the callback is delayed to the dedicated "sweep" phase. But either way we should be able to handle "race conditions" here, by cheking the map entry still points the GCed object. The problem is that the
At this stage, But either way, this is all conjecture, I'll have to dig more into the gem to figure what would be the most practical (finalizer vs free callback, ruby st_table vs upb int_table). |
The C struct contains the pointer to the upb object which is the cache key to the weak map, so I don't think this should be a problem. |
Ok, so I tried implementing a custom weakmap, however It appears it's not possible, as we'd need access to |
@@ -432,7 +432,7 @@ static VALUE RepeatedField_dup(VALUE _self) { | |||
*/ | |||
VALUE RepeatedField_to_ary(VALUE _self) { | |||
RepeatedField* self = ruby_to_RepeatedField(_self); | |||
int size = upb_Array_Size(self->array); | |||
int size = self->array ? upb_Array_Size(self->array) : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I'll submit individually, with similar fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you get a chance to submit a separate PR for this? I didn't see one.
What was the root cause that led you to this fix? I ask because it looks very similar to what I've been debugging in #11968
There is a similar symptom; self->array
ends up being NULL
. But in that case it's happening in RepeatedField_each
. I've isolated it to a race condition involving the object cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, I was working on it at some point but never submitted it. I think at some point I assumed this case wasn't supposed to happen and was just a fallout of another bug, but I might have been wrong not sure.
I can submit a PR if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this case is supposed to happen, and I do think it is the fallout of another bug. I think this is probably not the right fix.
But I am curious what triggered this bug for you. I'm wondering if it's a similar root cause to #11968 (a race condition on the object cache) or something else.
As for the Ruby bug, it's fixed upstream but hasn't been backported yet unfortunately. |
It should be present from Ruby 2.7 onwards.
The advantage over a raw
assert
is that it will trigger MRI's crash report on failure, making it much easier to debug at a glance.For context I'm trying to debug an assertion failure we've been seeing on Ruby nigthly builds: