-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault error 4 in libcoreclr.so #42885
Comments
The relevant assembly that caused the issue is:
The issue seems to be in Following that address we get:
So this looks like a null reference issue.
So this is a range of about 2.6 GB , but there is not further information provided. LLDB doesn't have a way to provide the memory map of a process, so I used
What I find really interesting is that I can't find this memory region anywhere. Note that the Heap 2 segment is I'm not sure what conclusion to draw from this. Our application uses native memory quite heavily, but this is the single cluster (out of hundreds) where we are seeing this kind of behavior. I tried looking upward in the stack, giving:
And here is the relevant disassembly:
Any pointers to figure out what is going on would be very helpful. |
This is likely caused by argument or local variable in the method containing garbage instead of a valid object reference. You need to go to |
I'll see what this gives us. Any possibility that this relates to: #41413 ? |
Unlikely. |
This is what I get when I switch to the
I'm not sure how to get the relevant values. I think you mean to go to the
And the instructions are:
Trying to use: I tried to go to: I assume that the
Any help would be great. |
I think so. Try https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-symbol |
Thank you, the symbols helps a lot, that said, I'm not sure how to interpret the data, I have. Currently going through the code and trying to figure out how to find what the method name really is.
And:
|
Trying further, I think that this should give me the relevant:
That doesn't seem to be the right value, so I'm afraid that I'm lost. |
Okay, I think I got it:
Which gives:
Which is more reasonable, this gives me:
This is where I'm forced to stop, because I cannot trace it further:
BTW, I'm basically trying to get the Given the information that I have, how do I turn the |
It is very labor intensive to do this manually. You need to get a working SOS: https://github.com/dotnet/diagnostics/blob/master/documentation/installing-sos-instructions.md and then use |
🤦 Of course, for some reason I was so focused on the native side of things, I forgot about SOS |
And with that, I now have a pretty good lead. The
Now, this is automatically generated code, included below. I apologize for how that looks. We have had some issues before with the number of local variables that were created because of this style of code. Could it be something similar? Any idea how to go forward? I assume that if I can narrow down what is the type of the object, I can get more information about the root cause. dumpobj 0x00007fb6cdd93f28 But it complained about missing class field. How do I go from the pointer /
|
We have another failed coredump, and there we have the same stack trace, but the failure is on:
Looking at the error further, I get:
Note that we have a null Looking further, here is the faulting instruction:
And the registers say:
So we have an issue with The good news is that we have a good idea where this goes off the rails, but I'm not sure how to narrow this further. |
This time the faulty code is elsewhere, but the situation is similar. A lot of operations and deeply nested structure. |
Probably not.
ppObject argument of |
How do I find the name of the argument? I don't think that I saw a command for that. |
There is no command for that. It requires looking at the disassembly and corelating with the source code. |
Tagging subscribers to this area: @dotnet/gc |
I'm afraid that I'm not sure that I'm following. I looked at the disassembly, and I can point what variable it is looking it in terms of the unmanaged code. I tried to follow the |
Right. The way to do that is to manually match the instructions in the disassembly with the C# source code. The SOS "u" command can interleave the disassembly with line numbers and IL (dotnet/diagnostics#452) that can help you with this. |
managed heap corruptions usually aren't debugged by the GC team... CCing @ChrisAhna to see if he has cycles to help. |
Also if you could share a repro and/or a few dumps that would be helpful to continue investigating further. Thx |
We have a new core dump - for some reason the crash rate changed.
@janvorli I'm missing debug symbols for the new libcoreclr.so:
Could you send us this file please? |
Of course, I am sorry I haven't shared it before. Here is the link to get it: You'll need to ungzip it. |
OK @janvorli we've got it in place and we can continue. I tried to figure out the method (the RavenDB index in question) that it is related to, but failed with I attach:
https://gist.github.com/gregolsky/2a1241f5cabe39b09adf396bedd3efac Do you know why would |
Could we get an update on this one please? |
@gregolsky thank you for the reminder, I am sorry for the delay. I'll take a look at the stuff you've logged and get back to you later today. |
It is really strange that the dumpmd cannot dump the MethodDesc, it seems as if it was corrupted or something. Do you happen to use unloading or collectible dynamic assemblies (https://docs.microsoft.com/en-us/dotnet/framework/reflection-and-codedom/collectible-assemblies)?
That one should match the one you were trying to dump before. |
Can you also please dump the stack trace for thread #1004 using And also please |
We don't use these as far as I know. Our indexes are compiled at runtime, but not unloaded. ip2md wasn't successful:
For 1004 the only thing that shows anything (at least to me) is
|
I've just realized what is the most likely reason for the MethodDesc and all other stuff not working properly. Since I've modified the libcoreclr.so, I should have also shared the libmscordaccore.so, which needs to match the libcoreclr.so. Since this file is read by SOS, you can just update the libmscordaccore.so and open the same dump. I think things will just work. |
Thank you! Glad we have that sorted out. Attaching the requested outputs: |
I am afraid I've made a mistake when building the libcoreclr for you. I am really sorry for that. It seems that I have somehow missed making the change, so I've basically shared a self build of plain 3.1.8 without any change. I remember I was originally building it on Ubuntu and then I've realized that it would be better to build it on CentOS 7 where we build the official builds. And I guess I somehow by mistake haven't cherry-picked the change and built just plain 3.1.8 by accident. I've just checked the source tree on the VM I've built that (where I've been sending you the libraries from) and that confirmed it. |
OK then. We will install it and grab another core dump if it crashes. |
@janvorli I am not sure if this is a concidence or was it planned, but on the first look it seems the memory usage has been decreased significantly after installation of the custom binary... Is that possible? Similar behavior can be observer on other 2 cluster nodes. |
@gregolsky that was not expected. Maybe somehow the issue doesn't always lead to crash, but somehow falsely keeps objects alive. But I don't have a clear explanation for such a mechanism. |
@janvorli I would like to report that we did not experience another segfault since the installation (2020-12-14) of the custom libcoreclr.so on the system. |
Great! Thank you for the info. I'll ask for approval for getting the change I've made for you into 3.1. |
@gregolsky the fix was approved for the 3.1.12 release that should be released in about a month. |
@janvorli That's great news! Thank you for an update. |
Is this also fixed in 5.0.3? Using 5.0.2, I am also experiencing a crash with the same error, few times a week. |
@ToxicLand what are the symptoms in your case? Are you sure it is the same problem? |
I am using .NET Core 5.0.2. The program I use crashes with a very similar log: It seem to happen about twice a week. |
Ah, this indicates just SIGSEGV, there can be many reasons for such a problem and it doesn't indicate that it is related to the specific issue here. Can you please create a new issue for your problem? I'll try to cooperate with you there to get more details on the issue. |
@janvorli Could you confirm it landed in the |
Yes, it landed in 3.1.12 release. See dotnet/coreclr#28119 |
Description
Process crashes on a regular basis (few times a week).
We were able to take a coredump and this is what we got from
lldb-3.9
on thread 19552:We've run
dotnet-dump analyze
verifyheap
command and it did not detect any corruption:Calculated the failed method offset 0x00007fb6d0863a7a - 0x7fb6d0794000 = 0xCFA7A and did
addr2line
:Configuration
Other information
The text was updated successfully, but these errors were encountered: