-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in GC #105780
Comments
Is this with regular CoreCLR with a JIT or with native AOT? |
@jkotas : I'm sorry I don't understand the difference. When I did a code search for CORINFO_HELP_NEWARR_1_VC ; it seems to be mapped to RhpNewArray by RyuJIT. The environment is whatever |
cc @VSadov |
Tagging subscribers to this area: @mangod9 |
Hi @jhudsoncedaron, you mention that this started occurring recently, so do you observe that this is a regression in 8.0.7? |
@mangod9 : Barring exotic hypotheses, yes; however we jumped straight from 8.0.5 to 8.0.7 and are not sure of the exact start date of the problem. Support was slow to report the issue to engineering. In addition, we just turned on gcDynamic a couple of months ago. However exotic hypotheses are on the table; the data being processed as seen in the dump is trying to allocate about 5,000 StringBuilder objects in 50ms. |
Hey @jhudsoncedaron, are you able to privately share a dump to investigate further? |
@mangod9 : I will ask the guy capable of approving immediately. If that doesn't work, there's other ways of getting necessary information from a dump. |
@mangod9 : Dump upload is not happening. So it comes down to tell me what to look for and I can look for it. |
So I feel the need to explain a few things. We are not sitting idle on this. At once other things are happening.
|
Could you please load the dump into windbg debugger and the stacktraces of all threads from windbg? The stacktraces above seems to be from Visual Studio and they are missing low-level details that we need to diagnose this. Also, could you please run |
Output of !threads and ~e!dumpstacks Unfortunately !stacks is broken. What's it mean "PsActiveProcessHead!"? |
Could you please share the output of
|
@jkotas : You're in luck, both windbg and sos.dll were downloaded and installed anew (for the first time on this SSD) today. Here's the output of your additional commands. I didn't find anything worth redacting this time around. |
This shows that all threads are waiting for GC to finish, but there does not seem to be any thread making progress to finish the GC. One possible explanation of this situation is use-after-free bug with Win32 HANDLEs somewhere else in the program. If some other part of the program inadvertently waited on the HANDLE used by the GC, it can explain why the GC gets stuck like this. It may be useful to find the thread that's holding the ThreadStore lock that the
|
0:001> !locks CritSec +36c3c6f0 at 000002be36c3c6f0 Scanned 48 critical sections Oh good there's exactly one held critical section; must be this one. OwningThread is 1bf0 which is a .NET Server GC thread. Confirmed: 1be8 coreclr!CrstBase::Enter this->mCriticalSection:
I think the event pipe is a red herring. We don't use it. This thread appears to be from when we tried to run dotnet-stacks on the already deadlocked process.
Which was my original analysis of the stacks from Visual Studio.
That particular explanation would require the HANDLE that was used after it was freed to have been allocated before the GC's event HANDLE. I am having a hard time believing this as I would expect the GC to eagerly allocate its own resources at process startup. Correct me if I'm wrong but I think this is testable hypothesis. If there's an early handle in managed code that is being used after it's freed, while being allocated before the GC runs the first time, I should be able to move the problem to a different handle by putting the following at the top of main: |
Tagging subscribers to this area: @dotnet/gc |
I have noticed that stacks2-full.txt shows 8x This looks like GC threading issue. cc @dotnet/gc |
@jhudsoncedaron You can test whether the problem is related to background GC (BGC) by trying to disable it: https://learn.microsoft.com/en-us/dotnet/core/runtime-config/garbage-collector#background-gc |
@jkotas : Yes I noticed that too. Since the number was even I didn't think much of it. |
Current status of the running code has been set to: Now we wait, I guess. If it doesn't recur it was likely a fault in background GC. |
SVR threads are created during initialization, but BGC threads are created on demand and DATAS is on, so the 6 vs 8 could be normal behavior. However, the BGC join object is created during initialization. Disabling DATAS (but leaving BGC enabled) would provide another interesting data point on this. Also GC traces will probably be useful in determining context for any possible GC issues here. |
DATAS - DOTNET_GCDynamicAdaptationMode or System.GC.DynamicAdaptationMode |
@jhudsoncedaron, did the issue not reproduce without BGC? Also since this is 8, assume you have explicitly enabled DATAS? |
@mangod9 : Correct, DATAS is enabled explicitly. I have not had a report that it reproduced again after setting BGC off; however not enough time has elapsed to be sure. Note that business day assumption is not valid; so we do not yet have 50% confidence. The way the math works out, we need two more weeks for 99% confidence (assuming normal distribution), which is an amazing coincidence because I will be on vacation the next two weeks. Operations is monitoring the situation; however it is likely this thread will not be updated until I return. A mitigation is scheduled to be installed Tuesday Evening 8/6. This mitigation reduces the number of allocations required in the allocation-hot loop (the stack that locked up allocating a StringBuilder) by removing the "small substrings are cheap" assumption. The number of allocations was reduced by almost two-thirds. |
Perhaps also trying with DATAS off but BGC on would be helpful. We have made quite a few changes to DATAS in 9, but wondering if it has shown reasonable improvements for your app since you have enabled it in 8. |
@mangod9 : That's next on the list to try; however this entails switching back to workstation GC. Trying to enable server GC without DATAS will invariably lock up the server. The only way this server works is with the ability to shuffle memory between running processes on demand. |
Status Update: No more incidents occurred after turning off background GC. |
so just to confirm its with DATAS enabled but BGC disabled? |
@mangod9 : Correct; Current status is DATAS enabled, BGC disabled, no more deadlocks. |
@jhudsoncedaron and I got in contact via email and will go from there. |
I did some debugging with @jhudsoncedaron yesterday (thanks so much for agreeing to do the debugging session!) and GC did try to create 7 BGC threads but one of the BGC threads is simply not there yet we do have a Thread object for it, and the thread creation call returned success. so I'm still looking at that part. this caused the deadlock since we are supposed to have 7 threads joining but only got 6. |
The instrumented binary is ready and can be accessed at https://github.com/Maoni0/clrgc/tree/main/issues/105780. Detailed instructions on how to use it can be found at https://github.com/Maoni0/clrgc/blob/main/README.md. Please let me know if you need any further assistance or information regarding this. |
A little update. I don't want to say too much right now at the risk of getting it wrong; however Maoni0 has found a bad interaction between DATAS and Background GC in the code. |
Description
Active worker thread is hung up with a diagnostic call stack that looks something like this:
ntdll.dll!syscall_gate
KERNELBASE.dll!WaitForMultipleObjects
???
CORINFO_HELP_NEWARR_1_VC
inlined!StringBuilder.ctor
Cedaron.Common.HL7.DLL!Cedaron.Common.HL7.HL7Segment.HL7Decode
All managed stacks as follows:
Reproduction Steps
No controlled reproduction available. We currently get the issue about three times per two weeks.
Expected behavior
new StringBuilder()
does not hang forever.Actual behavior
new StringBuilder()
calls WaitForMultipleObjects, which does not returnRegression?
Yes; Issue is less than three months old.
Known Workarounds
No response
Configuration
.NET Runtime 8.0.7 via dotnet publish -r win-x64
OS: Windows Server Core; probably 2022
Other information
We have a full memory dump but can't release it publicly. This is a production instance and PHI is on the stack.
The text was updated successfully, but these errors were encountered: