Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in GC #105780

Open
jhudsoncedaron opened this issue Jul 31, 2024 · 34 comments
Open

Deadlock in GC #105780

jhudsoncedaron opened this issue Jul 31, 2024 · 34 comments
Milestone

Comments

@jhudsoncedaron
Copy link

jhudsoncedaron commented Jul 31, 2024

Description

Active worker thread is hung up with a diagnostic call stack that looks something like this:

ntdll.dll!syscall_gate
KERNELBASE.dll!WaitForMultipleObjects
???
CORINFO_HELP_NEWARR_1_VC
inlined!StringBuilder.ctor
Cedaron.Common.HL7.DLL!Cedaron.Common.HL7.HL7Segment.HL7Decode

All managed stacks as follows:

Main thread:
 	ntdll.dll!00007ff9df400d24()	Unknown
 	KERNELBASE.dll!00007ff9dc9d464c()	WaitForSingleObject
 	[Managed to Native Transition]	
 	System.Private.CoreLib.dll!System.Threading.Monitor.Wait(object obj, int millisecondsTimeout) Line 156	C#
 	System.Private.CoreLib.dll!System.Threading.ManualResetEventSlim.Wait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) Line 561	C#
 	System.Private.CoreLib.dll!System.Threading.Tasks.Task.SpinThenBlockingWait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) Line 3072	C#
 	System.Private.CoreLib.dll!System.Threading.Tasks.Task.InternalWaitCore(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) Line 3007	C#
 	System.Private.CoreLib.dll!System.Threading.Tasks.Task.Wait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) Line 2768	C#
 	System.Private.CoreLib.dll!System.Threading.Tasks.Task.Wait(int millisecondsTimeout) Line 2728	C#
 	[Waiting on Async Operation, double-click or press enter to view Async Call Stacks]	
	System.IO.FileSystem.Watcher.dll!System.IO.FileSystemWatcher.WaitForChanged(System.IO.WatcherChangeTypes changeType, int timeout) Line 631	C#
 	Cedaron.Interface.Engine.dll!Cedaron.Interface.Engine.Program.Main(string[] args) Line 300	C#

Background task scheduler thread:
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	[Managed to Native Transition]	
 	Cedaron.Common.dll!Cedaron.Common.Portable.MultiWaiter.Win32WaitForAnyObject(nint[] waithandle, int tv_sec, int tv_msec)	Unknown
 	Cedaron.Common.dll!Cedaron.Common.Portable.MultiWaiter.WaitForAnyRead(System.Collections.Generic.IList<Cedaron.Common.Either<System.Net.Sockets.Socket, Cedaron.Common.Portable.IMultiWaitable>> waitfor, int tv_sec, int tv_msec)	Unknown
	Cedaron.Interface.Engine.dll!Cedaron.Interface.Engine.InterfaceEngine.BackgroundJobRunner()

Background keepalive thread:
 	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	[Managed to Native Transition]	
	System.Private.CoreLib.dll!System.Threading.Thread.Sleep(int millisecondsTimeout) Line 367	C#
 	Cedaron.Interface.FormBuilder.dll!Cedaron.Interface.FormBuilder.APIConfiguration.RefreshToken()	Unknown

.NET TP Gate:
 	ntdll.dll!00007ff9df400d24()	Unknown
 	KERNELBASE.dll!00007ff9dc9d464c()	WaitForSingleObject
 	[Managed to Native Transition]	
>	System.Private.CoreLib.dll!System.Threading.WaitHandle.WaitOneNoCheck(int millisecondsTimeout) Line 128	C#
 	System.Private.CoreLib.dll!System.Threading.PortableThreadPool.GateThread.GateThreadStart() Line 48	C#

.NET ThreadPool IO:
 	ntdll.dll!00007ff9df4030c4()	Unknown
 	KERNELBASE.dll!00007ff9dcab2724()	Unknown
 	[Managed to Native Transition]	
>	System.Private.CoreLib.dll!System.Threading.PortableThreadPool.IOCompletionPoller.Poll() Line 188	C#

.NET Timer:
 	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	[Managed to Native Transition]	
>	System.Private.CoreLib.dll!System.Threading.WaitHandle.WaitOneNoCheck(int millisecondsTimeout) Line 128	C#
 	System.Private.CoreLib.dll!System.Threading.TimerQueue.TimerThread() Line 87	C#

Connection Listener:
 	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	[Managed to Native Transition]	
 	Cedaron.Common.dll!Cedaron.Common.Portable.MultiWaiter.Win32WaitForAnyObject(nint[] waithandle, int tv_sec, int tv_msec)	Unknown
 	Cedaron.Common.dll!Cedaron.Common.Portable.MultiWaiter.WaitForAnyRead(System.Collections.Generic.IList<Cedaron.Common.Either<System.Net.Sockets.Socket, Cedaron.Common.Portable.IMultiWaitable>> waitfor, int tv_sec, int tv_msec)	Unknown
 	Cedaron.HL7.Client.dll!Cedaron.HL7.Client.SocketTransport.GetConnected(System.Action monitor, int monitorinterval)	Unknown
>	Cedaron.Interface.Engine.dll!Cedaron.Interface.Receivers.TCP.Cedaron.Interface.Declarations.IInterfaceReceiver.ReceiveLoop()
 	Cedaron.Interface.Engine.dll!Cedaron.Interface.Engine.Interface.ReceiveLoop()

.NET Finalizer:
>	ntdll.dll!00007ff9df400d24()	Unknown
 	KERNELBASE.dll!00007ff9dc9d464c()	WaitForSingleObject
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

.NET Server GC:
>	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

Only Active Worker Thread:
 	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	[Managed to Native Transition]	
>	Cedaron.Common.HL7.dll!Cedaron.Common.HL7.HL7Segment.HL7Decode(string data, int offset, int length, char msh1, string msh2, bool translateunknownesc) Line 391	C#
 	Cedaron.Common.HL7.dll!Cedaron.Common.HL7.HL7Segment.HL7Segment(string msgdata, int offset, int length, char msh1, string msh2, bool translateunknownesc) Line 300	C#
 	Cedaron.Common.HL7.dll!Cedaron.Common.HL7.HL7Message.HL7Message(byte[] message, int offset, int length, bool translateunknownesc) Line 523	C#
 	Cedaron.Interface.FormBuilder.dll!Cedaron.Interface.FormBuilder.ADT.UpdateManualPromoteRecords()
 	Cedaron.Interface.FormBuilder.dll!Cedaron.Interface.FormBuilder.ADT.ProcessADT()
 	Cedaron.Interface.FormBuilder.dll!Cedaron.Interface.FormBuilder.ADT.Cedaron.Interface.Declarations.IMessageProcessor.ProcessMessage()
 	Cedaron.Interface.FormBuilder.dll!Cedaron.Interface.FormBuilder.DatabaseArchiver.Cedaron.Interface.Declarations.IMessageProcessor.ProcessMessage()
 	Cedaron.Interface.Engine.dll!Cedaron.Interface.Engine.Interface.ProcessMessage()
 	Cedaron.Interface.Engine.dll!Cedaron.Interface.Receivers.TCP.Cedaron.Interface.Declarations.IInterfaceReceiver.ReceiveLoop.AnonymousMethod__0(Cedaron.HL7.Client.HL7BlockingStream stream)
 	Cedaron.Interface.Engine.dll!Cedaron.Interface.Receivers.TCP.Cedaron.Interface.Declarations.IInterfaceReceiver.ReceiveLoop()
 	Cedaron.Interface.Engine.dll!Cedaron.Interface.Engine.Interface.ReceiveLoop()

.NET Server GC:
>	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

.NET Server GC:
>	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

.NET Debugger:
>	ntdll.dll!00007ff9df400d24()	Unknown
 	KERNELBASE.dll!00007ff9dc9d464c()	WaitForSingleObject
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

.NET Server GC:
>	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

.NET Server GC:
>	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

.NET Event Pipe:
 	[Waiting on a lock, load symbols for ntdll.dll to show thread lock information]	
>	ntdll.dll!00007ff9df403cc4()	Unknown
 	ntdll.dll!00007ff9df379f09()	Unknown
 	ntdll.dll!00007ff9df3774c2()	Unknown
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

.NET Server GC:
>	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

.NET Server GC:
>	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

.NET Server GC:
>	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

.NET TP Worker:
 	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	[Managed to Native Transition]	
>	System.Private.CoreLib.dll!System.Threading.LowLevelLifoSemaphore.WaitForSignal(int timeoutMs) Line 140	C#
 	System.Private.CoreLib.dll!System.Threading.LowLevelLifoSemaphore.Wait(int timeoutMs, bool spinWait) Line 65	C#
 	System.Private.CoreLib.dll!System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() Line 100	C#

.NET BGC:
>	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

.NET BGC:
>	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

.NET BGC:
>	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

.NET BGC:
>	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

.NET BGC:
>	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

.NET BGC:
>	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

.NET TP Worker:
 	ntdll.dll!00007ff9df400254()	Unknown
 	KERNELBASE.dll!00007ff9dc9ad71e()	WaitForMultipleObjects
 	[Managed to Native Transition]	
>	System.Private.CoreLib.dll!System.Threading.LowLevelLifoSemaphore.WaitForSignal(int timeoutMs) Line 140	C#
 	System.Private.CoreLib.dll!System.Threading.LowLevelLifoSemaphore.Wait(int timeoutMs, bool spinWait) Line 65	C#
 	System.Private.CoreLib.dll!System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() Line 100	C#
 	[Native to Managed Transition]	
 	kernel32.dll!00007ff9dde74cb0()	Unknown
 	ntdll.dll!00007ff9df3deceb()	Unknown

Reproduction Steps

No controlled reproduction available. We currently get the issue about three times per two weeks.

Expected behavior

new StringBuilder() does not hang forever.

Actual behavior

new StringBuilder() calls WaitForMultipleObjects, which does not return

Regression?

Yes; Issue is less than three months old.

Known Workarounds

No response

Configuration

.NET Runtime 8.0.7 via dotnet publish -r win-x64
OS: Windows Server Core; probably 2022

Other information

We have a full memory dump but can't release it publicly. This is a production instance and PHI is on the stack.

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jul 31, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jul 31, 2024
@jkotas
Copy link
Member

jkotas commented Aug 1, 2024

Is this with regular CoreCLR with a JIT or with native AOT? RhpNewArray is a native AOT specific method, but the stacktraces you have shared look like regular CoreCLR with a JIT.

@jhudsoncedaron
Copy link
Author

@jkotas : I'm sorry I don't understand the difference. When I did a code search for CORINFO_HELP_NEWARR_1_VC ; it seems to be mapped to RhpNewArray by RyuJIT. The environment is whatever dotnet publish -r win-x64 spits out.

@jkotas jkotas added area-VM-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Aug 1, 2024
@jkotas
Copy link
Member

jkotas commented Aug 1, 2024

cc @VSadov

Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@mangod9
Copy link
Member

mangod9 commented Aug 1, 2024

Hi @jhudsoncedaron, you mention that this started occurring recently, so do you observe that this is a regression in 8.0.7?

@jhudsoncedaron
Copy link
Author

jhudsoncedaron commented Aug 1, 2024

@mangod9 : Barring exotic hypotheses, yes; however we jumped straight from 8.0.5 to 8.0.7 and are not sure of the exact start date of the problem. Support was slow to report the issue to engineering. In addition, we just turned on gcDynamic a couple of months ago.

However exotic hypotheses are on the table; the data being processed as seen in the dump is trying to allocate about 5,000 StringBuilder objects in 50ms.

@jhudsoncedaron jhudsoncedaron changed the title Potential Deadlock in GC Allocate via RhpNewArray Potential Deadlock in GC Allocate via CORINFO_HELP_NEWARR_1_VC Aug 1, 2024
@mangod9
Copy link
Member

mangod9 commented Aug 1, 2024

Hey @jhudsoncedaron, are you able to privately share a dump to investigate further?

@jhudsoncedaron
Copy link
Author

@mangod9 : I will ask the guy capable of approving immediately. If that doesn't work, there's other ways of getting necessary information from a dump.

@jhudsoncedaron
Copy link
Author

@mangod9 : Dump upload is not happening. So it comes down to tell me what to look for and I can look for it.

@jhudsoncedaron
Copy link
Author

So I feel the need to explain a few things.

We are not sitting idle on this. At once other things are happening.

  1. We are trying to switch that one instance to workstation GC to see if it clears it (this will hit production first).
  2. We are optimizing the allocation-hot loop (in fact the very one with the blocked thread) to allocate less stuff. This should make it happen less often.
  3. We looked at the wait handle code; and found a very unlikey pathway that could corrupt wait handle references. This requires two billion exceptions to be thrown from the same finally block where the only possible exception is OutOfMemoryException allocating a List enumerator. On triggering the fault it would overflow the reference counter inside SafeWaitHandle; but I don't see how it could cause this particular fault. (Fix: replace that particular foreach loop with a counted loop).
  4. We are trying to find a way to reproduce this in a dev environment. It's not looking promising.
  5. The dump contains Medical Records data. There's no way any production dump is getting uploaded anywhere until all other options are exhausted. I am reasonably certain a paid support case will happen first.

@jkotas
Copy link
Member

jkotas commented Aug 2, 2024

Could you please load the dump into windbg debugger and the stacktraces of all threads from windbg? The stacktraces above seems to be from Visual Studio and they are missing low-level details that we need to diagnose this.

Also, could you please run !threads command in windbg and share its output?

@jhudsoncedaron
Copy link
Author

Output of !threads and ~e!dumpstacks

Unfortunately !stacks is broken. What's it mean "PsActiveProcessHead!"?

stacks-lightredacted.txt

@jkotas
Copy link
Member

jkotas commented Aug 2, 2024

Could you please share the output of ~*k and !sos.threads?

~*k has more precise stacktraces. dumpstack prints approximate stacktraces and it is hard to tell what's exactly going on from those.

!sos.threads is explicitly invoking threads command from sos debugger extension. The !threads output that you have shared is from built-in windbg !threads command. It is missing .NET runtime specific information. If the sos extension is not present in your windbg, it probably means that your windbg is too old. Try installing latest windbg from https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/debugger-download-tools .

@jhudsoncedaron
Copy link
Author

@jkotas : You're in luck, both windbg and sos.dll were downloaded and installed anew (for the first time on this SSD) today. Here's the output of your additional commands. I didn't find anything worth redacting this time around.

stacks2-full.txt

@jkotas
Copy link
Member

jkotas commented Aug 3, 2024

stacks2-full.txt

This shows that all threads are waiting for GC to finish, but there does not seem to be any thread making progress to finish the GC.

One possible explanation of this situation is use-after-free bug with Win32 HANDLEs somewhere else in the program. If some other part of the program inadvertently waited on the HANDLE used by the GC, it can explain why the GC gets stuck like this.

It may be useful to find the thread that's holding the ThreadStore lock that the .NET EventPipe thread is trying to acquire. It may give us some clues. Could you please find the OwningThread from the CRTICAL_SECTION argument passed in into RtlEnterCriticalSection?

   1  Id: 1568.1be8 Suspend: 0 Teb: 000000cf`a93fd000 Unfrozen ".NET EventPipe"
 # Child-SP          RetAddr               Call Site
00 000000cf`a987f558 00007ff9`df379f09     ntdll!NtWaitForAlertByThreadId+0x14
01 000000cf`a987f560 00007ff9`df3774c2     ntdll!RtlpEnterCriticalSectionContended+0x339
02 000000cf`a987f680 00007ff9`d52f8c6a     ntdll!RtlEnterCriticalSection+0x42
03 000000cf`a987f6b0 00007ff9`d5364498     coreclr!CrstBase::Enter+0x5a [D:\a\_work\1\s\src\coreclr\vm\crst.cpp @ 328] 
04 (Inline Function) --------`--------     coreclr!ThreadStore::Enter+0x10 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 5179] 
05 000000cf`a987f6e0 00007ff9`d5368e0b     coreclr!ThreadSuspend::LockThreadStore+0x90 [D:\a\_work\1\s\src\coreclr\vm\threadsuspend.cpp @ 1894] 
06 (Inline Function) --------`--------     coreclr!ThreadStore::LockThreadStore+0x5 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 5201] 
07 (Inline Function) --------`--------     coreclr!StateHolder<&ThreadStore::LockThreadStore,&ThreadStore::UnlockThreadStore>::Acquire+0x5 [D:\a\_work\1\s\src\coreclr\inc\holder.h @ 354] 
08 (Inline Function) --------`--------     coreclr!StateHolder<&ThreadStore::LockThreadStore,&ThreadStore::UnlockThreadStore>::{ctor}+0x5 [D:\a\_work\1\s\src\coreclr\inc\holder.h @ 340] 
09 000000cf`a987f710 00007ff9`d5368631     coreclr!ThreadStore::AddThread+0x1f [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 5227] 
0a 000000cf`a987f740 00007ff9`d5658c20     coreclr!SetupUnstartedThread+0x59 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 879] 
0b 000000cf`a987f770 00007ff9`d565b411     coreclr!ep_rt_thread_create+0x100 [D:\a\_work\1\s\src\coreclr\vm\eventing\eventpipe\ep-rt-coreclr.h @ 855] 
0c 000000cf`a987f830 00007ff9`d5655030     coreclr!sample_profiler_enable+0x45 [D:\a\_work\1\s\src\native\eventpipe\ep-sample-profiler.c @ 204] 

@jhudsoncedaron
Copy link
Author

jhudsoncedaron commented Aug 5, 2024

0:001> !locks

CritSec +36c3c6f0 at 000002be36c3c6f0
WaiterWoken No
LockCount 1
RecursionCount 1
OwningThread 1bf0
EntryCount 0
ContentionCount 42
*** Locked

Scanned 48 critical sections

Oh good there's exactly one held critical section; must be this one. OwningThread is 1bf0 which is a .NET Server GC thread.

Confirmed: 1be8 coreclr!CrstBase::Enter this->mCriticalSection:

LockCount -6
RecursionCount 1
OwningThread 0x1bf0
LockSemaphore 0xffffffffffffffff
SpinCount 0x200007cf

I think the event pipe is a red herring. We don't use it. This thread appears to be from when we tried to run dotnet-stacks on the already deadlocked process.

This shows that all threads are waiting for GC to finish, but there does not seem to be any thread making progress to finish the GC.

Which was my original analysis of the stacks from Visual Studio.

One possible explanation of this situation is use-after-free bug with Win32 HANDLEs somewhere else in the program.

That particular explanation would require the HANDLE that was used after it was freed to have been allocated before the GC's event HANDLE. I am having a hard time believing this as I would expect the GC to eagerly allocate its own resources at process startup.

Correct me if I'm wrong but I think this is testable hypothesis. If there's an early handle in managed code that is being used after it's freed, while being allocated before the GC runs the first time, I should be able to move the problem to a different handle by putting the following at the top of main: GC.Collect(); GC.WaitForPendingFinalizers();.

@jkotas jkotas changed the title Potential Deadlock in GC Allocate via CORINFO_HELP_NEWARR_1_VC Deadlock in GC Aug 5, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@jkotas
Copy link
Member

jkotas commented Aug 5, 2024

I have noticed that stacks2-full.txt shows 8x .NET Server GC threads, but only 6x .NET BGC threads. Did we lose or fail to create two .NET BGC threads?

This looks like GC threading issue. cc @dotnet/gc

@jkotas
Copy link
Member

jkotas commented Aug 5, 2024

@jhudsoncedaron You can test whether the problem is related to background GC (BGC) by trying to disable it: https://learn.microsoft.com/en-us/dotnet/core/runtime-config/garbage-collector#background-gc

@jhudsoncedaron
Copy link
Author

@jkotas : Yes I noticed that too. Since the number was even I didn't think much of it.

@jhudsoncedaron
Copy link
Author

Current status of the running code has been set to:
Server GC ON
DATAS ON
Background GC OFF (confirmed by dump)

Now we wait, I guess. If it doesn't recur it was likely a fault in background GC.

@markples
Copy link
Contributor

markples commented Aug 5, 2024

SVR threads are created during initialization, but BGC threads are created on demand and DATAS is on, so the 6 vs 8 could be normal behavior. However, the BGC join object is created during initialization. Disabling DATAS (but leaving BGC enabled) would provide another interesting data point on this. Also GC traces will probably be useful in determining context for any possible GC issues here.

@markples
Copy link
Contributor

markples commented Aug 5, 2024

DATAS - DOTNET_GCDynamicAdaptationMode or System.GC.DynamicAdaptationMode

@mangod9
Copy link
Member

mangod9 commented Aug 8, 2024

@jhudsoncedaron, did the issue not reproduce without BGC? Also since this is 8, assume you have explicitly enabled DATAS?

@jhudsoncedaron
Copy link
Author

jhudsoncedaron commented Aug 8, 2024

@mangod9 : Correct, DATAS is enabled explicitly.

I have not had a report that it reproduced again after setting BGC off; however not enough time has elapsed to be sure. Note that business day assumption is not valid; so we do not yet have 50% confidence.

The way the math works out, we need two more weeks for 99% confidence (assuming normal distribution), which is an amazing coincidence because I will be on vacation the next two weeks. Operations is monitoring the situation; however it is likely this thread will not be updated until I return.

A mitigation is scheduled to be installed Tuesday Evening 8/6. This mitigation reduces the number of allocations required in the allocation-hot loop (the stack that locked up allocating a StringBuilder) by removing the "small substrings are cheap" assumption. The number of allocations was reduced by almost two-thirds.

@mangod9
Copy link
Member

mangod9 commented Aug 8, 2024

Perhaps also trying with DATAS off but BGC on would be helpful. We have made quite a few changes to DATAS in 9, but wondering if it has shown reasonable improvements for your app since you have enabled it in 8.

@jhudsoncedaron
Copy link
Author

@mangod9 : That's next on the list to try; however this entails switching back to workstation GC. Trying to enable server GC without DATAS will invariably lock up the server. The only way this server works is with the ability to shuffle memory between running processes on demand.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Aug 14, 2024
@mangod9 mangod9 added this to the Future milestone Aug 14, 2024
@jhudsoncedaron
Copy link
Author

Status Update: No more incidents occurred after turning off background GC.

@mangod9
Copy link
Member

mangod9 commented Aug 26, 2024

so just to confirm its with DATAS enabled but BGC disabled?

@jhudsoncedaron
Copy link
Author

@mangod9 : Correct; Current status is DATAS enabled, BGC disabled, no more deadlocks.

@Maoni0
Copy link
Member

Maoni0 commented Aug 28, 2024

@jhudsoncedaron and I got in contact via email and will go from there.

@Maoni0
Copy link
Member

Maoni0 commented Aug 29, 2024

I did some debugging with @jhudsoncedaron yesterday (thanks so much for agreeing to do the debugging session!) and GC did try to create 7 BGC threads but one of the BGC threads is simply not there yet we do have a Thread object for it, and the thread creation call returned success. so I'm still looking at that part. this caused the deadlock since we are supposed to have 7 threads joining but only got 6.

@cshung cshung modified the milestones: Future, 9.0.0 Sep 6, 2024
@cshung
Copy link
Member

cshung commented Sep 9, 2024

@jhudsoncedaron

The instrumented binary is ready and can be accessed at https://github.com/Maoni0/clrgc/tree/main/issues/105780. Detailed instructions on how to use it can be found at https://github.com/Maoni0/clrgc/blob/main/README.md.

Please let me know if you need any further assistance or information regarding this.

@jhudsoncedaron
Copy link
Author

A little update. I don't want to say too much right now at the risk of getting it wrong; however Maoni0 has found a bad interaction between DATAS and Background GC in the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants