Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage Collection hang since upgrading to .Net 9 #112203

Open
scotttho-datacom opened this issue Feb 5, 2025 · 11 comments
Open

Garbage Collection hang since upgrading to .Net 9 #112203

scotttho-datacom opened this issue Feb 5, 2025 · 11 comments
Labels
area-GC-coreclr untriaged New issue has not been triaged by the area owner

Comments

@scotttho-datacom
Copy link

scotttho-datacom commented Feb 5, 2025

Description

Since upgrading our application to .Net 9 we are seeing processes lock up and hang indefinitely (stuck for several hours before being killed manually). Analysis of the memory dumps suggests a garbage collection issue.

Have collected many memory dumps and can probably share a memory dump from our dev environment privately if needed.

WinDBg output:

0:000> !analyze -v
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************

*** WARNING: Unable to verify checksum for Datascape.exe

KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.mSec
    Value: 3625

    Key  : Analysis.Elapsed.mSec
    Value: 76019

    Key  : Analysis.IO.Other.Mb
    Value: 3

    Key  : Analysis.IO.Read.Mb
    Value: 1

    Key  : Analysis.IO.Write.Mb
    Value: 13

    Key  : Analysis.Init.CPU.mSec
    Value: 437

    Key  : Analysis.Init.Elapsed.mSec
    Value: 6136

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 233

    Key  : Analysis.Version.DbgEng
    Value: 10.0.27725.1000

    Key  : Analysis.Version.Description
    Value: 10.2408.27.01 amd64fre

    Key  : Analysis.Version.Ext
    Value: 1.2408.27.1

    Key  : CLR.Engine
    Value: CORECLR

    Key  : CLR.Version
    Value: 9.0.24.52809

    Key  : Failure.Bucket
    Value: BREAKPOINT_80000003_coreclr.dll!SVR::gc_heap::wait_for_gc_done

    Key  : Failure.Hash
    Value: {e9e7019b-8511-725f-3866-9e920f973aa5}

    Key  : Failure.Source.FileLine
    Value: 14738

    Key  : Failure.Source.FilePath
    Value: D:\a\_work\1\s\src\coreclr\gc\gc.cpp

    Key  : Failure.Source.SourceServerCommand
    Value: raw.githubusercontent.com/dotnet/runtime/9d5a6a9aa463d6d10b0b0ba6d5982cc82f363dc3/src/coreclr/gc/gc.cpp

    Key  : Timeline.OS.Boot.DeltaSec
    Value: 886293

    Key  : Timeline.Process.Start.DeltaSec
    Value: 461511

    Key  : WER.OS.Branch
    Value: fe_release

    Key  : WER.OS.Version
    Value: 10.0.20348.1

    Key  : WER.Process.Version
    Value: 2.7.122.1440


FILE_IN_CAB:  Datascape (2).DMP

NTGLOBALFLAG:  0

APPLICATION_VERIFIER_FLAGS:  0

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 0000000000000000
   ExceptionCode: 80000003 (Break instruction exception)
  ExceptionFlags: 00000000
NumberParameters: 0

FAULTING_THREAD:  000023b4

PROCESS_NAME:  Datascape.dll

ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION}  Breakpoint  A breakpoint has been reached.

EXCEPTION_CODE_STR:  80000003

STACK_TEXT:  
00000066`f4d7b438 00007ffd`d01ada4e     : 00000000`00000000 00000000`00001528 00000000`00000030 00000000`00000018 : ntdll!NtWaitForSingleObject+0x14
00000066`f4d7b440 00007ffd`a39d0991     : 00000000`00000000 00000000`00000018 00000260`00000000 00000000`000000a8 : KERNELBASE!WaitForSingleObjectEx+0x8e
00000066`f4d7b4e0 00007ffd`a3912632     : 00000000`00000001 00000000`00000000 00000000`00000000 00007ffd`00000000 : coreclr!SVR::gc_heap::wait_for_gc_done+0x5d
00000066`f4d7b510 00007ffd`a39f8819     : 00000260`a4ebf118 00000260`a4e21728 00000260`a4ebf118 00000000`00000040 : coreclr!SVR::GCHeap::GarbageCollectGeneration+0xee
00000066`f4d7b560 00007ffd`a3a34d96     : 00000000`00000000 00000260`a4ebf118 00000000`00000040 00000260`a4e21728 : coreclr!SVR::gc_heap::trigger_gc_for_alloc+0x15
00000066`f4d7b590 00007ffd`a3a34c7e     : 00000000`00000000 00000000`00000000 00000000`00000000 00000260`e851e288 : coreclr!SVR::gc_heap::try_allocate_more_space+0x656de
00000066`f4d7b600 00007ffd`a3920417     : 00000000`00000002 00000000`00000040 00000260`a4e21728 00000260`f1e34b28 : coreclr!SVR::gc_heap::allocate_more_space+0x65772
00000066`f4d7b660 00007ffd`a3948a71     : 00000000`00000002 00007ffd`44f31a4e 00000260`a4e21728 00000260`c5be9e68 : coreclr!SVR::GCHeap::Alloc+0xb7
00000066`f4d7b6c0 00007ffd`a39488cd     : 00000260`e854bb10 00000000`00000000 00000066`f4d7b970 00000260`e854c500 : coreclr!AllocateObject+0x101
00000066`f4d7b750 00007ffd`4ffdbe60     : 00007ffd`50236ce8 00000260`e854c758 00000260`e854c568 00000000`00000000 : coreclr!JIT_New+0xdd
00000066`f4d7b8b0 00007ffd`50236ce8     : 00000260`e854c758 00000260`e854c568 00000000`00000000 00000066`f4d7b8b0 : EntityFramework!System.Data.Entity.Core.Mapping.ViewGeneration.Validation.ForeignConstraint.CheckIfConstraintMappedToForeignKeyAssociation+0x580
00000066`f4d7b8b8 00000260`e854c758     : 00000260`e854c568 00000000`00000000 00000066`f4d7b8b0 00007ffd`4f85db7e : 0x00007ffd`50236ce8
00000066`f4d7b8c0 00000260`e854c568     : 00000000`00000000 00000066`f4d7b8b0 00007ffd`4f85db7e 00000000`00000000 : 0x00000260`e854c758
00000066`f4d7b8c8 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x00000260`e854c568


STACK_COMMAND:  ~0s; .ecxr ; kb

FAULTING_SOURCE_LINE:  D:\a\_work\1\s\src\coreclr\gc\gc.cpp

FAULTING_SOURCE_FILE:  D:\a\_work\1\s\src\coreclr\gc\gc.cpp

FAULTING_SOURCE_LINE_NUMBER:  14738

FAULTING_SOURCE_SRV_COMMAND:  https://raw.githubusercontent.com/dotnet/runtime/9d5a6a9aa463d6d10b0b0ba6d5982cc82f363dc3/src/coreclr/gc/gc.cpp

FAULTING_SOURCE_CODE:  
No source found for 'D:\a\_work\1\s\src\coreclr\gc\windows\gcenv.windows.cpp'


SYMBOL_NAME:  coreclr!SVR::gc_heap::wait_for_gc_done+5d

MODULE_NAME: coreclr

IMAGE_NAME:  coreclr.dll

FAILURE_BUCKET_ID:  BREAKPOINT_80000003_coreclr.dll!SVR::gc_heap::wait_for_gc_done

OS_VERSION:  10.0.20348.1

BUILDLAB_STR:  fe_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

IMAGE_VERSION:  9.0.24.52809

FAILURE_ID_HASH:  {e9e7019b-8511-725f-3866-9e920f973aa5}

Followup:     MachineOwner
---------

stacks.txt

Reproduction Steps

We have not been able to reproduce this on demand, but are seeing it on a nightly basis

Expected behavior

Application not to hang

Actual behavior

Application hangs indefinitely

Regression?

Seems to have come in since upgrading to .Net 9

Known Workarounds

None

Configuration

Some background on our setup:
Application Servers (windows x64)

  • Many instances (~40) of the app run as windows services
  • Following GC environment variables set
    • DOTNET_GCConserveMemory=6
    • DOTNET_gcServer=1
    • DOTNET_gcTrimCommitOnLowMemory=1
    • DOTNET_GCDynamicAdaptationMode=1
  • Note DATAS made a fantastic improvment for us in .Net 8 and we have been pretty happy with how GC has been working with this config
  • Have not had a single instance of the issue on these servers

Processing Servers (windows x64)

  • Run the same app as a cmd line process to do some short processing (20-40 mins) over night
  • Up to 5 instances running at a time
  • No GC environment variable set (have never needed to worry about GC here as they are short lived processes)
  • Most nights seeing 1-5 processes hang indefinitely

Have collected many memory dumps and the hang is often in the same area of code (Entity Framework) which looks like it is doing quite a lot of allocations.

Have tried applying the same GC settings to the processing servers and have had mixed results. No hangs in our dev environment after 2 days but 2 hangs in production the first day after applying the settings.

Other information

Possibly related:
#110350
#105780

Apologies if this is a duplicate of either of those, bit hard for me to tell so I figured a separate issue report might be best.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Feb 5, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@mangod9
Copy link
Member

mangod9 commented Feb 5, 2025

Does this deadlock happen on startup? Fix for #105780 should be included in the latest .NET 9 servicing release.

Fix for #110350 is not done yet. A temporary workaround should be to disable background gc

@scotttho-datacom
Copy link
Author

It's not specifically a startup issue and we should already be using that Jan service release of .Net 9. However it looks like one of those tickets had a fix rolled back anyway.

I'll run some tests with background GC disabled and report back the results next week. Will take a few days as I'm not expecting much load on these servers until the middle of the week and it only seems to happen when they are under high load.

@mangod9
Copy link
Member

mangod9 commented Feb 6, 2025

ok sounds good. Looks related to #110350 then. Will look at working on a fix for that.

@scotttho-datacom
Copy link
Author

scotttho-datacom commented Feb 12, 2025

So I have set the environment variable DOTNET_gcConcurrent = 0 on all the processing servers, had no issues for the last week then last night had 3 processes hang on the same server (all at different times).

Initial analysis looks the same as before:

*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************


KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.mSec
    Value: 1656

    Key  : Analysis.Elapsed.mSec
    Value: 5480

    Key  : Analysis.IO.Other.Mb
    Value: 0

    Key  : Analysis.IO.Read.Mb
    Value: 1

    Key  : Analysis.IO.Write.Mb
    Value: 0

    Key  : Analysis.Init.CPU.mSec
    Value: 1109

    Key  : Analysis.Init.Elapsed.mSec
    Value: 29112

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 229

    Key  : Analysis.Version.DbgEng
    Value: 10.0.27725.1000

    Key  : Analysis.Version.Description
    Value: 10.2408.27.01 amd64fre

    Key  : Analysis.Version.Ext
    Value: 1.2408.27.1

    Key  : CLR.Engine
    Value: CORECLR

    Key  : CLR.Version
    Value: 9.0.24.52809

    Key  : Failure.Bucket
    Value: BREAKPOINT_80000003_coreclr.dll!SVR::gc_heap::wait_for_gc_done

    Key  : Failure.Hash
    Value: {e9e7019b-8511-725f-3866-9e920f973aa5}

    Key  : Failure.Source.FileLine
    Value: 14738

    Key  : Failure.Source.FilePath
    Value: D:\a\_work\1\s\src\coreclr\gc\gc.cpp

    Key  : Failure.Source.SourceServerCommand
    Value: raw.githubusercontent.com/dotnet/runtime/9d5a6a9aa463d6d10b0b0ba6d5982cc82f363dc3/src/coreclr/gc/gc.cpp

    Key  : Timeline.OS.Boot.DeltaSec
    Value: 2272696

    Key  : Timeline.Process.Start.DeltaSec
    Value: 25686

    Key  : WER.OS.Branch
    Value: rs1_release

    Key  : WER.OS.Version
    Value: 10.0.14393.6343

    Key  : WER.Process.Version
    Value: 2.7.212.1055


FILE_IN_CAB:  RSHL INTPLAY Hung Upgrade.DMP

NTGLOBALFLAG:  0

APPLICATION_VERIFIER_FLAGS:  0

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 0000000000000000
   ExceptionCode: 80000003 (Break instruction exception)
  ExceptionFlags: 00000000
NumberParameters: 0

FAULTING_THREAD:  00002e64

PROCESS_NAME:  Datascape.dll

ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION}  Breakpoint  A breakpoint has been reached.

EXCEPTION_CODE_STR:  80000003

STACK_TEXT:  
0000005e`b577b658 00007ffe`07116d1f     : 00000000`00000000 00000000`00001528 00000000`00000060 00000000`00000048 : ntdll!NtWaitForSingleObject+0x14
0000005e`b577b660 00007ffd`d1ee0991     : 00000000`00000000 00000000`00000048 000001c8`00000000 00000000`00000244 : KERNELBASE!WaitForSingleObjectEx+0x8f
0000005e`b577b700 00007ffd`d1e22632     : 00000000`00000001 00000000`00000000 00000000`00000000 00007ffd`00000000 : coreclr!SVR::gc_heap::wait_for_gc_done+0x5d
0000005e`b577b730 00007ffd`d1f08819     : 000001c8`be16d488 000001c8`be1f6078 000001c8`be16d488 00000000`00000040 : coreclr!SVR::GCHeap::GarbageCollectGeneration+0xee
0000005e`b577b780 00007ffd`d1f44d96     : 00000000`00000000 000001c8`be16d488 00000000`00000040 000001c8`be1f6078 : coreclr!SVR::gc_heap::trigger_gc_for_alloc+0x15
0000005e`b577b7b0 00007ffd`d1f44c7e     : 00000000`00000000 00000000`00000000 00000000`00000000 000001c8`ffffffff : coreclr!SVR::gc_heap::try_allocate_more_space+0x656de
0000005e`b577b820 00007ffd`d1e30417     : 00000000`00000002 00000000`00000040 000001c8`be1f6078 000001c8`f9690628 : coreclr!SVR::gc_heap::allocate_more_space+0x65772
0000005e`b577b880 00007ffd`d1e58a71     : 00000000`00000002 00007ffd`7e757f10 000001c8`be1f6078 00007ffd`73438cdf : coreclr!SVR::GCHeap::Alloc+0xb7
0000005e`b577b8e0 00007ffd`d1e588cd     : 000001c8`ed9f81c8 00000000`00000000 0000005e`b577bb90 000001c8`f6118700 : coreclr!AllocateObject+0x101
0000005e`b577b970 00007ffd`7e45d4b7     : 00007ffd`7e756250 000001c9`0300cda0 000001c8`f6118750 00000000`00000000 : coreclr!JIT_New+0xdd
0000005e`b577bad0 00007ffd`7e756250     : 000001c9`0300cda0 000001c8`f6118750 00000000`00000000 0000005e`b577bad0 : EntityFramework!System.Data.Entity.Core.Mapping.ViewGeneration.Validation.ForeignConstraint.CheckIfConstraintMappedToForeignKeyAssociation+0x677
0000005e`b577bad8 000001c9`0300cda0     : 000001c8`f6118750 00000000`00000000 0000005e`b577bad0 00007ffd`7dd341be : 0x00007ffd`7e756250
0000005e`b577bae0 000001c8`f6118750     : 00000000`00000000 0000005e`b577bad0 00007ffd`7dd341be 00000000`00000000 : 0x000001c9`0300cda0
0000005e`b577bae8 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x000001c8`f6118750


STACK_COMMAND:  ~0s; .ecxr ; kb

FAULTING_SOURCE_LINE:  D:\a\_work\1\s\src\coreclr\gc\gc.cpp

FAULTING_SOURCE_FILE:  D:\a\_work\1\s\src\coreclr\gc\gc.cpp

FAULTING_SOURCE_LINE_NUMBER:  14738

FAULTING_SOURCE_SRV_COMMAND:  https://raw.githubusercontent.com/dotnet/runtime/9d5a6a9aa463d6d10b0b0ba6d5982cc82f363dc3/src/coreclr/gc/gc.cpp

FAULTING_SOURCE_CODE:  
No source found for 'D:\a\_work\1\s\src\coreclr\gc\windows\gcenv.windows.cpp'


SYMBOL_NAME:  coreclr!SVR::gc_heap::wait_for_gc_done+5d

MODULE_NAME: coreclr

IMAGE_NAME:  coreclr.dll

FAILURE_BUCKET_ID:  BREAKPOINT_80000003_coreclr.dll!SVR::gc_heap::wait_for_gc_done

OS_VERSION:  10.0.14393.6343

BUILDLAB_STR:  rs1_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

IMAGE_VERSION:  9.0.24.52809

FAILURE_ID_HASH:  {e9e7019b-8511-725f-3866-9e920f973aa5}

Followup:     MachineOwner

Looking at the stack traces I do still see a few threads referencing bgc_thread_function - does this suggest background GC is still enabled and maybe my environment variable hasn't worked?

Stacks Feb13.txt

@mangod9
Copy link
Member

mangod9 commented Feb 13, 2025

Yeah it appears the BGC was still enabled. You can check the env. vars using !peb in windbg.

@scotttho-datacom
Copy link
Author

Thanks, yes !peb confirms that the DOTNET_gcConcurrent environment variable was not present which is a bit bizarre so I'll need to dig into that. Elsewhere things seem to be fairly solid now so we seem to be on the right track.

@fabianoliver
Copy link

I think we faced the same issue.

I upgraded an application to NET9 a few weeks ago. Have around 60 different deployments with 2-3 pods each. Afterwards, on average every 3 or so days some random pod would apparently just freeze entirely - no HTTP routes reachable anymore, logs stopping entirely, and then quickly killed by k8s liveness probes.

For the time being, we've tried reverting to the old GC (COMPlus_GCName=libclrgc.so), which seems to have done the trick as well. In the past ~10 days since changing over to that, I haven't observed any more hangs.

Thanks for looking into this, we're keenly awaiting the fix as well as I'm eager to revert to the current GC (and am hesitant to move further services onto NET9 for the time being)

@mangod9
Copy link
Member

mangod9 commented Mar 4, 2025

we have a fix in the works which should be included in the next servicing release for 9. Thanks

@jkotas
Copy link
Member

jkotas commented Mar 4, 2025

@mangod9 Do you mean #110350 / #113055 ? I do not think that it is going to fix any hangs on Linux. It is Windows-specific change.

@mangod9
Copy link
Member

mangod9 commented Mar 4, 2025

oh sorry, yeah I was referring to the post from @scotttho-datacom. @fabianoliver you are running into a different issue if its on linux (we had some fixes in the latest .NET 9 servicing releases, assuming you are running the latest). If so please create a separate issue with details (dump or stack trace of the process would be helpful)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-GC-coreclr untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

4 participants