Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FATAL_GC_ERROR produces hard to diagnose hangs or crashes #112599

Closed
jkotas opened this issue Feb 15, 2025 · 2 comments · Fixed by #112640
Closed

FATAL_GC_ERROR produces hard to diagnose hangs or crashes #112599

jkotas opened this issue Feb 15, 2025 · 2 comments · Fixed by #112640
Assignees
Labels
area-ExceptionHandling-coreclr in-pr There is an active PR which will close this issue when it is merged tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@jkotas
Copy link
Member

jkotas commented Feb 15, 2025

The crashes caused by FATAL_GC_ERROR are very hard to diagnose. I helped somebody to diagnose one of these and it took more than a day to find out what's causing the problem.
 

Repro

  1. Add unconditional FATAL_GC_ERROR(); to gc_heap::verify_free_lists
  2. set DOTNET_HeapVerify=1
  3. Run a simple test that just calls GC.Collect on x64 checked runtime

Actual behavior

This is one of the possible failure modes. I have also seen other asserts or hangs with mode complex tests.

Assert failure(PID 24832 [0x00006100], Thread: 16656 [0x4110]): unbreakableLockCount == m_pThread->GetUnbreakableLockCount() || (!m_pThread->HasUnbreakableLock() && !m_pThread->HasThreadStateNC(Thread::TSNC_OwnsSpinLock))

CORECLR! FCallCheck::~FCallCheck + 0x40 (0x00007ffa`68c11370)
CORECLR! CallSettingFrameEncoded + 0x2A (0x00007ffa`69185c8a)
CORECLR! _FrameHandler4::FrameUnwindToState + 0x28E (0x00007ffa`691844ce)
CORECLR! _FrameHandler4::FrameUnwindToEmptyState + 0x4B (0x00007ffa`6917c1eb)
CORECLR! _InternalCxxFrameHandler<__FrameHandler4> + 0x283 (0x00007ffa`691829c3)
CORECLR! _InternalCxxFrameHandlerWrapper<__FrameHandler4> + 0x6A (0x00007ffa`69182cba)
CORECLR! _CxxFrameHandler4 + 0xFB (0x00007ffa`6917d04b)
CORECLR! _GSHandlerCheck_EH4 + 0x90 (0x00007ffa`69179070)
NTDLL! chkstk + 0x11F (0x00007ffb`04a43f8f)
NTDLL! RtlUnwindEx + 0x352 (0x00007ffb`048f4d22)
    File: C:\runtime\src\coreclr\vm\fcall.cpp:196
    Image: C:\runtime\artifacts\bin\coreclr\windows.x64.Checked\corerun.exe

Expected behavior

Error message that suggests fatal GC error. No hangs or crashes.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Feb 15, 2025
@jkotas jkotas added area-ExceptionHandling-coreclr tenet-reliability Reliability/stability related issue (stress, load problems, etc.) and removed area-GC-coreclr untriaged New issue has not been triaged by the area owner labels Feb 15, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@jkotas
Copy link
Member Author

jkotas commented Feb 15, 2025

cc @janvorli

Can we get the exception handling out of the way when we hit breakpoints in the GC, so that these fatal errors crash cleanly?

@jkotas jkotas added this to the 10.0.0 milestone Feb 15, 2025
janvorli added a commit to janvorli/runtime that referenced this issue Feb 17, 2025
The new exception handling doesn't work well with the DebugBreak in some
cases. E.g. when it is invoked from FATAL_GC_ERROR. The new EH attempts
to handle the STATUS_BREAKPOINT stemming from the DebugBreak, allocate a
managed exception object and hangs since it cannot do that when the GC
is running.

The cause is a missing check for the breakpoint exception in the
ProcessCLRExceptionNew that is present in the old ProcessCLRException.
To fix it, I've copied that code to the ProcessCLRExceptionNew.

Close dotnet#112599
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Feb 17, 2025
jkotas pushed a commit that referenced this issue Feb 19, 2025
* Fix new EH hang on DebugBreak

The new exception handling doesn't work well with the DebugBreak in some
cases. E.g. when it is invoked from FATAL_GC_ERROR. The new EH attempts
to handle the STATUS_BREAKPOINT stemming from the DebugBreak, allocate a
managed exception object and hangs since it cannot do that when the GC
is running.

The cause is a missing check for the breakpoint exception in the
ProcessCLRExceptionNew that is present in the old ProcessCLRException.
To fix it, I've copied that code to the ProcessCLRExceptionNew.

Close #112599

* Never process breakpoints via the new EH
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-ExceptionHandling-coreclr in-pr There is an active PR which will close this issue when it is merged tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants