-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang during shutdown with Native AOT on IReferenceTrackerHost::ReleaseDisconnectedReferenceSources waiting for finalizers #109538
Comments
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas |
cc. @AaronRobinsonMSFT @jkoritzinsky perhaps you might also have thoughts on this or some knowledge to share? 😄 |
Also + @VSadov in case this could be related to suspension |
It's not related to suspension, but definitely an area where expertise from @VSadov would help. Looks like native AOT will just wait on an event using a OS API and that's all there's to it. CoreCLR does a lot more to figure out how exactly to wait (Does the thread have a SynchronizationContext? Do we need to pump window messages with MsgWaitForMultipleObjects? etc.). As usual, this is a place that is a lot more complicated in CoreCLR VM and it's hard to say which part is relevant and how much of it we need to implement. runtime/src/coreclr/vm/threads.cpp Lines 3201 to 3484 in 3116db9
|
I think waiting with "alertable" is to allow a sleeping/waiting thread to react to Line 402 in 6d23ef4
I suspect the culprit is something with COM and message pumping, but it looks like we call |
CoreCLR may be doing the wait via user installed synchronization context. It would be useful to find out the callstack of the wait in CoreCLR. |
I did have a chat with some COM folks. On ASTA threads such as this one, My initial guess was the |
You can build a local native AOT package that changes this to alertable=true and try to repro it with that to prove or disprove this hypothesis. |
If there's something you'd like to try, here are the accelerated steps:
Then publish your project with native AOT as usual (might want to delete all of bin/obj before since this is not incremental), but set IlcSdkPath property like this |
This is not native AOT specific. I stepped through CoreCLR VM version of this. We end up taking a path where we wait like this:
There's a difference between native AOT and CoreCLR - on CoreCLR we pass In the end it doesn't make a difference because they both deadlock. |
Who has the next step here? If this is possible to hit in CoreCLR as well, is there a latent bug in this code? Do we know why we're hitting it more frequently in Native AOT? |
Yeah @manodasanW could consistently repro the same issue in CoreCLR as well, it seems this is not unique to NativeAOT. They're both broken in ASTA scenarios it would seem. Unclear why this is more consistent with NativeAOT. My understanding is the next steps are to debug the app shutdown with either .NETCore.Uwp or .NET Native (likely .NET Native) and compare traces and see if it's doing anything significantly different, and if so, figure out why that was changed. One hypothesis is that perhaps old .NET and .NET Native were not running finalizers at all upon shutdown (that would seem unlikely though), or perhaps doing that but not doing context switches in that scenario. There's also the additional caveat that this is technically not even a "shutdown" scenario, it's just PLM notifying the app that it's being suspended. This may or may not mean the app will then be closed, but it's not guaranteed. So not sure if there's even a way to tell. Checking what .NET Native was doing in an equivalent scenario should hopefully give us more information though 🤔 |
My hypothesis is that this is a race condition that was always there, but that was made more likely to be hit due to some unrelated changes (or by things being faster in general). |
The .NET Native side doesn't seem to be calling WaitForPendingFinalizers (the necessary condition for the hang). The closest I found is this: https://github.com/dotnet/corert/blob/a8830fe5158c499a75b19239524a6d1092a679fe/src/System.Private.Interop/src/Shared/RCWWalker.cs#L361-L369 |
That's quite surprising 👀
|
Do not block main app thread by waiting for finalizers. Wait for the finalizers on a dedicated thread so that the rest of the app can run undisturbed. |
But isn't the issue that PLM is requesting the app to suspend and that's what's triggered that reference tracking callback, and we have to block to ensure the suspension event handler doesn't complete before we're done? Basically if we used a different thread to wait, would that be doing anything at all vs. just letting things finalize on their own? We wouldn't be able to signal back to Jupiter when that's done anyway since we'd no longer be inside that wait callback anymore, right? Am I missing something? 😅 |
You can run a message pumping loop on the app thread that allows the remote calls through and ends when the finalization is finished. I am not sure whether the reference tracking callback is compatible with that. Is the reference tracking callback the only way for the app to get notified about pending suspend or is there a lower-level API that the app can subscribe to get notified? If there is a lower-level API like that, it may be a better option. |
Mmh good question. @AaronRobinsonMSFT @jkoritzinsky @manodasanW any thoughts on this? Related question for Jon and Keith: the docs for Eg. would it be fine and/or better than .NET Native's approach to at least eg. call a blocking full GC here to try to reclaim as much memory as possible as a best effort, and then any pending finalizers would just finish running when/if the process is potentially resumed? |
@jkotas would it be reasonable to do this:
|
Useful links:
_Check_return_ HRESULT DXamlCore::OnAfterAppSuspend()
{
HRESULT hr = S_OK;
// We request additional suspend time only for managed apps because
// GC is an asynchronous operation that can be slow.
if (ReferenceTrackerManager::HaveHost())
{
// Do a GC and wait for finalizers.
IGNOREHR(ReferenceTrackerManager::TriggerCollectionForSuspend());
IGNOREHR(ReferenceTrackerManager::TriggerFinalization());
}
// Flush the release queue
// We make this call for both native and managed apps because the queue
// is not guaranteed to be empty for native apps (for example, if some
// objects were recently released off thread).
ReleaseQueuedObjects();
if (m_pControl)
{
IFC(m_pControl->GetCoreServices()->OnSuspend(false /* isTriggeredByResourceTimer */, true /* allowOfferResources */));
}
Cleanup:
return hr;
} So it feels like (and the comments also say that), waiting for finalizers is just "recommended", not strictly required. |
Sounds reasonable to me.
I think that the proper fix may need to be in CsWinRT or Xaml (partially at least). |
…tedReferenceSources' (#110551) This PR updates the IReferenceTrackerHost::ReleaseDisconnectedReferenceSources implementation for CoreCLR and NativeAOT to match what .NET Native was doing, and not wait for finalizers, to avoid deadlocks in ASTA scenarios (UWP). Finalizers will just continue running normally, and if the process is suspended (which can only happen on UWP anyway), they'll resume when the process is resumed later on. Worst case scenario this would only cause a non-optimal memory use cleanup before suspension (which is better than a deadlock anyway). Can be revisited and improved for .NET 10 if needed. Contributes to #109538
…tedReferenceSources' (dotnet#110551) This PR updates the IReferenceTrackerHost::ReleaseDisconnectedReferenceSources implementation for CoreCLR and NativeAOT to match what .NET Native was doing, and not wait for finalizers, to avoid deadlocks in ASTA scenarios (UWP). Finalizers will just continue running normally, and if the process is suspended (which can only happen on UWP anyway), they'll resume when the process is resumed later on. Worst case scenario this would only cause a non-optimal memory use cleanup before suspension (which is better than a deadlock anyway). Can be revisited and improved for .NET 10 if needed. Contributes to dotnet#109538
I'm going to close this since #110551 fixed this to work same as .NET Native and given:
we don't know how the actual fix is supposed to look like (i.e. XAML/CsWinRT might need to define how this is actually supposed to work; we might even need new APIs or flags - who knows). There is no current known work on this within dotnet/runtime. |
…tedReferenceSources' (dotnet#110551) This PR updates the IReferenceTrackerHost::ReleaseDisconnectedReferenceSources implementation for CoreCLR and NativeAOT to match what .NET Native was doing, and not wait for finalizers, to avoid deadlocks in ASTA scenarios (UWP). Finalizers will just continue running normally, and if the process is suspended (which can only happen on UWP anyway), they'll resume when the process is resumed later on. Worst case scenario this would only cause a non-optimal memory use cleanup before suspension (which is better than a deadlock anyway). Can be revisited and improved for .NET 10 if needed. Contributes to dotnet#109538
Description
We're hitting a 100% consistent hang during application shutdown, only on Native AOT. It seems that the finalizer thread and the UI thread (ASTA) are possibly in a deadlock, resulting in the application process remaining alive after closing the main window. After a few seconds, Windows proceeds to kill the process, which shows up in WER as a hang (which is expected). Only repros with Native AOT, whereas CoreCLR seems to work fine.
Reproduction Steps
I don't have a minimal repro. Please ping me on Teams for instructions on how to deploy the Store locally to repro. Alternatively, I can also share an MSIX package for sideloading, with instructions on how to install it for testing (and how to restore the retail Store after that).
Here is a memory dump on the process during the hang (process was paused from WinDbg on the presumed deadlock).
Expected behavior
The application should shutdown correctly when closing the window.
Actual behavior
Here's the two relevant stacktraces I see in WinDbg.
Finalizer thread (!FinalizerStart) (click to expand)
UI thread (shcore!_WrapperThreadProc ApplicationView ASTA) (click to expand)
It seems that:
ComWrappers::IReferenceTrackerHost::ReleaseDisconnectedReferenceSources
is calledGC::WaitForPendingFinalizers
ObjectReferenceWithContext<T>
objectCallInContext
(here)WaitForMultipleObjectsEx
Some potentially relevant differences we noticed in the finalizer logic across CoreCLR (which works fine) and NativeAOT:
CoreCLR:
runtime/src/coreclr/vm/finalizerthread.cpp
Lines 493 to 548 in 302e0d4
Native AOT:
runtime/src/coreclr/nativeaot/Runtime/FinalizerHelpers.cpp
Lines 115 to 151 in 302e0d4
It seems that they're similar, however:
alertable: TRUE
alertable: FALSE
allowReentrantWait: TRUE
param, which makes it callCoWaitForMultipleHandles
hereNot sure whether that's intentional (why?) and whether it's related to the issue, just something we noticed.
Regression?
No.
Known Workarounds
None, this is a blocker 😅
Configuration
The text was updated successfully, but these errors were encountered: