Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to capture runtime fatal errors #3375

Closed
pakrym opened this issue Dec 5, 2018 · 30 comments
Closed

Add a way to capture runtime fatal errors #3375

pakrym opened this issue Dec 5, 2018 · 30 comments
Assignees
Milestone

Comments

@pakrym
Copy link
Contributor

pakrym commented Dec 5, 2018

As discussed in https://github.com/dotnet/core-setup/issues/4455#issuecomment-422214003 it would be nice for the host to be able to capture runtime fatal errors (including unhandled exception).

Original motivation: In ANCM we are trying to prevent the worker process from crashing by means of catching all native exceptions coming from hostfxr_main but doing so overrides coreclr unhandled exception handler that is logging to event log making diagnostics of unhandle exception very hard for customers.

@vitek-karas @jkotas

@jeffschwMSFT
Copy link
Member

@sdmaclea can you take a look? I was chatting with @davidfowl and there is an important web scenario that we should consider.

cc @jkotas

@vitek-karas
Copy link
Member

There are 2 parts to this:

  • Capturing runtime failures (so typically fatal runtime errors which cause immediate shutdown). The solution for this is probably relatively simple:
    Looked at how CLR does reporting of unhandled exceptions and it basically just writes to STDERR. We could add the ability to "forward" this output to the host somehow.
    As far as I can tell this would potentially redirect all STDERR from the runtime - not just unhandled exceptions. We could basically highjack the PrintToStdErr (and friends).
    The downside is that this would be thread agnostic - so the host would get all messages from all threads with no ordering guarantees (just like we write it to console).
    This would be unlike our potential solution for the hostfxr/hostpolicy which will be thread aware (it has to be as it's going to be used also for the runtime calling into host).

  • Capturing unhandled managed exceptions - this is trickier. The actual requirement from ASP.NET is to be able to catch these exception (and keep the process alive) and also capture the detailed error information about the exception. Currently the runtime has no way to do both. If the exception is handled, runtime does nothing special (it's basically not in the picture) - this includes cases where native code hosting the app catches the exception via SEH. Or if the exception is truly unhandled it will produce the detailed info (exception message and callstack) - this can be captured relatively easily, but then it forcefully exits the process.
    The design issue here is that in general we don't want to leave the process in a bad state, and having unhandled managed exception while keeping the process alive is a bad state (it could have been OOM, or other really bad exception which potentially leaves the process in inconsistent state).

Which part is more important for ASP.NET scenarios? The first can probably be done without much hassle, the second one is much more tricky.

@pakrym
Copy link
Contributor Author

pakrym commented Jan 15, 2019

Our main goal is to get exception information to customers.

The problem that we are hitting is that in addition to that we need to prevent the hosting process from crashing. To accomplish that we catch exceptions coming from the thread we run hostfxr_main on, this turns exception from unhandled to handled and prevents it from going through the mechanism described in point 1 and being logged anywhere stderr or event log.

All we really want is to get CLR to log Event Log message without crashing the process.

@jkotas
Copy link
Member

jkotas commented Jan 15, 2019

And what do you want CLR to do with the unhandled exception? Swallow it and hope for the best?

@vitek-karas
Copy link
Member

+1 to Jan's comment.

Or in more detail:
What should happen on the thread were the unhandled exception is thrown?
Note that currently there's no way to "continue" running any code on it, since there's nowhere to jump (no handler).

@pakrym
Copy link
Contributor Author

pakrym commented Jan 15, 2019

That's what we do now anyway. Catch it and hope for the best.

@jkotas
Copy link
Member

jkotas commented Jan 15, 2019

You are basically asking for LegacyUnhandledExceptionPolicy that was the default in .NET Framework 1.0. It is still available in CoreCLR a COMPlus config.

The problem with swallowing unhandled exceptions is that they tend gradually corrupt the process state and the process will crash anyway in much harder to diagnose way. This was explained in https://docs.microsoft.com/en-us/dotnet/standard/threading/exceptions-in-managed-threads .

If you really want to use LegacyUnhandledExceptionPolicy, you will be responsible for the support burden than comes with it.

@davidfowl
Copy link
Member

For context, here's what the user experience is like today dotnet/aspnetcore#5153. The main thing we need to handle is exceptions in Main. Exceptions on background threads can keep crashing as usual (those also log to the event log on windows which we want).

@vitek-karas
Copy link
Member

Can you wrap Main in that case - and just add try/catch around it - that feels like the safest solution.

@davidfowl
Copy link
Member

Can you wrap Main in that case - and just add try/catch around it - that feels like the safest solution.

No, because we don't want to change user code to get this behavior. We need to try/catch, then set some state in native code so that we can show an error until there is a change. The flow is that the user makes a request, then the CLR fails to boot (because of the error in main), then we set state to make future requests fail until there's a app/configuration change.

We want the CLR to log errors as normal in this case but we want to get control back so that we can set the state to make future requests fail.

What do you suggest?

@vitek-karas
Copy link
Member

What about this:
Instead of executing the application via "Execute assembly", start the runtime with the app but call into a framework functions - something like AspNet.Main which would do a try/catch and call the real application Main via reflection. This would not change user code.
Since you're already hosting the runtime in a custom way (sort of), further changes to the hosting code don't feel too disruptive.
The AspNet.Main in case of an exception would:

  • Log information about the exception - while this is not exactly what CoreCLR does, it should be reasonably easy to do something similar (since Exception.ToString would work, unlike in the native code).
  • Return some exit code, or some other mechanism through which it would tell the hosting layer that the failure occurred and that it should set this "fail all requests" flag.

@davidfowl
Copy link
Member

We don't want to write the code to find Main ourselves and call into it. We've basically been down this style of road before where we end up re-implementing parts of the CLR that already work to tweak 0.000001% of the logic, I'd like to avoid doing that here again. Doing this would require new methods in hostfxr that doesn't run main but instead returns a CLR host we can mutate and further configure before calling into the app.

We need a simple way to pass your exception handler enough state to avoid the code that aborts the process. I think we should look for solutions in that line of thinking.

@jeffschwMSFT
Copy link
Member

@davidfowl have you considered using the startup_hook and providing an unhandled exception filter: https://docs.microsoft.com/en-us/dotnet/api/system.appdomain.unhandledexception?view=netcore-2.2. That could do the necessary handling as the exception occured, though the main thread would still crash.

@davidfowl
Copy link
Member

@davidfowl have you considered using the startup_hook and providing an unhandled exception filter: https://docs.microsoft.com/en-us/dotnet/api/system.appdomain.unhandledexception?view=netcore-2.2. That could do the necessary handling as the exception occured, though the main thread would still crash.

Yes and yes. Using the startup hook I believe doesn't give a way to stop the crash which is what we need. I did consider this though. Ideally in a hook, we'd be able to wrap Main with a try/catch.

@jkotas
Copy link
Member

jkotas commented Jan 15, 2019

@davidfowl What would be the API or config setting you would propose to solve this problem that works for you?

@davidfowl
Copy link
Member

There would be a config setting/environment variable that would set a thread local on the thread that executes the Main entrypoint. That would flow to the exception handler logic and would disable the abort call.

COMPlus_DisableAbortOnExceptionsOnMainThread=1

@jkotas
Copy link
Member

jkotas commented Jan 15, 2019

That would flow to the exception handler logic and would disable the abort call.

Could you please elaborate?

Today, we have a global unhandled exception handler. It writes to event log, invokes Watson and aborts the process. We can remove the "invokes Watson and aborts the process" part and let the thread exit without crashing the process, but I am not sure whether it is actually what you want.

@davidfowl
Copy link
Member

We can remove the "invokes Watson and aborts the process" part and let the thread exit without crashing the process, but I am not sure whether it is actually what you want.

Yes, we want to avoid crashing the process on the main thread. Everything else is fine.

@jkotas
Copy link
Member

jkotas commented Jan 15, 2019

Hmm, there does not seem to be a way for us to exit the thread cleanly once the exception once it becomes unhandled. We can block the thread indefinitely that would prevent it from crashing the process.

Also, the environment variable would be inherited by all child processes...

@pakrym
Copy link
Contributor Author

pakrym commented Jan 16, 2019

We can block the thread indefinitely that would prevent it from crashing the process.

This won't work as host process won't be able to distinguish exception in main from server never starting and display the correct message.

@sdmaclea
Copy link
Contributor

Instead of executing the application via "Execute assembly", start the runtime with the app but call into a framework functions - something like AspNet.Main which would do a try/catch and call the real application Main via reflection. This would not change user code.

We don't want to write the code to find Main ourselves and call into it.

Hmm, there does not seem to be a way for us to exit the thread cleanly once the exception once it becomes unhandled

I think something like @vitek-karas idea is probably the solution. @davidfowl is right having this in ASP.NET is bad from a coupling perspective. Is there any reason we couldn't just implement this in the runtime? The try catch could write to a native buffer for tracing and return an error code corresponding to uncaught exception and disable the Watson + Abort.

@davidfowl
Copy link
Member

Right, its mostly there we just need to disable the abort on the main thread. We're ok with background threads crashing the process.

@jkotas
Copy link
Member

jkotas commented Jan 16, 2019

The try catch could write to a native buffer for tracing and return an error code corresponding to uncaught exception and disable the Watson + Abort.

I assume that we would add new methods on coreclr / hostfxr to do this. Correct?

@sdmaclea
Copy link
Contributor

I assume that we would add new methods on coreclr / hostfxr to do this. Correct?

That would be the plan I think. Possibly in an extensible way taking a options struct with a old school struct size.

@davidfowl
Copy link
Member

This needs to be part of the 3.0 milestone, what we have today would have been a ship stopper had all the right parties been notified

@MeiChin-Tsai
Copy link

Sorry to coming to this thread late. Do we have a proposal? I will suggest to separate two things -

  1. user unhandled exception
  2. runtime itself unhandled exception

Do we have a way to separate them? To be honest to you, in 2, continue really does not buy anything to us.

@muratg
Copy link

muratg commented Feb 27, 2019

@davidfowl ping

@sdmaclea sdmaclea self-assigned this Feb 27, 2019
@sdmaclea
Copy link
Contributor

We had a meeting to discuss this. We briefly discussed the requirements and the proposed solution. The next step is to document the proposed design. I have been focusing on infrastructure, but will be able to make progress next week.

@sdmaclea
Copy link
Contributor

sdmaclea commented Apr 4, 2019

We added dotnet/coreclr#23054. I was thinking that was all that was remaining. Can this be closed or do we have another action item?

@sdmaclea
Copy link
Contributor

Fixed by dotnet/coreclr#23054

@msftgits msftgits transferred this issue from dotnet/core-setup Jan 30, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 30, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants