Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception stacks #28878

Merged
merged 16 commits into from
Oct 23, 2018
Merged

Exception stacks #28878

merged 16 commits into from
Oct 23, 2018

Conversation

c42f
Copy link
Member

@c42f c42f commented Aug 24, 2018

An exception stack system for julia

Here's my attempt to fix #12485 / #19979 by introducing an exception stack
system. The basic problem is that you may throw a new exception while handling
an exception. What do we do then? The existing runtime simply overwrites the
first exception, loosing any information it might have had about the root
cause.

A distinct problem discussed in #12485 is that task switching looses all
exception and backtrace information. I've tackled this at the same time
though in hindsight it might have been better to do the exception stack
first, followed by fixing task switching.

Design

To solve the first problem, I've implemented an exception stack mechanism.
Conceptually:

  • Every throw() pushes an (exception,backtrace) pair onto a stack of current
    exceptions.
  • When you exit a catch block normally, any exception generated within the
    corresponding try is considered to have been handled. The stack is therefore
    popped back to the state at the entry of the corresponding try.
  • The full stack is available for root cause analysis via Base.catch_stack().

For example:

try
    error("A")
catch
    try
        error("B")
    catch
        for (e, bt) in Base.catch_stack()
            showerror(stdout, e, bt)
            println()
        end
    end
end

Produces

A
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope at ./REPL[3]:2
 [3] eval(::Module, ::Any) at ./boot.jl:319
 [4] eval_user_input(::Any, ::REPL.REPLBackend) at /home/chris/dev/julia/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:85
 [5] macro expansion at /home/chris/dev/julia/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:117 [inlined]
 [6] (::getfield(REPL, Symbol("##28#29")){REPL.REPLBackend})() at ./task.jl:259
B
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope at ./REPL[3]:5
 [3] eval(::Module, ::Any) at ./boot.jl:319
 [4] eval_user_input(::Any, ::REPL.REPLBackend) at /home/chris/dev/julia/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:85
 [5] macro expansion at /home/chris/dev/julia/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:117 [inlined]
 [6] (::getfield(REPL, Symbol("##28#29")){REPL.REPLBackend})() at ./task.jl:259

Implementation

Lowering

A new lowered expression head :pop_exc is introduced and emitted in any
location where a catch block exits normally (either by falling out the bottom,
or using return, break, goto). This pops the exception stack back to the state
of the associated enter. There needs to be some way to track the associated
enter in the IR, so I chose to emit an SSAValue token from enter which is
consumed by pop_exc. I tried various alternatives, but this was by far the
nicest in terms of non-disruptive integration into the SSAIR processing code,
and supporting both the interpreter and codegen.

finally lowering no longer includes a foreigncall, so expressions using
finally can now be interpreted. This includes the output of @test so may
give a nice speed boost.

Runtime

  • The exception in transit is no longer a single pointer in the thread local
    storage. It's stored in ptls->current_task->exc_stack instead, along with
    associated backtrace. Space for the exception and backtrace is allocated
    on demand inside throw_internal, after changing GC mode.

  • The exception stack is copied out onto the Task naturally preserved during task switching.
    I've been back and forth on several variations here, as the exact detail
    of how this is done depends on how the memory for the exception
    stack is allocated (GC vs plain malloc).

  • jl_current_exception() is now the method to retreive the current exception
    from within a JL_CATCH dynamic context.

  • _resetstkoflw() handling is consolidated into jl_eh_restore_state.
    Needs proper testing on windows.

  • Several places where manual restoration of the exception and backtrace was
    done are no longer necessary because the stack system handles these
    automatically.

  • An exception stack type, jl_exc_stack_t and associated manipulation
    functions has been added. Conceptually stores a stack of
    (exception,backtrace) pairs. It's stored in a shared buffer so we can avoid
    reallocating when throwing and catching the same exception or set of
    exceptions repeatedly.

  • jl_eh_restore_state has become non-inline. It seemed good to get this out
    of the public header.

  • jl_sig_throw is now used with jl_throw_in_ctx from signal handlers, to
    make the special circumstances clear and to avoid conflation with rethrow
    which now simply rethrows the existing exception stack.

Interpreter / Codegen

Some minor rearrangement in the interpreter to make the eh buffer management
and leave handling a little easier to understand as I found this surprisingly
confusing.

Mostly small changes in codegen. Representing the enter token as an SSAValue
really integrates well here when you need to retreive it for the associated
pop_exc.

GC

What? How did this rabbit hole lead down into the GC?

Unfortunately the raw backtrace buffers contain references to GC allocated
objects. Since these buffers can now live on a Task, their contents need to
be GC rooted. Unfortunately the raw buffers are also in a non-julia native
format, so I had to write specialized GC code for them.

I'm not sure this was a good idea, especially having extra code to maintain in
this rather sensitive part of the codebase. Perhaps it would be better to
modify our raw backtrace buffer format so that it can be understood by the GC
without major changes.

TODO

  • Check some potential GC issues I found in testing.
  • Decide on detail of Base.catch_stack() API and export it.
  • Double check consolidated _resetstkoflw handling on windows (checked on cross compiled version via wine on ubuntu)
  • Tests for exception preservation during task switching
  • Reinstate rooting of temporary bt_data
  • Check that access to backtraces for failed Tasks still works correctly and figure out how to access the full exception stack for those.

Optional

  • Reconsider best practices for rethrow(exc) vs just throwing a new exception. Can be done separately
  • jl_apply_with_saved_exception_state may be unnecessary now
  • jl_rethrow_other is probably unnecessary now is required for rethrow(ex)
  • Persist backtrace when task exits with an exception

@iamed2
Copy link
Contributor

iamed2 commented Aug 24, 2018

I won't comment on the technical aspects of the solution but I like it conceptually! And I'm happy to see those issues resolved, they made error handling in Dispatcher and stacktrace generation in Memento tricky business.

@ararslan ararslan added the error handling Handling of exceptions by Julia or the user label Aug 24, 2018
@c42f
Copy link
Member Author

c42f commented Aug 25, 2018

@iamed2 Hah yes I originally came across this when working on logging.

To add to the commentary above, One of my biggest difficulties with this patch was the need to GC root objects in the bt buffer, and also GC the buffer itself in the case that it's attached to a Task. Combine this with the need to actually fill the buffer from inside a signal handler (and potentially other restricted contexts) and you're in a tricky situation. There's probably several holes I've yet to plug there.

Actually I hope to delete most of the GC changes which were done here as they seem like overkill. My current best idea is to use the new efficient array-of-small-Union support to represent the raw backtrace buffer as a julia array of Union{Ptr{Nothing},UInt,CodeInfo,...} which could be filled without allocation and with only a little more book keeping than what we currently do. Need to sort out whether we can call jl_gc_wb in the contexts where we fill the buffer.

Is this a sound idea? Advice on how to improve the memory management in this patch would be particularly welcome.

@c42f c42f force-pushed the cjf/exception-stack branch 2 times, most recently from af0afe3 to f76eb32 Compare August 25, 2018 07:55
@c42f
Copy link
Member Author

c42f commented Aug 25, 2018

Ok, CI failures are because I've got something wrong in GC land and the debug build is failing on an assertion. My best guess is that I've broken whatever invariants jl_gc_alloc_buf are meant to obey.

Anyway, the release build does "work" so the branch is in a state where the curious can play with it until the real problem is fixed.

@c42f
Copy link
Member Author

c42f commented Aug 27, 2018

Looking into alternative storage for backtrace buffers, the Union isn't going to work because we really need to store pointers in there. So perhaps we're stuck with two parallel arrays instead like the current bt, bt2 which appears in some places. Or just put up with having some special purpose code in the gc.

@JeffBezanson
Copy link
Member

Thanks for tackling this; great work! I think this is basically the right thing. I'll review in detail at some point. A bit of custom code for GC marking doesn't seem so bad to me.

@JeffBezanson JeffBezanson added this to the 1.1 milestone Sep 6, 2018
@c42f c42f force-pushed the cjf/exception-stack branch 2 times, most recently from aca0df5 to 9d51768 Compare September 14, 2018 14:09
src/gc.c Outdated
@@ -1901,7 +1901,7 @@ excstack: {
size_t i = stackitr->i;
while (itr > 0) {
size_t bt_size = jl_exc_stack_bt_size(exc_stack, itr);
uint64_t *bt_data = jl_exc_stack_bt_data(exc_stack, itr);
uintptr_t *bt_data = jl_exc_stack_bt_data(exc_stack, itr);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, this was a hilariously stupid bug which cost a lot of time due to randomly busted GC on 32 bit, I don't know what I was thinking using uint64_t here :-/

It would have been caught early if we were able to use -Werror on CI builds but I suspect that is quite disruptive in other ways.

Copy link
Member

@StefanKarpinski StefanKarpinski Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would have been caught early if we were able to use -Werror on CI builds but I suspect that is quite disruptive in other ways.

Please, we should absolutely do this. We've had a creeping increase of compiler warnings in the past month or so.

@c42f
Copy link
Member Author

c42f commented Sep 18, 2018

Hmm, just encountered a spurious test failure on 32 bit due to #27788.

@c42f c42f force-pushed the cjf/exception-stack branch from b853caf to bcc7bab Compare September 18, 2018 14:03
@c42f
Copy link
Member Author

c42f commented Sep 18, 2018

Tests have finally passed at least once on all systems and I've done a fair amount of cleanup. Should be good for a closer look.

I've exported catch_stack() for now, it could do with a better name / better cohesion with the catch_backtrace() API.

@c42f
Copy link
Member Author

c42f commented Sep 19, 2018

Ok, tests have passed after restarting the appveyor x86_64 build three times. There is an intermittent failure in a single one of the spawn tests on that platform related to deleting a directory, see https://ci.appveyor.com/project/JuliaLang/julia/build/1.0.30766/job/wnxk1pxa27xte6un for example.

This seems to be the test from #26687 coming back again after #28107 was merged.

The fact that it's recurring here more frequently could be related to the changes in this PR.

@c42f c42f force-pushed the cjf/exception-stack branch from ebf472a to 66e934c Compare October 3, 2018 05:41
@c42f
Copy link
Member Author

c42f commented Oct 3, 2018

I just rebased this again to clean up some recent conflicts, and to consolidate a lot of WIP commits into a small number of consistent changes.

This reminded me of an interesting side effect of this patch which I should point out — code generated by @test should now be able to run in the interpreter. This used not to work because @test emits a finally block, and the lowering of finally previously contained a :foreigncall to jl_rethrow_other which caused the interpreter to be bypassed.

c42f added 4 commits October 6, 2018 07:45
The logic for this was repeated in three quite different ways, in
JL_CATCH, interpreter and codegen but putting it into the runtime in
jl_eh_restore_state should be sufficient.

Also move jl_eh_restore_state out of main header - seems unnecessary to
expose such implementation detail, and inconsistent with other eh
functions.
* A new lowered expression head :pop_exc is introduced and emitted in any
  location where a catch block exits normally (either by stepping out,
  or using return, break, goto). The semantics of :pop_exc are to pop
  the exception stack back to the state of the associated enter.

* Make Expr(:enter) return a token which may be consumed by :pop_exc,
  thereby allowing the interpreter and codegen to know which :enter
  state should be used to pop the exception stack.  I tried various
  alternatives for this association, but this was by far the nicest in
  terms of non-disruptive integration into the SSAIR processing code,
  and supporting both the interpreter and codegen.
Runtime
-------

* An exception stack type, `jl_exc_stack_t` and associated manipulation functions
  has been added. Conceptually this stores a stack of (exception,backtrace)
  pairs. It's stored in a contiguous buffer so we can avoid reallocating when
  throwing and catching the same exception or set of exceptions repeatedly. Space
  for the exception and backtrace is allocated on demand inside `throw_internal`,
  after changing GC mode. Several variations were tried for allocating this
  sgorage, including allocating up front with malloc on the thread local state
  and copying during task switching. Keeping everything on the task seemed the
  simplest as it involves the least copying and keeps track of the exception
  stack buffers in a unified way.

* The exception in transit is no longer a single pointer in the thread local
  storage. It's stored in `ptls->current_task->exc_stack` instead, along with
  associated backtrace.

* `jl_current_exception()` is now the method to retreive the current exception
  from within a `JL_CATCH` dynamic context.

* Several places where manual restoration of the exception and backtrace was
  done are no longer necessary because the stack system handles these
  automatically. This code is removed, including
  `jl_apply_with_saved_exception_state`.

* `jl_eh_restore_state` has become non-inline. It seemed good to get this out
  of the public header.

* `jl_sig_throw` is now used with `jl_throw_in_ctx` from signal handlers, to
  make the special circumstances clear and to avoid conflation with rethrow
  which now simply rethrows the existing exception stack.

* Make rethrow outside a catch block into an error.

* Use `ptls->previous_exception` to support `jl_exception_occurred` in
  embedding API.

* finally lowering no longer includes a `foreigncall`, so expressions
  using finally can now be interpreted.

Interpreter / Codegen
---------------------

Mostly small changes here, to track the `:enter` and `:pop_exc` association
with the SSAValue token. The token SSAValue slot is ideal here for storing the
state of the exception stack at the `:enter`.

GC
--

Integrate exception and raw backtrace scanning as a special case into GC mark
loop. This is necessary now that Task objects can refer to exception and
backtrace data in raw form.
@c42f c42f force-pushed the cjf/exception-stack branch from 66e934c to 5333b07 Compare October 5, 2018 22:41
@ararslan
Copy link
Member

Should be fixed now. The currently running build looks poised to complete successfully.

@c42f
Copy link
Member Author

c42f commented Oct 23, 2018

@ararslan Thanks so much!

Pending CI passing once more and the (github outage related) JuliaLang/julia merge lockout being removed, I will go ahead and merge this. Hopefully later today.

@c42f c42f merged commit 24f1316 into master Oct 23, 2018
@c42f c42f deleted the cjf/exception-stack branch October 23, 2018 12:57
@iamed2
Copy link
Contributor

iamed2 commented Oct 23, 2018

Congrats on this!! I'm excited to use these when I can!

c42f added a commit that referenced this pull request Oct 31, 2018
Fix uninitialized previous_exception bug introduced in #28878
@c42f c42f mentioned this pull request Nov 2, 2018
fredrikekre added a commit that referenced this pull request Dec 1, 2018
fredrikekre added a commit that referenced this pull request Dec 1, 2018
fredrikekre added a commit that referenced this pull request Dec 3, 2018
fredrikekre added a commit that referenced this pull request Dec 4, 2018
fredrikekre added a commit that referenced this pull request Dec 4, 2018
fredrikekre added a commit that referenced this pull request Dec 5, 2018
changes between Julia 1.0 and 1.1, including:

- Custom .css-style for compat admonitions.

- Information about compat annotations to CONTRIBUTING.md.

- NEWS.md entry for PRs #30090, #30035, #30022, #29978,
  #29969, #29858, #29845, #29754, #29638, #29636, #29615,
  #29600, #29506, #29469, #29316, #29259, #29178, #29153,
  #29033, #28902, #28761, #28745, #28708, #28696, #29997,
  #28790, #29092, #29108, #29782

- Compat annotation for PRs #30090, #30013, #29978,
  #29890, #29858, #29827, #29754, #29679, #29636, #29623,
  #29600, #29440, #29316, #29259, #29178, #29157, #29153,
  #29033, #28902, #28878, #28761, #28708, #28156, #29733,
  #29670, #29997, #28790, #29092, #29108, #29782, #25278

- Documentation for broadcasting CartesianIndices (#30230).
- Documentation for Base.julia_cmd().
- Documentation for colon constructor of CartesianIndices (#29440).
- Documentation for ^(::Matrix, ::Number) and ^(::Number, ::Matrix).

- Run NEWS-update.jl.

Co-authored-by: Morten Piibeleht <[email protected]>
Co-authored-by: Fredrik Ekre <[email protected]>
fredrikekre added a commit that referenced this pull request Dec 5, 2018
changes between Julia 1.0 and 1.1, including:

- Custom .css-style for compat admonitions.

- Information about compat annotations to CONTRIBUTING.md.

- NEWS.md entry for PRs #30090, #30035, #30022, #29978,
  #29969, #29858, #29845, #29754, #29638, #29636, #29615,
  #29600, #29506, #29469, #29316, #29259, #29178, #29153,
  #29033, #28902, #28761, #28745, #28708, #28696, #29997,
  #28790, #29092, #29108, #29782

- Compat annotation for PRs #30090, #30013, #29978,
  #29890, #29858, #29827, #29754, #29679, #29636, #29623,
  #29600, #29440, #29316, #29259, #29178, #29157, #29153,
  #29033, #28902, #28878, #28761, #28708, #28156, #29733,
  #29670, #29997, #28790, #29092, #29108, #29782, #25278

- Documentation for broadcasting CartesianIndices (#30230).
- Documentation for Base.julia_cmd().
- Documentation for colon constructor of CartesianIndices (#29440).
- Documentation for ^(::Matrix, ::Number) and ^(::Number, ::Matrix).

- Run NEWS-update.jl.

Co-authored-by: Morten Piibeleht <[email protected]>
Co-authored-by: Fredrik Ekre <[email protected]>
fredrikekre added a commit that referenced this pull request Dec 5, 2018
Addition of NEWS and compat admonitions for important changes between Julia 1.0 and 1.1, including:

- Custom .css-style for compat admonitions.

- Information about compat annotations to CONTRIBUTING.md.

- NEWS.md entry for PRs #30090, #30035, #30022, #29978,
  #29969, #29858, #29845, #29754, #29638, #29636, #29615,
  #29600, #29506, #29469, #29316, #29259, #29178, #29153,
  #29033, #28902, #28761, #28745, #28708, #28696, #29997,
  #28790, #29092, #29108, #29782

- Compat annotation for PRs #30090, #30013, #29978,
  #29890, #29858, #29827, #29754, #29679, #29636, #29623,
  #29600, #29440, #29316, #29259, #29178, #29157, #29153,
  #29033, #28902, #28878, #28761, #28708, #28156, #29733,
  #29670, #29997, #28790, #29092, #29108, #29782, #25278

- Documentation for broadcasting CartesianIndices (#30230).
- Documentation for Base.julia_cmd().
- Documentation for colon constructor of CartesianIndices (#29440).
- Documentation for ^(::Matrix, ::Number) and ^(::Number, ::Matrix).

- Run NEWS-update.jl.


Co-authored-by: Morten Piibeleht <[email protected]>
Co-authored-by: Fredrik Ekre <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error handling Handling of exceptions by Julia or the user
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tasks and exceptions
8 participants