Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lwt semantic improvements: exception safety and stack overflows #500

Merged
merged 7 commits into from
Nov 26, 2017

Conversation

aantron
Copy link
Collaborator

@aantron aantron commented Nov 10, 2017

This PR improves the stack- and exception-safety of most functions in Lwt: bind, catch, try_bind, finalize, join, choose, pick, nchoose, npick, nchoose_split, cancel, protected, no_cancel, on_cancel, on_success, on_failure, on_any, and on_termination.

To present, the text describes:

  • the two problems, stack- and exception-safety,
  • the solution in this PR (to both of them),
  • what is actually implemented in this PR, and a soft upgrade path for users,
  • some miscellaneous thoughts.

Problem 1: Lwt.bind and similar functions leak exceptions

This program raises an exception:

let () =
  Lwt.bind Lwt.return_unit (fun () -> raise Exit) |> ignore

The behavior one would expect from Lwt.bind here is that the exception be stored in the promise Lwt.bind creates, i.e. the promise is created and rejected. Indeed, that is what happens with map:

let () =
  let p = Lwt.map (fun () -> raise Exit) Lwt.return_unit in
  assert (Lwt.state p = Lwt.Fail Exit)

Much more disturbingly, that is what happens in Lwt.bind if the promise it is called on started out pending instead of resolved:

let () =
  let (p, r) = Lwt.wait () in
  let p' = Lwt.bind p (fun () -> raise Exit) in
  Lwt.wakeup r ();
  assert (Lwt.state p' = Lwt.Fail Exit)

...which suggests that merely refactoring some code, or a different execution trace, could alter which way exceptions go from a callback of bind: forward into a promise, or back up the stack. That seems like a needless source of anxiety.

This was originally reported in #329. Besides bind, analogous problems also affect catch, try_bind, finalize, and their backtrace_* variants, which are used internally by the PPX.

The reason for this problem is that if p is already resolved, Lwt.bind p f calls f immediately, as a performance optimization. However, since bind might be called in a loop, the call to f is implemented as a tail call: otherwise, a loop that does a few hundred thousand binds will overflow the stack. This means that it is not possible for bind to wrap an exception handler around the call to f.


Problem 2: Lwt can resolve an unbounded number of promises in one go

...and since each promise resolution deepens the stack, this can trigger a stack overflow.

I didn't originally set out to fix this problem, but fixing it is a side effect of fixing problem 1.

Basically, if one promise triggers the resolution of another, and so on, for long enough, the result is a crash. Here is a contrived example using only Lwt.join, but actually this can be triggered using any sufficiently large combination of the functions listed at the beginning of this PR:

let rec make_huge_promise_chain n last_promise =
  if n = 0 then
    last_promise
  else
    make_huge_promise_chain (n - 1) (Lwt.join [last_promise])
in
let (p, r) = Lwt.wait () in
make_huge_promise_chain 1_000_000 p |> ignore;

Lwt.wakeup r ()

How the problems are solved by this PR

The observation is that Lwt actually never can do tail calls: all the places where Lwt does tail calls before this PR, are places where Lwt sacrifices correct exception semantics.

So, this PR abandons tail calls completely. As a result, the callbacks called by Lwt.bind, etc., can now be wrapped in exception handlers for correct exception semantics. Without doing anything else, this introduces unbounded stack growth in loops of Lwt.bind.

To avoid this stack growth due to Lwt.bind, and eveywhere else, the PR introduces a disciplined callback deferral mechanism. Lwt now tracks how many callback calls it has nested inside each other. There is a limit of 42 such nested callback calls. When Lwt needs to call a callback, if the depth is less than 42, it increments the depth and calls the callback immediately. Otherwise, if the depth is 42 or more, Lwt defers the callback call into a queue, which is drained later, when the depth is about to reach zero again.

So, instead of immediate callback calls, Lwt now has hybrid immeidate/deferred callback calls.

I don't expect this change to affect most code that uses Lwt. However, it can still be dangerous for some programs, so this PR takes a somewhat gentle approach. The code here introduces a generalized mechanism, that is capable of both the broken and the patched behavior, and choosing between the two is a matter of changing a few lines of code. The code is currently used to emulate the broken behavior. There is a separate branch, safer-semantics, which adds one commit over this PR, and has the patched behavior. I suggest releasing this PR in 3.2.0 with the broken behavior, announcing the patched branch so that users can test against it, and then applying the final patch in Lwt 4.0.0.

The changes to the test suite in safer-semantics suggest some of the improvements that will be made in 4.0.0.

Note that safer-semantics needs PR #499, in order to be able to fully pass the Lwt test suite, because fixing Lwt.bind turns out to be one way to (accidentally) fix some bugs in Lwt_list. The Lwt_list test suite without #499 expects incorrect Lwt_list behavior.


Effect on performance

These are timings of an Lwt.bind loop on my machine:

  • 8ns per bind for current (broken) Lwt.
  • 16ns per bind for this PR, when emulating broken behavior (as I suggest to release in 3.2.0).
  • 50ns per bind with the patched behavior (4.0.0, with a deferral every 42 binds).

All the measurements were done with Flambda disabled, as it seems Flambda is being reworked right now.

The 16ns can be decreased somewhat. For example, avoiding capture in some new closures in bind reduced it to 13ns. It should be possible to restore something close to 8ns, but I wanted to stick with idiomatic and relatively concise code for now, and not get into premature optimization. I suspect the 50ns can also be improved in this way, though probably to a lesser degree.

I chose 42 for the maximum number of immediate callback calls based on a few measurements:

nesting    time per
 limit       bind

  1         144ns
  8          66ns
 16          53ns
 42          50ns
420          50ns

It seems that 42 is close to where the cost of deferral is amortized enough to become insignificant.


Miscellaneous

Calling callbacks immediately is a kind of LIFO callback-calling strategy, while deferral, as implemented, is FIFO, and the hybrid behavior in this PR is just unpredictable. Lwt does not promise any callback-calling order in its API, but I am worried about surprising effects and starvation. We could set the nesting limit to 1, to get JS-like always-defer FIFO behavior, except maybe on the first callback call.

Since Lwt now has a centralized mechanism for deferring callbacks, or deciding where in the queue they will go, it should be fairly easy to randomize behavior for testing. cc @stedolan, @yomimono.

@domsj, @copy, @smondet, if you have time at all, would you be willing to run this branch disciplined-callbacks, and the patched branch safer-semantics through your stress tests? It could be very helpful :)

cc also @mfp, @rgrinberg, @hcarty, @mfp, @c-cube based on past interest in this topic, overheard(seen) discussions on IRC, etc. :)

Before this commit, Lwt only tracked whether it was already inside a
callback call or not, i.e. a single boolean value. Now, it tracks how
many callback calls it has nested.
Lwt's internal [resolve] function does two things:

- Sets the state of a pending promise to either Fulfilled or Rejected.
- Calls all the callbacks that were attached to the promise while it was
  still pending.

The old version of this function unconditionally called the callbacks
immediately. The version added in this commit first checks the current
callback nesting depth. If the depth exceeds a limit, callbacks are
deferred into a queue instead of being called right away.

To support the old behavior until Lwt 4.0.0, the new function has an
[?allow_deferring] optional argument that can be set to [false]. If so
set, callbacks are not deferred, even if the nesting limit is exceeded.

To support the old behavior of specifically [Lwt.wakeup_later], the new
[resolve] function has a [?maximum_callback_nesting_depth] optional
argument. [Lwt.wakeup_later] sets this to 1.

The old [resolve] could only be safely called during an already-started
resolution loop, i.e. inside a surrounding [run_in_resolution_loop]
call. To statically ensure this, a dummy [in_resolution_loop] value was
passed around Lwt, and the old [resolve] could only be called from
contexts where [in_resolution_loop] was available.

The new [resolve] enters the resolution loop on its own. Indeed, this is
part of tracking the callback nesting depth. As a result, the
[in_resolution_loop] safety mechanism is no longer necessary, and it is
removed by this commit.
The goal of this is to hide the callback-deferring machinery from as
much other Lwt code as possible, to make sure that it is invoked
correctly, and remains easy to understand.
As in [Lwt.bind p f] when [p] is already fulfilled.
@aantron
Copy link
Collaborator Author

aantron commented Nov 10, 2017

One more thing, the PR is broken up into commits that should individually hopefully be easy to review. The net diff is not as easy.

@aantron aantron added this to the 3.2.0 milestone Nov 10, 2017
@copy
Copy link
Contributor

copy commented Nov 10, 2017

Cool! I've been bitten by this quite a few times.

Neither safer-semantics nor disciplined-callbacks has a measurable effect on the performance of my application (which is generally more CPU-bound, so that doesn't surprise me).

screenshot_2017-11-10_18-22-21

@aantron
Copy link
Collaborator Author

aantron commented Nov 10, 2017

@copy Wow, neat graph, and thanks :)

@talex5
Copy link
Contributor

talex5 commented Nov 11, 2017

Note that this breaks the rule (return x >>= f) = (f x). Isn't that one of the monad laws?

This change would (slightly) break mirage-firewall at least, which relies on this to know whether a packet has been sent or queued. Bad design perhaps, but I think this is going to cause some subtle breakage.

In general, I think it's really important to have a defined order of execution. Looking at a piece of code, I should be able to tell which side-effects happen first.

FWIW, E always defers, and I think Async does too, and that seems better than the current behaviour, even if it is slower. But changing seems risky. Deferring some times and not others, depending on stack depth, sounds very bad :-(

Perhaps we could have a separate Lwt.Defer.(>>=) that always defers, so people can opt-in? But even so, I think Lwt should define the order of callbacks rather than saying it's unspecified. Otherwise, every library has to implement its own queuing system on top of Lwt, and I suspect that doesn't compose well (capnp-rpc relies on Lwt.pause to defer with a known ordering, but that is also unspecified by the docs).

@aantron
Copy link
Collaborator Author

aantron commented Nov 14, 2017

Note that this breaks the rule (return x >>= f) = (f x).

Indeed, the rule is broken by this PR (actually, the safer-semantics branch).

For other readers: in the future, if f raises an exception, return x >>= f will catch the exception and create a rejected promise, while f x will "leak" the exception. The current behavior is that both expressions leak the exception.

I'm comfortable with breaking the rule, however. Here are various semi-reasons:

  1. Note that one can tell syntactically which side of the equivalence involves potentially asynchronous execution with no synchronous "prefix": return x >>= f does, while f x might run some code synchronously first.

    This is precisely what I'd like to achieve with safer-semantics: everything that is passed to Lwt, to be run potentially later, can syntactically be known not to leak exceptions. In particular, this includes anything to the right of >>=, like f in return x >>= f. It shouldn't matter that the left is syntactically a resolved promise. The user should be able to refactor that without worrying about different, syntactically unpredictable, behavior from >>=.

    On the other hand, if you don't pass f to bind, i.e. you just do f x, you can, syntactically, tell that you are on your own.

    The exception handling rule becomes very uniform in safer-semantics: if a function is called by Lwt, Lwt always deals with any raised exception by either rejecting a promise that is handy, or else passing the exception to !Lwt.async_exception_hook and terminating the process. It never allows the exception to leak past Lwt.

    Essentially, safer-semantics priveleges having less things to worry about when refactoring return x >>= f to e >>= f for some other e, over refactoring return x >>= f to f x. I claim that it's more important to have consistent behavior when keeping consistent usage of Lwt, than to have consistent behavior when deleting or introducing syntactically-visible usage of Lwt, because in the latter case, it is visible that the user is calling into Lwt, an asynchronous execution library, at different places from before, so the user is warned.

    safer-semantics also favors other refactorings, more on those later.

  2. An ugly modified rule does hold:

    return x >>= f   ≡   try (f x) with exn -> fail exn
    

    In other words, if you want to Lwt-ize f x so that it doesn't leak an exception, catch the exception and turn it into a rejected promise.

    This seems fairly intuitive once one is aware of it.

  3. The semantics are already broken in at least one way, which is that map f p ≡ p >>= fun v -> return (f v) doesn't hold. This isn't one of the monad laws, but it is part of how one converts between the two main formulations of what is a monad. safer-semantics gives Lwt this equivalence. I recall talk of Lwt breaking monad laws in some other way also, but I don't have time to find it now.

    More generally, giving Lwt uniform exception-handling semantics means that one can refactor Lwt code without worrying that exception behavior will change subtly, based on which exact Lwt functions are called. Back to the particular case, changing between bind and map is very common, at least for me: I often replace >>= by >|= and vice versa. I don't want to worry that replacing a >|= by a >>= can suddenly cause leaked exceptions, and only on some traces that I might miss in testing.

    I suppose another way to restore the equivalence is to make map also "unsafe," like bind is, but this seems like a step in the opposite direction, towards behavior dependent on runtime promise state.

  4. I acknowledge monads are a useful structure, but I doubt the actual relevance of Lwt being a monad. As explained above, I think it's much more important to provide other aspects of predictable semantics, than to satisfy the above monad law. Also note that the proposed new module Lwt reference makes no mention of monads.

    We may just have a proof that Lwt is not a monad, and/or that Lwt's style exception handling is inherently ill-conceived :p

  5. Ultimately, I hope to create a new module, something like Lwt.Promise, and we can figure out an even better way to deal with exceptions when designing that. We might have a better shot at ensuring Lwt.Promise actually is a monad.

FWIW, E always defers, and I think Async does too, and that seems better than the current behaviour, even if it is slower. But changing seems risky. Deferring some times and not others, depending on stack depth, sounds very bad :-(

I'm certainly open to always deferring. However, I want to point out, that it would take somewhat more refactoring to get "fully-ticked" behavior, like Node has.

In particular, if current Lwt always deferred, it would still have a sort of mixed behavior. This is because Lwt has both a top-level I/O loop, which figures out which promises to resolve when the process wakes up from waiting for I/O, and a "core" loop which runs all callbacks and completely drains the deferral queue of everything that got triggered by resolving only one promise.

So, if the I/O loop decided that it needs to resolve two promises, resolved the first one, and that triggered more Lwt code that eventually queued a deferred callback, that deferred callback would still be run before the second promise's callbacks run.

To get ticked behavior, we would need to get rid of this "core" loop, and rely on the I/O loop to handle the deferral queue – so basically, a real scheduler.

This would actually make the Lwt code much simpler. It is also how Lwt works in the JS promises branch (it relies on the JS engine for maintaining the deferral queue, and doesn't have one in lwt.ml).

I'm saving this for future work, and not this PR, though :p Let me know any thoughts on this.

Perhaps we could have a separate Lwt.Defer.(>>=) that always defers, so people can opt-in? But even so, I think Lwt should define the order of callbacks rather than saying it's unspecified. Otherwise, every library has to implement its own queuing system on top of Lwt, and I suspect that doesn't compose well (capnp-rpc relies on Lwt.pause to defer with a known ordering, but that is also unspecified by the docs).

I'm actually in favor of documenting that Lwt.pause is FIFO, and encouraging it as the "default" way to do basic queueing in Lwt.

Apart from that, though, I think it's some kind of anti-pattern to depend on the order of deferrals. It does seem better to provide an explicit queueing library or libraries. Towards this, Lwt.pause is a very rudimentary such thing.


Thanks for the insight, @talex5!

@aantron
Copy link
Collaborator Author

aantron commented Nov 14, 2017

I also want to add that if the code was written specifically to always defer, it would probably be faster than the 144ns/bind suggests above. That 144ns/bind was for some generalized code, configured by runtime checks to always defer. Because the code is so generalized, and because I didn't want to manually inline everything to optimize it, that code is slower than it has to be.

@talex5
Copy link
Contributor

talex5 commented Nov 15, 2017

These are interesting points, but actually (I wasn't very clear), I wasn't thinking about exceptions but about mutation. e.g.

let x = ref 0 in
let y = Lwt.return () >>= fun () -> incr x; Lwt.return () in
!x

In current Lwt, this always returns 1. In always-defer, it always returns 0. With this PR, it depends on the stack depth.

(the always-defer behaviour has the additional advantage of returning 0 in all cases, even if the return () is replaced by a promise)

@raphael-proust
Copy link
Collaborator

A useful toll to have would be a test framework that runs code with different deferral strategies (defer as often as possible, defer as rarely as possible, defer sometimes). This way, developers can create test suites to check that their functions does not rely on deference strategy.

@aantron
Copy link
Collaborator Author

aantron commented Nov 22, 2017

@talex5: I see, yes. But again, I think depending on implicit assumptions about when anything after >>= runs is a mistake. The example is a special case of a race condition. I think the only guarantee Lwt should make, and can reasonably make, is that a run of code that does not call into Lwt nor return to it, cannot interleave with any other code. The example shows two different such runs interacting, so I would prefer to encourage explicit synchronization if it is needed.

As a small aside, implicit assumptions make Lwt more difficult to learn, as beginners have to worry what these unstated assumptions are. It seems better and more uniform to discourage such assumptions, and always encourage explicit synchronization.

@raphael-proust Agreed. Perhaps @yomimono also has interest in this, at risk of ire for ccing twice :p

@yomimono
Copy link

Yes, she does have interest in that ;)

@aantron aantron merged commit 52e6c34 into master Nov 26, 2017
@aantron
Copy link
Collaborator Author

aantron commented Nov 26, 2017

Thanks for the reviews, benchmarking, discussion, etc.! Even though the pull request is now merged, we still have until 4.0.0 (i.e., at least three months) before the semantics change. This gives us time for more thinking, discussion, and getting ideas from a wider audience before we fully commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants