-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lwt semantic improvements: exception safety and stack overflows #500
Conversation
Before this commit, Lwt only tracked whether it was already inside a callback call or not, i.e. a single boolean value. Now, it tracks how many callback calls it has nested.
Lwt's internal [resolve] function does two things: - Sets the state of a pending promise to either Fulfilled or Rejected. - Calls all the callbacks that were attached to the promise while it was still pending. The old version of this function unconditionally called the callbacks immediately. The version added in this commit first checks the current callback nesting depth. If the depth exceeds a limit, callbacks are deferred into a queue instead of being called right away. To support the old behavior until Lwt 4.0.0, the new function has an [?allow_deferring] optional argument that can be set to [false]. If so set, callbacks are not deferred, even if the nesting limit is exceeded. To support the old behavior of specifically [Lwt.wakeup_later], the new [resolve] function has a [?maximum_callback_nesting_depth] optional argument. [Lwt.wakeup_later] sets this to 1. The old [resolve] could only be safely called during an already-started resolution loop, i.e. inside a surrounding [run_in_resolution_loop] call. To statically ensure this, a dummy [in_resolution_loop] value was passed around Lwt, and the old [resolve] could only be called from contexts where [in_resolution_loop] was available. The new [resolve] enters the resolution loop on its own. Indeed, this is part of tracking the callback nesting depth. As a result, the [in_resolution_loop] safety mechanism is no longer necessary, and it is removed by this commit.
The goal of this is to hide the callback-deferring machinery from as much other Lwt code as possible, to make sure that it is invoked correctly, and remains easy to understand.
As in [Lwt.bind p f] when [p] is already fulfilled.
One more thing, the PR is broken up into commits that should individually hopefully be easy to review. The net diff is not as easy. |
@copy Wow, neat graph, and thanks :) |
Note that this breaks the rule This change would (slightly) break mirage-firewall at least, which relies on this to know whether a packet has been sent or queued. Bad design perhaps, but I think this is going to cause some subtle breakage. In general, I think it's really important to have a defined order of execution. Looking at a piece of code, I should be able to tell which side-effects happen first. FWIW, E always defers, and I think Async does too, and that seems better than the current behaviour, even if it is slower. But changing seems risky. Deferring some times and not others, depending on stack depth, sounds very bad :-( Perhaps we could have a separate |
Indeed, the rule is broken by this PR (actually, the For other readers: in the future, if I'm comfortable with breaking the rule, however. Here are various semi-reasons:
I'm certainly open to always deferring. However, I want to point out, that it would take somewhat more refactoring to get "fully-ticked" behavior, like Node has. In particular, if current Lwt always deferred, it would still have a sort of mixed behavior. This is because Lwt has both a top-level I/O loop, which figures out which promises to resolve when the process wakes up from waiting for I/O, and a "core" loop which runs all callbacks and completely drains the deferral queue of everything that got triggered by resolving only one promise. So, if the I/O loop decided that it needs to resolve two promises, resolved the first one, and that triggered more Lwt code that eventually queued a deferred callback, that deferred callback would still be run before the second promise's callbacks run. To get ticked behavior, we would need to get rid of this "core" loop, and rely on the I/O loop to handle the deferral queue – so basically, a real scheduler. This would actually make the Lwt code much simpler. It is also how Lwt works in the JS promises branch (it relies on the JS engine for maintaining the deferral queue, and doesn't have one in I'm saving this for future work, and not this PR, though :p Let me know any thoughts on this.
I'm actually in favor of documenting that Apart from that, though, I think it's some kind of anti-pattern to depend on the order of deferrals. It does seem better to provide an explicit queueing library or libraries. Towards this, Thanks for the insight, @talex5! |
I also want to add that if the code was written specifically to always defer, it would probably be faster than the 144ns/ |
These are interesting points, but actually (I wasn't very clear), I wasn't thinking about exceptions but about mutation. e.g.
In current Lwt, this always returns 1. In always-defer, it always returns 0. With this PR, it depends on the stack depth. (the always-defer behaviour has the additional advantage of returning 0 in all cases, even if the |
A useful toll to have would be a test framework that runs code with different deferral strategies (defer as often as possible, defer as rarely as possible, defer sometimes). This way, developers can create test suites to check that their functions does not rely on deference strategy. |
@talex5: I see, yes. But again, I think depending on implicit assumptions about when anything after As a small aside, implicit assumptions make Lwt more difficult to learn, as beginners have to worry what these unstated assumptions are. It seems better and more uniform to discourage such assumptions, and always encourage explicit synchronization. @raphael-proust Agreed. Perhaps @yomimono also has interest in this, at risk of ire for ccing twice :p |
Yes, she does have interest in that ;) |
Thanks for the reviews, benchmarking, discussion, etc.! Even though the pull request is now merged, we still have until 4.0.0 (i.e., at least three months) before the semantics change. This gives us time for more thinking, discussion, and getting ideas from a wider audience before we fully commit. |
This PR improves the stack- and exception-safety of most functions in Lwt:
bind
,catch
,try_bind
,finalize
,join
,choose
,pick
,nchoose
,npick
,nchoose_split
,cancel
,protected
,no_cancel
,on_cancel
,on_success
,on_failure
,on_any
, andon_termination
.To present, the text describes:
Problem 1:
Lwt.bind
and similar functions leak exceptionsThis program raises an exception:
The behavior one would expect from
Lwt.bind
here is that the exception be stored in the promiseLwt.bind
creates, i.e. the promise is created and rejected. Indeed, that is what happens withmap
:Much more disturbingly, that is what happens in
Lwt.bind
if the promise it is called on started out pending instead of resolved:...which suggests that merely refactoring some code, or a different execution trace, could alter which way exceptions go from a callback of
bind
: forward into a promise, or back up the stack. That seems like a needless source of anxiety.This was originally reported in #329. Besides
bind
, analogous problems also affectcatch
,try_bind
,finalize
, and theirbacktrace_*
variants, which are used internally by the PPX.The reason for this problem is that if
p
is already resolved,Lwt.bind p f
callsf
immediately, as a performance optimization. However, sincebind
might be called in a loop, the call tof
is implemented as a tail call: otherwise, a loop that does a few hundred thousandbind
s will overflow the stack. This means that it is not possible forbind
to wrap an exception handler around the call tof
.Problem 2: Lwt can resolve an unbounded number of promises in one go
...and since each promise resolution deepens the stack, this can trigger a stack overflow.
I didn't originally set out to fix this problem, but fixing it is a side effect of fixing problem 1.
Basically, if one promise triggers the resolution of another, and so on, for long enough, the result is a crash. Here is a contrived example using only
Lwt.join
, but actually this can be triggered using any sufficiently large combination of the functions listed at the beginning of this PR:How the problems are solved by this PR
The observation is that Lwt actually never can do tail calls: all the places where Lwt does tail calls before this PR, are places where Lwt sacrifices correct exception semantics.
So, this PR abandons tail calls completely. As a result, the callbacks called by
Lwt.bind
, etc., can now be wrapped in exception handlers for correct exception semantics. Without doing anything else, this introduces unbounded stack growth in loops ofLwt.bind
.To avoid this stack growth due to
Lwt.bind
, and eveywhere else, the PR introduces a disciplined callback deferral mechanism. Lwt now tracks how many callback calls it has nested inside each other. There is a limit of 42 such nested callback calls. When Lwt needs to call a callback, if the depth is less than 42, it increments the depth and calls the callback immediately. Otherwise, if the depth is 42 or more, Lwt defers the callback call into a queue, which is drained later, when the depth is about to reach zero again.So, instead of immediate callback calls, Lwt now has hybrid immeidate/deferred callback calls.
I don't expect this change to affect most code that uses Lwt. However, it can still be dangerous for some programs, so this PR takes a somewhat gentle approach. The code here introduces a generalized mechanism, that is capable of both the broken and the patched behavior, and choosing between the two is a matter of changing a few lines of code. The code is currently used to emulate the broken behavior. There is a separate branch,
safer-semantics
, which adds one commit over this PR, and has the patched behavior. I suggest releasing this PR in 3.2.0 with the broken behavior, announcing the patched branch so that users can test against it, and then applying the final patch in Lwt 4.0.0.The changes to the test suite in
safer-semantics
suggest some of the improvements that will be made in 4.0.0.Note that
safer-semantics
needs PR #499, in order to be able to fully pass the Lwt test suite, because fixingLwt.bind
turns out to be one way to (accidentally) fix some bugs inLwt_list
. TheLwt_list
test suite without #499 expects incorrectLwt_list
behavior.Effect on performance
These are timings of an
Lwt.bind
loop on my machine:bind
for current (broken) Lwt.bind
for this PR, when emulating broken behavior (as I suggest to release in 3.2.0).bind
with the patched behavior (4.0.0, with a deferral every 42 binds).All the measurements were done with Flambda disabled, as it seems Flambda is being reworked right now.
The 16ns can be decreased somewhat. For example, avoiding capture in some new closures in
bind
reduced it to 13ns. It should be possible to restore something close to 8ns, but I wanted to stick with idiomatic and relatively concise code for now, and not get into premature optimization. I suspect the 50ns can also be improved in this way, though probably to a lesser degree.I chose 42 for the maximum number of immediate callback calls based on a few measurements:
It seems that 42 is close to where the cost of deferral is amortized enough to become insignificant.
Miscellaneous
Calling callbacks immediately is a kind of LIFO callback-calling strategy, while deferral, as implemented, is FIFO, and the hybrid behavior in this PR is just unpredictable. Lwt does not promise any callback-calling order in its API, but I am worried about surprising effects and starvation. We could set the nesting limit to 1, to get JS-like always-defer FIFO behavior, except maybe on the first callback call.
Since Lwt now has a centralized mechanism for deferring callbacks, or deciding where in the queue they will go, it should be fairly easy to randomize behavior for testing. cc @stedolan, @yomimono.
@domsj, @copy, @smondet, if you have time at all, would you be willing to run this branch
disciplined-callbacks
, and the patched branchsafer-semantics
through your stress tests? It could be very helpful :)cc also @mfp, @rgrinberg, @hcarty, @mfp, @c-cube based on past interest in this topic, overheard(seen) discussions on IRC, etc. :)