-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Free holding cell on monitor-updating-restored when there's no upd #755
Free holding cell on monitor-updating-restored when there's no upd #755
Conversation
If there is no pending channel update messages when monitor updating is restored (though there may be an RAA to send), and we're connected to our peer and not awaiting a remote RAA, we need to free anything in our holding cell. Without this, chanmon_fail_consistency was able to find a stuck condition where we sit on an HTLC failure in our holding cell and don't ever handle it (at least until we have other actions to take which empty the holding cell). Still, this approach sucks - it introduces reentrancy in a particularly dangerous form: a) we re-enter user code around monitor updates while being called from user code around monitor updates, making deadlocks very likely (in fact, our current tests have a bug here!), b) the re-entrancy only occurs in a very rare case, making it likely users will not hit it in testing, only deadlocking in production. I'm not entirely sure what the alternative is, however - we could move to a world where we poll for holding cell events that can be freed on our 1-minute-timer, but we still have a super rare reentrancy case, just in timer_chan_freshness_every_min() instead.
Because I'm not sure if this is the right fix, and the case should be incredibly rare, and its not really a critical bug, per se, and this is a more involved change than everything else that's left, I'm not going to bother tagging this 0.0.12 and suggest we hold off on fixing it. |
Oh, I should mention this was found by the updates in #753. |
Codecov Report
@@ Coverage Diff @@
## main #755 +/- ##
==========================================
- Coverage 91.45% 91.38% -0.07%
==========================================
Files 37 37
Lines 22249 22263 +14
==========================================
- Hits 20347 20346 -1
- Misses 1902 1917 +15
Continue to review full report at Codecov.
|
Could I see the fuzz failure that this is solving? I'm having a hard time wrapping my mind around #756 and I think it's because it seems to be solving multiple distinct problems that could be separated out. So, gonna try to understand them one by one. |
This input to chanmon_consistency fails on current upstream.
|
For me this fuzz input^ fails with the same error on If this PR can stand on its own -- and it's not too inconvenient -- I'd rather reopen, review + get this mergeable separately from addressing the other commits in #756. |
Right, honestly its been a while and I don't remember exactly the contours of what is/isn't in this PR, but that fuzz input does pass on #756. Most of the other commits on 756 are just merging to codepaths that were near-identical that should now be identical and then fixing a few miner further bugs in the fuzztarget. I'm happy to try to dig deeper into it, but I'm not sure just this version is really an option, so much as an intended demonstration of an idea seeking concept acks. |
If there is no pending channel update messages when monitor updating
is restored (though there may be an RAA to send), and we're
connected to our peer and not awaiting a remote RAA, we need to
free anything in our holding cell.
Without this, chanmon_fail_consistency was able to find a stuck
condition where we sit on an HTLC failure in our holding cell and
don't ever handle it (at least until we have other actions to take
which empty the holding cell).
Still, this approach sucks - it introduces reentrancy in a
particularly dangerous form:
a) we re-enter user code around monitor updates while being called
from user code around monitor updates, making deadlocks very
likely (in fact, our current tests have a bug here!),
b) the re-entrancy only occurs in a very rare case, making it
likely users will not hit it in testing, only deadlocking in
production.
I'm not entirely sure what the alternative is, however - we could
move to a world where we poll for holding cell events that can be
freed on our 1-minute-timer, but we still have a super rare
reentrancy case, just in timer_chan_freshness_every_min() instead.
I didn't bother writing a test cause I'm not entirely sure this is the right approach - mostly I'd like concept feedback.