Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Free holding cell on monitor-updating-restored when there's no upd #755

Conversation

TheBlueMatt
Copy link
Collaborator

If there is no pending channel update messages when monitor updating
is restored (though there may be an RAA to send), and we're
connected to our peer and not awaiting a remote RAA, we need to
free anything in our holding cell.

Without this, chanmon_fail_consistency was able to find a stuck
condition where we sit on an HTLC failure in our holding cell and
don't ever handle it (at least until we have other actions to take
which empty the holding cell).

Still, this approach sucks - it introduces reentrancy in a
particularly dangerous form:
a) we re-enter user code around monitor updates while being called
from user code around monitor updates, making deadlocks very
likely (in fact, our current tests have a bug here!),
b) the re-entrancy only occurs in a very rare case, making it
likely users will not hit it in testing, only deadlocking in
production.
I'm not entirely sure what the alternative is, however - we could
move to a world where we poll for holding cell events that can be
freed on our 1-minute-timer, but we still have a super rare
reentrancy case, just in timer_chan_freshness_every_min() instead.

I didn't bother writing a test cause I'm not entirely sure this is the right approach - mostly I'd like concept feedback.

If there is no pending channel update messages when monitor updating
is restored (though there may be an RAA to send), and we're
connected to our peer and not awaiting a remote RAA, we need to
free anything in our holding cell.

Without this, chanmon_fail_consistency was able to find a stuck
condition where we sit on an HTLC failure in our holding cell and
don't ever handle it (at least until we have other actions to take
which empty the holding cell).

Still, this approach sucks - it introduces reentrancy in a
particularly dangerous form:
 a) we re-enter user code around monitor updates while being called
    from user code around monitor updates, making deadlocks very
    likely (in fact, our current tests have a bug here!),
 b) the re-entrancy only occurs in a very rare case, making it
    likely users will not hit it in testing, only deadlocking in
    production.
I'm not entirely sure what the alternative is, however - we could
move to a world where we poll for holding cell events that can be
freed on our 1-minute-timer, but we still have a super rare
reentrancy case, just in timer_chan_freshness_every_min() instead.
@TheBlueMatt
Copy link
Collaborator Author

Because I'm not sure if this is the right fix, and the case should be incredibly rare, and its not really a critical bug, per se, and this is a more involved change than everything else that's left, I'm not going to bother tagging this 0.0.12 and suggest we hold off on fixing it.

@TheBlueMatt
Copy link
Collaborator Author

Oh, I should mention this was found by the updates in #753.

@codecov
Copy link

codecov bot commented Nov 19, 2020

Codecov Report

Merging #755 (07fb28e) into main (4e82003) will decrease coverage by 0.06%.
The diff coverage is 93.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #755      +/-   ##
==========================================
- Coverage   91.45%   91.38%   -0.07%     
==========================================
  Files          37       37              
  Lines       22249    22263      +14     
==========================================
- Hits        20347    20346       -1     
- Misses       1902     1917      +15     
Impacted Files Coverage Δ
lightning/src/ln/channel.rs 87.37% <81.25%> (-0.07%) ⬇️
lightning/src/ln/channelmanager.rs 85.48% <100.00%> (+0.11%) ⬆️
lightning/src/ln/functional_tests.rs 96.92% <0.00%> (-0.25%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4e82003...f35c8bf. Read the comment docs.

@valentinewallace
Copy link
Contributor

Could I see the fuzz failure that this is solving?

I'm having a hard time wrapping my mind around #756 and I think it's because it seems to be solving multiple distinct problems that could be separated out. So, gonna try to understand them one by one.

@TheBlueMatt
Copy link
Collaborator Author

This input to chanmon_consistency fails on current upstream.

3c10143411341434100d5454543033333038345454545454545454545454
5454545454545454545454545454110015542d3454545411151154543154
541cff14

@valentinewallace
Copy link
Contributor

valentinewallace commented Feb 10, 2021

This input to chanmon_consistency fails on current upstream.

3c10143411341434100d5454543033333038345454545454545454545454
5454545454545454545454545454110015542d3454545411151154543154
541cff14

For me this fuzz input^ fails with the same error on main and this PR's commit. Is that expected?

If this PR can stand on its own -- and it's not too inconvenient -- I'd rather reopen, review + get this mergeable separately from addressing the other commits in #756.

@TheBlueMatt
Copy link
Collaborator Author

Right, honestly its been a while and I don't remember exactly the contours of what is/isn't in this PR, but that fuzz input does pass on #756. Most of the other commits on 756 are just merging to codepaths that were near-identical that should now be identical and then fixing a few miner further bugs in the fuzztarget. I'm happy to try to dig deeper into it, but I'm not sure just this version is really an option, so much as an intended demonstration of an idea seeking concept acks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants