Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail holding-cell AddHTLCs on Channel deser to match disconnection #754

Conversation

TheBlueMatt
Copy link
Collaborator

Note that this bug was found thanks to the changes in #753, but its reasonable to take this by itself for 0.0.12 without 753.

As Channel::write says in the comment at the top: "we write out as
if remove_uncommitted_htlcs_and_mark_paused had just been called",
except that we previously deliberately included holding-cell
AddHTLC events in the serialization. On the flip side, in
remove_uncommitted_htlcs_and_mark_paused, we removed pending
AddHTLC events under the assumption that, if we can't forward
something ASAP, its better to fail it back to the origin than to
sit on it for a while.

Given there's likely to be just as large a time-lag between
ser/deserialization as between when a peer dis/reconnects, there
isn't much of a reason for this difference. Worse, we debug_assert
that there are no pending AddHTLC holding cell events when doing a
reconnect, so any tests or fuzzers which deserialized a
ChannelManager with AddHTLC events would panic.

We resolve this by adding logic to fail any holding-cell AddHTLC
events upon deserialization, in part because trying to forward it
before we're sure we have an up-to-date chain is somewhat risky -
the sender may have already gone to chain while our upstream has
not.

@TheBlueMatt TheBlueMatt modified the milestone: 0.0.12 Nov 19, 2020
@TheBlueMatt
Copy link
Collaborator Author

Note that I only think we need to take this for 0.0.12 because of the risk of forwarding something when we don't have a current chain - the assertion is only a debug assertion, and as far as I can tell the behavior even when hitting it is correct.

As Channel::write says in the comment at the top: "we write out as
if remove_uncommitted_htlcs_and_mark_paused had just been called",
except that we previously deliberately included holding-cell
AddHTLC events in the serialization. On the flip side, in
remove_uncommitted_htlcs_and_mark_paused, we removed pending
AddHTLC events under the assumption that, if we can't forward
something ASAP, its better to fail it back to the origin than to
sit on it for a while.

Given there's likely to be just as large a time-lag between
ser/deserialization as between when a peer dis/reconnects, there
isn't much of a reason for this difference. Worse, we debug_assert
that there are no pending AddHTLC holding cell events when doing a
reconnect, so any tests or fuzzers which deserialized a
ChannelManager with AddHTLC events would panic.

We resolve this by adding logic to fail any holding-cell AddHTLC
events upon deserialization, in part because trying to forward it
before we're sure we have an up-to-date chain is somewhat risky -
the sender may have already gone to chain while our upstream has
not.
@TheBlueMatt TheBlueMatt force-pushed the 2020-11-holding-cell-add-panic branch from 4956ad6 to 1ecee17 Compare November 19, 2020 01:23
@codecov
Copy link

codecov bot commented Nov 19, 2020

Codecov Report

Merging #754 (728ead6) into main (4e82003) will increase coverage by 0.04%.
The diff coverage is 93.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #754      +/-   ##
==========================================
+ Coverage   91.45%   91.49%   +0.04%     
==========================================
  Files          37       37              
  Lines       22249    22340      +91     
==========================================
+ Hits        20347    20440      +93     
+ Misses       1902     1900       -2     
Impacted Files Coverage Δ
lightning/src/ln/channel.rs 88.08% <63.63%> (+0.64%) ⬆️
lightning/src/ln/chanmon_update_fail_tests.rs 97.51% <96.70%> (-0.06%) ⬇️
lightning/src/ln/channelmanager.rs 85.41% <100.00%> (+0.05%) ⬆️
lightning/src/ln/functional_tests.rs 96.94% <0.00%> (-0.23%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4e82003...1ecee17. Read the comment docs.

@valentinewallace
Copy link
Contributor

valentinewallace commented Nov 20, 2020

In terms of "concept," the one question I have is whether this could lead to unexpected behavior for users. It's not obvious that reading state from disk would lead to failing HTLCs back.

For example, if a user has a method that reads all of RL's info from disk, and they batch ChannelMonitor updates such that they do have to read from disk periodically while the node is still running (inefficient on their part but maybe plausible), then after this PR, HTLCs would be unexpectedly failed backwards while the node is running.

Maybe this PR speaks to a need for ChannelManager to have a shutdown() function, so it'd have the opportunity to fail HTLCs back on shutdown? (then we could also persist to disk one last time on shutdown, which might be nice)

(edit: possibly relevant tweet: https://twitter.com/jaredpalmer/status/1171415929865064449?s=20)

@TheBlueMatt
Copy link
Collaborator Author

For example, if a user has a method that reads all of RL's info from disk, and they batch ChannelMonitor updates such that they do have to read from disk periodically while the node is still running (inefficient on their part but maybe plausible), then after this PR, HTLCs would be unexpectedly failed backwards while the node is running.

I'm a little confused by this - you have to provide the ChannelMonitor data to ChannelManager during deserialization and if a channel isn't in the same state in both then the channel will be closed. Doing some kind of late-updates is definitely not allowed by our API.

Maybe this PR speaks to a need for ChannelManager to have a shutdown() function, so it'd have the opportunity to fail HTLCs back on shutdown? (then we could also persist to disk one last time on shutdown, which might be nice)

Hopefully not. I tend to subscribe to the "having clean shutdown logic lulls you into a false sense of security" - user apps crash, especially on mobile, and getting a shutdown() function to be called reliably is a difficult problem. Either way we have to ensure that we support reloading after a crash, so might as well just make that the default.

@TheBlueMatt
Copy link
Collaborator Author

Discussed it offline for a while with @valentinewallace, in short I think this at least doesn't make sense because we should be going a different direction - solve #661 and the debug-assert failure by not failing-back holding cell HTLCs just because monitor updating has paused or we went to disk, and think harder about how to handle pre-chain-sync load-time HTLC relaying more generally (probably for now just say that you shouldn't connect to peers until chain is synced).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants