Fail holding-cell AddHTLCs on Channel deser to match disconnection #754

TheBlueMatt · 2020-11-19T00:46:16Z

Note that this bug was found thanks to the changes in #753, but its reasonable to take this by itself for 0.0.12 without 753.

As Channel::write says in the comment at the top: "we write out as
if remove_uncommitted_htlcs_and_mark_paused had just been called",
except that we previously deliberately included holding-cell
AddHTLC events in the serialization. On the flip side, in
remove_uncommitted_htlcs_and_mark_paused, we removed pending
AddHTLC events under the assumption that, if we can't forward
something ASAP, its better to fail it back to the origin than to
sit on it for a while.

Given there's likely to be just as large a time-lag between
ser/deserialization as between when a peer dis/reconnects, there
isn't much of a reason for this difference. Worse, we debug_assert
that there are no pending AddHTLC holding cell events when doing a
reconnect, so any tests or fuzzers which deserialized a
ChannelManager with AddHTLC events would panic.

We resolve this by adding logic to fail any holding-cell AddHTLC
events upon deserialization, in part because trying to forward it
before we're sure we have an up-to-date chain is somewhat risky -
the sender may have already gone to chain while our upstream has
not.

TheBlueMatt · 2020-11-19T00:48:28Z

Note that I only think we need to take this for 0.0.12 because of the risk of forwarding something when we don't have a current chain - the assertion is only a debug assertion, and as far as I can tell the behavior even when hitting it is correct.

As Channel::write says in the comment at the top: "we write out as if remove_uncommitted_htlcs_and_mark_paused had just been called", except that we previously deliberately included holding-cell AddHTLC events in the serialization. On the flip side, in remove_uncommitted_htlcs_and_mark_paused, we removed pending AddHTLC events under the assumption that, if we can't forward something ASAP, its better to fail it back to the origin than to sit on it for a while. Given there's likely to be just as large a time-lag between ser/deserialization as between when a peer dis/reconnects, there isn't much of a reason for this difference. Worse, we debug_assert that there are no pending AddHTLC holding cell events when doing a reconnect, so any tests or fuzzers which deserialized a ChannelManager with AddHTLC events would panic. We resolve this by adding logic to fail any holding-cell AddHTLC events upon deserialization, in part because trying to forward it before we're sure we have an up-to-date chain is somewhat risky - the sender may have already gone to chain while our upstream has not.

codecov · 2020-11-19T01:37:25Z

Codecov Report

Merging #754 (728ead6) into main (4e82003) will increase coverage by 0.04%.
The diff coverage is 93.85%.

@@            Coverage Diff             @@
##             main     #754      +/-   ##
==========================================
+ Coverage   91.45%   91.49%   +0.04%     
==========================================
  Files          37       37              
  Lines       22249    22340      +91     
==========================================
+ Hits        20347    20440      +93     
+ Misses       1902     1900       -2

Impacted Files	Coverage Δ
lightning/src/ln/channel.rs	`88.08% <63.63%> (+0.64%)`	⬆️
lightning/src/ln/chanmon_update_fail_tests.rs	`97.51% <96.70%> (-0.06%)`	⬇️
lightning/src/ln/channelmanager.rs	`85.41% <100.00%> (+0.05%)`	⬆️
lightning/src/ln/functional_tests.rs	`96.94% <0.00%> (-0.23%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4e82003...1ecee17. Read the comment docs.

valentinewallace · 2020-11-20T16:08:51Z

In terms of "concept," the one question I have is whether this could lead to unexpected behavior for users. It's not obvious that reading state from disk would lead to failing HTLCs back.

For example, if a user has a method that reads all of RL's info from disk, and they batch ChannelMonitor updates such that they do have to read from disk periodically while the node is still running (inefficient on their part but maybe plausible), then after this PR, HTLCs would be unexpectedly failed backwards while the node is running.

Maybe this PR speaks to a need for ChannelManager to have a shutdown() function, so it'd have the opportunity to fail HTLCs back on shutdown? (then we could also persist to disk one last time on shutdown, which might be nice)

(edit: possibly relevant tweet: https://twitter.com/jaredpalmer/status/1171415929865064449?s=20)

TheBlueMatt · 2020-11-20T16:58:21Z

For example, if a user has a method that reads all of RL's info from disk, and they batch ChannelMonitor updates such that they do have to read from disk periodically while the node is still running (inefficient on their part but maybe plausible), then after this PR, HTLCs would be unexpectedly failed backwards while the node is running.

I'm a little confused by this - you have to provide the ChannelMonitor data to ChannelManager during deserialization and if a channel isn't in the same state in both then the channel will be closed. Doing some kind of late-updates is definitely not allowed by our API.

Maybe this PR speaks to a need for ChannelManager to have a shutdown() function, so it'd have the opportunity to fail HTLCs back on shutdown? (then we could also persist to disk one last time on shutdown, which might be nice)

Hopefully not. I tend to subscribe to the "having clean shutdown logic lulls you into a false sense of security" - user apps crash, especially on mobile, and getting a shutdown() function to be called reliably is a difficult problem. Either way we have to ensure that we support reloading after a crash, so might as well just make that the default.

TheBlueMatt · 2020-11-20T18:50:22Z

Discussed it offline for a while with @valentinewallace, in short I think this at least doesn't make sense because we should be going a different direction - solve #661 and the debug-assert failure by not failing-back holding cell HTLCs just because monitor updating has paused or we went to disk, and think harder about how to handle pre-chain-sync load-time HTLC relaying more generally (probably for now just say that you shouldn't connect to peers until chain is synced).

TheBlueMatt modified the milestone: 0.0.12 Nov 19, 2020

TheBlueMatt force-pushed the 2020-11-holding-cell-add-panic branch from 4956ad6 to 1ecee17 Compare November 19, 2020 01:23

TheBlueMatt closed this Nov 20, 2020

TheBlueMatt mentioned this pull request Nov 20, 2020

Clean up and more liberally free holding cell HTLCs #756

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail holding-cell AddHTLCs on Channel deser to match disconnection #754

Fail holding-cell AddHTLCs on Channel deser to match disconnection #754

TheBlueMatt commented Nov 19, 2020

TheBlueMatt commented Nov 19, 2020

codecov bot commented Nov 19, 2020

valentinewallace commented Nov 20, 2020 •

edited

Loading

TheBlueMatt commented Nov 20, 2020

TheBlueMatt commented Nov 20, 2020

Fail holding-cell AddHTLCs on Channel deser to match disconnection #754

Fail holding-cell AddHTLCs on Channel deser to match disconnection #754

Conversation

TheBlueMatt commented Nov 19, 2020

TheBlueMatt commented Nov 19, 2020

codecov bot commented Nov 19, 2020

Codecov Report

valentinewallace commented Nov 20, 2020 • edited Loading

TheBlueMatt commented Nov 20, 2020

TheBlueMatt commented Nov 20, 2020

valentinewallace commented Nov 20, 2020 •

edited

Loading