Clean up and more liberally free holding cell HTLCs (without re-entrancy) #851

TheBlueMatt · 2021-03-18T23:54:23Z

This is #756 reworked a decent chunk. In #756 the reentrancy story around channel_monitor_updated scared me significantly, but for some reason I apparently never bothered to really think deeply about the options. Here we move the holding cell free to get_and_clear_pending_msg_events, which is a super obvious location for it - it should be called right after channel_monitor_updated in most cases, but shouldn't have the same reentrancy issues because it already can generate a monitor update. It requires some macro rework, but its not too bad.

codecov · 2021-03-19T00:10:14Z

Codecov Report

Merging #851 (34fcd99) into main (b6de281) will increase coverage by 0.08%.
The diff coverage is 80.70%.

@@            Coverage Diff             @@
##             main     #851      +/-   ##
==========================================
+ Coverage   90.48%   90.56%   +0.08%     
==========================================
  Files          59       59              
  Lines       29896    30015     +119     
==========================================
+ Hits        27050    27184     +134     
+ Misses       2846     2831      -15

Impacted Files	Coverage Δ
lightning/src/ln/functional_test_utils.rs	`95.06% <ø> (ø)`
lightning/src/ln/channelmanager.rs	`82.92% <55.55%> (-0.55%)`	⬇️
lightning/src/ln/chanmon_update_fail_tests.rs	`97.77% <99.20%> (+0.13%)`	⬆️
lightning/src/ln/channel.rs	`88.29% <100.00%> (+1.04%)`	⬆️
lightning/src/ln/functional_tests.rs	`96.87% <0.00%> (-0.04%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b6de281...34fcd99. Read the comment docs.

lightning/src/ln/channelmanager.rs

valentinewallace · 2021-03-23T19:24:36Z

lightning/src/ln/channelmanager.rs

-							order = RAACommitmentOrder::RevokeAndACKFirst;
-						}
-						return_monitor_err!(self, e, channel_state, chan, order, revoke_and_ack.is_some(), commitment_update.is_some());
-						//TODO: Resend the funding_locked if needed once we get the monitor running again


This todo doesn't need to be migrated to the new code, right?

Hmm, thats a good question! I don't think it matters - this PR switched the ordering of message resending to resend the funding_locked before handling the monitor update, meaning we'll never need to resend because we always send (instead of maybe not sending if the monitor update fails).

The "don't send a funding_locked if we're paused pending monitor update" thing isn't really a security thing (we just expose a next_per_commitment_point which doesn't allow our counterparty to....do anything), its just kinda nice to not tell our counterparty we're ready to accept HTLCs when we're definitely not. That said, Channel::channel_reestablish doesn't really handle it all that carefully anyway, and we may end up re-sending a funding_locked on reconnect when we are still pending a monitor update failure.

lightning/src/ln/channelmanager.rs

lightning/src/ln/channel.rs

lightning/src/ln/channelmanager.rs

lightning/src/ln/channel.rs

valentinewallace

Nothing major. This looks pretty reasonable to me!

lightning/src/ln/channelmanager.rs

valentinewallace · 2021-03-26T19:43:38Z

lightning/src/ln/channel.rs

+		if self.channel_state >= ChannelState::ChannelFunded as u32 &&
+		   (self.channel_state & (ChannelState::AwaitingRemoteRevoke as u32 | ChannelState::PeerDisconnected as u32 | ChannelState::MonitorUpdateFailed as u32)) == 0 {


Why not just move these checks into free_holding_cell(..)? Also, revoke_and_ack currently calls the regular free_holding_cell_htlcs(..), wonder if it should call maybe_free instead?

Hmm, we already check all the preconditions at the callsites, I suppose we could move to just calling maybe_..., but that seems more brittle, given we already check them.

lightning/src/ln/channelmanager.rs

valentinewallace · 2021-03-26T19:55:21Z

lightning/src/ln/channelmanager.rs

+							if let Err(e) = self.chain_monitor.update_channel(chan.get_funding_txo().unwrap(), monitor_update) {
+								let (res, drop) = handle_monitor_err!(self, e, short_to_id, chan, RAACommitmentOrder::CommitmentFirst, false, true, Vec::new(), Vec::new(), channel_id);
+								handle_errors.push((chan.get_counterparty_node_id(), res));
+								if drop { return false; }


Hmm, so drop == true implies that we don't need to push the UpdateHTLCs event below? Maybe could use a comment for why.

I renamed drop to close_channel is that more clear?

valentinewallace · 2021-03-26T20:13:13Z

lightning/src/ln/channelmanager.rs

+		}
+
+		for (counterparty_node_id, err) in handle_errors.drain(..) {
+			let _ = handle_error!(self, err, counterparty_node_id);


I don't think it matters, but if we Permanently failed to update_channel above, we may still attempt another update_channel here.

Hmm, right, that's awkward, but I think also globally true - we always generate a ChannelForceClosed update for the monitor, even if the user can't persist it.

valentinewallace · 2021-03-26T20:49:25Z

lightning/src/ln/channel.rs

-						// disconnect, we may be handed some HTLCs to fail backwards here.
-						assert!(htlcs_to_fail.is_empty());
-						return Ok((resend_funding_locked, required_revoke, Some(commitment_update), Some(monitor_update), self.resend_order.clone(), shutdown_msg));
+						return Ok((resend_funding_locked, required_revoke, Some(commitment_update), Some(monitor_update), self.resend_order.clone(), htlcs_to_fail, shutdown_msg));


More of a general point: might we wanna think about how this logic fits into "cancelling payments" in the future? Because I could see a situation where a user attempts a payments, which gets stuck in the holding cell, and they'd prefer to attempt the payment again by a different route and cancel the existing one.

Hmm, possibly, but I'm not very worried. The holding cell is only really intended for things that are just pending some network latency - you should update a channel monitor or receive a revoke_and_ack from your peer within some hundreds of milliseconds, at max - not for things that are going to take a while - ie its not (really) for when your counterparty is offline.

valentinewallace

I'm ACK af14d44 modulo fixing 5613be8's commit message

lightning/src/ln/channel.rs

TheBlueMatt · 2021-03-31T03:16:23Z

Addressed Val's comments, full diff is:

$ git diff-tree -U1 af14d44 386cc60
diff --git a/lightning/src/ln/channel.rs b/lightning/src/ln/channel.rs
index c40e7f1c4..a2bcede85 100644
--- a/lightning/src/ln/channel.rs
+++ b/lightning/src/ln/channel.rs
@@ -2309,2 +2309,3 @@ impl<Signer: Sign> Channel<Signer> {
 	}
+
 	/// Used to fulfill holding_cell_htlcs when we get a remote ack (or implicitly get it by them
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index 73ba9d9bd..b538655bd 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -731,3 +731,3 @@ macro_rules! handle_monitor_err {
 			ChannelMonitorUpdateErr::PermanentFailure => {
-				log_error!($self.logger, "Closing channel {} due to monitor update PermanentFailure", log_bytes!($chan_id[..]));
+				log_error!($self.logger, "Closing channel {} due to monitor update ChannelMonitorUpdateErr::PermanentFailure", log_bytes!($chan_id[..]));
 				if let Some(short_id) = $chan.get_short_channel_id() {

TheBlueMatt · 2021-04-20T18:40:05Z

Rebased on latest upstream.

TheBlueMatt · 2021-04-28T00:04:28Z

Rebased on upstream, with pretty trivial changes.

TheBlueMatt · 2021-04-29T19:35:59Z

Looks like this is slipping to 0.0.15. It would be good to land some of these fixes as they are potential bugs, even if very rare.

ariard

Reviewed until e33012f included so far

fuzz/src/chanmon_consistency.rs

lightning/src/ln/channelmanager.rs

ariard · 2021-05-13T13:51:14Z

lightning/src/ln/channelmanager.rs

 			}

+			let chanmon_update: Option<ChannelMonitorUpdate> = $chanmon_update; // Force type-checking to resolve
+			if chanmon_update.is_some() {
+				// On reconnect (or monitor restoration), we, by definition, only resend a


I think mentioning monitor restoration (monitor_updating_restored ?) is confusing as this function never returns a monitor update.

I would remove it and instead in the $funding_locked branch, say something "at monitor restoration, we might resend to our counterparty funding_locked but this event doesn't generate a channel monitor update itself"

Hmm, you're right. I removed the reference to monitor restoration, but I'm not sure your comment communicates enough information to later re-analyze the situation going on.

Hmmmm "In case of monitor restoration, funding_locked will never be generated as this event isn't altering monitor state"?

lightning/src/ln/channelmanager.rs

ariard · 2021-05-13T15:10:07Z

lightning/src/ln/channelmanager.rs

 			macro_rules! handle_cs { () => {
+				if let Some(monitor_update) = chanmon_update {
+					assert!($order == RAACommitmentOrder::RevokeAndACKFirst);


I can't make my mind about correctness of this statement. It implies the path in Channel::channel_reestablish L3023, when we return htlc failures to route backward included in a new commitment transaction ? It doesn't necessary mean peer send us a commitment_signed for which we should send a RAA /

Yea, I can't convince myself its safe anymore, I walked this bit back.

lightning/src/ln/channelmanager.rs

TheBlueMatt · 2021-05-13T20:50:19Z

Rebased on latest upstream, relevant changes responding to comments are in fixup commits.

ariard

Left few more comments, top commits sounds good to me, but want to do a second parse with fixed comments.

lightning/src/ln/channel.rs

ariard · 2021-05-14T15:51:23Z

lightning/src/ln/channelmanager.rs

+				match chan.maybe_free_holding_cell_htlcs(&self.logger) {
+					Ok((None, ref htlcs)) if htlcs.is_empty() => true,
+					Ok((commitment_opt, failed_htlcs)) => {
+						forwarding_failed_htlcs.push((failed_htlcs, *channel_id));


I think it should be backwarding_failed_htlcs here ? Per fail_holding_cell_htlcs comment, those htlcs are designated as "The HTLCs need to be failed backwards or, if they were one of our outgoing HTLCs".

Yeah likely, we should give a brush stroke at some point about the whole HTLC routing section with consistent naming and better comments.

Yea, that's confusing, the name was "htlcs which failed during forwarding" but that wasn't clear.

TheBlueMatt · 2021-05-14T22:35:28Z

I believe I've addressed all the comments.

The channel restoration code in channel monitor updating and peer reconnection both do incredibly similar things, and there is little reason to have them be separate. Sadly because they require holding a lock with a reference to elements in the lock, its not practical to make them utility functions, so instead we introduce a two-step macro here which will eventually be used for both. Because we still support pre-NLL Rust, the macro has to be in two parts - one which runs with the channel_state lock, and one which does not.

This mostly swaps some Vecs that can only ever contain one element for Options.

This fails chanmon_consistency on IgnoreError error events and on messages left over to be sent to a just-disconnected peer, which should have been drained. These should never appear, so consider them a fuzzer fail case.

Because of the merge between peer reconnection and channel monitor updating channel restoration code, we now sometimes generate (somewhat spurious) announcement signatures when restoring channel monitor updating. This should not result in a fuzzing failure.

TheBlueMatt · 2021-05-20T21:36:04Z

Rebased with no changes aside from two new fixup commits to respond to Jeff's comments.

jkczyz

Looks good to me, but I'd be more comfortable if @ariard gave a look before merging.

lightning/src/ln/channelmanager.rs

This merges the code for restoring channel functionality between channel monitor updating restored and peer reconnection, reducing redundant code.

Both break_chan_entry and try_chan_entry do almost identical work, only differing on if they `break` or `return` in response to an error. Because we will now also need an option to do neither, we break out the common code into a shared `convert_chan_err` macro.

If there is no pending channel update messages when monitor updating is restored (though there may be an RAA to send), and we're connected to our peer and not awaiting a remote RAA, we need to free anything in our holding cell. However, we don't want to immediately free the holding cell during channel_monitor_updated as it presents a somewhat bug-prone case of reentrancy: a) it would re-enter user code around a monitor update while being called from user code notifying us of the same monitor being updated, making deadlocs very likely (in fact, our fuzzers would have a bug here!), b) the re-entrancy only occurs in a very rare case, making it likely users will not hit it in testing, only deadlocking in production. Thus, we add a holding-cell-free pass over each channel in get_and_clear_pending_msg_events. This fits up nicely with the anticipated bug - users almost certainly need to process new network messages immediately after monitor updating has been restored to send messages which were not sent originally when the monitor updating was paused. Without this, chanmon_fail_consistency was able to find a stuck condition where we sit on an HTLC failure in our holding cell and don't ever handle it (at least until we have other actions to take which empty the holding cell).

Because we may now generate a monitor update during get_and_clear_pending_msg_events calls, we need to ensure we re-serialize the relevant ChannelManager before attempting to reload it, if such a monitor update occurred.

Previously, if we got disconnected from a peer while there were HTLCs pending forwarding in the holding cell, we'd clear them and fail them all backwards. This is largely fine, but since we now have support for handling such HTLCs on reconnect, we might as well not, instead relying on our timeout logic to fail them backwards if it takes too long to forward them.

If the fuzz target is failing due to a channel force-close, the immediately-visible error is that we're signing a stale state. This is because the ChannelMonitorUpdateStep::ChannelForceClosed event results in a signature in the test clone which was deserialized using a OnlyReadsKeysInterface. Instead, we need to deserialize using the full KeysInterface instance.

TheBlueMatt · 2021-05-21T15:11:38Z

Squashed and renamed one variable as @jkczyz suggested:

$ git diff-tree -U1 7fde8910 34fcd99f
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index 14d9c528..9f9820c1 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -910,3 +910,3 @@ macro_rules! handle_chan_restoration_locked {
 		let chanmon_update: Option<ChannelMonitorUpdate> = $chanmon_update; // Force type-checking to resolve
-		let chanmon_update_is_some = chanmon_update.is_some();
+		let chanmon_update_is_none = chanmon_update.is_none();
 		let res = loop {
@@ -1006,3 +1006,3 @@ macro_rules! handle_chan_restoration_locked {
 
-		if !chanmon_update_is_some {
+		if chanmon_update_is_none {
 			// If there was no ChannelMonitorUpdate, we should never generate an Err in the res loop

TheBlueMatt · 2021-05-24T21:02:46Z

Gonna go ahead and take this since its been a long time. Post-merge review from @ariard would also be welcome but I'd rather not wait on it if he's busy.

We use `Channel::is_live()` to gate inclusion of a channel in `ChannelManager::list_usable_channels()` and when sending an HTLC to select whether a channel is available for forwarding through/sending to. In both of these cases, we almost certainly want `Channel::is_live()` to include channels which are simply pending a monitor update, as some clients may update monitors asynchronously, thus any rejection of HTLCs based on a monitor update still pending causing a race condition. After lightningdevkit#851, we always ensure any holding cells are free'd when sending P2P messages, making this much more trivially correct - instead of having to ensure that we always have a matching holding cell free any time we add something to the holding cell, we can simply rely on the fact that it always happens. Fixes lightningdevkit#661.

We use `Channel::is_live()` to gate inclusion of a channel in `ChannelManager::list_usable_channels()` and when sending an HTLC to select whether a channel is available for forwarding through/sending to. In both of these cases, we should consider a channel `is_live()` when they are pending a monitor update. Some clients may update monitors asynchronously, thus we may simply be waiting a short duration for a monitor update to complete, and shouldn't fail all forwarding HTLCs during that time. After lightningdevkit#851, we always ensure any holding cells are free'd when sending P2P messages, making this change much more trivially correct - instead of having to ensure that we always free the holding cell when a channel becomes live again after adding something to the holding cell, we can simply rely on the fact that it always happens. Fixes lightningdevkit#661.

TheBlueMatt added this to the 0.0.14 milestone Mar 18, 2021

valentinewallace reviewed Mar 24, 2021

View reviewed changes

TheBlueMatt force-pushed the 2021-03-holding-cell-clear-msg-get branch from 0993c08 to b14ea14 Compare March 25, 2021 19:21

valentinewallace reviewed Mar 26, 2021

View reviewed changes

TheBlueMatt force-pushed the 2021-03-holding-cell-clear-msg-get branch from b14ea14 to af14d44 Compare March 26, 2021 23:11

valentinewallace reviewed Mar 29, 2021

View reviewed changes

lightning/src/ln/channel.rs Show resolved Hide resolved

TheBlueMatt force-pushed the 2021-03-holding-cell-clear-msg-get branch from af14d44 to 386cc60 Compare March 31, 2021 03:15

TheBlueMatt force-pushed the 2021-03-holding-cell-clear-msg-get branch from 386cc60 to 0df3461 Compare April 20, 2021 18:40

TheBlueMatt force-pushed the 2021-03-holding-cell-clear-msg-get branch 3 times, most recently from 478044e to 72b730c Compare April 21, 2021 22:25

TheBlueMatt mentioned this pull request Apr 27, 2021

Correct Channel outbound HTLC serialization and expand fuzzing coverage #892

Merged

TheBlueMatt force-pushed the 2021-03-holding-cell-clear-msg-get branch from 72b730c to ffbe950 Compare April 28, 2021 00:02

TheBlueMatt modified the milestones: 0.0.14, 0.0.15 Apr 29, 2021

ariard reviewed May 13, 2021

View reviewed changes

TheBlueMatt force-pushed the 2021-03-holding-cell-clear-msg-get branch 4 times, most recently from 0b365ec to 1005f07 Compare May 13, 2021 20:48

ariard reviewed May 14, 2021

View reviewed changes

TheBlueMatt force-pushed the 2021-03-holding-cell-clear-msg-get branch 2 times, most recently from 9b1aa05 to 7eaca87 Compare May 14, 2021 22:35

TheBlueMatt added 5 commits May 20, 2021 21:18

Drop unused variable and loop in channel_monitor_updated

45ccf68

Clean up channel updating macro somewhat

e9fd35f

This mostly swaps some Vecs that can only ever contain one element for Options.

[fuzz] Be more strict about msg events in chanmon_consistency

d90aac8

This fails chanmon_consistency on IgnoreError error events and on messages left over to be sent to a just-disconnected peer, which should have been drained. These should never appear, so consider them a fuzzer fail case.

TheBlueMatt force-pushed the 2021-03-holding-cell-clear-msg-get branch from 365cee1 to 7fde891 Compare May 20, 2021 21:35

jkczyz approved these changes May 21, 2021

View reviewed changes

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved

TheBlueMatt added 9 commits May 21, 2021 15:10

Use new chan_restoration macros in channel_reestablish handling.

b2cb1fb

This merges the code for restoring channel functionality between channel monitor updating restored and peer reconnection, reducing redundant code.

Add a test for HTLC freeing on monitor update restoration

e19d7ef

Log info about HTLC failures when we fail them back

96b0faf

Add trace to ci-fuzz.sh to make debugging CI issues easier

34fcd99

TheBlueMatt force-pushed the 2021-03-holding-cell-clear-msg-get branch from 7fde891 to 34fcd99 Compare May 21, 2021 15:10

valentinewallace approved these changes May 24, 2021

View reviewed changes

TheBlueMatt merged commit 3a0356f into lightningdevkit:main May 24, 2021

TheBlueMatt mentioned this pull request Jun 17, 2021

Consider channels "live" even if they are awaiting a monitor update #954

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up and more liberally free holding cell HTLCs (without re-entrancy) #851

Clean up and more liberally free holding cell HTLCs (without re-entrancy) #851

TheBlueMatt commented Mar 18, 2021

codecov bot commented Mar 19, 2021 •

edited

Loading

valentinewallace Mar 23, 2021

TheBlueMatt Mar 24, 2021

valentinewallace left a comment

valentinewallace Mar 26, 2021

TheBlueMatt Mar 26, 2021

valentinewallace Mar 26, 2021

TheBlueMatt Mar 26, 2021

valentinewallace Mar 26, 2021

TheBlueMatt Mar 26, 2021

valentinewallace Mar 26, 2021

TheBlueMatt Mar 26, 2021 •

edited

Loading

valentinewallace left a comment

TheBlueMatt commented Mar 31, 2021

TheBlueMatt commented Apr 20, 2021

TheBlueMatt commented Apr 28, 2021

TheBlueMatt commented Apr 29, 2021

ariard left a comment

ariard May 13, 2021

TheBlueMatt May 13, 2021

ariard May 14, 2021

ariard May 13, 2021

TheBlueMatt May 13, 2021

TheBlueMatt commented May 13, 2021

ariard left a comment

ariard May 14, 2021

TheBlueMatt May 14, 2021

TheBlueMatt commented May 14, 2021

TheBlueMatt commented May 20, 2021

jkczyz left a comment

TheBlueMatt commented May 21, 2021

TheBlueMatt commented May 24, 2021

		if self.channel_state >= ChannelState::ChannelFunded as u32 &&
		(self.channel_state & (ChannelState::AwaitingRemoteRevoke as u32 \| ChannelState::PeerDisconnected as u32 \| ChannelState::MonitorUpdateFailed as u32)) == 0 {

Clean up and more liberally free holding cell HTLCs (without re-entrancy) #851

Clean up and more liberally free holding cell HTLCs (without re-entrancy) #851

Conversation

TheBlueMatt commented Mar 18, 2021

codecov bot commented Mar 19, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valentinewallace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheBlueMatt Mar 26, 2021 • edited Loading

Choose a reason for hiding this comment

valentinewallace left a comment

Choose a reason for hiding this comment

TheBlueMatt commented Mar 31, 2021

TheBlueMatt commented Apr 20, 2021

TheBlueMatt commented Apr 28, 2021

TheBlueMatt commented Apr 29, 2021

ariard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheBlueMatt commented May 13, 2021

ariard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheBlueMatt commented May 14, 2021

TheBlueMatt commented May 20, 2021

jkczyz left a comment

Choose a reason for hiding this comment

TheBlueMatt commented May 21, 2021

TheBlueMatt commented May 24, 2021

codecov bot commented Mar 19, 2021 •

edited

Loading

TheBlueMatt Mar 26, 2021 •

edited

Loading