Clean up and more liberally free holding cell HTLCs #756

TheBlueMatt · 2020-11-20T20:57:31Z

This is effectively a superset of #755 (with the same commit) and an alternate approach to #754, instead holding onto the holding cell HTLCs and supporting handling them correctly in response to a channel_reestablish message. This was more of a slap-together thing, and still needs testing, but seeking concept ACKs on the restructure and approach (also to moving towards addressing #661).

TheBlueMatt · 2020-11-20T20:57:41Z

@valentinewallace are you happier with this?

valentinewallace · 2020-11-28T21:38:33Z

In terms of:

an alternate approach to #754

If it would be fair to summarize this approach as: "don't fail back holding cell add HTLCs on disconnect," (aka mainly just 54a601c) then I'm conceptACK. Still have to review the other parts of the PR in detail but that lgtm approach-wise!

Edit: fwiw, lnd also doesn't fail back HTLCs on disconnect afaict.

TheBlueMatt · 2020-11-29T19:24:45Z

If it would be fair to summarize this approach as: "don't fail back holding cell add HTLCs on disconnect,"

Correct.

then I'm conceptACK

Alright, I'll try to clean this up and get the macros to build on pre-NLL rust.

Edit: fwiw, lnd also doesn't fail back HTLCs on disconnect afaict.

Right, I presume most others don't, of course this is just on holding-cell stuff so its a really strange edge-case anyway.

TheBlueMatt · 2020-12-15T22:00:59Z

This has been updated to build on pre-NLL rust and rebased.

codecov · 2020-12-15T22:06:17Z

Codecov Report

Merging #756 (7f7fee5) into main (beb88e6) will increase coverage by 1.49%.
The diff coverage is 92.85%.

@@            Coverage Diff             @@
##             main     #756      +/-   ##
==========================================
+ Coverage   91.00%   92.50%   +1.49%     
==========================================
  Files          48       45       -3     
  Lines       25483    30571    +5088     
==========================================
+ Hits        23192    28279    +5087     
- Misses       2291     2292       +1

Impacted Files	Coverage Δ
lightning/src/ln/functional_test_utils.rs	`94.97% <ø> (ø)`
lightning/src/ln/channel.rs	`92.13% <92.59%> (+4.32%)`	⬆️
lightning/src/ln/channelmanager.rs	`89.64% <93.93%> (+4.40%)`	⬆️
lightning-block-sync/src/lib.rs	`0.00% <0.00%> (-95.38%)`	⬇️
lightning/src/ln/wire.rs	`59.24% <0.00%> (-4.87%)`	⬇️
lightning/src/util/ser.rs	`90.32% <0.00%> (-0.30%)`	⬇️
lightning/src/ln/features.rs	`98.76% <0.00%> (-0.08%)`	⬇️
lightning/src/chain/mod.rs	`100.00% <0.00%> (ø)`
lightning/src/util/errors.rs	`71.42% <0.00%> (ø)`
lightning/src/chain/chainmonitor.rs	`94.11% <0.00%> (ø)`
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9fba7c9...76db5db. Read the comment docs.

TheBlueMatt · 2021-01-27T19:16:46Z

Rebased without changes (just include conflicts).

lightning/src/ln/channelmanager.rs

lightning/src/ln/channel.rs

jkczyz · 2021-02-17T00:44:08Z

lightning/src/ln/channelmanager.rs

+	/// In some cases, this may generate a monitor update, resulting in a call to the
+	/// `chain::Watch`'s `update_channel` method for the same channel monitor which is being
+	/// notified of a successful update here. Because of this, please be very careful with
+	/// reentrancy bugs! It is incredibly easy to write an implementation of `update_channel` which
+	/// will take a lock which is also held when calling this method.
+	///


Is this situation any different from other pub methods that call update_channel?

Only in the sense that we're updating precisely the ChannelMonitor (via update_channel) that we're being notified has been successfully updated. I'm not quite sure how to better capture this in the docs here, but its rather easy (as we do in tests), to hit this case (but only rarely because you only ever get the callbacks rarely).

struct TestChainMonitor { monitor_state: Mutex<HashMap<OutPoint, StateInformation>>, } impl TestChainMonitor { fn finish_async_update() { let monitor_state = self.monitor_state.lock().unwrap(); ... channel_manager.channel_monitor_updated(...); } } impl chain::Watch for TestChainMonitor { fn update_channel(&self, funding_txo: OutPoint, update: channelmonitor::ChannelMonitorUpdate) -> Result<(), channelmonitor::ChannelMonitorUpdateErr> { let this_monitor_state = self.monitor_state.lock().unwrap(); ... start_async_update(update); Err(TemporaryFailure) } .... }

lightning/src/ln/channelmanager.rs

jkczyz · 2021-02-17T01:30:27Z

fuzz/src/chanmon_consistency.rs

-				if let Some((id, _)) = monitor_b.latest_monitors.lock().unwrap().get(&chan_1_funding) {
-					nodes[1].channel_monitor_updated(&chan_1_funding, *id);
+				{
+					let mon_id = monitor_b.latest_monitors.lock().unwrap().get(&chan_1_funding).map(|(id, _)| *id);
+					if let Some(id) = mon_id {
+						nodes[1].channel_monitor_updated(&chan_1_funding, id);
+					}
 				}


Was the lock held outside of this scope and thus causing a deadlock?

Yes, latest_monitors is locked again inside update_channel, which can be called eventually by the channel_monitor_updated call.

valentinewallace

I'm conceptACK on this approach, and in favor of opening an issue or so to look into a more events-based approach that's less risky but requires more refactoring (as discussed offline).

It'd be nice to add a test for the fuzz failure if it's not too difficult.

Otherwise, this is looking pretty good to me. Most of the changes are super mechanical.

lightning/src/ln/channelmanager.rs

valentinewallace · 2021-02-19T18:45:02Z

lightning/src/ln/channelmanager.rs

+	($self: ident, $locked_res: expr, $pending_failures: expr, $forwarding_failures: expr) => { {
+		let (htlc_forwards, funding_broadcast_safe, res, channel_id, counterparty_node_id) = $locked_res;
+
+		let _ = handle_error!($self, res, counterparty_node_id);


This line also adds reentrancy risk, right? (in addition to the reentrancy risk of freeing the holding cell HTLCs)

I don't believe so, at least not from the PoV of calling user code - handle_error shouldn't ever call user code directly, only pushing events onto the message/regular event queues.

Oh, it's hard to untangle the macros, but handle_error does call finish_force_close, which looks like it can call update_channel.

Ah! I see your point. Tracing it back a bit, I think the only way we have an Err in res here is if we break the handle_chan_restoration_locked loop with a handle_monitor_err result, which is precisely only if the monitor update that we generated failed.

TheBlueMatt · 2021-02-22T14:10:31Z

I'll finish working on some tests today, given it seems likely we'll move in this direction, at least in the short term.

The channel restoration code in channel monitor updating and peer reconnection both do incredibly similar things, and there is little reason to have them be separate. Sadly because they require holding a lock with a reference to elements in the lock, its not practical to make them utility functions, so instead we introduce a two-step macro here which will eventually be used for both. Because we still support pre-NLL Rust, the macro has to be in two parts - one which runs with the channel_state lock, and one which does not.

This mostly swaps some Vecs that can only ever contain one element for Options.

If there is no pending channel update messages when monitor updating is restored (though there may be an RAA to send), and we're connected to our peer and not awaiting a remote RAA, we need to free anything in our holding cell. Without this, chanmon_fail_consistency was able to find a stuck condition where we sit on an HTLC failure in our holding cell and don't ever handle it (at least until we have other actions to take which empty the holding cell). Still, this approach sucks - it introduces reentrancy in a particularly dangerous form: a) we re-enter user code around monitor updates while being called from user code around monitor updates, making deadlocks very likely (in fact, our current tests have a bug here!), b) the re-entrancy only occurs in a very rare case, making it likely users will not hit it in testing, only deadlocking in production. I'm not entirely sure what the alternative is, however - we could move to a world where we poll for holding cell events that can be freed on our 1-minute-timer, but we still have a super rare reentrancy case, just in timer_chan_freshness_every_min() instead.

This merges the code for restoring channel functionality between channel monitor updating restored and peer reconnection, reducing redundant code.

Previously, if we get a temporary monitor update failure while there were HTLCs pending forwarding in the holding cell, we'd clear them and fail them all backwards. This makes sense if temporary failures are rare, but in an async environment, temporary monitor update failures may be the normal case. In such a world, this results in potentially a lot of spurious HTLC forwarding failures (which is the topic of lightningdevkit#661).

As pointed out by Jeff, using a return struct instead of an incredibly-long tuple improves readability in several places.

60d83ef introduced reentrancy when calling channel_monitor_updated. This commit fixes the chanmon_consistency fuzzer to no longer deadlock as a result of this reentrancy.

Our fuzz tests previously only printed the log output of the first fuzz test case to fail. This commit changes that (with lots of auto-generated updates) to ensure we print all log outputs.

These should never appear, so consider them a fuzzer fail case.

Because of the merge between peer reconnection and channel monitor updating channel restoration code, we now sometimes generate (somewhat spurious) announcement signatures when restoring channel monitor updating. This should not result in a fuzzing failure.

If the fuzz target is failing due to a channel force-close, the immediately-visible error is that we're signing a stale state. This is because the ChannelMonitorUpdateStep::ChannelForceClosed event results in a signature in the test clone which was deserialized using a OnlyReadsKeysInterface. Instead, we need to deserialize using the full KeysInterface instance.

TheBlueMatt · 2021-03-01T02:04:43Z

Added a test which I think demonstrates the specific issue well (while also getting good coverage).

valentinewallace · 2021-03-01T23:04:37Z

lightning/src/ln/channelmanager.rs

+	($self: ident, $locked_res: expr, $pending_failures: expr, $forwarding_failures: expr) => { {
+		let (htlc_forwards, funding_broadcast_safe, res, channel_id, counterparty_node_id) = $locked_res;
+
+		let _ = handle_error!($self, res, counterparty_node_id);


Oh, it's hard to untangle the macros, but handle_error does call finish_force_close, which looks like it can call update_channel.

valentinewallace · 2021-03-17T21:09:23Z

lightning/src/ln/channel.rs

+		if commitment_update.is_none() && self.channel_state & (ChannelState::AwaitingRemoteRevoke as u32) == 0 {
+			order = RAACommitmentOrder::RevokeAndACKFirst;
+
+			let (update_opt, mut failed_htlcs) = self.free_holding_cell_htlcs(logger).unwrap();


Since we now have the whole process_background_events thing, I wonder if it'd ease the reentrancy concerns a bit to shift this call over there?

I'm not a fan of using that for this because its hooked on the one minute timer which may take a while...however, I think it makes sense in get_and_clear_pending_msg_events. Going to open a new PR with that.

TheBlueMatt added this to the 0.0.13 milestone Nov 21, 2020

TheBlueMatt force-pushed the 2020-11-holding-cell-clear-mon-upd branch 2 times, most recently from 4a8bec0 to 2eb9070 Compare November 24, 2020 01:20

valentinewallace mentioned this pull request Nov 28, 2020

Spurious Forwarding Failures in Async Monitor Update Clients #661

Closed

TheBlueMatt force-pushed the 2020-11-holding-cell-clear-mon-upd branch 2 times, most recently from f31497a to 5552b67 Compare December 15, 2020 22:00

TheBlueMatt marked this pull request as ready for review December 15, 2020 22:00

TheBlueMatt force-pushed the 2020-11-holding-cell-clear-mon-upd branch from 5552b67 to 393a420 Compare January 27, 2021 19:16

valentinewallace self-requested a review February 5, 2021 18:34

valentinewallace mentioned this pull request Feb 9, 2021

Free holding cell on monitor-updating-restored when there's no upd #755

Closed

TheBlueMatt force-pushed the 2020-11-holding-cell-clear-mon-upd branch from 393a420 to ed1194c Compare February 9, 2021 20:31

jkczyz reviewed Feb 17, 2021

View reviewed changes

valentinewallace reviewed Feb 19, 2021

View reviewed changes

TheBlueMatt force-pushed the 2020-11-holding-cell-clear-mon-upd branch from ed1194c to 7f7fee5 Compare February 21, 2021 02:05

TheBlueMatt modified the milestones: 0.0.13, 0.0.14 Feb 26, 2021

TheBlueMatt added 7 commits February 28, 2021 21:03

Drop unused variable and loop in channel_monitor_updated

6d99c89

Clean up channel updating macro somewhat

6170828

This mostly swaps some Vecs that can only ever contain one element for Options.

Use new chan_restoration macros in channel_reestablish handling.

49c7b01

This merges the code for restoring channel functionality between channel monitor updating restored and peer reconnection, reducing redundant code.

Use return struct in monitor_updating_restored+channel_reestablish

e7a4908

As pointed out by Jeff, using a return struct instead of an incredibly-long tuple improves readability in several places.

TheBlueMatt added 6 commits February 28, 2021 21:03

[fuzz] fix deadlock in chanmon_consistency due to new reentrancy

daedbbe

60d83ef introduced reentrancy when calling channel_monitor_updated. This commit fixes the chanmon_consistency fuzzer to no longer deadlock as a result of this reentrancy.

Print the output of all failed fuzz cases in test, not one test.

81cd1ad

Our fuzz tests previously only printed the log output of the first fuzz test case to fail. This commit changes that (with lots of auto-generated updates) to ensure we print all log outputs.

[fuzz] Fail in chanmon_consistency on IgnoreError error events

d0a8a90

These should never appear, so consider them a fuzzer fail case.

Add a test for HTLC freeing on monitor update restoration

76db5db

TheBlueMatt force-pushed the 2020-11-holding-cell-clear-mon-upd branch from 7f7fee5 to 76db5db Compare March 1, 2021 02:04

valentinewallace reviewed Mar 17, 2021

View reviewed changes

TheBlueMatt closed this Mar 18, 2021

TheBlueMatt mentioned this pull request Mar 18, 2021

Clean up and more liberally free holding cell HTLCs (without re-entrancy) #851

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up and more liberally free holding cell HTLCs #756

Clean up and more liberally free holding cell HTLCs #756

TheBlueMatt commented Nov 20, 2020

TheBlueMatt commented Nov 20, 2020

valentinewallace commented Nov 28, 2020 •

edited

Loading

TheBlueMatt commented Nov 29, 2020

TheBlueMatt commented Dec 15, 2020

codecov bot commented Dec 15, 2020 •

edited

Loading

TheBlueMatt commented Jan 27, 2021

jkczyz Feb 17, 2021

TheBlueMatt Feb 19, 2021

jkczyz Feb 17, 2021

TheBlueMatt Feb 21, 2021

valentinewallace left a comment

valentinewallace Feb 19, 2021

TheBlueMatt Feb 21, 2021

valentinewallace Mar 1, 2021

TheBlueMatt Mar 18, 2021

TheBlueMatt commented Feb 22, 2021

TheBlueMatt commented Mar 1, 2021

valentinewallace Mar 1, 2021

valentinewallace Mar 17, 2021

TheBlueMatt Mar 18, 2021

Clean up and more liberally free holding cell HTLCs #756

Clean up and more liberally free holding cell HTLCs #756

Conversation

TheBlueMatt commented Nov 20, 2020

TheBlueMatt commented Nov 20, 2020

valentinewallace commented Nov 28, 2020 • edited Loading

TheBlueMatt commented Nov 29, 2020

TheBlueMatt commented Dec 15, 2020

codecov bot commented Dec 15, 2020 • edited Loading

Codecov Report

TheBlueMatt commented Jan 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valentinewallace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheBlueMatt commented Feb 22, 2021

TheBlueMatt commented Mar 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valentinewallace commented Nov 28, 2020 •

edited

Loading

codecov bot commented Dec 15, 2020 •

edited

Loading