Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up and more liberally free holding cell HTLCs #756

Conversation

TheBlueMatt
Copy link
Collaborator

This is effectively a superset of #755 (with the same commit) and an alternate approach to #754, instead holding onto the holding cell HTLCs and supporting handling them correctly in response to a channel_reestablish message. This was more of a slap-together thing, and still needs testing, but seeking concept ACKs on the restructure and approach (also to moving towards addressing #661).

@TheBlueMatt
Copy link
Collaborator Author

@valentinewallace are you happier with this?

@TheBlueMatt TheBlueMatt added this to the 0.0.13 milestone Nov 21, 2020
@TheBlueMatt TheBlueMatt force-pushed the 2020-11-holding-cell-clear-mon-upd branch 2 times, most recently from 4a8bec0 to 2eb9070 Compare November 24, 2020 01:20
@valentinewallace
Copy link
Contributor

valentinewallace commented Nov 28, 2020

In terms of:

an alternate approach to #754

If it would be fair to summarize this approach as: "don't fail back holding cell add HTLCs on disconnect," (aka mainly just 54a601c) then I'm conceptACK. Still have to review the other parts of the PR in detail but that lgtm approach-wise!

Edit: fwiw, lnd also doesn't fail back HTLCs on disconnect afaict.

@TheBlueMatt
Copy link
Collaborator Author

If it would be fair to summarize this approach as: "don't fail back holding cell add HTLCs on disconnect,"

Correct.

then I'm conceptACK

Alright, I'll try to clean this up and get the macros to build on pre-NLL rust.

Edit: fwiw, lnd also doesn't fail back HTLCs on disconnect afaict.

Right, I presume most others don't, of course this is just on holding-cell stuff so its a really strange edge-case anyway.

@TheBlueMatt TheBlueMatt force-pushed the 2020-11-holding-cell-clear-mon-upd branch 2 times, most recently from f31497a to 5552b67 Compare December 15, 2020 22:00
@TheBlueMatt TheBlueMatt marked this pull request as ready for review December 15, 2020 22:00
@TheBlueMatt
Copy link
Collaborator Author

This has been updated to build on pre-NLL rust and rebased.

@codecov
Copy link

codecov bot commented Dec 15, 2020

Codecov Report

Merging #756 (7f7fee5) into main (beb88e6) will increase coverage by 1.49%.
The diff coverage is 92.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #756      +/-   ##
==========================================
+ Coverage   91.00%   92.50%   +1.49%     
==========================================
  Files          48       45       -3     
  Lines       25483    30571    +5088     
==========================================
+ Hits        23192    28279    +5087     
- Misses       2291     2292       +1     
Impacted Files Coverage Δ
lightning/src/ln/functional_test_utils.rs 94.97% <ø> (ø)
lightning/src/ln/channel.rs 92.13% <92.59%> (+4.32%) ⬆️
lightning/src/ln/channelmanager.rs 89.64% <93.93%> (+4.40%) ⬆️
lightning-block-sync/src/lib.rs 0.00% <0.00%> (-95.38%) ⬇️
lightning/src/ln/wire.rs 59.24% <0.00%> (-4.87%) ⬇️
lightning/src/util/ser.rs 90.32% <0.00%> (-0.30%) ⬇️
lightning/src/ln/features.rs 98.76% <0.00%> (-0.08%) ⬇️
lightning/src/chain/mod.rs 100.00% <0.00%> (ø)
lightning/src/util/errors.rs 71.42% <0.00%> (ø)
lightning/src/chain/chainmonitor.rs 94.11% <0.00%> (ø)
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9fba7c9...76db5db. Read the comment docs.

@TheBlueMatt TheBlueMatt force-pushed the 2020-11-holding-cell-clear-mon-upd branch from 5552b67 to 393a420 Compare January 27, 2021 19:16
@TheBlueMatt
Copy link
Collaborator Author

Rebased without changes (just include conflicts).

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channel.rs Outdated Show resolved Hide resolved
Comment on lines 2300 to 2367
/// In some cases, this may generate a monitor update, resulting in a call to the
/// `chain::Watch`'s `update_channel` method for the same channel monitor which is being
/// notified of a successful update here. Because of this, please be very careful with
/// reentrancy bugs! It is incredibly easy to write an implementation of `update_channel` which
/// will take a lock which is also held when calling this method.
///
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this situation any different from other pub methods that call update_channel?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only in the sense that we're updating precisely the ChannelMonitor (via update_channel) that we're being notified has been successfully updated. I'm not quite sure how to better capture this in the docs here, but its rather easy (as we do in tests), to hit this case (but only rarely because you only ever get the callbacks rarely).

struct TestChainMonitor {
    monitor_state: Mutex<HashMap<OutPoint, StateInformation>>,
}
impl TestChainMonitor {
    fn finish_async_update() {
        let monitor_state = self.monitor_state.lock().unwrap();
        ...
        channel_manager.channel_monitor_updated(...);
    }
}
impl chain::Watch for TestChainMonitor {
    fn update_channel(&self, funding_txo: OutPoint, update: channelmonitor::ChannelMonitorUpdate) -> Result<(), channelmonitor::ChannelMonitorUpdateErr> {
        let this_monitor_state = self.monitor_state.lock().unwrap();
        ...
        start_async_update(update);
        Err(TemporaryFailure)
    }
    ....
}

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
Comment on lines -927 to 945
if let Some((id, _)) = monitor_b.latest_monitors.lock().unwrap().get(&chan_1_funding) {
nodes[1].channel_monitor_updated(&chan_1_funding, *id);
{
let mon_id = monitor_b.latest_monitors.lock().unwrap().get(&chan_1_funding).map(|(id, _)| *id);
if let Some(id) = mon_id {
nodes[1].channel_monitor_updated(&chan_1_funding, id);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the lock held outside of this scope and thus causing a deadlock?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, latest_monitors is locked again inside update_channel, which can be called eventually by the channel_monitor_updated call.

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm conceptACK on this approach, and in favor of opening an issue or so to look into a more events-based approach that's less risky but requires more refactoring (as discussed offline).

It'd be nice to add a test for the fuzz failure if it's not too difficult.

Otherwise, this is looking pretty good to me. Most of the changes are super mechanical.

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
($self: ident, $locked_res: expr, $pending_failures: expr, $forwarding_failures: expr) => { {
let (htlc_forwards, funding_broadcast_safe, res, channel_id, counterparty_node_id) = $locked_res;

let _ = handle_error!($self, res, counterparty_node_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line also adds reentrancy risk, right? (in addition to the reentrancy risk of freeing the holding cell HTLCs)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so, at least not from the PoV of calling user code - handle_error shouldn't ever call user code directly, only pushing events onto the message/regular event queues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's hard to untangle the macros, but handle_error does call finish_force_close, which looks like it can call update_channel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I see your point. Tracing it back a bit, I think the only way we have an Err in res here is if we break the handle_chan_restoration_locked loop with a handle_monitor_err result, which is precisely only if the monitor update that we generated failed.

@TheBlueMatt TheBlueMatt force-pushed the 2020-11-holding-cell-clear-mon-upd branch from ed1194c to 7f7fee5 Compare February 21, 2021 02:05
@TheBlueMatt
Copy link
Collaborator Author

I'll finish working on some tests today, given it seems likely we'll move in this direction, at least in the short term.

@TheBlueMatt TheBlueMatt modified the milestones: 0.0.13, 0.0.14 Feb 26, 2021
The channel restoration code in channel monitor updating and peer
reconnection both do incredibly similar things, and there is
little reason to have them be separate. Sadly because they require
holding a lock with a reference to elements in the lock, its not
practical to make them utility functions, so instead we introduce
a two-step macro here which will eventually be used for both.

Because we still support pre-NLL Rust, the macro has to be in two
parts - one which runs with the channel_state lock, and one which
does not.
This mostly swaps some Vecs that can only ever contain one element
for Options.
If there is no pending channel update messages when monitor updating
is restored (though there may be an RAA to send), and we're
connected to our peer and not awaiting a remote RAA, we need to
free anything in our holding cell.

Without this, chanmon_fail_consistency was able to find a stuck
condition where we sit on an HTLC failure in our holding cell and
don't ever handle it (at least until we have other actions to take
which empty the holding cell).

Still, this approach sucks - it introduces reentrancy in a
particularly dangerous form:
 a) we re-enter user code around monitor updates while being called
    from user code around monitor updates, making deadlocks very
    likely (in fact, our current tests have a bug here!),
 b) the re-entrancy only occurs in a very rare case, making it
    likely users will not hit it in testing, only deadlocking in
    production.
I'm not entirely sure what the alternative is, however - we could
move to a world where we poll for holding cell events that can be
freed on our 1-minute-timer, but we still have a super rare
reentrancy case, just in timer_chan_freshness_every_min() instead.
This merges the code for restoring channel functionality between
channel monitor updating restored and peer reconnection, reducing
redundant code.
Previously, if we get a temporary monitor update failure while
there were HTLCs pending forwarding in the holding cell, we'd clear
them and fail them all backwards. This makes sense if temporary
failures are rare, but in an async environment, temporary monitor
update failures may be the normal case. In such a world, this
results in potentially a lot of spurious HTLC forwarding failures
(which is the topic of lightningdevkit#661).
As pointed out by Jeff, using a return struct instead of an
incredibly-long tuple improves readability in several places.
60d83ef introduced reentrancy when
calling channel_monitor_updated. This commit fixes the
chanmon_consistency fuzzer to no longer deadlock as a result of this
reentrancy.
Our fuzz tests previously only printed the log output of the first
fuzz test case to fail. This commit changes that (with lots of
auto-generated updates) to ensure we print all log outputs.
These should never appear, so consider them a fuzzer fail case.
Because of the merge between peer reconnection and channel monitor
updating channel restoration code, we now sometimes generate
(somewhat spurious) announcement signatures when restoring channel
monitor updating. This should not result in a fuzzing failure.
If the fuzz target is failing due to a channel force-close, the
immediately-visible error is that we're signing a stale state. This
is because the ChannelMonitorUpdateStep::ChannelForceClosed event
results in a signature in the test clone which was deserialized
using a OnlyReadsKeysInterface. Instead, we need to deserialize
using the full KeysInterface instance.
@TheBlueMatt TheBlueMatt force-pushed the 2020-11-holding-cell-clear-mon-upd branch from 7f7fee5 to 76db5db Compare March 1, 2021 02:04
@TheBlueMatt
Copy link
Collaborator Author

Added a test which I think demonstrates the specific issue well (while also getting good coverage).

($self: ident, $locked_res: expr, $pending_failures: expr, $forwarding_failures: expr) => { {
let (htlc_forwards, funding_broadcast_safe, res, channel_id, counterparty_node_id) = $locked_res;

let _ = handle_error!($self, res, counterparty_node_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's hard to untangle the macros, but handle_error does call finish_force_close, which looks like it can call update_channel.

if commitment_update.is_none() && self.channel_state & (ChannelState::AwaitingRemoteRevoke as u32) == 0 {
order = RAACommitmentOrder::RevokeAndACKFirst;

let (update_opt, mut failed_htlcs) = self.free_holding_cell_htlcs(logger).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we now have the whole process_background_events thing, I wonder if it'd ease the reentrancy concerns a bit to shift this call over there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of using that for this because its hooked on the one minute timer which may take a while...however, I think it makes sense in get_and_clear_pending_msg_events. Going to open a new PR with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants