-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up and more liberally free holding cell HTLCs (without re-entrancy) #851
Clean up and more liberally free holding cell HTLCs (without re-entrancy) #851
Conversation
Codecov Report
@@ Coverage Diff @@
## main #851 +/- ##
==========================================
+ Coverage 90.48% 90.56% +0.08%
==========================================
Files 59 59
Lines 29896 30015 +119
==========================================
+ Hits 27050 27184 +134
+ Misses 2846 2831 -15
Continue to review full report at Codecov.
|
order = RAACommitmentOrder::RevokeAndACKFirst; | ||
} | ||
return_monitor_err!(self, e, channel_state, chan, order, revoke_and_ack.is_some(), commitment_update.is_some()); | ||
//TODO: Resend the funding_locked if needed once we get the monitor running again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This todo doesn't need to be migrated to the new code, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, thats a good question! I don't think it matters - this PR switched the ordering of message resending to resend the funding_locked before handling the monitor update, meaning we'll never need to resend because we always send (instead of maybe not sending if the monitor update fails).
The "don't send a funding_locked if we're paused pending monitor update" thing isn't really a security thing (we just expose a next_per_commitment_point
which doesn't allow our counterparty to....do anything), its just kinda nice to not tell our counterparty we're ready to accept HTLCs when we're definitely not. That said, Channel::channel_reestablish
doesn't really handle it all that carefully anyway, and we may end up re-sending a funding_locked on reconnect when we are still pending a monitor update failure.
0993c08
to
b14ea14
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing major. This looks pretty reasonable to me!
if self.channel_state >= ChannelState::ChannelFunded as u32 && | ||
(self.channel_state & (ChannelState::AwaitingRemoteRevoke as u32 | ChannelState::PeerDisconnected as u32 | ChannelState::MonitorUpdateFailed as u32)) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just move these checks into free_holding_cell(..)
? Also, revoke_and_ack
currently calls the regular free_holding_cell_htlcs(..)
, wonder if it should call maybe_free
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we already check all the preconditions at the callsites, I suppose we could move to just calling maybe_...
, but that seems more brittle, given we already check them.
lightning/src/ln/channelmanager.rs
Outdated
if let Err(e) = self.chain_monitor.update_channel(chan.get_funding_txo().unwrap(), monitor_update) { | ||
let (res, drop) = handle_monitor_err!(self, e, short_to_id, chan, RAACommitmentOrder::CommitmentFirst, false, true, Vec::new(), Vec::new(), channel_id); | ||
handle_errors.push((chan.get_counterparty_node_id(), res)); | ||
if drop { return false; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so drop == true
implies that we don't need to push the UpdateHTLCs
event below? Maybe could use a comment for why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed drop
to close_channel
is that more clear?
} | ||
|
||
for (counterparty_node_id, err) in handle_errors.drain(..) { | ||
let _ = handle_error!(self, err, counterparty_node_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters, but if we Permanent
ly failed to update_channel
above, we may still attempt another update_channel
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, right, that's awkward, but I think also globally true - we always generate a ChannelForceClosed
update for the monitor, even if the user can't persist it.
// disconnect, we may be handed some HTLCs to fail backwards here. | ||
assert!(htlcs_to_fail.is_empty()); | ||
return Ok((resend_funding_locked, required_revoke, Some(commitment_update), Some(monitor_update), self.resend_order.clone(), shutdown_msg)); | ||
return Ok((resend_funding_locked, required_revoke, Some(commitment_update), Some(monitor_update), self.resend_order.clone(), htlcs_to_fail, shutdown_msg)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a general point: might we wanna think about how this logic fits into "cancelling payments" in the future? Because I could see a situation where a user attempts a payments, which gets stuck in the holding cell, and they'd prefer to attempt the payment again by a different route and cancel the existing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, possibly, but I'm not very worried. The holding cell is only really intended for things that are just pending some network latency - you should update a channel monitor or receive a revoke_and_ack from your peer within some hundreds of milliseconds, at max - not for things that are going to take a while - ie its not (really) for when your counterparty is offline.
b14ea14
to
af14d44
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
af14d44
to
386cc60
Compare
Addressed Val's comments, full diff is:
|
386cc60
to
0df3461
Compare
Rebased on latest upstream. |
478044e
to
72b730c
Compare
72b730c
to
ffbe950
Compare
Rebased on upstream, with pretty trivial changes. |
Looks like this is slipping to 0.0.15. It would be good to land some of these fixes as they are potential bugs, even if very rare. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed until e33012f included so far
lightning/src/ln/channelmanager.rs
Outdated
} | ||
|
||
let chanmon_update: Option<ChannelMonitorUpdate> = $chanmon_update; // Force type-checking to resolve | ||
if chanmon_update.is_some() { | ||
// On reconnect (or monitor restoration), we, by definition, only resend a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think mentioning monitor restoration (monitor_updating_restored
?) is confusing as this function never returns a monitor update.
I would remove it and instead in the $funding_locked branch, say something "at monitor restoration, we might resend to our counterparty funding_locked but this event doesn't generate a channel monitor update itself"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, you're right. I removed the reference to monitor restoration, but I'm not sure your comment communicates enough information to later re-analyze the situation going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm "In case of monitor restoration, funding_locked will never be generated as this event isn't altering monitor state"?
lightning/src/ln/channelmanager.rs
Outdated
macro_rules! handle_cs { () => { | ||
if let Some(monitor_update) = chanmon_update { | ||
assert!($order == RAACommitmentOrder::RevokeAndACKFirst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't make my mind about correctness of this statement. It implies the path in Channel::channel_reestablish
L3023, when we return htlc failures to route backward included in a new commitment transaction ? It doesn't necessary mean peer send us a commitment_signed for which we should send a RAA /
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I can't convince myself its safe anymore, I walked this bit back.
0b365ec
to
1005f07
Compare
Rebased on latest upstream, relevant changes responding to comments are in fixup commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left few more comments, top commits sounds good to me, but want to do a second parse with fixed comments.
lightning/src/ln/channelmanager.rs
Outdated
match chan.maybe_free_holding_cell_htlcs(&self.logger) { | ||
Ok((None, ref htlcs)) if htlcs.is_empty() => true, | ||
Ok((commitment_opt, failed_htlcs)) => { | ||
forwarding_failed_htlcs.push((failed_htlcs, *channel_id)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be backwarding_failed_htlcs
here ? Per fail_holding_cell_htlcs
comment, those htlcs are designated as "The HTLCs need to be failed backwards or, if they were one of our outgoing HTLCs".
Yeah likely, we should give a brush stroke at some point about the whole HTLC routing section with consistent naming and better comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's confusing, the name was "htlcs which failed during forwarding" but that wasn't clear.
9b1aa05
to
7eaca87
Compare
I believe I've addressed all the comments. |
The channel restoration code in channel monitor updating and peer reconnection both do incredibly similar things, and there is little reason to have them be separate. Sadly because they require holding a lock with a reference to elements in the lock, its not practical to make them utility functions, so instead we introduce a two-step macro here which will eventually be used for both. Because we still support pre-NLL Rust, the macro has to be in two parts - one which runs with the channel_state lock, and one which does not.
This mostly swaps some Vecs that can only ever contain one element for Options.
This fails chanmon_consistency on IgnoreError error events and on messages left over to be sent to a just-disconnected peer, which should have been drained. These should never appear, so consider them a fuzzer fail case.
Because of the merge between peer reconnection and channel monitor updating channel restoration code, we now sometimes generate (somewhat spurious) announcement signatures when restoring channel monitor updating. This should not result in a fuzzing failure.
365cee1
to
7fde891
Compare
Rebased with no changes aside from two new fixup commits to respond to Jeff's comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but I'd be more comfortable if @ariard gave a look before merging.
This merges the code for restoring channel functionality between channel monitor updating restored and peer reconnection, reducing redundant code.
Both break_chan_entry and try_chan_entry do almost identical work, only differing on if they `break` or `return` in response to an error. Because we will now also need an option to do neither, we break out the common code into a shared `convert_chan_err` macro.
If there is no pending channel update messages when monitor updating is restored (though there may be an RAA to send), and we're connected to our peer and not awaiting a remote RAA, we need to free anything in our holding cell. However, we don't want to immediately free the holding cell during channel_monitor_updated as it presents a somewhat bug-prone case of reentrancy: a) it would re-enter user code around a monitor update while being called from user code notifying us of the same monitor being updated, making deadlocs very likely (in fact, our fuzzers would have a bug here!), b) the re-entrancy only occurs in a very rare case, making it likely users will not hit it in testing, only deadlocking in production. Thus, we add a holding-cell-free pass over each channel in get_and_clear_pending_msg_events. This fits up nicely with the anticipated bug - users almost certainly need to process new network messages immediately after monitor updating has been restored to send messages which were not sent originally when the monitor updating was paused. Without this, chanmon_fail_consistency was able to find a stuck condition where we sit on an HTLC failure in our holding cell and don't ever handle it (at least until we have other actions to take which empty the holding cell).
Because we may now generate a monitor update during get_and_clear_pending_msg_events calls, we need to ensure we re-serialize the relevant ChannelManager before attempting to reload it, if such a monitor update occurred.
Previously, if we got disconnected from a peer while there were HTLCs pending forwarding in the holding cell, we'd clear them and fail them all backwards. This is largely fine, but since we now have support for handling such HTLCs on reconnect, we might as well not, instead relying on our timeout logic to fail them backwards if it takes too long to forward them.
If the fuzz target is failing due to a channel force-close, the immediately-visible error is that we're signing a stale state. This is because the ChannelMonitorUpdateStep::ChannelForceClosed event results in a signature in the test clone which was deserialized using a OnlyReadsKeysInterface. Instead, we need to deserialize using the full KeysInterface instance.
7fde891
to
34fcd99
Compare
Squashed and renamed one variable as @jkczyz suggested:
|
Gonna go ahead and take this since its been a long time. Post-merge review from @ariard would also be welcome but I'd rather not wait on it if he's busy. |
We use `Channel::is_live()` to gate inclusion of a channel in `ChannelManager::list_usable_channels()` and when sending an HTLC to select whether a channel is available for forwarding through/sending to. In both of these cases, we almost certainly want `Channel::is_live()` to include channels which are simply pending a monitor update, as some clients may update monitors asynchronously, thus any rejection of HTLCs based on a monitor update still pending causing a race condition. After lightningdevkit#851, we always ensure any holding cells are free'd when sending P2P messages, making this much more trivially correct - instead of having to ensure that we always have a matching holding cell free any time we add something to the holding cell, we can simply rely on the fact that it always happens. Fixes lightningdevkit#661.
We use `Channel::is_live()` to gate inclusion of a channel in `ChannelManager::list_usable_channels()` and when sending an HTLC to select whether a channel is available for forwarding through/sending to. In both of these cases, we almost certainly want `Channel::is_live()` to include channels which are simply pending a monitor update, as some clients may update monitors asynchronously, thus any rejection of HTLCs based on a monitor update still pending causing a race condition. After lightningdevkit#851, we always ensure any holding cells are free'd when sending P2P messages, making this much more trivially correct - instead of having to ensure that we always have a matching holding cell free any time we add something to the holding cell, we can simply rely on the fact that it always happens. Fixes lightningdevkit#661.
We use `Channel::is_live()` to gate inclusion of a channel in `ChannelManager::list_usable_channels()` and when sending an HTLC to select whether a channel is available for forwarding through/sending to. In both of these cases, we almost certainly want `Channel::is_live()` to include channels which are simply pending a monitor update, as some clients may update monitors asynchronously, thus any rejection of HTLCs based on a monitor update still pending causing a race condition. After lightningdevkit#851, we always ensure any holding cells are free'd when sending P2P messages, making this much more trivially correct - instead of having to ensure that we always have a matching holding cell free any time we add something to the holding cell, we can simply rely on the fact that it always happens. Fixes lightningdevkit#661.
We use `Channel::is_live()` to gate inclusion of a channel in `ChannelManager::list_usable_channels()` and when sending an HTLC to select whether a channel is available for forwarding through/sending to. In both of these cases, we almost certainly want `Channel::is_live()` to include channels which are simply pending a monitor update, as some clients may update monitors asynchronously, thus any rejection of HTLCs based on a monitor update still pending causing a race condition. After lightningdevkit#851, we always ensure any holding cells are free'd when sending P2P messages, making this much more trivially correct - instead of having to ensure that we always have a matching holding cell free any time we add something to the holding cell, we can simply rely on the fact that it always happens. Fixes lightningdevkit#661.
We use `Channel::is_live()` to gate inclusion of a channel in `ChannelManager::list_usable_channels()` and when sending an HTLC to select whether a channel is available for forwarding through/sending to. In both of these cases, we almost certainly want `Channel::is_live()` to include channels which are simply pending a monitor update, as some clients may update monitors asynchronously, thus any rejection of HTLCs based on a monitor update still pending causing a race condition. After lightningdevkit#851, we always ensure any holding cells are free'd when sending P2P messages, making this much more trivially correct - instead of having to ensure that we always have a matching holding cell free any time we add something to the holding cell, we can simply rely on the fact that it always happens. Fixes lightningdevkit#661.
We use `Channel::is_live()` to gate inclusion of a channel in `ChannelManager::list_usable_channels()` and when sending an HTLC to select whether a channel is available for forwarding through/sending to. In both of these cases, we should consider a channel `is_live()` when they are pending a monitor update. Some clients may update monitors asynchronously, thus we may simply be waiting a short duration for a monitor update to complete, and shouldn't fail all forwarding HTLCs during that time. After lightningdevkit#851, we always ensure any holding cells are free'd when sending P2P messages, making this change much more trivially correct - instead of having to ensure that we always free the holding cell when a channel becomes live again after adding something to the holding cell, we can simply rely on the fact that it always happens. Fixes lightningdevkit#661.
We use `Channel::is_live()` to gate inclusion of a channel in `ChannelManager::list_usable_channels()` and when sending an HTLC to select whether a channel is available for forwarding through/sending to. In both of these cases, we should consider a channel `is_live()` when they are pending a monitor update. Some clients may update monitors asynchronously, thus we may simply be waiting a short duration for a monitor update to complete, and shouldn't fail all forwarding HTLCs during that time. After lightningdevkit#851, we always ensure any holding cells are free'd when sending P2P messages, making this change much more trivially correct - instead of having to ensure that we always free the holding cell when a channel becomes live again after adding something to the holding cell, we can simply rely on the fact that it always happens. Fixes lightningdevkit#661.
This is #756 reworked a decent chunk. In #756 the reentrancy story around
channel_monitor_updated
scared me significantly, but for some reason I apparently never bothered to really think deeply about the options. Here we move the holding cell free toget_and_clear_pending_msg_events
, which is a super obvious location for it - it should be called right afterchannel_monitor_updated
in most cases, but shouldn't have the same reentrancy issues because it already can generate a monitor update. It requires some macro rework, but its not too bad.