-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up and more liberally free holding cell HTLCs #756
Clean up and more liberally free holding cell HTLCs #756
Conversation
@valentinewallace are you happier with this? |
4a8bec0
to
2eb9070
Compare
In terms of:
If it would be fair to summarize this approach as: "don't fail back holding cell add HTLCs on disconnect," (aka mainly just 54a601c) then I'm conceptACK. Still have to review the other parts of the PR in detail but that lgtm approach-wise! Edit: fwiw, |
Correct.
Alright, I'll try to clean this up and get the macros to build on pre-NLL rust.
Right, I presume most others don't, of course this is just on holding-cell stuff so its a really strange edge-case anyway. |
f31497a
to
5552b67
Compare
This has been updated to build on pre-NLL rust and rebased. |
Codecov Report
@@ Coverage Diff @@
## main #756 +/- ##
==========================================
+ Coverage 91.00% 92.50% +1.49%
==========================================
Files 48 45 -3
Lines 25483 30571 +5088
==========================================
+ Hits 23192 28279 +5087
- Misses 2291 2292 +1
Continue to review full report at Codecov.
|
5552b67
to
393a420
Compare
Rebased without changes (just include conflicts). |
393a420
to
ed1194c
Compare
lightning/src/ln/channelmanager.rs
Outdated
/// In some cases, this may generate a monitor update, resulting in a call to the | ||
/// `chain::Watch`'s `update_channel` method for the same channel monitor which is being | ||
/// notified of a successful update here. Because of this, please be very careful with | ||
/// reentrancy bugs! It is incredibly easy to write an implementation of `update_channel` which | ||
/// will take a lock which is also held when calling this method. | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this situation any different from other pub
methods that call update_channel
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only in the sense that we're updating precisely the ChannelMonitor (via update_channel
) that we're being notified has been successfully updated. I'm not quite sure how to better capture this in the docs here, but its rather easy (as we do in tests), to hit this case (but only rarely because you only ever get the callbacks rarely).
struct TestChainMonitor {
monitor_state: Mutex<HashMap<OutPoint, StateInformation>>,
}
impl TestChainMonitor {
fn finish_async_update() {
let monitor_state = self.monitor_state.lock().unwrap();
...
channel_manager.channel_monitor_updated(...);
}
}
impl chain::Watch for TestChainMonitor {
fn update_channel(&self, funding_txo: OutPoint, update: channelmonitor::ChannelMonitorUpdate) -> Result<(), channelmonitor::ChannelMonitorUpdateErr> {
let this_monitor_state = self.monitor_state.lock().unwrap();
...
start_async_update(update);
Err(TemporaryFailure)
}
....
}
if let Some((id, _)) = monitor_b.latest_monitors.lock().unwrap().get(&chan_1_funding) { | ||
nodes[1].channel_monitor_updated(&chan_1_funding, *id); | ||
{ | ||
let mon_id = monitor_b.latest_monitors.lock().unwrap().get(&chan_1_funding).map(|(id, _)| *id); | ||
if let Some(id) = mon_id { | ||
nodes[1].channel_monitor_updated(&chan_1_funding, id); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the lock held outside of this scope and thus causing a deadlock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, latest_monitors
is locked again inside update_channel
, which can be called eventually by the channel_monitor_updated
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm conceptACK on this approach, and in favor of opening an issue or so to look into a more events-based approach that's less risky but requires more refactoring (as discussed offline).
It'd be nice to add a test for the fuzz failure if it's not too difficult.
Otherwise, this is looking pretty good to me. Most of the changes are super mechanical.
($self: ident, $locked_res: expr, $pending_failures: expr, $forwarding_failures: expr) => { { | ||
let (htlc_forwards, funding_broadcast_safe, res, channel_id, counterparty_node_id) = $locked_res; | ||
|
||
let _ = handle_error!($self, res, counterparty_node_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line also adds reentrancy risk, right? (in addition to the reentrancy risk of freeing the holding cell HTLCs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe so, at least not from the PoV of calling user code - handle_error
shouldn't ever call user code directly, only pushing events onto the message/regular event queues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's hard to untangle the macros, but handle_error
does call finish_force_close
, which looks like it can call update_channel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I see your point. Tracing it back a bit, I think the only way we have an Err
in res
here is if we break the handle_chan_restoration_locked
loop with a handle_monitor_err
result, which is precisely only if the monitor update that we generated failed.
ed1194c
to
7f7fee5
Compare
I'll finish working on some tests today, given it seems likely we'll move in this direction, at least in the short term. |
The channel restoration code in channel monitor updating and peer reconnection both do incredibly similar things, and there is little reason to have them be separate. Sadly because they require holding a lock with a reference to elements in the lock, its not practical to make them utility functions, so instead we introduce a two-step macro here which will eventually be used for both. Because we still support pre-NLL Rust, the macro has to be in two parts - one which runs with the channel_state lock, and one which does not.
This mostly swaps some Vecs that can only ever contain one element for Options.
If there is no pending channel update messages when monitor updating is restored (though there may be an RAA to send), and we're connected to our peer and not awaiting a remote RAA, we need to free anything in our holding cell. Without this, chanmon_fail_consistency was able to find a stuck condition where we sit on an HTLC failure in our holding cell and don't ever handle it (at least until we have other actions to take which empty the holding cell). Still, this approach sucks - it introduces reentrancy in a particularly dangerous form: a) we re-enter user code around monitor updates while being called from user code around monitor updates, making deadlocks very likely (in fact, our current tests have a bug here!), b) the re-entrancy only occurs in a very rare case, making it likely users will not hit it in testing, only deadlocking in production. I'm not entirely sure what the alternative is, however - we could move to a world where we poll for holding cell events that can be freed on our 1-minute-timer, but we still have a super rare reentrancy case, just in timer_chan_freshness_every_min() instead.
This merges the code for restoring channel functionality between channel monitor updating restored and peer reconnection, reducing redundant code.
Previously, if we get a temporary monitor update failure while there were HTLCs pending forwarding in the holding cell, we'd clear them and fail them all backwards. This makes sense if temporary failures are rare, but in an async environment, temporary monitor update failures may be the normal case. In such a world, this results in potentially a lot of spurious HTLC forwarding failures (which is the topic of lightningdevkit#661).
As pointed out by Jeff, using a return struct instead of an incredibly-long tuple improves readability in several places.
60d83ef introduced reentrancy when calling channel_monitor_updated. This commit fixes the chanmon_consistency fuzzer to no longer deadlock as a result of this reentrancy.
Our fuzz tests previously only printed the log output of the first fuzz test case to fail. This commit changes that (with lots of auto-generated updates) to ensure we print all log outputs.
These should never appear, so consider them a fuzzer fail case.
Because of the merge between peer reconnection and channel monitor updating channel restoration code, we now sometimes generate (somewhat spurious) announcement signatures when restoring channel monitor updating. This should not result in a fuzzing failure.
If the fuzz target is failing due to a channel force-close, the immediately-visible error is that we're signing a stale state. This is because the ChannelMonitorUpdateStep::ChannelForceClosed event results in a signature in the test clone which was deserialized using a OnlyReadsKeysInterface. Instead, we need to deserialize using the full KeysInterface instance.
7f7fee5
to
76db5db
Compare
Added a test which I think demonstrates the specific issue well (while also getting good coverage). |
($self: ident, $locked_res: expr, $pending_failures: expr, $forwarding_failures: expr) => { { | ||
let (htlc_forwards, funding_broadcast_safe, res, channel_id, counterparty_node_id) = $locked_res; | ||
|
||
let _ = handle_error!($self, res, counterparty_node_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's hard to untangle the macros, but handle_error
does call finish_force_close
, which looks like it can call update_channel
.
if commitment_update.is_none() && self.channel_state & (ChannelState::AwaitingRemoteRevoke as u32) == 0 { | ||
order = RAACommitmentOrder::RevokeAndACKFirst; | ||
|
||
let (update_opt, mut failed_htlcs) = self.free_holding_cell_htlcs(logger).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we now have the whole process_background_events
thing, I wonder if it'd ease the reentrancy concerns a bit to shift this call over there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of using that for this because its hooked on the one minute timer which may take a while...however, I think it makes sense in get_and_clear_pending_msg_events. Going to open a new PR with that.
This is effectively a superset of #755 (with the same commit) and an alternate approach to #754, instead holding onto the holding cell HTLCs and supporting handling them correctly in response to a channel_reestablish message. This was more of a slap-together thing, and still needs testing, but seeking concept ACKs on the restructure and approach (also to moving towards addressing #661).