Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connectd: fix accidental handling of old reconnections. #5256

Merged

Conversation

rustyrussell
Copy link
Contributor

@rustyrussell rustyrussell commented May 13, 2022

We had multiple reports of channels being unilaterally closed because
it seemed like the peer was sending old revocation numbers.

Turns out, it was actually old reestablish messages! When we have a
reconnection, we would put the new connection aside, and tell lightningd
to close the current connection: when it did, we would restart
processing of the initial reconnection.

However, we could end up with multiple "reconnecting" connections,
while waiting for an existing connection to close. Though the
connections were long gone, there could still be messages queued
(particularly the channel_reestablish message, which comes early on).

Eventually, a normal reconnection would cause us to process one of
these reconnecting connections, and channeld would see the (perhaps
very old!) messages, and get confused.

(I have a test which triggers this, but it also hangs the connect
command, due to other issues we will fix in the next release...)

Fixes: #5240
Fixes: #5235

@niftynei
Copy link
Contributor

niftynei commented May 13, 2022

ACK 3fab422

Reviewed w/ @endothermicdev

Copy link
Contributor

@vincenzopalazzo vincenzopalazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 3fab422

This should fix also #5235 just to free all the issue :)

@ZmnSCPxj
Copy link
Contributor

ZmnSCPxj commented May 13, 2022

Looks OK to merge, just a style nit: the checks in our repo look for Changelog-whatever tags in commit messages, but this specific PR puts the changelog entries directly into the CHANGELOG.md file. Or are we planning a quick bugfix release that just contains this bugfix?

pr = peer_reconnected_htable_get(&daemon->reconnected, id);
if (pr) {
status_broken("Reconnected AGAIN");
peer_reconnected_htable_del(&daemon->reconnected, pr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the previous reconnection already had the destructor installed, would it not delete this twice? Or delete from htable is idempotent?

We had multiple reports of channels being unilaterally closed because
it seemed like the peer was sending old revocation numbers.

Turns out, it was actually old reestablish messages!  When we have a
reconnection, we would put the new connection aside, and tell lightningd
to close the current connection: when it did, we would restart
processing of the initial reconnection.

However, we could end up with *multiple* "reconnecting" connections,
while waiting for an existing connection to close.  Though the
connections were long gone, there could still be messages queued
(particularly the channel_reestablish message, which comes early on).

Eventually, a normal reconnection would cause us to process one of
these reconnecting connections, and channeld would see the (perhaps
very old!) messages, and get confused.

(I have a test which triggers this, but it also hangs the connect
 command, due to other issues we will fix in the next release...)

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <[email protected]>
I have a test which reproduces this, too, and it's been seen in the
wild.  It seems we can add a subd as we're closing, which causes
this assert to trigger.

Fixes: ElementsProject#5254
Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell force-pushed the guilt/connectd-cleanups branch from 3fab422 to b817e0a Compare May 14, 2022 04:58
@rustyrussell
Copy link
Contributor Author

Removed accidentally left BROKEN log, and added another assert removal to fix another report.

Copy link
Contributor

@vincenzopalazzo vincenzopalazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK b817e0a

frustating I know, but I think this will fix also #5254

@cdecker
Copy link
Member

cdecker commented May 15, 2022

ACK b817e0a

@svewa
Copy link

svewa commented May 15, 2022

bad reestablish revocation_number: 0 vs 3 in the changelog is maybe a bad example, as a revocation number of 0 usually indicates use of SCB?

### Fixed

- connectd: make sure we don't keep stale reconnections around. ([#5256])
- connectd: fix assert which we could trigger. ([#5256])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- connectd: fix assert which we could trigger. ([#5256])
- connectd: fix assert which we could trigger. ([#5254])

This should be #5254 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, these refer to the PR, not the bug itself...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah damnt! Sorry just assuming wrong stuff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Channels shutting down with "bad reestablish revocation_number" bad reestablish revocation_number: x vs y
6 participants