Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bolt2: add go_on to quiescence spec #14

Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 38 additions & 2 deletions 02-peer-protocol.md
Original file line number Diff line number Diff line change
Expand Up @@ -507,7 +507,7 @@ channel by indicating that "SomeThing Fundamental is Underway".
* [`channel_id`:`channel_id`]
* [`u8`:`initiator`]

### Requirements
#### Requirements

The sender of `stfu`:
- MUST NOT send `stfu` unless `option_quiesce` is negotiated.
Expand All @@ -532,7 +532,7 @@ The receiver of `stfu`:
Upon disconnection:
- the channel is no longer considered quiescent.

### Rationale
#### Rationale

The normal use would be to cease sending updates, then wait for all
the current updates to be acknowledged by both peers, then start
Expand All @@ -545,6 +545,42 @@ considered to be the channel funder (the sender of `open_channel`).
The quiescence effect is exactly the same as if one had replied to the
other.

### `go_on`

1. type: TODO (`go_on`)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope @cdecker can find us a better name for that one ;)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stands for "global operation is over now" 😁

But yeah I'm very open to improvement here 😂

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

carry_on 🧐

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Channel Active Renegotiation Realized. Yes it's Over Now

2. data:
* [`channel_id`:`channel_id`]
* **NOTE FOR REVIEWERS** do we need to include initiator here? I think not.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so either, since the only valid flow is:

  1. the quiescence initiator sends go_on
  2. the other node responds with go_on

Any other flow would be an error, that should trigger a disconnection.


#### Requirements

The sender of `go_on`:
- MUST NOT send `go_on` for a channel that is NOT quiescent.
- MUST NOT send `go_on` if it is not the quiescence "initiator".
- MUST NOT send `go_on` twice without an interceding full quiescence
negotiation.
- MUST send `go_on` once any dependent protocol has completed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be:

Suggested change
- MUST send `go_on` once any dependent protocol has completed.
- MUST send `go_on` only after a dependent protocol has completed.

Say if dependent protocol A completes, we might not want to end the quiescent state then because maybe we want to begin another dependent protocol session that requires a quiescent state.

- MAY send `update_*` messages afterwards.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you wait to receive your peer's go_on (which acts as an ACK) before resuming updates? This would more clearly mark a collaborative exit of the quiescence state. If we don't cleanly exit that way, we should disconnect before resuming updates.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can tighten the requirement here, but I'm trying to figure out what problem it solves. This only really impacts the initiator so we can examine that.

If I'm the initiator, and I send go_on and do not wait before sending new update_* messages, the responder has two options. Due to the protocol assumption of in-order message processing, any subsequent updates must be processed after the go_on. As the responder, either I'm ready to process them, in which case I send my go_on before anything else as an ack and then proceed to handle the rest of the buffer, or I am not ready in which case I'll just hang up.

I think it's tempting to think that these messages are "commands" for the peer, but since the whole protocol is asynchronous we can't really think of things that way. I view stfu and shutdown as polite requests for your peer do do something but what they really are is declarations for what to expect from your future behavior. Under this mental model we don't really need to wait for the responder.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's simply cleaner from a state machine design point of view (which is why we're adding a go_on message in the first place). If you don't wait for your your peer to respond with go_on before sending updates, you're in a slightly weird state after sending the first go_on: you're not back to your normal, out-of-quiescence state because you need to handle receiving your peer's go_on message. So you'll need to import all of the message semantics from your "normal" state into that "waiting for peer's go_on" state plus a handler for receiving go_on, at which point you go back to your "normal" state and quiescence is completely exited.

Eclair uses a strict finite state machine design for most of its components, which shows this design weirdness. But if you feel strongly about it I don't really mind, because your current requirements also allow us to wait for the responding go_on before sending updates, which is what we'll implement.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can as also follow what the parent PR did and write, "must now consider this channel < whatever the opposite of quiescing is >"
https://github.com/lightning/bolts/blob/a68d7b07cea0fb243f25e955cca6134cae5872cf/02-peer-protocol.md#requirements-5

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you'll need to import all of the message semantics from your "normal" state into that "waiting for peer's go_on" state plus a handler for receiving go_on, at which point you go back to your "normal" state and quiescence is completely exited.

Yeah I can see what you're saying where if you go for the strict FSM design and the only valid exit of quiescence is send/receive. I have preferred not to think about it as a strict FSM due to the asynchronous nature of the rest of the protocol.

While we have a total ordering of all messages on each half of the connection, we don't have a total ordering of all messages exchanged since there are many valid zips. As a result, as the responder, I have no way to guarantee you've received my go_on after I send it. I suppose if I don't ever send it I can guarantee you haven't received it. So maybe you're saying that your ACK of my go_on response is the resumption of update_* messages.

I actually don't feel strongly about it! And int the process of responding to this comment I'm finding myself convinced by this line of reasoning.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a result, as the responder, I have no way to guarantee you've received my go_on after I send it.

Sure, but I have the guarantee that if you receive my future update_* messages, you must have received my go_on first so your FSM cleanly got back to the normal, non-quiescent channel state. I think that should be enough?

So maybe you're saying that your ACK of my go_on response is the resumption of update_* messages.

Yes exactly, or the fact that if you resume sending update_* messages I won't disconnect on you.

I actually don't feel strongly about it! And int the process of responding to this comment I'm finding myself convinced by this line of reasoning.

I don't feel too strongly about this either, as we still have the option of waiting for our peer's go_on before sending our update_* messages with the way you currently worded the spec. So it's your choice!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update as I do believe your suggestion makes it more likely that people will implement with predictable behavior.


The receiver of `go_on`:
- if there is any incomplete downstream protocol state:
- SHOULD treat downstream protocol state as it would if the peer
disconnected.
Comment on lines +567 to +568
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would only make it worse with regards to the inner protocol. This indicates a buggy peer, so it would be safer to simply disconnect (which also guarantees the inner protocol will be aborted on both sides), wouldn't it?

Suggested change
- SHOULD treat downstream protocol state as it would if the peer
disconnected.
- SHOULD send a `warning` and disconnect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the interpretation as a buggy peer, however I see it differently. I see it as a forced abort of the inner protocol. I don't think things need to be buggy in order for you to behave that way. Indeed, the very fact that you are suggesting disconnecting as a means of handling this implies that the only sensible way to handle this partially executed inner protocol is the same.

So it depends on if we want to bless the process of aborting the inner protocol using this message or not. If you don't, then we can consider it a bug to arrive early and then it would make perfect sense to hang up. However, we could also just truncate the inner protocol state (which is what we do anyway) and then not have to reestablish the connection. I am sympathetic to using a disconnect here though as it may be the simplest path to resetting the inner protocol state and there may not be much to be gained by having this abort mechanism.

I think either approach is well-defined, and I think the particulars of how the implementation was done may make a "live truncation" of inner protocol state varying levels of difficult. If you think it drive you nuts to do this in Eclair we can just ban it and accept the suggestion you lay out here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either approach is well-defined

I don't think that's correct. In this commit we're adding an explicit message to terminate the quiescence state to ensure that its state machine is well-defined independently of the inner protocols that may run once quiescent. By the same reasoning, every inner protocol must define its own clean termination mechanism as well.

In the case described in this requirement, the sender of go_on didn't use the correct inner protocol termination: we thus have no way of knowing how to make them terminate the inner protocol cleanly. The only general solution here is to disconnect, because if we don't, we can't know in what state they'll be?

Copy link
Author

@ProofOfKeags ProofOfKeags Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't think the semantics are different. If you send an out-of-turn go_on it would have the same effect on the inner protocol as a disconnect. If you don't like it, there's no reason we have to go this route, but I think it is well-defined.

we thus have no way of knowing how to make them terminate the inner protocol cleanly. The only general solution here is to disconnect, because if we don't, we can't know in what state they'll be?

Any inner protocol like this will have volatile state and committed state. The spec should delineate between the two as it is required for any disconnect tolerant protocol. As such, you know exactly what state they'll be in and can just revert any volatile state, which is exactly what you do during reconnect.

Sending an unexpected go_on always has a sensible interpretation. If you want to disconnect anyway, you can. You always can.

Does this deviate from Eclair's implementation strategy? I understand that it may be annoying to go this route and as such I'm willing to commit your suggestion, but I just want to make sure that we're on the same page that this is an effort/practicality discussion not a question of whether a solution exists.

EDIT: One thing we can also do is add "MAY send warning and disconnect" alongside the "SHOULD treat downstream protocol state as it would if the peer disconnected". This would allow for us to upgrade to this more "liveness preserving" abort implementation later while still allowing for the simpler-to-implement connection-cycling implementation out the gate.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you send an out-of-turn go_on it would have the same effect on the inner protocol as a disconnect.

I don't think so: the node who receives an out-of-turn go_on can apply the same clean-up as if it were a disconnect, but it has no guarantee that the sending node did the same. The receiver cannot know in which state the sender is (it only knows that it's buggy so it could be in any state) so it can't know how the sender will react to future messages. I think the only way to be sure you resolve this is to hang up on that buggy sender, hoping that at least on disconnection they correctly clean up their state (which isn't guaranteed either since they're buggy, but is more likely).

As a rule of thumb, whenever we detect buggy behavior from our peer, I believe we should disconnect, because we have no way of knowing how messed up their current state is, and restarting "from scratch" is our best bet at fixing it, since we have mechanisms in channel_reestablish to get back to the latest safe point. If the buggy peer is too buggy and even that doesn't fix it, the only solution is to close channels and permanently hang up on them.

EDIT: One thing we can also do is add "MAY send warning and disconnect"

That would work for me 👍, as eclair will disconnect in that case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: I'll add the language around "MAY disconnect" and that should unify concerns. If you want to make a case for why this is YAGNI, I'm open to it. I haven't thought about the strength-to-weight ratio of such a feature. I was just trying to write as permissive of a spec as possible.


The receiver cannot know in which state the sender is

The receiver only needs to know that the protocol has been aborted.

but it has no guarantee that the sending node did the same

Good catch. I'll admit I should probably add the same requirement to the sender to truncate volatile state to be clear.

it only knows that it's buggy so it could be in any state

Buggy implies that doing a forced abort can only result from incorrect implementation, and that is not a valid assumption in context right now. If you want to make the claim that we should never do a forced abort of some negotiation and that any protocol that supports an abort operation MUST specify one, then we can. However, I'm aware of no such requirement up until this point.

I think we have to be a little more careful about what behavior we classify as "buggy". Given that the spec language I propose explicitly allows it, it -- by definition -- would not be buggy to send those messages out-of-turn.

I'm operating with the understanding that a bug is when the software either doesn't implement the spec, or if the spec violates an assumed property. In this case, I'm not sure what assumed property would be violated.

Because any property that would be violated by an out-of-turn go_on would necessarily be violated by a disconnect in the same place, and since disconnects are assumed to be always allowed, there must exist well-defined behavior for this case and if we want to take advantage of that well-defined behavior in the future, we shouldn't classify it as "buggy". We may deem that the implementation effort of that well-defined behavior isn't worth what it buys us. Fair enough! However, it's only buggy if we decide to define it as buggy, which is currently what's under debate!

Beyond that, I consider it a layer violation to make the validity of the messages defined in quiescence depend on the in-this-scope-unknowable state of the downstream protocol. So I would consider it incoherent to put the following statement into the spec:

MUST NOT send go_on while a downstream protocol session is active.

...or something to that effect. This is because while quiescence is aware that there may exist downstream protocols, it shouldn't be allowed to know any of the particulars of what they are and how they work. I contrast this with the current language where we make a recommendation for how the downstream protocol handles this case since the awareness of protocol state flows from dependency to dependent AND NOT the other way around.

As a rule of thumb, whenever we detect buggy behavior from our peer, I believe we should disconnect, because we have no way of knowing how messed up their current state is, and restarting "from scratch" is our best bet at fixing it, since we have mechanisms in channel_reestablish to get back to the latest safe point.

I wholeheartedly agree with this, modulo my philosophy on "buggy" behavior from above.

- MUST handle subsequent `update_*` messages from the peer.
- if it has not sent a `go_on` since it sent `stfu`:
- MUST reply with a `go_on` message.

#### Rationale

If we do not include a message that explicitly resumes update traffic then we
must depend on the the downstream protocol to mark when update traffic may
resume. This is problematic because it means we cannot implement this protocol
in isolation with all of its resumption semantics.

Only the initiator of the quiescence is allowed to commence with `go_on` as
it is that node that can determine when the quiescence purpose has been
accomplished.

## Channel Close

Nodes can negotiate a mutual close of the connection, which unlike a
Expand Down