-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bolt2: add go_on
to quiescence spec
#14
bolt2: add go_on
to quiescence spec
#14
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks simple enough, I'll experiment with that in eclair
and will report back.
@@ -545,6 +545,42 @@ considered to be the channel funder (the sender of `open_channel`). | |||
The quiescence effect is exactly the same as if one had replied to the | |||
other. | |||
|
|||
### `go_on` | |||
|
|||
1. type: TODO (`go_on`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope @cdecker can find us a better name for that one ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stands for "global operation is over now" 😁
But yeah I'm very open to improvement here 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
carry_on
🧐
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Channel Active Renegotiation Realized. Yes it's Over Now
- SHOULD treat downstream protocol state as it would if the peer | ||
disconnected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would only make it worse with regards to the inner protocol. This indicates a buggy peer, so it would be safer to simply disconnect (which also guarantees the inner protocol will be aborted on both sides), wouldn't it?
- SHOULD treat downstream protocol state as it would if the peer | |
disconnected. | |
- SHOULD send a `warning` and disconnect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the interpretation as a buggy peer, however I see it differently. I see it as a forced abort of the inner protocol. I don't think things need to be buggy in order for you to behave that way. Indeed, the very fact that you are suggesting disconnecting as a means of handling this implies that the only sensible way to handle this partially executed inner protocol is the same.
So it depends on if we want to bless the process of aborting the inner protocol using this message or not. If you don't, then we can consider it a bug to arrive early and then it would make perfect sense to hang up. However, we could also just truncate the inner protocol state (which is what we do anyway) and then not have to reestablish the connection. I am sympathetic to using a disconnect here though as it may be the simplest path to resetting the inner protocol state and there may not be much to be gained by having this abort mechanism.
I think either approach is well-defined, and I think the particulars of how the implementation was done may make a "live truncation" of inner protocol state varying levels of difficult. If you think it drive you nuts to do this in Eclair we can just ban it and accept the suggestion you lay out here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either approach is well-defined
I don't think that's correct. In this commit we're adding an explicit message to terminate the quiescence state to ensure that its state machine is well-defined independently of the inner protocols that may run once quiescent. By the same reasoning, every inner protocol must define its own clean termination mechanism as well.
In the case described in this requirement, the sender of go_on
didn't use the correct inner protocol termination: we thus have no way of knowing how to make them terminate the inner protocol cleanly. The only general solution here is to disconnect, because if we don't, we can't know in what state they'll be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I don't think the semantics are different. If you send an out-of-turn go_on
it would have the same effect on the inner protocol as a disconnect. If you don't like it, there's no reason we have to go this route, but I think it is well-defined.
we thus have no way of knowing how to make them terminate the inner protocol cleanly. The only general solution here is to disconnect, because if we don't, we can't know in what state they'll be?
Any inner protocol like this will have volatile state and committed state. The spec should delineate between the two as it is required for any disconnect tolerant protocol. As such, you know exactly what state they'll be in and can just revert any volatile state, which is exactly what you do during reconnect.
Sending an unexpected go_on
always has a sensible interpretation. If you want to disconnect anyway, you can. You always can.
Does this deviate from Eclair's implementation strategy? I understand that it may be annoying to go this route and as such I'm willing to commit your suggestion, but I just want to make sure that we're on the same page that this is an effort/practicality discussion not a question of whether a solution exists.
EDIT: One thing we can also do is add "MAY send warning
and disconnect" alongside the "SHOULD treat downstream protocol state as it would if the peer disconnected". This would allow for us to upgrade to this more "liveness preserving" abort implementation later while still allowing for the simpler-to-implement connection-cycling implementation out the gate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you send an out-of-turn go_on it would have the same effect on the inner protocol as a disconnect.
I don't think so: the node who receives an out-of-turn go_on
can apply the same clean-up as if it were a disconnect, but it has no guarantee that the sending node did the same. The receiver cannot know in which state the sender is (it only knows that it's buggy so it could be in any state) so it can't know how the sender will react to future messages. I think the only way to be sure you resolve this is to hang up on that buggy sender, hoping that at least on disconnection they correctly clean up their state (which isn't guaranteed either since they're buggy, but is more likely).
As a rule of thumb, whenever we detect buggy behavior from our peer, I believe we should disconnect, because we have no way of knowing how messed up their current state is, and restarting "from scratch" is our best bet at fixing it, since we have mechanisms in channel_reestablish
to get back to the latest safe point. If the buggy peer is too buggy and even that doesn't fix it, the only solution is to close channels and permanently hang up on them.
EDIT: One thing we can also do is add "MAY send warning and disconnect"
That would work for me 👍, as eclair will disconnect in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR: I'll add the language around "MAY disconnect" and that should unify concerns. If you want to make a case for why this is YAGNI, I'm open to it. I haven't thought about the strength-to-weight ratio of such a feature. I was just trying to write as permissive of a spec as possible.
The receiver cannot know in which state the sender is
The receiver only needs to know that the protocol has been aborted.
but it has no guarantee that the sending node did the same
Good catch. I'll admit I should probably add the same requirement to the sender to truncate volatile state to be clear.
it only knows that it's buggy so it could be in any state
Buggy implies that doing a forced abort can only result from incorrect implementation, and that is not a valid assumption in context right now. If you want to make the claim that we should never do a forced abort of some negotiation and that any protocol that supports an abort operation MUST specify one, then we can. However, I'm aware of no such requirement up until this point.
I think we have to be a little more careful about what behavior we classify as "buggy". Given that the spec language I propose explicitly allows it, it -- by definition -- would not be buggy to send those messages out-of-turn.
I'm operating with the understanding that a bug is when the software either doesn't implement the spec, or if the spec violates an assumed property. In this case, I'm not sure what assumed property would be violated.
Because any property that would be violated by an out-of-turn go_on
would necessarily be violated by a disconnect in the same place, and since disconnects are assumed to be always allowed, there must exist well-defined behavior for this case and if we want to take advantage of that well-defined behavior in the future, we shouldn't classify it as "buggy". We may deem that the implementation effort of that well-defined behavior isn't worth what it buys us. Fair enough! However, it's only buggy if we decide to define it as buggy, which is currently what's under debate!
Beyond that, I consider it a layer violation to make the validity of the messages defined in quiescence depend on the in-this-scope-unknowable state of the downstream protocol. So I would consider it incoherent to put the following statement into the spec:
MUST NOT send
go_on
while a downstream protocol session is active.
...or something to that effect. This is because while quiescence is aware that there may exist downstream protocols, it shouldn't be allowed to know any of the particulars of what they are and how they work. I contrast this with the current language where we make a recommendation for how the downstream protocol handles this case since the awareness of protocol state flows from dependency to dependent AND NOT the other way around.
As a rule of thumb, whenever we detect buggy behavior from our peer, I believe we should disconnect, because we have no way of knowing how messed up their current state is, and restarting "from scratch" is our best bet at fixing it, since we have mechanisms in channel_reestablish to get back to the latest safe point.
I wholeheartedly agree with this, modulo my philosophy on "buggy" behavior from above.
1. type: TODO (`go_on`) | ||
2. data: | ||
* [`channel_id`:`channel_id`] | ||
* **NOTE FOR REVIEWERS** do we need to include initiator here? I think not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so either, since the only valid flow is:
- the quiescence initiator sends
go_on
- the other node responds with
go_on
Any other flow would be an error, that should trigger a disconnection.
- MUST NOT send `go_on` twice without an interceding full quiescence | ||
negotiation. | ||
- MUST send `go_on` once any dependent protocol has completed. | ||
- MAY send `update_*` messages afterwards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you wait to receive your peer's go_on
(which acts as an ACK) before resuming updates? This would more clearly mark a collaborative exit of the quiescence
state. If we don't cleanly exit that way, we should disconnect before resuming updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can tighten the requirement here, but I'm trying to figure out what problem it solves. This only really impacts the initiator so we can examine that.
If I'm the initiator, and I send go_on
and do not wait before sending new update_*
messages, the responder has two options. Due to the protocol assumption of in-order message processing, any subsequent updates must be processed after the go_on
. As the responder, either I'm ready to process them, in which case I send my go_on
before anything else as an ack and then proceed to handle the rest of the buffer, or I am not ready in which case I'll just hang up.
I think it's tempting to think that these messages are "commands" for the peer, but since the whole protocol is asynchronous we can't really think of things that way. I view stfu
and shutdown
as polite requests for your peer do do something but what they really are is declarations for what to expect from your future behavior. Under this mental model we don't really need to wait for the responder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's simply cleaner from a state machine design point of view (which is why we're adding a go_on
message in the first place). If you don't wait for your your peer to respond with go_on
before sending updates, you're in a slightly weird state after sending the first go_on
: you're not back to your normal, out-of-quiescence state because you need to handle receiving your peer's go_on
message. So you'll need to import all of the message semantics from your "normal" state into that "waiting for peer's go_on" state plus a handler for receiving go_on
, at which point you go back to your "normal" state and quiescence is completely exited.
Eclair uses a strict finite state machine design for most of its components, which shows this design weirdness. But if you feel strongly about it I don't really mind, because your current requirements also allow us to wait for the responding go_on
before sending updates, which is what we'll implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can as also follow what the parent PR did and write, "must now consider this channel < whatever the opposite of quiescing is >"
https://github.com/lightning/bolts/blob/a68d7b07cea0fb243f25e955cca6134cae5872cf/02-peer-protocol.md#requirements-5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you'll need to import all of the message semantics from your "normal" state into that "waiting for peer's go_on" state plus a handler for receiving go_on, at which point you go back to your "normal" state and quiescence is completely exited.
Yeah I can see what you're saying where if you go for the strict FSM design and the only valid exit of quiescence is send/receive. I have preferred not to think about it as a strict FSM due to the asynchronous nature of the rest of the protocol.
While we have a total ordering of all messages on each half of the connection, we don't have a total ordering of all messages exchanged since there are many valid zips. As a result, as the responder, I have no way to guarantee you've received my go_on
after I send it. I suppose if I don't ever send it I can guarantee you haven't received it. So maybe you're saying that your ACK of my go_on
response is the resumption of update_*
messages.
I actually don't feel strongly about it! And int the process of responding to this comment I'm finding myself convinced by this line of reasoning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a result, as the responder, I have no way to guarantee you've received my go_on after I send it.
Sure, but I have the guarantee that if you receive my future update_*
messages, you must have received my go_on
first so your FSM cleanly got back to the normal, non-quiescent channel state. I think that should be enough?
So maybe you're saying that your ACK of my go_on response is the resumption of update_* messages.
Yes exactly, or the fact that if you resume sending update_*
messages I won't disconnect on you.
I actually don't feel strongly about it! And int the process of responding to this comment I'm finding myself convinced by this line of reasoning.
I don't feel too strongly about this either, as we still have the option of waiting for our peer's go_on
before sending our update_*
messages with the way you currently worded the spec. So it's your choice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update as I do believe your suggestion makes it more likely that people will implement with predictable behavior.
- MUST NOT send `go_on` if it is not the quiescence "initiator". | ||
- MUST NOT send `go_on` twice without an interceding full quiescence | ||
negotiation. | ||
- MUST send `go_on` once any dependent protocol has completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be:
- MUST send `go_on` once any dependent protocol has completed. | |
- MUST send `go_on` only after a dependent protocol has completed. |
Say if dependent protocol A completes, we might not want to end the quiescent state then because maybe we want to begin another dependent protocol session that requires a quiescent state.
- MUST NOT send `go_on` twice without an interceding full quiescence | ||
negotiation. | ||
- MUST send `go_on` once any dependent protocol has completed. | ||
- MAY send `update_*` messages afterwards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can as also follow what the parent PR did and write, "must now consider this channel < whatever the opposite of quiescing is >"
https://github.com/lightning/bolts/blob/a68d7b07cea0fb243f25e955cca6134cae5872cf/02-peer-protocol.md#requirements-5
I've been prototyping this additional message with splicing, and this is not as simple to implement as I thought. The main complexity comes from a clash between the two following properties:
I think we want to preserve (1), because it's important to be able to unilaterally get out of quiescence when necessary (e.g. to avoid force-closing when you have a pending This forces such inner protocols to track whether a disconnection happened and handle two cases:
This adds non-negligible complexity, and increases the likelihood of ending up in de-synchronized states in edge cases. Since we still need to be able to finalize the inner protocol without explicitly ending quiescence, it's simpler if that's always the case. I've tried to think about different ways of implementing splicing than what we did in At that point, I currently think it's not a good idea to add the Details on (2): inner protocols may reach a state where they cannot be aborted anymore, even though they're not finished yet. That's the case for splicing when one side has sent its signatures but hasn't received the remote signatures. At that point one peer has a fully signed transaction that spends the funding tx, so the protocol cannot be aborted anymore. If a disconnection happens, we thus need to re-send signatures to cleanly finalize the splice: we must finalize the inner protocol in a non-quiescent state (which is ok, we don't need the channel to be quiescent at that point). But we must also handle the case where a cheating peer would never send us their signatures, and would publish the transaction anyway. There is no reason to force-close in that case, once the splice transaction is confirmed (or seen in the mempool) we don't need their signatures anymore: this is an exit path for that splice session. I don't think you have the same issue for your current usage of dynamic commitments, because you don't broadcast the kick-off transaction nor expect it to confirm (you don't want the ability to later spend that kick-off transaction to stack upgrades), which is why this problem isn't obvious in your implementation. In your case, one side can end up with an unsigned kick-off transaction, but they can forever keep it unsigned: if they need to force-close, they'd publish the commit-tx that spends the funding transaction, not the kick-off one. Note that this means there is a fundamental limitation on inner protocols that depend on quiescence: if they reach a state where they cannot be aborted anymore, they must be able to finalize in a non-quiescent channel state. I don't know if that could be limiting: I don't think so, but I can't predict all future use-cases. |
Idk why you would choose to have the inner protocol track this rather than just signalling that it is complete and letting a quiescence session manager decide what to do with that knowledge.
Yeah this seems like the scenario when non-volatile state is half-committed. This will be a fundamental challenge for all protocols here. Dynamic Commitments has this challenge as well. IIUC the scenario you're trying to deal with is as follows:
Did I capture this properly? If so, I think you're saying that the core difference between a However, let's suppose we don't do it. Do we just make all downstream protocols responsible for defining the quiescence end state? I still think that is a problematic design from a clean boundaries perspective. Perhaps a subtle change to the language may help here. Rather than giving Splicing/DynComms the responsibility of defining quiescence termination, they just clearly state when they are themselves "terminal". Then we can go into the quiescence spec and say:
This somewhat improves the issues I had, but I have to say I still think leaving it implicit is not a great idea. I think this points fairly strongly to the fact that we don't have a good generalized way to handle half-committed states. It seems we depend on the fact that disconnection is the only thing that can produce half-committed state, which seems reasonable. If we want to continue with this design pattern, then we should disconnect any time we believe that a protocol is stuck between half-committed and fully-committed. |
This is indeed quite similar to what we do, but it doesn't change the core issue that with a
It's rather that an inner protocol needs to define a mechanism to guarantee finalization if we get to the point where we cannot abort anymore. This requires handling disconnections (because disconnections may happen), and thus requires a mechanism to finalize the protocol on reconnection (in
That indeed captures the intent correctly, but it adds more implementation complexity than considering that this terminal state also ends quiescence.
Yes, that's exactly it, but I don't think we can have a generalized way of handling this: the mechanism we introduced for dual funding and splicing is a generalized way of handling this for any
I wouldn't say that it's implicit if inner protocols have an explicit finalization (which is the case for splicing). It's rather a layering violation (which conceptually is not entirely satisfying, I agree with you on that point), but if the spec says that termination of the inner protocol terminates the quiescence session, this is explicit enough. As I mentioned initially, if we could get clean layering without additional complexity, I would be all for it, but in this case it doesn't seem to be the case: clean layering introduces non-trivial complexity, which is why I'm reluctant to have it... |
Yeah we would need to support both using In CLN we have a function called
It is called as the last step in a normal splice (and would probably be responsible for Looking through the code though, I suspect that part would be fairly trivial to add as long as we settle on some rules:
Looking at (3) feels a little complicated at first pass. The easy way to implement this would be to drop
So in this case ^ Node A sends |
Note that you need to be careful here that while you are resuming splice negotiation after a reconnect, your peer may also initiate |
Yeah I think this is the main challenge. With splicing we have explicit rules for when aborting is allowed or not -- basically if we sent signatures, you can't abort until you send yours. These rules have been fleshed out for |
Yes I think it does, because the quiescence initiator may want to follow-up with something else after the splice and send a message that would be invalidated by the non-initiator sending |
Ok so I'm finding myself convinced here that the cost-benefit calculation of adding this probably isn't worth it. But it has highlighted some expectations we have of the protocol design that I would like to state explicitly:
I'm closing this but will link this PR to the main BOLTs PR to minimize Lost Lore |
Thanks @ProofOfKeags, that's a very good summary! And thanks for investigating this idea, it was worth spending time thinking about this and prototyping to figure out those subtleties. |
Note: we should have an explicit "disconnect" message which disconnects a particular channel, (i.e. next message must be reestablish) so we don't have to actually disconnect at the TCP level... |
Should we cut an issue about this so it doesn't get lost? |
This change adds an explicit resume message to quiescence. Discussion on why this is needed is included in the
Rationale
section and for a deeper history of what motivates this we can examine the discussion on #869