Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bolt2: add go_on to quiescence spec #14

Conversation

ProofOfKeags
Copy link

This change adds an explicit resume message to quiescence. Discussion on why this is needed is included in the Rationale section and for a deeper history of what motivates this we can examine the discussion on #869

Copy link

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks simple enough, I'll experiment with that in eclair and will report back.

@@ -545,6 +545,42 @@ considered to be the channel funder (the sender of `open_channel`).
The quiescence effect is exactly the same as if one had replied to the
other.

### `go_on`

1. type: TODO (`go_on`)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope @cdecker can find us a better name for that one ;)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stands for "global operation is over now" 😁

But yeah I'm very open to improvement here 😂

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

carry_on 🧐

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Channel Active Renegotiation Realized. Yes it's Over Now

Comment on lines +567 to +568
- SHOULD treat downstream protocol state as it would if the peer
disconnected.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would only make it worse with regards to the inner protocol. This indicates a buggy peer, so it would be safer to simply disconnect (which also guarantees the inner protocol will be aborted on both sides), wouldn't it?

Suggested change
- SHOULD treat downstream protocol state as it would if the peer
disconnected.
- SHOULD send a `warning` and disconnect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the interpretation as a buggy peer, however I see it differently. I see it as a forced abort of the inner protocol. I don't think things need to be buggy in order for you to behave that way. Indeed, the very fact that you are suggesting disconnecting as a means of handling this implies that the only sensible way to handle this partially executed inner protocol is the same.

So it depends on if we want to bless the process of aborting the inner protocol using this message or not. If you don't, then we can consider it a bug to arrive early and then it would make perfect sense to hang up. However, we could also just truncate the inner protocol state (which is what we do anyway) and then not have to reestablish the connection. I am sympathetic to using a disconnect here though as it may be the simplest path to resetting the inner protocol state and there may not be much to be gained by having this abort mechanism.

I think either approach is well-defined, and I think the particulars of how the implementation was done may make a "live truncation" of inner protocol state varying levels of difficult. If you think it drive you nuts to do this in Eclair we can just ban it and accept the suggestion you lay out here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either approach is well-defined

I don't think that's correct. In this commit we're adding an explicit message to terminate the quiescence state to ensure that its state machine is well-defined independently of the inner protocols that may run once quiescent. By the same reasoning, every inner protocol must define its own clean termination mechanism as well.

In the case described in this requirement, the sender of go_on didn't use the correct inner protocol termination: we thus have no way of knowing how to make them terminate the inner protocol cleanly. The only general solution here is to disconnect, because if we don't, we can't know in what state they'll be?

Copy link
Author

@ProofOfKeags ProofOfKeags Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't think the semantics are different. If you send an out-of-turn go_on it would have the same effect on the inner protocol as a disconnect. If you don't like it, there's no reason we have to go this route, but I think it is well-defined.

we thus have no way of knowing how to make them terminate the inner protocol cleanly. The only general solution here is to disconnect, because if we don't, we can't know in what state they'll be?

Any inner protocol like this will have volatile state and committed state. The spec should delineate between the two as it is required for any disconnect tolerant protocol. As such, you know exactly what state they'll be in and can just revert any volatile state, which is exactly what you do during reconnect.

Sending an unexpected go_on always has a sensible interpretation. If you want to disconnect anyway, you can. You always can.

Does this deviate from Eclair's implementation strategy? I understand that it may be annoying to go this route and as such I'm willing to commit your suggestion, but I just want to make sure that we're on the same page that this is an effort/practicality discussion not a question of whether a solution exists.

EDIT: One thing we can also do is add "MAY send warning and disconnect" alongside the "SHOULD treat downstream protocol state as it would if the peer disconnected". This would allow for us to upgrade to this more "liveness preserving" abort implementation later while still allowing for the simpler-to-implement connection-cycling implementation out the gate.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you send an out-of-turn go_on it would have the same effect on the inner protocol as a disconnect.

I don't think so: the node who receives an out-of-turn go_on can apply the same clean-up as if it were a disconnect, but it has no guarantee that the sending node did the same. The receiver cannot know in which state the sender is (it only knows that it's buggy so it could be in any state) so it can't know how the sender will react to future messages. I think the only way to be sure you resolve this is to hang up on that buggy sender, hoping that at least on disconnection they correctly clean up their state (which isn't guaranteed either since they're buggy, but is more likely).

As a rule of thumb, whenever we detect buggy behavior from our peer, I believe we should disconnect, because we have no way of knowing how messed up their current state is, and restarting "from scratch" is our best bet at fixing it, since we have mechanisms in channel_reestablish to get back to the latest safe point. If the buggy peer is too buggy and even that doesn't fix it, the only solution is to close channels and permanently hang up on them.

EDIT: One thing we can also do is add "MAY send warning and disconnect"

That would work for me 👍, as eclair will disconnect in that case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: I'll add the language around "MAY disconnect" and that should unify concerns. If you want to make a case for why this is YAGNI, I'm open to it. I haven't thought about the strength-to-weight ratio of such a feature. I was just trying to write as permissive of a spec as possible.


The receiver cannot know in which state the sender is

The receiver only needs to know that the protocol has been aborted.

but it has no guarantee that the sending node did the same

Good catch. I'll admit I should probably add the same requirement to the sender to truncate volatile state to be clear.

it only knows that it's buggy so it could be in any state

Buggy implies that doing a forced abort can only result from incorrect implementation, and that is not a valid assumption in context right now. If you want to make the claim that we should never do a forced abort of some negotiation and that any protocol that supports an abort operation MUST specify one, then we can. However, I'm aware of no such requirement up until this point.

I think we have to be a little more careful about what behavior we classify as "buggy". Given that the spec language I propose explicitly allows it, it -- by definition -- would not be buggy to send those messages out-of-turn.

I'm operating with the understanding that a bug is when the software either doesn't implement the spec, or if the spec violates an assumed property. In this case, I'm not sure what assumed property would be violated.

Because any property that would be violated by an out-of-turn go_on would necessarily be violated by a disconnect in the same place, and since disconnects are assumed to be always allowed, there must exist well-defined behavior for this case and if we want to take advantage of that well-defined behavior in the future, we shouldn't classify it as "buggy". We may deem that the implementation effort of that well-defined behavior isn't worth what it buys us. Fair enough! However, it's only buggy if we decide to define it as buggy, which is currently what's under debate!

Beyond that, I consider it a layer violation to make the validity of the messages defined in quiescence depend on the in-this-scope-unknowable state of the downstream protocol. So I would consider it incoherent to put the following statement into the spec:

MUST NOT send go_on while a downstream protocol session is active.

...or something to that effect. This is because while quiescence is aware that there may exist downstream protocols, it shouldn't be allowed to know any of the particulars of what they are and how they work. I contrast this with the current language where we make a recommendation for how the downstream protocol handles this case since the awareness of protocol state flows from dependency to dependent AND NOT the other way around.

As a rule of thumb, whenever we detect buggy behavior from our peer, I believe we should disconnect, because we have no way of knowing how messed up their current state is, and restarting "from scratch" is our best bet at fixing it, since we have mechanisms in channel_reestablish to get back to the latest safe point.

I wholeheartedly agree with this, modulo my philosophy on "buggy" behavior from above.

1. type: TODO (`go_on`)
2. data:
* [`channel_id`:`channel_id`]
* **NOTE FOR REVIEWERS** do we need to include initiator here? I think not.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so either, since the only valid flow is:

  1. the quiescence initiator sends go_on
  2. the other node responds with go_on

Any other flow would be an error, that should trigger a disconnection.

- MUST NOT send `go_on` twice without an interceding full quiescence
negotiation.
- MUST send `go_on` once any dependent protocol has completed.
- MAY send `update_*` messages afterwards.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you wait to receive your peer's go_on (which acts as an ACK) before resuming updates? This would more clearly mark a collaborative exit of the quiescence state. If we don't cleanly exit that way, we should disconnect before resuming updates.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can tighten the requirement here, but I'm trying to figure out what problem it solves. This only really impacts the initiator so we can examine that.

If I'm the initiator, and I send go_on and do not wait before sending new update_* messages, the responder has two options. Due to the protocol assumption of in-order message processing, any subsequent updates must be processed after the go_on. As the responder, either I'm ready to process them, in which case I send my go_on before anything else as an ack and then proceed to handle the rest of the buffer, or I am not ready in which case I'll just hang up.

I think it's tempting to think that these messages are "commands" for the peer, but since the whole protocol is asynchronous we can't really think of things that way. I view stfu and shutdown as polite requests for your peer do do something but what they really are is declarations for what to expect from your future behavior. Under this mental model we don't really need to wait for the responder.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's simply cleaner from a state machine design point of view (which is why we're adding a go_on message in the first place). If you don't wait for your your peer to respond with go_on before sending updates, you're in a slightly weird state after sending the first go_on: you're not back to your normal, out-of-quiescence state because you need to handle receiving your peer's go_on message. So you'll need to import all of the message semantics from your "normal" state into that "waiting for peer's go_on" state plus a handler for receiving go_on, at which point you go back to your "normal" state and quiescence is completely exited.

Eclair uses a strict finite state machine design for most of its components, which shows this design weirdness. But if you feel strongly about it I don't really mind, because your current requirements also allow us to wait for the responding go_on before sending updates, which is what we'll implement.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can as also follow what the parent PR did and write, "must now consider this channel < whatever the opposite of quiescing is >"
https://github.com/lightning/bolts/blob/a68d7b07cea0fb243f25e955cca6134cae5872cf/02-peer-protocol.md#requirements-5

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you'll need to import all of the message semantics from your "normal" state into that "waiting for peer's go_on" state plus a handler for receiving go_on, at which point you go back to your "normal" state and quiescence is completely exited.

Yeah I can see what you're saying where if you go for the strict FSM design and the only valid exit of quiescence is send/receive. I have preferred not to think about it as a strict FSM due to the asynchronous nature of the rest of the protocol.

While we have a total ordering of all messages on each half of the connection, we don't have a total ordering of all messages exchanged since there are many valid zips. As a result, as the responder, I have no way to guarantee you've received my go_on after I send it. I suppose if I don't ever send it I can guarantee you haven't received it. So maybe you're saying that your ACK of my go_on response is the resumption of update_* messages.

I actually don't feel strongly about it! And int the process of responding to this comment I'm finding myself convinced by this line of reasoning.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a result, as the responder, I have no way to guarantee you've received my go_on after I send it.

Sure, but I have the guarantee that if you receive my future update_* messages, you must have received my go_on first so your FSM cleanly got back to the normal, non-quiescent channel state. I think that should be enough?

So maybe you're saying that your ACK of my go_on response is the resumption of update_* messages.

Yes exactly, or the fact that if you resume sending update_* messages I won't disconnect on you.

I actually don't feel strongly about it! And int the process of responding to this comment I'm finding myself convinced by this line of reasoning.

I don't feel too strongly about this either, as we still have the option of waiting for our peer's go_on before sending our update_* messages with the way you currently worded the spec. So it's your choice!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update as I do believe your suggestion makes it more likely that people will implement with predictable behavior.

- MUST NOT send `go_on` if it is not the quiescence "initiator".
- MUST NOT send `go_on` twice without an interceding full quiescence
negotiation.
- MUST send `go_on` once any dependent protocol has completed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be:

Suggested change
- MUST send `go_on` once any dependent protocol has completed.
- MUST send `go_on` only after a dependent protocol has completed.

Say if dependent protocol A completes, we might not want to end the quiescent state then because maybe we want to begin another dependent protocol session that requires a quiescent state.

- MUST NOT send `go_on` twice without an interceding full quiescence
negotiation.
- MUST send `go_on` once any dependent protocol has completed.
- MAY send `update_*` messages afterwards.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can as also follow what the parent PR did and write, "must now consider this channel < whatever the opposite of quiescing is >"
https://github.com/lightning/bolts/blob/a68d7b07cea0fb243f25e955cca6134cae5872cf/02-peer-protocol.md#requirements-5

@t-bast
Copy link

t-bast commented Apr 26, 2024

I've been prototyping this additional message with splicing, and this is not as simple to implement as I thought. The main complexity comes from a clash between the two following properties:

  1. Disconnecting ends quiescence, allowing us to return to a clean channel state if something goes wrong
  2. But inner protocols like splicing may need to finalize after a disconnection, which means they must partly execute in a non-quiescent state

I think we want to preserve (1), because it's important to be able to unilaterally get out of quiescence when necessary (e.g. to avoid force-closing when you have a pending update_fulfill_htlc to apply). Disconnecting is really nice here, because even if on reconnection your peer sends stfu again, you can first send your updates and get them cross-signed before sending your own stfu.

This forces such inner protocols to track whether a disconnection happened and handle two cases:

  • we haven't disconnected and thus need to also end quiescence (exchanging go_on messages)
  • we have disconnected and thus can exit the inner protocol without explicitly ending quiescence (sending go_on at that point may mess up an unrelated instance of quiescence)

This adds non-negligible complexity, and increases the likelihood of ending up in de-synchronized states in edge cases. Since we still need to be able to finalize the inner protocol without explicitly ending quiescence, it's simpler if that's always the case. I've tried to think about different ways of implementing splicing than what we did in eclair to get around this limitation, but it seems to be implementation-agnostic. @ddustin does this also lead to non-negligible additional complexity for the cln implementation?

At that point, I currently think it's not a good idea to add the go_on message.


Details on (2): inner protocols may reach a state where they cannot be aborted anymore, even though they're not finished yet. That's the case for splicing when one side has sent its signatures but hasn't received the remote signatures. At that point one peer has a fully signed transaction that spends the funding tx, so the protocol cannot be aborted anymore. If a disconnection happens, we thus need to re-send signatures to cleanly finalize the splice: we must finalize the inner protocol in a non-quiescent state (which is ok, we don't need the channel to be quiescent at that point).

But we must also handle the case where a cheating peer would never send us their signatures, and would publish the transaction anyway. There is no reason to force-close in that case, once the splice transaction is confirmed (or seen in the mempool) we don't need their signatures anymore: this is an exit path for that splice session.

I don't think you have the same issue for your current usage of dynamic commitments, because you don't broadcast the kick-off transaction nor expect it to confirm (you don't want the ability to later spend that kick-off transaction to stack upgrades), which is why this problem isn't obvious in your implementation. In your case, one side can end up with an unsigned kick-off transaction, but they can forever keep it unsigned: if they need to force-close, they'd publish the commit-tx that spends the funding transaction, not the kick-off one.


Note that this means there is a fundamental limitation on inner protocols that depend on quiescence: if they reach a state where they cannot be aborted anymore, they must be able to finalize in a non-quiescent channel state. I don't know if that could be limiting: I don't think so, but I can't predict all future use-cases.

@ProofOfKeags
Copy link
Author

This forces such inner protocols to track whether a disconnection happened and handle two cases:

Idk why you would choose to have the inner protocol track this rather than just signalling that it is complete and letting a quiescence session manager decide what to do with that knowledge.

Details on (2): inner protocols may reach a state where they cannot be aborted anymore, even though they're not finished yet. That's the case for splicing when one side has sent its signatures but hasn't received the remote signatures. At that point one peer has a fully signed transaction that spends the funding tx, so the protocol cannot be aborted anymore. If a disconnection happens, we thus need to re-send signatures to cleanly finalize the splice: we must finalize the inner protocol in a non-quiescent state (which is ok, we don't need the channel to be quiescent at that point).

Yeah this seems like the scenario when non-volatile state is half-committed. This will be a fundamental challenge for all protocols here. Dynamic Commitments has this challenge as well. IIUC the scenario you're trying to deal with is as follows:

  1. Splice gets negotiated
  2. Alice sends sigs to Bob
  3. Bob does NOT send sigs back to Alice (blocking the clean termination of splicing)
  4. Bob sends go_on (instead of disconnecting, skipping the reest negotiation)
  5. Alice normally would want to reest in this state, but the connection is live and it's not presently OK/expected to do reest here
  6. Bob broadcasts the splice tx
  7. Alice sees this and is in a good state again

Did I capture this properly? If so, I think you're saying that the core difference between a go_on volatile state truncation and a disconnect state truncation is the fact that we don't know whether or not we need to reest, and if we do, how to actually do that. This seems like a thorny issue indeed. It definitely suggests to me that your recommendation to disconnect if you receive an out-of-turn go_on is likely to be better. I'm not sure I think it means that a go_on message is entirely the wrong idea.

However, let's suppose we don't do it. Do we just make all downstream protocols responsible for defining the quiescence end state? I still think that is a problematic design from a clean boundaries perspective. Perhaps a subtle change to the language may help here. Rather than giving Splicing/DynComms the responsibility of defining quiescence termination, they just clearly state when they are themselves "terminal". Then we can go into the quiescence spec and say:

only one downstream protocol that requires quiescence can be executed per quiescence session. Quiescence is considered terminated when the downstream protocol reaches a terminal state.

This somewhat improves the issues I had, but I have to say I still think leaving it implicit is not a great idea. I think this points fairly strongly to the fact that we don't have a good generalized way to handle half-committed states. It seems we depend on the fact that disconnection is the only thing that can produce half-committed state, which seems reasonable. If we want to continue with this design pattern, then we should disconnect any time we believe that a protocol is stuck between half-committed and fully-committed.

@t-bast
Copy link

t-bast commented Apr 29, 2024

Idk why you would choose to have the inner protocol track this rather than just signalling that it is complete and letting a quiescence session manager decide what to do with that knowledge.

This is indeed quite similar to what we do, but it doesn't change the core issue that with a go_on message, we need to handle two non-trivial code paths, while we only need to handle one without it (once the inner protocol reaches the state where it cannot be aborted, you must either complete it regardless of disconnections, or force-close if you fail to complete it). The channel state machine can already be quite thorny, so I'm trying to minimize the additional complexity we add to it...

Did I capture this properly? If so, I think you're saying that the core difference between a go_on volatile state truncation and a disconnect state truncation is the fact that we don't know whether or not we need to reest, and if we do, how to actually do that. This seems like a thorny issue indeed. It definitely suggests to me that your recommendation to disconnect if you receive an out-of-turn go_on is likely to be better. I'm not sure I think it means that a go_on message is entirely the wrong idea.

It's rather that an inner protocol needs to define a mechanism to guarantee finalization if we get to the point where we cannot abort anymore. This requires handling disconnections (because disconnections may happen), and thus requires a mechanism to finalize the protocol on reconnection (in channel_reestablish to ensure that it can combine with other channel state reconciliation steps).

Rather than giving Splicing/DynComms the responsibility of defining quiescence termination, they just clearly state when they are themselves "terminal". Then we can go into the quiescence spec and say:

That indeed captures the intent correctly, but it adds more implementation complexity than considering that this terminal state also ends quiescence.

I think this points fairly strongly to the fact that we don't have a good generalized way to handle half-committed states.

Yes, that's exactly it, but I don't think we can have a generalized way of handling this: the mechanism we introduced for dual funding and splicing is a generalized way of handling this for any interactive-tx session, which is a good step towards this, but I'm not sure we can do better...

This somewhat improves the issues I had, but I have to say I still think leaving it implicit is not a great idea.

I wouldn't say that it's implicit if inner protocols have an explicit finalization (which is the case for splicing). It's rather a layering violation (which conceptually is not entirely satisfying, I agree with you on that point), but if the spec says that termination of the inner protocol terminates the quiescence session, this is explicit enough. As I mentioned initially, if we could get clean layering without additional complexity, I would be all for it, but in this case it doesn't seem to be the case: clean layering introduces non-trivial complexity, which is why I'm reluctant to have it...

@ddustin
Copy link
Collaborator

ddustin commented Apr 29, 2024

This adds non-negligible complexity, and increases the likelihood of ending up in de-synchronized states in edge cases. Since we still need to be able to finalize the inner protocol without explicitly ending quiescence, it's simpler if that's always the case. I've tried to think about different ways of implementing splicing than what we did in eclair to get around this limitation, but it seems to be implementation-agnostic. @ddustin does this also lead to non-negligible additional complexity for the cln implementation?

Yeah we would need to support both using go_on and not depending on state.

In CLN we have a function called resume_splice_negotiation

/* Called to finish an ongoing splice OR on restart from channel_reestablish. */
static void resume_splice_negotiation(struct peer *peer,
				      bool send_commitments,
				      bool recv_commitments,
				      bool send_signature,
				      bool recv_signature)

It is called as the last step in a normal splice (and would probably be responsible for go_on) as well as during channel reestablish (which should not send go_on). So either detecting if we're in STFU mode or reestablishing and sending/expecting go_on.

Looking through the code though, I suspect that part would be fairly trivial to add as long as we settle on some rules:

  1. You can only send go_on immediately after tx_signatures, any other time would be a warning + disconnect or force close when not allowed (I suspect rules would be same as tx_abort here?)
  2. Does the order of who sends go_on first matter?
  3. How many messages (if any) can we receive after sending go_on that aren't the go_on ACK?

Looking at (3) feels a little complicated at first pass. The easy way to implement this would be to drop go_on into the spot we consider STFU ending today, but that requires one side to receive a tx_signatures then go_on after sending go_on.

     Node A        Node B
tx_signature ->
go_on        ->
             <- tx_signature
             <- go_on

So in this case ^ Node A sends go_on and then receives tx_signature before receiving the ACK of go_on.

@t-bast
Copy link

t-bast commented Apr 29, 2024

In CLN we have a function called resume_splice_negotiation

Note that you need to be careful here that while you are resuming splice negotiation after a reconnect, your peer may also initiate stfu for a different thing concurrently. That should be decorrelated from the splice session you're finalizing (not saying there's an issue with your implementation, just something to look out for - especially when considering the additional go_on requirements).

@ddustin
Copy link
Collaborator

ddustin commented Apr 29, 2024

I wouldn't say that it's implicit if inner protocols have an explicit finalization (which is the case for splicing). It's rather a layering violation (which conceptually is not entirely satisfying, I agree with you on that point), but if the spec says that termination of the inner protocol terminates the quiescence session, this is explicit enough. As I mentioned initially, if we could get clean layering without additional complexity, I would be all for it, but in this case it doesn't seem to be the case: clean layering introduces non-trivial complexity, which is why I'm reluctant to have it...

Yeah I think this is the main challenge. With splicing we have explicit rules for when aborting is allowed or not -- basically if we sent signatures, you can't abort until you send yours.

These rules have been fleshed out for tx_abort. go_on would need to follow those rules too. Which essentially means at a protocol design level the higher layer of go_on rules are intermingled with the lower layer of splicing.

@t-bast
Copy link

t-bast commented Apr 29, 2024

  1. Does the order of who sends go_on first matter?

Yes I think it does, because the quiescence initiator may want to follow-up with something else after the splice and send a message that would be invalidated by the non-initiator sending go_on concurrently...

@ProofOfKeags
Copy link
Author

Ok so I'm finding myself convinced here that the cost-benefit calculation of adding this probably isn't worth it. But it has highlighted some expectations we have of the protocol design that I would like to state explicitly:

  1. All synchronization of half-committed state happens through channel_reestablish (reest) and trying to do this synchronization at any other part of the protocol is a Very Bad Idea.
  2. We should not create any protocols that bless message sequences that result in half-committed state. Stated differently, the only way we should ever end up in a half-committed state is if the peer is faulty (network disconnects, software bugs, attacks). Any protocol that allows this by design is considered a flawed design.
  3. Disconnections are always possible, so we can never truly avoid half-committed state, but since we reest on reconnection we have a central place through which half-committed state is resolved.
  4. Layering violations are aesthetically dissatisfying, but not necessarily ill-defined. Aesthetic improvements that add serious implementation complexity should be abandoned.

I'm closing this but will link this PR to the main BOLTs PR to minimize Lost Lore

@t-bast
Copy link

t-bast commented May 2, 2024

Thanks @ProofOfKeags, that's a very good summary! And thanks for investigating this idea, it was worth spending time thinking about this and prototyping to figure out those subtleties.

@rustyrussell
Copy link
Owner

Note: we should have an explicit "disconnect" message which disconnects a particular channel, (i.e. next message must be reestablish) so we don't have to actually disconnect at the TCP level...

@ProofOfKeags
Copy link
Author

Note: we should have an explicit "disconnect" message which disconnects a particular channel, (i.e. next message must be reestablish) so we don't have to actually disconnect at the TCP level...

Should we cut an issue about this so it doesn't get lost?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants