-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add delta handling to the node synchronization path to properly remove data #140
Comments
I'll be starting this project early next week at Phase 1 to make sure the tests are in good shape and the design passes some basic requirements before implementing |
@julianknutsen Info for other readers: Me, @julianknutsen and @freimair had today a voice chat where we discussed that proposal. |
Phase 1 is complete as of bisq-network/bisq#3667 |
Closing as incomplete. I will be unable to complete this work, but it may be useful for future devs that want to pick it up. |
Should this be re-opened? I may be able to take a stab at it once I'm in the heap. |
Honestly, this doesn't feel that high priority given the other outstanding issues with Bisq. The only user-facing issue is duplicate compensation requests and that isn't very critical. If you have the expertise it is likely better to spend your time on something more important. |
I just didn't want it to get lost. |
It shouldn't get lost. The maintainers in charge of the pending work are aware of it and if it is important enough it will bubble to the top. There is just too much noise with everything that could be done ending up as a proposal issue. |
This proposal suggests a design and implementation plan to fix bisq-network/bisq#3610. A bug was identified where nodes would keep stale data locally if it was removed while they were offline and it caused deleted Proposals to reappear in the BlindVote phase.
Design & Background
Development Plan
Future Work
Progress
Phase 1: Complete
Phase 2: On Track
Proposal
TL;DR Nodes are informed of deletes that occurred while they were offline so they can synchronize their internal state during the initialization sequence with a seednode.
Before
ProtectedStorageEntrys
, thePersistableNetworkPayload
objects were append-only and immutable. This meant that there were never issues with interleaved remove operations orGetData
updating when nodes were offline. It was OK to just ask for all of the new objects when a node came online and the set of the Payloads would sync. But, with the introduction ofProtectedStorageEntry
objects that can be removed and receive updates, there is a window of risk where nodes will not see operations that occurred while they were offline.The current initialization path uses a sequence of messages between an initializing node (consumer) and a seed node (provider). Currently, this sequence is limited to retrieving only
ProtectedStorageEntrys
that exist on the provider node, but not on the consumer node.Current Synchronization Messages
Specific Failure Cases
1. Lost Remove Messages
The current message sequence just focuses on adding missing data and not synchronizing the state between the consumer and provider. This causes issues in two cases:
a. RemoveData when node is offline
ProtectedStoragePayload
objects implementing thePersistablePayload
interface will load from disk early in startup. This means that the payload hashes for these objects will be sent in the initialPreliminaryGetDataRequest
message. Even if the provider node doesn't recognize the hash, it is only interested in sending NEW objects so the consumer node will never delete the should-be-deletedProtectedStorageEntry
.This is the current bug with Compensation Requests (bisq-network/bisq#3610)
b. RemoveData between PreliminaryGetDataRequest & GetUpdatedDataRequest
Until the HS is published, the P2PDataStorage layer does not receive broadcast messages. But the timing of this operation is not synchronized with the
PreliminaryGetDataRequest
->GetDataResponse
->GetUpdatedDataRequest
sequence.This means that if a
RemoveData
message was broadcast BEFORE the HS was published, but AFTER theGetDataResponse
was sent to the consumer, the consumer would add the should-be-removed Entry and would never know the Entry was removed. It would be up to the TTL to remove it eventually, but it may have UI artifacts or state updates based on this data that may be invalid.2. Lost Metadata Updates
For
ProtectedStorageEntrys
, since only the payload hash is sent to the provider, it will not receive any updates to known payloads that have had their metadata updated. Metadata here refers to themetadata
section below.So, if an update (read addProtectedStorageEntry() for existing Payload) was sent while a node was offline, the consumer node will not receive that update. I believe in the current system this isn't an issue, because there are not persistent
ProtectedStoragePayload
objects that depend on metadata (OfferPayload is non-persistent). But, if the future holds more payload types that depend on metadata updates, it may make sense to fix this gap.Design Goals
ProtectedStorageEntry Delta Design
This section will focus on a design for the P2PDataStorage system:
ProtectedStorageEntryDelta
. I'll explain the basic concepts and implementation and then tie it into how it can solve the initialization problems described above.A
ProtectedStorageEntryDelta
is the minimum set of data required to perform an update or delete operation on aProtectedStorageEntry
. Think of them as a lightweightProtectedStorageEntry
that is only useful for objects that exist. They leverage the fact thatPayload
objects already exist on the peer, so they do not need to be retransmitted.Benefits
Currently, the system relies on full ProtectedStorageEntry messages to remove data
RemoveDataMessage
(1700 bytes) and a smaller version for updatesRefreshOfferMessage
(123 bytes). Using a new lightweightProtectedStorageEntryDelta
object can give the same functionality while reducing the on-wire size.During the GetData sequence, the provider can send
RemoveDelta
objects to the consumer to inform them ofRemoveDataMessage
that happened while the node was offline. This will be less expensive in-memory storage and bandwidth than storing the fullRemoveDataMessage
and sending it to them.During the GetData sequence, the provider could send
UpdateDelta
objects to the consumer to inform them ofAddDataMessage
that happened while the node was offline. This will be less expensive in-memory storage and bandwidth than storing the last fullAddDataMessage
and sending it to them.Important to note that I don't think there are any interesting consumers of this right now. The only
Payload
that cares about the metadata is theOfferPayload
and since it is non-persistable it will never be sent from the consumer to the provider in thePreliminaryGetDataRequest
.Future development work could replace
RefreshOfferMessage
withUpdateDelta
reducing on-wire size by ~20%.Future development work could replace
RemoveDataMessage
withRemoveDelta
reducing on-wire size by ~95%.Future development could optimize
addProtectedStorageEntry
to use anUpdateDelta
as an optimization if thePayload
already exists.Implementation Behavior
RemoveDelta
RemoveDelta
is received for aProtectedStorageEntry
that does not exist, it is discarded.Payload
we can't validate the signature.RemoveDelta
is received for aProtectedStorageEntry
that does exist:ProtectedStorageEntry
is created from the existingprotectedStoragePayload
and the new metadata provided in theProtectedStorageEntryDelta
. It is then sent to theremove()
call as if it was sent across the wire as a fullRemoveDataMessage(ProtectedStorageEntry)
.Payload.ownerPubKey
can be used in the Mailbox and Non-Mailbox case. Should talk through it more to validate that assertion.remove()
path ensure all the sequence number checks, remove-before-add, and other validation is performed equivalently between this delta-based remove and a full remove.UpdateDelta
UpdateDelta
is received for aProtectedStorageEntry
that does not exist, it is discarded.Payload
we can't validate the signature.UpdateDelta
is received for aProtectedStorageEntry
that does exist:ProtectedStorageEntry
is constructed with the existingprotectedStoragePayload
andPayload.ownerPubKey
and the new metadata provided in theProtectedStorageEntryDelta
. It is then sent to theaddProtectedStorageEntry()
call as if it was originally sent across the wire as a fullAddDataMessage(ProtectedStorageEntry)
.addProtectedStorageEntry()
path ensures all sequence number checks and other validation is performed equivalently between this delta-based add and a full add.Known Issues
Malicious Nodes
One of the goals of this design was to not put too much trust in seed nodes. But, after looking through the current implementation I don't believe this is possible without a major update to the P2P messages.
The main issue is that there is no signature verification of the message type, only the hash(seqNr, payload). This means any malicious node can just take the contents of a valid
AddDataMessage
, put them inside aRemoveDataMessage
, and it will be accepted by peer nodes as aremove()
.We just rely on the fact that the original
AddDataMessage
with the same sequence number will propagate faster than the maliciousRemoveDataMessage
.With this current constraint, there is also no way for a consumer node to validate that a
RemoveDelta
sent from a seed node was originally a fully-formedRemoveDataMessage
that was sent to the network by the original owner. So any of these designs must be aware that a malicious actor can cause issues regardless of this design.A potential fix for this is to update the Entry signatures to include the message type. This would remove the rebroadcast attack vector by giving nodes a way to validate the operation type as well so they wouldn't remove data that wasn't intentionally deleted by the owner.
createdTimeStamp
not verifiableSimilar to the above issue, it isn't possible for a consumer to verify that the
createdTimeStamp
in aUpdateDelta
message is valid. This makes this path pretty uninteresting until the other Entry fields are signed or there is some common timekeeping mechanism for the network that doesn't rely on the system clock. The only path that would use this feature right now is the RefreshTTL path and that relies on the current system time, anyway.Development Plan
Fix #1a (RemoveData when node is offline) [~4wks]
There are quite a few ideas here. Some are good long term others are useful short term. The goal isn't to fix everything at once, but make progress while keeping it maintainable long term. I think the minimum version of this just implements
RemoveDelta
and just uses it in theGetData
sequence. This helps with a few of the edge cases and allows something to be released sooner that fixes a big issue. If we like the idea of usingUpdateDelta
, it may make sense to incorporate that into the initial object design.This path would fix the current issue with Compensation Requests and is a possible long term solution for other
PersistableProtectedStoragePayload
types.I would recommend the following development plan:
Phase 1 (Refactor GetData Path) [1wk] COMPLETE
bisq-network/bisq#3667
I would like to push more of the knowledge about creating
GetData
messages and processingGetDataResponse
messages down intoP2PDataStorage
. This will allow us to test the API for the P2PDataStore that constructs and processes the objects in this GetData path so when we change it we can verify it works.I'm also hopeful that we would be able to write the interop tests as P2PDataStorage unit tests which would be a big win.
This would get all the plumbing and test fixtures in good shape so all changes will be easily testable.
Phase 2 (Update new APIs (seen above) that utilize
Delta
objects) [2wk]This will validate the design and ensure any strange test cases are covered. If it doesn't work it will be obvious with just a few days of lost time.
RemoveDelta
requirementsprivate void applyDeltas(...)
[Consumer Side]private void createDeltas(...)
[Provider Side]RemoveDelta
at deletion time for future consumersP2PDataStorage::generateGetData()
andP2PDataStorage::processGetDataResponse()
tests updated with new test cases including an old GetDataResponse (older seed node) and old GetData (nulled fields from older Bisq node)Phase 3 (Manual Testing) [2-3 days]
This involves developing a test plan involving combinations of old & new seed-nodes and Bisq clients
Phase 4 (Rollout) [whenever the next release happens]
Do we have dedicated seednodes on testnet for beta testing? We could potentially start this during Phase 3
With all the previous testing, everything should work regardless of the providers version or the consumers version. So, we can choose to either roll out the seed node updates first or just do a normal release.
Future Work
Here are some high-level ideas for next steps.
Remove 1b (RemoveData between PreliminaryGetDataRequest & GetUpdatedDataRequest) failure
This would involve synchronizing the HS publishing (more importantly the first time we can receive Broadcast messages) and the GetData path to ensure that we can receive BroadcastMessages before we send a
GetUpdatedDataRequest
. This will ensure we will see anyRemoveData
messages that come in that otherwise would have been lost.Remove 2 (Lost Metadata Updates)
This is more about closing a potential issue later on if any
PersistentPayload
objects start caring about the Entry.createdTimeStamp. For now, it may be fine that consumer nodes have older seqNrs and createdTimestamps.Optimize RemoveDataMessage
By replacing
RemoveDataMessage
w/Removedelta
we would save about 95% of the bandwidth for those messages. They are infrequent so it may not be a priority.Optimize RefreshTTL
This message currently uses 32 bytes that aren't needed. Replacing it with an
UpdateDelta
would save some bandwidth.The text was updated successfully, but these errors were encountered: