Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add delta handling to the node synchronization path to properly remove data #140

Closed
julianknutsen opened this issue Nov 17, 2019 · 8 comments

Comments

@julianknutsen
Copy link

julianknutsen commented Nov 17, 2019

This proposal suggests a design and implementation plan to fix bisq-network/bisq#3610. A bug was identified where nodes would keep stale data locally if it was removed while they were offline and it caused deleted Proposals to reappear in the BlindVote phase.

Design & Background
Development Plan
Future Work

Progress

Phase 1: Complete
Phase 2: On Track

Proposal

TL;DR Nodes are informed of deletes that occurred while they were offline so they can synchronize their internal state during the initialization sequence with a seednode.

Before ProtectedStorageEntrys, the PersistableNetworkPayload objects were append-only and immutable. This meant that there were never issues with interleaved remove operations or GetData updating when nodes were offline. It was OK to just ask for all of the new objects when a node came online and the set of the Payloads would sync. But, with the introduction of ProtectedStorageEntry objects that can be removed and receive updates, there is a window of risk where nodes will not see operations that occurred while they were offline.

The current initialization path uses a sequence of messages between an initializing node (consumer) and a seed node (provider). Currently, this sequence is limited to retrieving only ProtectedStorageEntrys that exist on the provider node, but not on the consumer node.

Current Synchronization Messages

message PreliminaryGetDataRequest {
    int32 nonce = 21;
    repeated bytes excluded_keys = 2;
    repeated int32 supported_capabilities = 3;
}

message GetDataResponse {
    int32 request_nonce = 1;
    bool is_get_updated_data_response = 2;
    repeated StorageEntryWrapper data_set = 3;
    repeated int32 supported_capabilities = 4;
    repeated PersistableNetworkPayload persistable_network_payload_items = 5;
}

message GetUpdatedDataRequest {
    NodeAddress sender_node_address = 1;
    int32 nonce = 2;
    repeated bytes excluded_keys = 3;
}

Specific Failure Cases

1. Lost Remove Messages

The current message sequence just focuses on adding missing data and not synchronizing the state between the consumer and provider. This causes issues in two cases:

a. RemoveData when node is offline

ProtectedStoragePayload objects implementing the PersistablePayload interface will load from disk early in startup. This means that the payload hashes for these objects will be sent in the initial PreliminaryGetDataRequest message. Even if the provider node doesn't recognize the hash, it is only interested in sending NEW objects so the consumer node will never delete the should-be-deleted ProtectedStorageEntry.

This is the current bug with Compensation Requests (bisq-network/bisq#3610)
b. RemoveData between PreliminaryGetDataRequest & GetUpdatedDataRequest

Until the HS is published, the P2PDataStorage layer does not receive broadcast messages. But the timing of this operation is not synchronized with the PreliminaryGetDataRequest->GetDataResponse->GetUpdatedDataRequest sequence.

This means that if a RemoveData message was broadcast BEFORE the HS was published, but AFTER the GetDataResponse was sent to the consumer, the consumer would add the should-be-removed Entry and would never know the Entry was removed. It would be up to the TTL to remove it eventually, but it may have UI artifacts or state updates based on this data that may be invalid.

2. Lost Metadata Updates

For ProtectedStorageEntrys, since only the payload hash is sent to the provider, it will not receive any updates to known payloads that have had their metadata updated. Metadata here refers to the
metadata section below.

    // metadata
    private byte[] signature;
    private long creationTimeStamp;
    private final int sequenceNumber;

    // immutable data
    private final ProtectedStoragePayload protectedStoragePayload;
    

So, if an update (read addProtectedStorageEntry() for existing Payload) was sent while a node was offline, the consumer node will not receive that update. I believe in the current system this isn't an issue, because there are not persistent ProtectedStoragePayload objects that depend on metadata (OfferPayload is non-persistent). But, if the future holds more payload types that depend on metadata updates, it may make sense to fix this gap.

Design Goals

  1. Remove risk window for lost removes and metadata updates
  2. All ProtectedStorageEntry updates must be validated (no implied trust of SeedNodes). This isn't possible and I will explain more below. (See Known Issues Section)
  3. SeedNode (old) <--> Bisq (new) should work
  4. Bisq (old) <--> SeedNode (new) should work

ProtectedStorageEntry Delta Design

This section will focus on a design for the P2PDataStorage system: ProtectedStorageEntryDelta. I'll explain the basic concepts and implementation and then tie it into how it can solve the initialization problems described above.

A ProtectedStorageEntryDelta is the minimum set of data required to perform an update or delete operation on a ProtectedStorageEntry. Think of them as a lightweight ProtectedStorageEntry that is only useful for objects that exist. They leverage the fact that Payload objects already exist on the peer, so they do not need to be retransmitted.

enum DeltaType {
    ADD = 1,
    UPDATE = 2,
    REMOVE = 3,
}

public class ProtectedStorageEntryDelta {
    private final int deltaType;                 // 4 bytes 
    private final byte[] signature;              // 46 bytes
    private final byte[] hashOfPayload;          // 20 bytes
    private final int sequenceNumber;            // 4 bytes
    private long timeStamp;                      // 8 bytes
} // 82 Bytes vs 124 Byte RefreshOffer and 1700 Byte RemoveData

Benefits

Currently, the system relies on full ProtectedStorageEntry messages to remove data RemoveDataMessage (1700 bytes) and a smaller version for updates RefreshOfferMessage (123 bytes). Using a new lightweight ProtectedStorageEntryDelta object can give the same functionality while reducing the on-wire size.

  • During the GetData sequence, the provider can send RemoveDelta objects to the consumer to inform them of RemoveDataMessage that happened while the node was offline. This will be less expensive in-memory storage and bandwidth than storing the full RemoveDataMessage and sending it to them.

  • During the GetData sequence, the provider could send UpdateDelta objects to the consumer to inform them of AddDataMessage that happened while the node was offline. This will be less expensive in-memory storage and bandwidth than storing the last full AddDataMessage and sending it to them.

    Important to note that I don't think there are any interesting consumers of this right now. The only Payload that cares about the metadata is the OfferPayload and since it is non-persistable it will never be sent from the consumer to the provider in the PreliminaryGetDataRequest.

  • Future development work could replace RefreshOfferMessage with UpdateDelta reducing on-wire size by ~20%.

  • Future development work could replace RemoveDataMessage with RemoveDelta reducing on-wire size by ~95%.

  • Future development could optimize addProtectedStorageEntry to use an UpdateDelta as an optimization if the Payload already exists.

Implementation Behavior

RemoveDelta

  • If a RemoveDelta is received for a ProtectedStorageEntry that does not exist, it is discarded.
    • Without an existing Payload we can't validate the signature.
  • If a RemoveDelta is received for a ProtectedStorageEntry that does exist:
    • Validation is performed on the post-delta object using similar checks to RefreshTTL, a ProtectedStorageEntry is created from the existing protectedStoragePayload and the new metadata provided in the ProtectedStorageEntryDelta. It is then sent to the remove() call as if it was sent across the wire as a full RemoveDataMessage(ProtectedStorageEntry).
      • This is subtle, but the Payload.ownerPubKey can be used in the Mailbox and Non-Mailbox case. Should talk through it more to validate that assertion.
    • Using the common remove() path ensure all the sequence number checks, remove-before-add, and other validation is performed equivalently between this delta-based remove and a full remove.

UpdateDelta

  • If an UpdateDelta is received for a ProtectedStorageEntry that does not exist, it is discarded.
    • Without an existing Payload we can't validate the signature.
  • If an UpdateDelta is received for a ProtectedStorageEntry that does exist:
    • Validation is performed on the post-delta object using similar checks to RefreshTTL, the new ProtectedStorageEntry is constructed with the existing protectedStoragePayload and Payload.ownerPubKey and the new metadata provided in the ProtectedStorageEntryDelta. It is then sent to the addProtectedStorageEntry() call as if it was originally sent across the wire as a full AddDataMessage(ProtectedStorageEntry).
    • Using the common addProtectedStorageEntry() path ensures all sequence number checks and other validation is performed equivalently between this delta-based add and a full add.

Known Issues

Malicious Nodes

One of the goals of this design was to not put too much trust in seed nodes. But, after looking through the current implementation I don't believe this is possible without a major update to the P2P messages.

The main issue is that there is no signature verification of the message type, only the hash(seqNr, payload). This means any malicious node can just take the contents of a valid AddDataMessage, put them inside a RemoveDataMessage, and it will be accepted by peer nodes as a remove().

We just rely on the fact that the original AddDataMessage with the same sequence number will propagate faster than the malicious RemoveDataMessage.

With this current constraint, there is also no way for a consumer node to validate that a RemoveDelta sent from a seed node was originally a fully-formed RemoveDataMessage that was sent to the network by the original owner. So any of these designs must be aware that a malicious actor can cause issues regardless of this design.

A potential fix for this is to update the Entry signatures to include the message type. This would remove the rebroadcast attack vector by giving nodes a way to validate the operation type as well so they wouldn't remove data that wasn't intentionally deleted by the owner.

createdTimeStamp not verifiable

Similar to the above issue, it isn't possible for a consumer to verify that the createdTimeStamp in a UpdateDelta message is valid. This makes this path pretty uninteresting until the other Entry fields are signed or there is some common timekeeping mechanism for the network that doesn't rely on the system clock. The only path that would use this feature right now is the RefreshTTL path and that relies on the current system time, anyway.

Development Plan

Fix #1a (RemoveData when node is offline) [~4wks]

There are quite a few ideas here. Some are good long term others are useful short term. The goal isn't to fix everything at once, but make progress while keeping it maintainable long term. I think the minimum version of this just implements RemoveDelta and just uses it in the GetData sequence. This helps with a few of the edge cases and allows something to be released sooner that fixes a big issue. If we like the idea of using UpdateDelta, it may make sense to incorporate that into the initial object design.

This path would fix the current issue with Compensation Requests and is a possible long term solution for other PersistableProtectedStoragePayload types.

I would recommend the following development plan:

Phase 1 (Refactor GetData Path) [1wk] COMPLETE

bisq-network/bisq#3667

I would like to push more of the knowledge about creating GetData messages and processing GetDataResponse messages down into P2PDataStorage. This will allow us to test the API for the P2PDataStore that constructs and processes the objects in this GetData path so when we change it we can verify it works.

I'm also hopeful that we would be able to write the interop tests as P2PDataStorage unit tests which would be a big win.

This would get all the plumbing and test fixtures in good shape so all changes will be easily testable.

New APIs

class P2PDataStorage {
  public GetData generateGetData(boolean isPreliminaryDataRequest, ..., ...);
  public GetDataResponse generateGetDataResponse(GetData request, ..., ...);
  public void processGetDataResponse(GetDataResponse response);
}

Phase 2 (Update new APIs (seen above) that utilize Delta objects) [2wk]

This will validate the design and ensure any strange test cases are covered. If it doesn't work it will be obvious with just a few days of lost time.

  • Updates to ProtectedMailboxStorageEntry to handle RemoveDelta requirements
  • Implementation and tests of private void applyDeltas(...) [Consumer Side]
  • Implementation and tests of private void createDeltas(...) [Provider Side]
    • Involves storing a RemoveDelta at deletion time for future consumers
  • P2PDataStorage::generateGetData() and P2PDataStorage::processGetDataResponse() tests updated with new test cases including an old GetDataResponse (older seed node) and old GetData (nulled fields from older Bisq node)

Phase 3 (Manual Testing) [2-3 days]

This involves developing a test plan involving combinations of old & new seed-nodes and Bisq clients

Phase 4 (Rollout) [whenever the next release happens]

Do we have dedicated seednodes on testnet for beta testing? We could potentially start this during Phase 3

With all the previous testing, everything should work regardless of the providers version or the consumers version. So, we can choose to either roll out the seed node updates first or just do a normal release.

Future Work

Here are some high-level ideas for next steps.

  1. Remove 1b (RemoveData between PreliminaryGetDataRequest & GetUpdatedDataRequest) failure
    This would involve synchronizing the HS publishing (more importantly the first time we can receive Broadcast messages) and the GetData path to ensure that we can receive BroadcastMessages before we send a GetUpdatedDataRequest. This will ensure we will see any RemoveData messages that come in that otherwise would have been lost.

  2. Remove 2 (Lost Metadata Updates)
    This is more about closing a potential issue later on if any PersistentPayload objects start caring about the Entry.createdTimeStamp. For now, it may be fine that consumer nodes have older seqNrs and createdTimestamps.

  3. Optimize RemoveDataMessage
    By replacing RemoveDataMessage w/ Removedelta we would save about 95% of the bandwidth for those messages. They are infrequent so it may not be a priority.

  4. Optimize RefreshTTL
    This message currently uses 32 bytes that aren't needed. Replacing it with an UpdateDelta would save some bandwidth.

@julianknutsen julianknutsen changed the title ProtectedStorageEntry Deltas ProtectedStorageEntryDeltas Nov 17, 2019
@julianknutsen
Copy link
Author

I'll be starting this project early next week at Phase 1 to make sure the tests are in good shape and the design passes some basic requirements before implementing RemoveDeltas as part of the GetData path.

@chimp1984
Copy link

@julianknutsen
Thanks for the analysis and action plan. Could you give the proposal a title which reflects the intended action (e.g. "Add delta handling for ProtectedStorageEntry") and maybe a short summary section where it is clear what the proposal is about? See submission and review process

Info for other readers: Me, @julianknutsen and @freimair had today a voice chat where we discussed that proposal.

@julianknutsen julianknutsen changed the title ProtectedStorageEntryDeltas Add delta handling to the node synchronization path on startup to fix bisq-network/bisq#3610 Nov 17, 2019
@julianknutsen julianknutsen changed the title Add delta handling to the node synchronization path on startup to fix bisq-network/bisq#3610 Add delta handling to the node synchronization path to properly remove data Nov 17, 2019
@julianknutsen
Copy link
Author

julianknutsen commented Nov 23, 2019

Phase 1 is complete as of bisq-network/bisq#3667

@julianknutsen
Copy link
Author

Closing as incomplete. I will be unable to complete this work, but it may be useful for future devs that want to pick it up.

@objectorange
Copy link

Should this be re-opened? I may be able to take a stab at it once I'm in the heap.

@julianknutsen
Copy link
Author

Honestly, this doesn't feel that high priority given the other outstanding issues with Bisq. The only user-facing issue is duplicate compensation requests and that isn't very critical. If you have the expertise it is likely better to spend your time on something more important.

@objectorange
Copy link

I just didn't want it to get lost.

@julianknutsen
Copy link
Author

julianknutsen commented Jan 13, 2020

Then advocate for a real project tracker. Random github issues isn't the right place.

It shouldn't get lost. The maintainers in charge of the pending work are aware of it and if it is important enough it will bubble to the top. There is just too much noise with everything that could be done ending up as a proposal issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants