-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prolly Trees under Waku synchronization process. #71
Comments
Thanks!
Could you elaborate more on this? Does this refer to some kind of grouping? Eg grouping leafs (messages) by eg hour?
I haven't thought this through, but these properties of waku store might be relevant:
Just in case these properties help. |
A bit misleading Prolly trees are always ordered collections (Key only or Key-Value) if not then it's not a Prolly tree. As for the inserting I'm not sure what you mean.
I'm not sure what you mean here either, Prolly trees were designed specifically for this purpose.
It would be much simpler to have 2 trees one for key=timestamp value=hashes, one for key=hashes value=none.
💯 Binary search is always used since keys are ordered.
No need IMO, Prolly trees are plenty fast. |
Thank you for this summary. I have a couple of questions:
Does this imply we need some natural sorting order for message hashes? Even though all messages have timestamps, we do not generate epoch-sortable hashes. Perhaps we should or perhaps we need to include the timestamp as significant part of the key? Or am I missing something?
What is the tradeoff here? Do inserts become more expensive? Tree depth become more unpredictable? Not for this post: |
Interesting, although I may not quite get this idea yet. Do you mean that you have a tree of timestamp:hash_of_all_message_hashes_with_this_timestamp? Why wouldn't a tree build of key=time_sortable_hash value=none be more effective? |
One tree would represent a table with keys of type int representing message timestamps, the values would be the hashes of messages. The other tree would be a set of all the message hashes. 🤔 May not be that useful since range queries would be weird. |
Right, so I don't think I see the use of the second tree here. As long as the number of hashes per timestamp remains relatively low (we expect number of messages per second to be < 100), perhaps it may be okay-ish to just compare the hashes on each timestamp sequentially? However, this tree would indeed have modifiable values (new messages arrive for previous timestamp), conflicts, etc.
Which again seems to me to imply some natural sorting order of hashes based on encoded timestamp. Still not sure I see the advantage of this approach vs. some kind of key=timestamp,hash value=null |
Timestamps should be unique, don't we use nanosecond precision? We can store multiple hashes per key in the tree but it's annoying.
|
Well, we have nanosecond-precision timestamp fields, but we definitely suggest timestamps to tick on coarse boundaries (by design), otherwise timing attacks becoming pretty trivial. We should expect timestamps around 500ms boundaries. |
sure, Leaf nodes are actual chunks of relevant data following some bucketing mechanism i.e. if the rolling hash of the transactions before falls under a category (eg. aggregated hash is less than some threshold value) so that grouping is like this, it is not based on timelined buckets (although they will be great though).
Indeed but just documenting it in case for broader understanding.
Yes in the ideal case there will be no more insertions (except sync replies), if we construct the Prolly tree on entries older than 20 seconds (limit for updated messages). |
oh yes this statement needs more clarity I will amend that. But here I mean the flow of incoming events/messages, random means the flow of keys can be disorder at the time of creation but eventually ordered. In our case, we take entries 20 seconds older (outdated message limit) so the order will be predefined (except the missing entries filled by Sync process) i.e. insert from right.
I am talking about a case where imagine the leaf entries are just messageHashes and no key-value hashed together. In that case, we get a level 0 chunk/bucket where there can be 5 leaf nodes or 50000 (based on parameter/input set). To not resend the whole chunk/bucket we can find the starting messageHash which is causing the upper-level mismatch. This process might take time to find the starting range.
But I think here is an assumption that for a timestamp there can be multiple messages, so idk how this timestamp-based key-value store will work. Imagine user said sync me between X and Y time stamp, so to get X we need to hash it with it's value, but values can be different, so collision. |
@jm-clius @SionoiS So basically I am trying to avoid having two Prolly trees, if using some clever trick we can track down the first difference in messageHash at level 0 leaf chunk/bucket then answering the range quires can be easily handled at store/archive level itself. For eg. If given a root Hash and the range that Sync me between X timestamp and Y timestamp then even in the case of "one timestamp multiple Hashes" we can get the messageHash of first occurrence of timestamp X and for Y timestamp last occurrence. So Sync between those might give us an optimal |
For a read you don't need to hash anything, only inserts/deletes to determine "chunk" boundaries. If we have multiple values per key, Sync is less efficient because you loose granularity. |
Had a good discussion with @ABresting, we probably need to allow multiple values per key in the case of a time index but loose on sync efficiency as time precision decrease. Also, we could limit the type of queries by not building every possible index OR do the opposite and build every index and allow arbitrary queries. I'm not sure the of the advantages of either. Lastly, the chunking algorithm could be based on rolling hashes but can be change later, the structure of the tree is agnostic to this. |
I just had an idea. If we need multiple values per key then could sort the values too. That way if we choose a tree shape with big leafs (chunks) searching this chunk is more efficient (only work if values can be sorted) and deterministic. |
so a BTree inside the multiValue key? |
Prolly Trees
The Prolly Tree, an amalgamation of Merkle and B-Tree architectures, is devised to streamline data searching and classification. This structure is pivotal in swiftly discerning discrepancies between two data stores, rendering it particularly advantageous for synchronization applications. The Prolly Tree operationalizes by hashing input data (mostly key-value) and generating a tree-like structure. It ensures that hash contents conforming to a specific pattern are clustered within the same subtree branch. The tree accommodates adjustable branching, controlled through a quotient parameter (Q) applied over the span of possible hash values. Q is used to calculate a threshold value. If the space used by the hash value generated of size 8 bytes i.e 264 size space, then 2Q acts as chunking factor.
Threshold = max_Hash_Size/Chunking_Factor.
For eg if hash value space size is 232 and chunking factor is 28 i.e. Q = 8. Then Threshold is 224 i.e. 16777216. So any new hash value below this parameter will end up in a new chunk formation, following values if not falling under this threshold will be part of the same chunk.
A Prolly Tree can be constructed through various methods, including insert-left, insert-right, and insert-random but the keys will be eventually ordered.
When constructing a Prolly Tree the process of comparing diffs between two nodes can be depicted as follows:
Issues:
A notable limitation arises with linear or chronological data management in Prolly trees, especially when random edits are involved - such as modifications, insertions, or deletions. Locating specific data for these operations lacks a direct, expedited pathway. Although upto some degree it can be mitigated Using the following ways:
Timestamp Boundaries in Leaf Buckets: Implementing timestamp boundaries within the leaf chunk buckets, which hold ordered messageHashes, enhances navigational efficiency. This structure enables more rapid access to the pertinent bucket, leveraging logarithmic complexity.
Optimization with Flat Binary Search: Further refinement can be achieved using a flat binary search. By utilizing timestamps and other chronological ordering criteria, this search method can expedite the identification and access of specific data points within the tree.
Employing Bloom Filters for Efficient Filtering: Utilizing Bloom filters presents another expedient method for identifying missing entries within the Prolly Tree. While inherently probabilistic, Bloom filters can significantly streamline the process of locating missing entries. Their probabilistic nature implies a trade-off between accuracy and efficiency, but they can provide substantial benefits in swiftly narrowing down the search space for potential missing or altered data points in the tree structure.
The text was updated successfully, but these errors were encountered: