Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse Merkle Trees (SMTs) designs #1472

Closed
protolambda opened this issue Nov 8, 2019 · 17 comments
Closed

Sparse Merkle Trees (SMTs) designs #1472

protolambda opened this issue Nov 8, 2019 · 17 comments
Labels

Comments

@protolambda
Copy link
Contributor

protolambda commented Nov 8, 2019

SSZ for general purpose merkleization is lacking a few types, of which sparse merkle trees are probably the most different from what is already there.

It only passed by a few times in the specs:

  • lists merkleization discussion 1115
  • multi-proof queries 644
  • lists/vectors, and the desire for a future sparse vector variant 1160

However I do think it is necessary to have if SSZ is to be adopted more widely than the current spec state machines.

Options

There is not one way to implement this, we have many options:

Opinion

Personally I dislike breaking changes here, and feel like those solutions are too specialized, hard to optimize or simply too incompatible with other parts of the spec.

The full hashing is also something I like to avoid, 256 hashes is computationally unacceptable, even with proof encoding compression (something proof backings should be optimizing, not sparse merkle trees).

@protolambda
Copy link
Contributor Author

By all means, do tear the new idea apart, there are many options and discussion is welcome.

@protolambda
Copy link
Contributor Author

protolambda commented Nov 9, 2019

Another possible variant of the compact sparse merkle trees, to make it work somewhat with generalized indices, and define two missing sparse data structures:

We define a MerkleSet[T], MerkleMap[K, V] and generalized index annotations.

MerkleSet[T]: Take the hash-tree-root of each element, this will be the identity in the set.
The set is merkleized as a compact tree, with each element in the position determined by its hash-tree-root, but each element is wrapped with a right-hand zero to mark it as a leaf element. H(obj_T, 0x000...)

MerkleMap[K, V]: As a MerkleSet[T], where each element is positioned by just the hash-tree-root of K, but is merkleized as H(H(V, K), 0x000...) (incl the right-hand-zeroes as leaf marker).

Now, to access a set or map value with a stable generalized index we need the generalized index to know where to stop and look for a 0x000... right hand in the tree.

A generalized index could be like:

1aaaa...aaaabbbb...bbbbcccc...cccc
aaaa...aaaa: some regular generalized index path, anything, leading to the root of the set or map.
bbbb...bbbb: 32 bytes key embedded. (set value hash-tree-root, or map key hash-tree-root)
cccc...cccc: path within the value

And then add an annotation like "at bit index i a 32 byte key can be found, follow this path like normal, but stop and skip to the end of the 32 byte key once a right-hand zeroed node is encountered. Check if the key matches the HTR of the left-hand".
And to work with for the MerkleMap encoding, define a second type of annotation, exactly like the first, but with a different key check: "the 32 byte key must match the next right child node" (i.e. in the pair left from the zeroes).

Such annotation could be just an index, maybe 2 bytes with the first bit reserved to describe the map case (2**15 = 32768 max merkle depth).

And one modification could be made to consider the hash with zeroes (H(e, 0x000...)) as the identity, instead of the element itself. Making the content keys uniform.
The pro of not doing this would be that keys can be chosen (and made uniform manually with a hash if necessary), and more readable in the generalized index.

@protolambda
Copy link
Contributor Author

As I am implementing a merkle-tree backing and typed views/partials in Go I'm starting to be a fan of single-node mixins. I.e. A mix-in like the list length is not opt-out: it's always part of a proof about the data. You are forced into the right direction, enabled to check in every case, and deal with less edge cases.

Similarly the zeroes as stopper for the sparse tree work well, they are always part of the proof data.
Maybe the flexibility and super-compatibility (with regular ssz merkle trees) offered by the content-vector commitment separation is still worth it, but it will be much more difficult to enforce to use it properly.

@protolambda
Copy link
Contributor Author

Edited this out of the main-post. It is a complementary idea to the compact trees that has some efficiency problems (2x proof size of individual leaf), but offers some fun new possibilities. Think of it as a thought-experiment to complete the picture of the different compact merkle trees.


The idea of compact sparse merkle trees has been re-hashed (no pun intended 🤔) by many authors, but I think gets close to fitting our needs. (ignoring stable merkle depth for a moment here). Let's just explore another variant.

Compact tree affinity

One thing that is particularly striking is that prefixes just fit well with binary trees and our generalized indices:

  • 1, 100, 1000000, etc are all exclusive: cannot represent them as leafs in the same sparse compact tree (unless you define splits over depth differences too). Aligning it to the left is no problem. E.g. 1000000000 ... 0000 can be a key.
  • Always starts with a 1, we can special case something that starts with a 0 when we merkleize this type of info.
  • If we do not mess the with leaf nodes, every current SSZ merkle structure is a valid sparse merkle tree!
    • vector commitment issue: positions are ignored by the sparse merkle tree. But the SSZ structures embedded here would have a static shape anyway.

Vector commitment mixin

Now take a look at SSZ lists:

  • Positional information is mixed in at the root, not per leaf. It's just that lists are not very complex, but it could be anything. And it is convenient for proofs to have this basic vector commitment: "elements 0...N are here"
  • Only when the vector commitment changes, the mix-in changes.

Now this new idea is to do the same for sparse trees: separate the vector commitment, and call it the "sparse tree mixin".

And this vector commitment can be a compact sparse tree of keys, optimized exactly the same way.

And since the vector commitment may not change as much (well, in some cases it does, in others never at all), an extra hash for an update there wouldn't be as bad.
So in the vector commitment, we take the "1 extra hash" to separate leafs (the actual keys) from hashes. And this hash can be 000...0000 as positions all start with 1 and can't collide with the zeroes. Nor should a random hash.

Also note that the contents part would back the values and the vector part would back the keys of the data-structure. And membership commitment depends on what you are after: content inclusion can be proven with just the values part, identity inclusion can be proven with just the keys part.

And now that the vector commitment can be summarized into 32 bytes, and the contents commitment part is compatible with regular SSZ trees, we can have really interesting type-less but safe tree interactions:
E.g. a BeaconState can be explored as a sparse tree, by just mixing in (to the regular BeaconState root) a constant magic 32 bytes derived from its type definition to convert a regular root into that of a valid sparse tree.

Another special property is that moving a value by just a little (slight key alternation) may only change the keys commitment part.

Compactness vs Stable depth

The remaining problem here is that compactness on a sparse structure cannot be achieved without breaking the stable depth. A generalized index to an actual sparse tree leaf cannot be derived without knowing about the contents of the tree. On the bright side, the separation of the vector commitment makes this easy to learn and communicate.

The compactness gets us O(log(N)) node depth for N nodes with uniformly distributed tree keys.
This compactness applies to both the values as the keys part of the tree.
And the compactness still puts adjacent keys close together efficiently, if a sparse tree is used for non-uniform data it is perfectly fine. (bit bigger proof size, but not much more than a regular List would do, and fast computation)

Drawbacks

size: If the only use case is to both proof the key and value at the same time, and the tree changes a lot, it would be ~2x more data (separate key and value proofs) compared to regular compact trees that harden the leaf nodes with a position. The total amount of nodes may be comparable still, but the 2x is especially noticeable when proofing

@vbuterin
Copy link
Contributor

vbuterin commented Nov 13, 2019

Out of all the options I think I like leaf-level mixins the most (so to store v at position k, store H(k, v)). The main challenge (really, with any of these schemes except my modified hash function proposal) is that generalized indices would not work because the path for any given object would be dynamic. I would suggest we just bite the bullet and in objects that contain sets/dicts we just require a path to consist of a generalized index (that pretends every key is the full 256 bits) along with a second object that is a list of pairs (depth of dict root, depth of leaf relative to root). Note that this list would need to be dynamically generated and passed along with each proof.

From this list of pairs, you can translate the generalized index into a "modified generalized index" (basically, the bit positions corresponding to levels that were elided in the tree get snipped out of the generalized index), and generate a list of mixin checks ("check that the value at generalized index x is y").

Something like:

def transform_generalized_index(gindex: Bitlist, positions: Sequence[Tuple[int,int]]) -> Bitlist:
    o = []
    my_pos = 0
    for pos, length in positions:
        assert pos >= my_pos
        # my_pos:pos is everything in between the dicts, pos:pos+branch is the branch within the dict itself
        o.extend(gindex[my_pos:pos + length])
        # Keys have length 32, plus one bit for the mixin
        my_pos = pos + 33
        # Mixin is the left child of the leaf node, value is the right child
        o.append(1)
    return o

The function could also be augmented to return a list of generalized indices at the leaf of each dict along the path, along with the key; this would be fed into a Merkle branch or multiproof checker so it can verify that the key provided in the mixin equals to the expected key.

@protolambda
Copy link
Contributor Author

Instead of transforming the generalized indices, and then still having to pass the information to check the keys, we could also mark the "anchor points" with generalized indices:
Say, for some type there is a SMT with its root anchored at 0b10010011, then it would be sufficient to pass that to the multiproof checker. The checker can check if it is in this subtree, and if so, check that the 32 bytes after this prefix in the gindex of a SMT leaf match the key part of the SMT leaf. (Assuming we standardize on 32 byte keys, and hash-tree-root if the actual key is bigger).
And this data would be the same for every entry in the sparse merkle tree: no need to duplicate it for each leaf in that subtree under the SMT anchor. And if regular SMTs are not nested very deep into the type of the proof as whole, this SMT anchor gindex only costs a few bytes. Negligible compared to the size of a single leaf node.

@Tolsi
Copy link

Tolsi commented Nov 14, 2019

What do you think about AVL+ trees?

@protolambda
Copy link
Contributor Author

@Tolsi Here is my view, primarily focused on how compact trees compare. I probably missed a few things, but will do a proper full read when I have more time.

Observation 1: Use Tree-Balancing Operations that Stay on Path

If single-node branches are already reduced to just the node itself, and keys are random and uniform, I do not think rebalancing is helpful. It is more like another way to compress a set of leafs into a binary looking tree, just with more edge-cases an optimized merkleization implementation has to deal with. A compact tree of N leafs has an avg. leaf depth of O(log(N) which is fine. And no re-hashing of anything else than on the branch of a modified leaf.

Observation 2: Do Not Hash Internal Keys

  • we do not add the keys of internal nodes
  • the verifier does not need to know how the leaf was found

There do not have to be internal keys indeed; if you already key the leafs, the internal nodes just have to reach them indeed, and be able to verify the leaf corresponds to some key. And IMHO it is much better if it is intuitive and simple: walk the path of the key until you find the key. Rather than determine your path based on the entire contents of the tree. All of this may end up in a every-op-counts smartcontract. And although the verifier does not need to know where the leaf was found, being able to quickly locate and read/write a leaf in a proof is important.

Observation 3: Skip Lists are Just a Variant of Treaps

Another thing with SSZ is to keep optimized tree descriptions separate from merkleization: new use-cases and algorithms will keep emerging. Merkleization should just be fast, minimal and straightforward. The lookup tricks and tree description can be provided alongside the proof. A "backing" is a representation that implements the merkle proof tree traversal needs, and optimizes reads/writes/etc. as it likes. Tree-rotation is something too much (ssz trees are static and predictable, and we hope SMTs can be nearly-statically described), but skip lists etc. can be part of the "backing". E.g. you can optimize lookups without moving nodes perfectly fine. See tree-offsets described here: https://github.com/protolambda/eth-merkle-trees

Observation 4: Deterministic is Better.
However, since inserted keys may be influenced by the adversary, this method of generating randomness may give an attacker the ability to make the data structure very slow and the proofs very long

Yes, deterministic structures and values are better. A SSZ SMT should hash to the same root regardless of tree constructor or insertion order.

Regarding determinism in compact trees: I do think choosing keys is not so much of a problem: keys are already hashes of other data (with the exception of basic keys, see below). And because of the compacting property (a branch stops at a single node, no extra depth), the only real attack surface is to insert a large amount of keys, or lots of keys with the same prefix (which is exponentially hard the longer the prefix is, so only effectively adds a few hashes).

For MerkleSet[A] or MerkleMap[A, X] of A a basic type or Bytes32, the user can pick a MerkleMap[K, V], where K = Hash(A). Chosen consecutive basic keys are not that bad either: keys are little-endian and so the prefix should differ immediately for regular use of basic keys 0 ... N has a minimal amount of common prefixes. However, if you try to use e.g. a range of 2**240 ... 2**240 + N as your keys, then yes, the common prefix is 240 bits long and the proof gets very large. Hash your keys, or avoid such arbitrary range. We can try to optimize out this common prefix, but the user can handle this exceptional case themselves, and I do not like some type of extension-nodes as in Merkle Patricia Trees: breaking otherwise viable optimizations and dealing with them in every single proof is not worth it. I may reconsider this if we get numbers on possible attacks and just the random birthday-like chance of a tree leaf key sharing a long common prefix with another.

Observation 5: AVL Trees Outperform on the Most Relevant Parameters
the worst-case distance [for leafs when inserting many random keys] for AVL trees is 1.44 times the optimal

We should run some numbers on compact trees, but I don't expect this worst case to occur very often with uniform keys.

Observation 6: Proofs for Multiple Operations Can Be Compressed.

As with any multi-merkle-proof? Providing just the leaf nodes, and witness data that can't be constructed from lower witness-data and leaves is relatively easy. There is some (naive but readable) pseudo code here: multiproofs. Compression of the actual description of the tree is then a responsibility of the "backing", replaceable later.

Implementation and Evaluation

Impressed by the amount of analysis in the paper, but prefer simplicity and closer affinity with the general binary merkle spec. The MerkleSet and MerkleMap are just a discussion thing for now, before anything gets standardized proposals, simulations and review will be done.

@vbuterin
Copy link
Contributor

Say, for some type there is a SMT with its root anchored at 0b10010011, then it would be sufficient to pass that to the multiproof checker. The checker can check if it is in this subtree, and if so, check that the 32 bytes after this prefix in the gindex of a SMT leaf match the key part of the SMT leaf. (Assuming we standardize on 32 byte keys, and hash-tree-root if the actual key is bigger).

Ah, I was thinking of the case where the SMT is fully embedded into the SSZ suite, so you could have the SMT be an element of other structures, other structures be an element of the SMT, etc, with possibly multiple SMTs in one path. Yes, if you have a gindex for just an SMT leaf from its own root then that is sufficient.

@protolambda
Copy link
Contributor Author

Right, for each SMT anchor (start of the tree, i.e. the SMT root) in an SSZ structure you would need to declare it's an SMT to get the special traversal behavior. But nothing extra depending on SMT contents. And it's just the anchor to be described if we standardize on 32 byte SMT keys. (Take hash-tree-root of key).

@vbuterin
Copy link
Contributor

You would also need to specify the length of the branches. Otherwise if you just provide a branch it's not clear what part is supposed to go down to the leaf and what part is doing something further after that leaf. One possible case to think about would be a direct 2-level SMT (ie. an SMT of SMTs); you need to declare where the first one ends and the second begins.

@protolambda
Copy link
Contributor Author

Well, you don't need an end and start I think, just the anchors of each SMT: if you have the start (anchor) info of both SMTs, and you traverse the first one, you can skip 256 bits of gindex after this first anchor. Navigate the SMT till you hit the bottom (zero right hand, or 256 gindex bits all consumed), check the key (against the skipped 256 bits), and then continue reading the remaining part of the gindex to navigate deeper. Which could be into the next deeper SMT, anchored at higher SMT anchor ++ its SMT 256 bit position ++ optional in-between SMTs navigation. (++ is generalized index concat). It's not pretty, but still minimal enough imo.

@vbuterin
Copy link
Contributor

vbuterin commented Nov 22, 2019

Navigate the SMT till you hit the bottom (zero right hand

Ah, are we assuming that the leaf in an SMT is (0, real_leaf)? Then where would the full key go?

@protolambda
Copy link
Contributor Author

protolambda commented Nov 22, 2019

I was thinking (and described it in MerkleMap/MerkleSet) for a map the bottom would be like H(H(htr(V), htr(K)), 0). First you break traversal on the 0 (as full 0 is not a natural hash of any kv pair). Then you check the key one level deeper (but to the left). And for a set you can say that H(V) is the position to check, and you don't need a K.

@JustinDrake
Copy link
Contributor

@protolambda What's the status on this issue? Do you still want to push it forward?

@protolambda
Copy link
Contributor Author

@JustinDrake Yes, but it simply is not required right now for phase0, and not impossible to implement in addition to what we currently have. So I'm alright with keeping this open for now, and focusing on the network/testnet problems. If anyone is not working on testnets and wants to give this another look, maybe a new proposal, I'm happy to review it though.

@farazhaider
Copy link

Hi @protolambda
I am the author of Compact sparse merkle trees (https://ethresear.ch/t/compact-sparse-merkle-trees/3741)
Unfortunately I could not find time for the past 2 years to do some follow up work on this but now I do thanks to the current world situation.
I'm not familiar with Ethereum's internals and what exactly are Ethereum's requirement in relation to this, could you please point me to a document where the broad requirements for using a Sparse merkle tree for Ethereum are listed.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants