Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tree hash functions for SSZ #54

Closed
vbuterin opened this issue Oct 5, 2018 · 6 comments
Closed

Tree hash functions for SSZ #54

vbuterin opened this issue Oct 5, 2018 · 6 comments
Labels
general:enhancement New feature or request

Comments

@vbuterin
Copy link
Contributor

vbuterin commented Oct 5, 2018

The following is a general-purpose strategy for making all data structures in the beacon chain more light client friendly. When (i) hashing the beacon chain active state, (ii) hashing the beacon chain crystallized state, or (iii) hashing beacon chain blocks, we instead use the following hash function specific to SSZ objects, where hash(x) is some underlying hash function with a 32-byte output (eg. blake(x)[0:32])

def hash_ssz_object(obj):
    if isinstance(obj, list):
        objhashes = [hash_ssz_object(o) for o in obj]
        return merkle_root(objhashes)
    elif not isinstance(obj, SSZObject):
        return hash(obj)
    else:
        o = b''
        for f in obj.fields:
            val = getattr(obj, f)
            o += hash_ssz_object(val)
        return hash(o)    

Where merkle_root is defined as follows:

def merkle_root(objs):
    min_pow_of_2 = 1
    while min_pow_of_2 <= len(objs):
        min_pow_of_2 *= 2
    o = [0] * min_pow_of_2 + [len(objs).to_bytes(32, 'big')] + objs + [b'\x00'*32] * (min_pow_of_2 - len(objs))
    for i in range(min_pow_of_2 - 1, 0, -1):
        o[i] = hash(o[i*2] + o[i*2+1])
    return o[1]

Collision resistance is only guaranteed between objects of the same type, not objects of different types.

Efficiency

Fundamentally, Merkle-hashing instead of regular hashing doubles the amount of data hashes, but because hash functions have fixed costs the overhead is higher. Here are some simulation results, using 111-byte objects for accounts because this is currently roughly the size of a beacon chain ValidatorRecord object:

>>> import blake2b
>>> def hash(x): blake2b(x).digest()[:32]
>>> import time
>>> accounts = [b'\x35' * 111 for _ in range (1000000)]
>>> a = time.time(); x = hash(b''.join(accounts)); print(time.time() - a)
0.42771387100219727
>>> a = time.time(); x = merkle_root(accounts); print(time.time() - a)
1.2481215000152588
@hwwhww hwwhww added the general:enhancement New feature or request label Oct 6, 2018
@paulhauner
Copy link
Contributor

paulhauner commented Oct 11, 2018

Very minor, but I suspect return hash(p) should be return hash(o).

Update: I was expecting hash_ssz_object to return a byte array (hash digest) but I'm getting an array with a mix of ints and byte arrays. Should I be expecting to see a hash digest?

@vbuterin
Copy link
Contributor Author

Fixed both! Merkle root accidentally returned the entire tree.

@sorpaas
Copy link

sorpaas commented Nov 2, 2018

In hash_ssz_object, for structs, it looks like we hash each fields individually, combine them, and then hash again. Are there any reasons why we need those many hashes (which I don't see how they can be used for light clients, as they're not part of merkle trie)? If we combine fields first, it would save many hash rounds.

@djrtwo
Copy link
Contributor

djrtwo commented Nov 6, 2018

@vbuterin I'd like to get this merged soon and reduce state to one state root.

I believe @sorpaas is correct. I don't see the need to hash the SSZ base types in the following line

    elif not isinstance(obj, SSZObject):
        return hash(obj)

We can just return the raw obj data here.

@mkalinin
Copy link
Contributor

mkalinin commented Nov 9, 2018

Did an evaluation in Java.
There is a room for tiny improvement which addresses very rare case (when min_pow_of_2 == len(objs), original lines are commented out):

def merkle_root(objs):
    min_pow_of_2 = 1
    # while min_pow_of_2 <= len(objs):
    while min_pow_of_2 < len(objs):
        min_pow_of_2 *= 2
    # o = [0] * min_pow_of_2 + [len(objs).to_bytes(32, 'big')] + objs + [b'\x00'*32] * (min_pow_of_2 - len(objs))
    o = [len(objs).to_bytes(32, 'big')] + [0] * (min_pow_of_2 - 1) + objs + [b'\x00'*32] * (min_pow_of_2 - len(objs))
    for i in range(min_pow_of_2 - 1, 0, -1):
        o[i] = hash(o[i*2] + o[i*2+1])
    # return o[1]
    return o[0] 

But original algorithm with @sorpaas's correction looks good to me either.

@vbuterin
Copy link
Contributor Author

vbuterin commented Nov 9, 2018

I made a modified algorithm that has some efficiency improvements:

#120

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
general:enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants