-
Notifications
You must be signed in to change notification settings - Fork 20.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trie: make fullnode children hash calculation concurrently #15131
Conversation
@fjl PTAL |
Hey there! I've looked through the PR, the idea is good, but the solution is less so. The main problem is that you spin up a new goroutine for every full-node child. Now, with my new sync code, we have instances when I hash tries consisting of 20-60K nodes. I haven't run the numbers, but I can easily imagine the trie containing at least a few thousand nodes. This means a few thousand goroutines are cycled, and also a few thousand generators are allocated, which hits the GC too. I'll try to put together a benchmark with some large trie to make a stronger case, but even in your own benchmark I think this can be seen as running on 4x cores only gives you a 28% edge, which suggests the cores are still idle. It's important to investigate how much time hashing a single trie node takes? As nodes are around 500 bytes, I'd assume some +- insignificant amount. If so, this means that the goroutine and memory allocation overhead might be quite large compared to the actual work being done. As such, we need to do a lot more work per thread to make the concurrency worthwhile. Given that tries are mostly storing SHA3 images (as the keys), they can be assumed to be evenly distributed. A possible quick and dirty solution would be to only ever thread out on the topmost full node. That way we still do parallel processing without spinning up too many goroutines. Of course this may not perform the best during block processing since the trie loaded into memory may well not be balanced. Note, I've had a solution from a year or so ago which used a hasher pool and any hashing operations sent the task to the pool and processed the result, but the sync overhead was greater than the concurrency gain, especially since the buffer handling becomes more complex (like in your case). Any solution we come up with will need a much nicer benchmark suite and also needs to map the memory allocs in the benchmark to ensure we don't accidentally swap sha3 time for gc time. |
Thanks for your suggestions. As you mentioned, the solution is quite unconsidered. Maybe the topmost parallelism plan is feasible, since only the tree is large enough, the speed up got by parallelism is we really care about. And the imbalance will not exist in a "large" tree. |
I've opened a PR to fix the old benchmark. The numbers are still in your favor btw, but now the benchmark is a bit more realistic: Old:
This PR:
That being said, running on 8 cores, I'd expect a bit more speedup. |
Not so good about the performance. I'll try to topmost parallelism to see if better performance can obtain. Since this approach seems more stable. |
I don't really have a good suggestion, just consider my post a brain dump :D |
I just "hacked" my top level idea into your impl, and I seem to get double performance out of it, might be worth investigating: diff --git a/trie/hasher.go b/trie/hasher.go
index bfcf0b3..1f3ae42 100644
--- a/trie/hasher.go
+++ b/trie/hasher.go
@@ -64,7 +64,7 @@ func (h *hasher) returnCalculator(calculator *calculator) {
// hash collapses a node down into a hash node, also returning a copy of the
// original node initialized with the computed hash to replace the original one.
-func (h *hasher) hash(n node, db DatabaseWriter, force bool) (node, node, error) {
+func (h *hasher) hash(n node, db DatabaseWriter, force bool, threaded bool) (node, node, error) {
// If we're not storing the node, just hashing, use available cached data
if hash, dirty := n.cache(); hash != nil {
if db == nil {
@@ -81,7 +81,7 @@ func (h *hasher) hash(n node, db DatabaseWriter, force bool) (node, node, error)
}
}
// Trie not processed yet or needs storage, walk the children
- collapsed, cached, err := h.hashChildren(n, db)
+ collapsed, cached, err := h.hashChildren(n, db, threaded)
if err != nil {
return hashNode{}, n, err
}
@@ -111,7 +111,7 @@ func (h *hasher) hash(n node, db DatabaseWriter, force bool) (node, node, error)
// hashChildren replaces the children of a node with their hashes if the encoded
// size of the child is larger than a hash, returning the collapsed node as well
// as a replacement for the original node with the child hashes cached in.
-func (h *hasher) hashChildren(original node, db DatabaseWriter) (node, node, error) {
+func (h *hasher) hashChildren(original node, db DatabaseWriter, threaded bool) (node, node, error) {
var err error
switch n := original.(type) {
@@ -122,7 +122,7 @@ func (h *hasher) hashChildren(original node, db DatabaseWriter) (node, node, err
cached.Key = common.CopyBytes(n.Key)
if _, ok := n.Val.(valueNode); !ok {
- collapsed.Val, cached.Val, err = h.hash(n.Val, db, false)
+ collapsed.Val, cached.Val, err = h.hash(n.Val, db, false, threaded)
if err != nil {
return original, original, err
}
@@ -138,15 +138,23 @@ func (h *hasher) hashChildren(original node, db DatabaseWriter) (node, node, err
collapsed, cached := n.copy(), n.copy()
for i := 0; i < 16; i++ {
if n.Children[i] != nil {
- wg.Add(1)
- go func(i int) {
+ if !threaded {
+ wg.Add(1)
+ go func(i int) {
+ var herr error
+ collapsed.Children[i], cached.Children[i], herr = h.hash(n.Children[i], db, false, true)
+ if herr != nil {
+ err = herr
+ }
+ wg.Done()
+ }(i)
+ } else {
var herr error
- collapsed.Children[i], cached.Children[i], herr = h.hash(n.Children[i], db, false)
+ collapsed.Children[i], cached.Children[i], herr = h.hash(n.Children[i], db, false, true)
if herr != nil {
err = herr
}
- wg.Done()
- }(i)
+ }
} else {
collapsed.Children[i] = valueNode(nil) // Ensure that nil children are encoded as empty strings.
}
diff --git a/trie/proof.go b/trie/proof.go
index 298f648..81f91b3 100644
--- a/trie/proof.go
+++ b/trie/proof.go
@@ -72,7 +72,7 @@ func (t *Trie) Prove(key []byte) []rlp.RawValue {
for i, n := range nodes {
// Don't bother checking for errors here since hasher panics
// if encoding doesn't work and we're not writing to any database.
- n, _, _ = hasher.hashChildren(n, nil)
+ n, _, _ = hasher.hashChildren(n, nil, false)
hn, _ := hasher.store(n, nil, false)
if _, ok := hn.(hashNode); ok || i == 0 {
// If the node's database encoding is a hash (or is the
diff --git a/trie/trie.go b/trie/trie.go
index c211e75..e8c8fd5 100644
--- a/trie/trie.go
+++ b/trie/trie.go
@@ -501,5 +501,5 @@ func (t *Trie) hashRoot(db DatabaseWriter) (node, node, error) {
return hashNode(emptyRoot.Bytes()), nil, nil
}
h := newHasher(t.cachegen, t.cachelimit)
- return h.hash(t.root, db, true)
+ return h.hash(t.root, db, true, false)
}
EDIT: Of course the above diff is very very ugly code, just a PoC :) |
4cf83c9
to
2af4feb
Compare
@karalabe Updated |
trie/hasher.go
Outdated
type calculator struct { | ||
sha hash.Hash | ||
buffer *bytes.Buffer | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a utility type used by the hasher, it's cleaner to have it (and its methods) defined at the top of the file. Please bubble them up. Also please add docs for the type to state what it does.
trie/hasher.go
Outdated
@@ -121,17 +135,44 @@ func (h *hasher) hashChildren(original node, db DatabaseWriter) (node, node, err | |||
|
|||
case *fullNode: | |||
// Hash the full node's children, caching the newly hashed subtrees | |||
var wg sync.WaitGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only use this WaitGroup in a deeper nesting. Please move it inside, there's no need to declare it out here.
trie/hasher.go
Outdated
} | ||
} | ||
if err != nil { | ||
return original, original, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit reluctant about this code construct. Seems very brittle that we have the same code duplicated, it's asking for trouble down the line when someone wants to modify it. Couldn't we somehow unify the threaded and non-threaded version?
trie/hasher.go
Outdated
var herr error | ||
collapsed.Children[i], cached.Children[i], herr = h.hash(n.Children[i], db, false) | ||
if herr != nil { | ||
err = herr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This construct is racey, since multiple goroutines may set the same error.
2af4feb
to
cfb622a
Compare
@karalabe Updated |
trie/hasher.go
Outdated
collapsed.Children[i], cached.Children[i], err = h.hash(n.Children[i], db, false) | ||
if err != nil { | ||
return original, original, err | ||
hashChild := func(index int, finalize func()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Callbacks are not the nicest option :) A cleaner solution is to have the *sync.Waitgroup
passed as teh parameter. Then if it's not nil, you can do a defer wg.Done
. It's the same thing in essence, just a bit less boilerplate involved.
cfb622a
to
7fd6a8d
Compare
8f46b6d
to
d7f1cf1
Compare
d7f1cf1
to
4863cc7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…thereum#15131)" This reverts commit 0f7fbb8.
…thereum#15131)" (ethereum#15889) This reverts commit 0f7fbb8.
Motivation:
For fullnode in trie, the children hash calculation was serial before. However, the hash calculation between any child nodes does not affect each other. So i make this concurrently.
Performance:
I run the
BenchmarkHashBE
andBenchmarkHashLE
on my server machine(8cores 3.7GHZ) for performance checking.for parallel:
for serial:
Performance doubled when hash large tree.