Christmas trie #20481

holiman · 2019-12-23T15:49:39Z

Dedicated hasher/committer

The previous hasher.go, was a generic implementation which was used by both Commit and Hash. It does a
couple of things,

Walk the un-hashed / dirty trie towards the leafs,
Hash the tree on the return path,
Create one new 'shadow'-trie containing the new hashes, cached, which will replace the old root,
Create one new collapsed-trie containing the hashing-preimages, which is incrementally used to create higher order hash values.
Call store, if commit is being performed, to send nodes to the database, and also to the onleaf handler (for Commit)

This means that for every Hash or Commit, every processed node was duplicated twice, using copy. When I split this in two, it was possible
to remove a lot of redundant allocs.

Commit

The Commit operation is basically read-only, with the exception of setting the dirty-flag to false.

This means that we can skip the whole 'create-new-shadow-tree-for-new-root' creation, and cut allocs in half.
We very seldon need to do any hashing in Commit (but sometimes still do), we can skip rlp-encoding most of the times.
The database insertion path needs the size of a node for tracking purposes, but that can be estimated instead
of calculated strictly.
Sending items to onleaf and db has been split off into a separate thread, and instead this PR send those tasks via a channel.

name                             old time/op    new time/op    delta
CommitAfterHash/no-onleaf-6        4.95µs ± 9%    3.54µs ±19%  -28.39%  (p=0.000 n=10+9)
CommitAfterHash/with-onleaf-6      6.07µs ±18%    4.61µs ±21%  -24.07%  (p=0.000 n=10+10)
CommitAfterHashFixedSize/10-6      65.7µs ± 3%    34.0µs ± 5%  -48.23%  (p=0.000 n=9+10)
CommitAfterHashFixedSize/100-6      277µs ± 2%     157µs ± 7%  -43.21%  (p=0.000 n=9+10)
CommitAfterHashFixedSize/1K-6      3.29ms ±14%    1.68ms ±12%  -49.05%  (p=0.000 n=10+9)
CommitAfterHashFixedSize/10K-6     45.1ms ± 8%    26.8ms ±13%  -40.55%  (p=0.000 n=10+10)
CommitAfterHashFixedSize/100K-6     505ms ±15%     343ms ± 9%  -32.21%  (p=0.000 n=10+10)

name                             old alloc/op   new alloc/op   delta
CommitAfterHash/no-onleaf-6        1.79kB ± 1%    1.47kB ± 4%  -17.77%  (p=0.000 n=9+10)
CommitAfterHash/with-onleaf-6      1.98kB ± 2%    1.75kB ± 2%  -11.53%  (p=0.000 n=9+10)
CommitAfterHashFixedSize/10-6      37.5kB ± 0%    30.8kB ± 0%  -17.76%  (p=0.000 n=9+10)
CommitAfterHashFixedSize/100-6      179kB ± 0%     148kB ± 0%  -17.35%  (p=0.000 n=10+9)
CommitAfterHashFixedSize/1K-6      1.83MB ± 0%    1.49MB ± 0%  -18.41%  (p=0.000 n=10+10)
CommitAfterHashFixedSize/10K-6     19.9MB ± 0%    16.4MB ± 0%  -17.71%  (p=0.000 n=10+10)
CommitAfterHashFixedSize/100K-6     192MB ± 0%     157MB ± 0%  -18.19%  (p=0.000 n=10+10)

name                             old allocs/op  new allocs/op  delta
CommitAfterHash/no-onleaf-6          11.0 ± 0%       9.0 ± 0%  -18.18%  (p=0.000 n=10+10)
CommitAfterHash/with-onleaf-6        15.0 ± 0%      14.0 ± 0%   -6.67%  (p=0.002 n=8+10)
CommitAfterHashFixedSize/10-6         240 ± 0%       188 ± 0%  -21.67%  (p=0.000 n=10+10)
CommitAfterHashFixedSize/100-6      1.16k ± 0%     0.90k ± 0%  -22.45%  (p=0.000 n=10+10)
CommitAfterHashFixedSize/1K-6       11.9k ± 0%      9.2k ± 0%  -22.80%  (p=0.001 n=8+9)
CommitAfterHashFixedSize/10K-6       122k ± 0%       94k ± 0%  -22.82%  (p=0.000 n=10+10)
CommitAfterHashFixedSize/100K-6     1.21M ± 0%     0.94M ± 0%  -22.82%  (p=0.000 n=10+10)

Hash

The Hash operation has been optimized mainly to decrease the number of allocations.

name                  old time/op    new time/op    delta
Hash-6                  7.08µs ±12%    7.34µs ±14%     ~     (p=0.481 n=10+10)
HashFixedSize/10-6      70.0µs ± 2%    68.0µs ± 4%   -2.85%  (p=0.011 n=10+10)
HashFixedSize/100-6      328µs ± 3%     315µs ± 3%   -3.88%  (p=0.000 n=10+10)
HashFixedSize/1K-6      3.67ms ± 3%    3.54ms ±12%   -3.43%  (p=0.043 n=9+10)
HashFixedSize/10K-6     43.3ms ± 2%    40.9ms ± 3%   -5.46%  (p=0.000 n=9+8)
HashFixedSize/100K-6     525ms ±12%     465ms ±16%  -11.29%  (p=0.002 n=10+10)

name                  old alloc/op   new alloc/op   delta
Hash-6                    989B ± 1%      916B ± 2%   -7.36%  (p=0.000 n=10+10)
HashFixedSize/10-6      13.8kB ± 0%    12.5kB ± 0%   -9.26%  (p=0.000 n=10+10)
HashFixedSize/100-6     65.0kB ± 0%    58.7kB ± 0%   -9.83%  (p=0.000 n=10+10)
HashFixedSize/1K-6       695kB ± 0%     631kB ± 0%   -9.21%  (p=0.000 n=10+9)
HashFixedSize/10K-6     7.21MB ± 0%    6.57MB ± 0%   -8.88%  (p=0.000 n=10+10)
HashFixedSize/100K-6    71.3MB ± 0%    64.9MB ± 0%   -8.97%  (p=0.000 n=10+10)

name                  old allocs/op  new allocs/op  delta
Hash-6                    12.0 ± 0%      11.0 ± 0%   -8.33%  (p=0.000 n=10+10)
HashFixedSize/10-6         182 ± 0%       162 ± 0%  -10.99%  (p=0.000 n=10+10)
HashFixedSize/100-6        886 ± 0%       786 ± 0%  -11.29%  (p=0.000 n=10+10)
HashFixedSize/1K-6       9.16k ± 0%     8.15k ± 0%  -10.97%  (p=0.001 n=8+9)
HashFixedSize/10K-6      93.5k ± 0%     83.4k ± 0%  -10.81%  (p=0.000 n=10+10)
HashFixedSize/100K-6      932k ± 0%      830k ± 0%  -10.87%  (p=0.000 n=10+10)

Benchmark tests

The trie benchmark tests were not accurate. The trie benchmarks used b.N to set the number of leafs in the trie. However,
the correlation between 'work performed' and 'leafs' is not linear, so this caused the benchmark results to be
incomparable. On a faster codebase, the test framework will increase the N factor, which will increase the work
factor by N * f(N), and depending on whatever final N the go test framework decides to stop at, it might show a
slower average (N / time) than the slower codebase.

I added new Fixed benchmarks, which benchmarks the times to Commit/Hash fixed-sized tries instead.

Further improvements

There are a couple of further improvements that can be made:

Paralellized hashing: by splitting out the trie-walking on the first fullnode, we can have up to 16 goroutines that walk
each subpath.

Dirty poc: [experiment] parallell trie hashing #20462
Old attempt by Gary: trie: make fullnode children hash calculation concurrently #15131 , was reverted in Revert "trie: make fullnode children hash calculation concurrently" #15889

This should be an improvement, but only for tries of suitable size (perhaps 1000 new leafs or so). Most likely (imo), is that the old attempt had too
much overhead on the storage trie hashes, since the storage tries are too small to gain any real benefit from the change.

It might be a good idea to add an updatecounter to the trie, and only use paralell hashing when the number of updates (estimated) is above a certain threshold.

Dedicated database inserter goroutine. This PR spins up a goroutine for db insertion at every trie commit. Although it's fairly fast, it's likely not a speed boost
for the storage tries, for the same reason as above. So a better approach might be to have one dedicated (per trie, or per database) where we can send data from the committer
to the database+onleaf callbacks. A separate benefit from doing this, is that during statedb.go commit, we don't have to wait for each individual trie to finish before
starting to process the next one. Pipelining ftw!

I actually did an implementation of this, but it got kind of messy since my poc exposed the node and caused cyclic dependencies, so I've held it off so far, until I figure out
how to best structure it.

Real life tests

I'm doing a full-sync on mon08/mon09, it appears that htis PR is significantly faster than master.

https://geth-bench.ethdevops.io/d/Jpk-Be5Wk/dual-geth?orgId=1&var-exp=mon08&var-master=mon09&var-percentile=50&from=1577128018418&to=now

holiman · 2019-12-23T17:57:53Z

Here's the cut-off when switching from #20465 to this PR, showing the times for commit and hash ops.

Something is not right, though, this PR seem to have somehow broken the mem gc:

Which also shows in the leveldb stats:

holiman · 2019-12-23T19:33:02Z

New run: https://geth-bench.ethdevops.io/d/Jpk-Be5Wk/dual-geth?orgId=1&var-exp=mon08&var-master=mon09&var-percentile=50&from=1577128018418&to=now

holiman · 2019-12-24T08:58:21Z

It's doing pretty well. Since last night, when I restarted it, the experimental has taken a lead with about 45 minutes, I think.

holiman · 2019-12-25T20:47:51Z

daily update, experimental ahead with about 2h 45m after about two days.

Here are the commit/hash charts:

holiman · 2019-12-26T18:20:16Z

daily update, after about 3 x 24h, this PR is about 4h 30m ahead of master.
I squashed/cleaned up the commits, and added a description

holiman · 2019-12-28T17:45:43Z

After five days (around 118h), this PR is at 7.731M, 7h ahead of master at 7.520M,

holiman · 2019-12-30T11:41:24Z

master reached block 8.6M about 9h25m later than this PR.

holiman · 2019-12-31T11:44:52Z

Mon08 (this PR) finished full sync at block 9188701 at 2019-12-31 03:05:02, after a total of 175h
mon09 was 10 hours behind, at block 9.177M, when the run was stopped.

Here are some comparison tables. On master, account commit is the fourth 'heaviest' thing, whereas in this PR it dropped to sixth place, after even acount hash.

holiman · 2020-01-04T13:01:21Z

Added a commit to prevent loading tries when not needed. At the moment, before this commit, we load the (non-existing) storage trie for every origin EOA, for example.

Later on, these tries are all committed -- and although the non-modified storage trie commit exit early, this PR also spins up a goroutine for them, which is now skipped aswell.

core/state/state_object.go

karalabe · 2020-01-07T14:08:59Z

core/state/state_object.go

 func (s *stateObject) updateTrie(db Database) Trie {
 	// Make sure all dirty slots are finalized into the pending storage area
 	s.finalise()
-
+	if len(s.pendingStorage) == 0 {
+		return s.trie


Curious question. What happens if pre-byzantium, one tx modifies the trie but a second one does not. Then pending will be empty I think, but s.trie not nil any more, since a previous already loaded it. Shouldn't we explicitly return nil here?

I guess it will enter the commit, and quickly find that the trie root is not dirty. Since we don't provide a leaf callback for storage tries, there will not be a lot of overhead on that operation

trie/committer.go

trie/hasher.go

trie/committer.go

karalabe · 2020-01-07T14:27:23Z

trie/committer.go

+		return hash, nil
+	}
+	if db == nil {
+		return nil, errors.New("no db provided")


Wondering about this. Should we error or panic? If it's a comitter, should it allow nil db? Isn't that a programming error?

It also feels a bit weird to explicitly check this for every single trie node. Shoudln't this be checked way way above and here just assumed it's correct?

If db is nil, it means that the trie.db is nil. I can't say if that's ever done on purpose, e.g. in tests.

trie/committer.go

karalabe · 2020-01-28T10:15:22Z

trie/committer.go

+		panic("commit did not hash down to a hashnode!")
+	}
+	return hn, nil
+}


Myeah, I don't think there's much reason for this to exist. You can move the db check into commit and I would not care for the root hash node check (I mean, that's what commit is meant to do, no point in sanity checking that it works).

karalabe · 2020-01-28T10:17:05Z

trie/trie.go

+	if common.BytesToHash(newRoot) != rootHash{
+		panic(fmt.Sprintf("Committed root %x != roothash %x", newRoot, rootHash))
+	}
+	t.root = newRoot


Same here, I don't think this level of paranoia is needed :D

karalabe · 2020-01-28T10:18:10Z

Re: panics, perhaps we can run this PR as is on a benchmarker and nuke and merge afterwards. Then you'll have one last sanity check, but it should really be fine.

holiman · 2020-01-29T13:43:31Z

After a fast-sync

holiman · 2020-01-30T09:02:30Z

This PR for some reason appears to have more memory held than master.

However, the master had a different cht, which resulted in the master being able to place a lot more data directly into ancients. I'll nuke and restart a fast-sync (now that the PR is rebased)

holiman · 2020-01-30T10:38:30Z

Here we go again: https://geth-bench.ethdevops.io/d/Jpk-Be5Wk/dual-geth?orgId=1&from=1580375143000&to=now

holiman · 2020-01-31T10:20:31Z

Regarding the mem held, here's the current chart:

This time, after sync, it's actually master that's holding more memory; however, it seems to correlate to an increase in egress :

So all in all, I don't think this PR changes memory behaviour.

holiman force-pushed the pure_commit branch from 04a13af to ca1e6b3 Compare December 26, 2019 11:56

holiman requested review from karalabe and rjl493456442 as code owners December 26, 2019 11:56

holiman mentioned this pull request Dec 28, 2019

Full Sync: BAD BLOCK: gas limit reached #20485

Closed

holiman mentioned this pull request Dec 28, 2019

Parallel hashing #20488

Merged

holiman mentioned this pull request Jan 4, 2020

trie: db insertions + leaf callback parallelism #20465

Closed

holiman force-pushed the pure_commit branch 2 times, most recently from 539071a to 4ce3930 Compare January 7, 2020 07:50