Fast storage optimization for queries and iterations #5

p0mvn · 2022-01-12T18:24:02Z

Background

We are conceptualizing the fast cache as this direct key-value store for the latest state. For simplicity of deployment/migration logic, our plan is to make this a secondary copy of the latest state in the database. (We do more egregious space overheads with the cosmos pruning strategy, so this is not that bad). The improved data locality on disk should reduce the time needed for retrieving data for keys that are close to each other. As a result, iterating over the latest state in order should become extremely fast since we do not need to read from random physical locations on a disk.

Original Issues
#1 Setting & getting data for the Fast Cache
#3 Iteration over Fast Cache
#4 Migration code

Summary of Changes
IAVL is divided into two trees, mutable_tree and immutable_tree. Sets only happen on the mutable tree.

Things that need to change and be investigated for getting and setting, and the fast node:

mutable tree
- GetVersioned
  - Change the logic to check the FastNode cache first, then do the GetImmutable logic
- Set
  - Cache unsaved fast nodes in memory, avoid persisting them to disk right away
  - Update unsaved removals if needed
- Remove
  - Cache unsaved removals in memory, avoid persisting changes to disk
  - Update unsaved additions if needed
- SaveVersion
  - Persist unsaved fast node additions and removals to disk
  - Sort by key before saving to db to ensure data locality
  - Potential optimizations:
    - Look into if removing before writing to db is faster
    - Look into if combining sorted removals with sorted additions is faster than doing them one after the other
    - Conclusion is documented here: Investigate and document the most efficient write pattern for Level DB #13
- Iterate
  - Now that we have some unsaved changes cached in the mutable tree, we cannot use the Iterate's implementation from the immutable tree. Introduce this new method.
  - Ensure that we iterate in the most efficient manner in sorted order by having 2 pointers:
    1. to the next element on disk
    2. to the next unsaved change in the mutable tree
  - Compare the values of the 2 pointers on each iteration and choose the next one
- Iterator
  - Immutable tree is embedded into mutable. From the way composition works in Go, the mutable tree can access all methods and fields of the immutable. So we need to overwrite the implementation of the immutable tree and return an invalid iterator by default.
  - The iterator in the mutable tree is invalid because we cannot support updates and delayed iterations at the same time.
- Get
  - For the same reason as in Iterate, we must check unsaved additions first before attempting to use the strategy employed by the immutable tree
- enableFastStorageAndCommit and its variations
  - These are the methods that are used to perform automatic upgrades to fast storage if the system detects that fast cache is not used. This detection happens by checking the value of a special metadata key called mstorage_version where m is a new prefix. If the version is lower than the fastStorageVersionValue threshold - migration is triggered.
immutable_tree
- Get and (GetWithIndex
  - renamed Get to GetWithIndex. GetWithIndex always uses the default live state traversal strategy
  - introduced Get method. Get attempts to use the fast cache first. Only fallbacks to regular tree traversal strategy if the fast cache is disabled or tree is not of the latest version
- Iterator
  - returns either regular or updated fast iterator (see more below) depending on if fast storage is enabled and migration is complete
nodedb
- updated and tested the underlying storage management logic
fast_iterator
- introduced and tested a new iterator that binds directly to the database iterator by searching for keys that begin with the prefix f which stands for fast. Basically, all fast nodes are sorted on disk by key in ascending order so we can simply traverse the disk ensuring efficient hardware access.
testing
- unit tests for mutable tree Get, Set, Save, Delete
- unit tests for immutable - Get, Iterator - fast and slow
- some unit tests for nodedb
- updated randomized tests that function like integration tests
- updated bench tests

Old Benchmark
Date: 2022-01-22 12:33 AM PST
Branch: dev/iavl_data_locality with some modifications to the bench tests

go version go1.17.6 linux/amd64

Init Tree took 78.75 MB
goos: linux
goarch: amd64
pkg: github.com/cosmos/iavl/benchmarks
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkMedium/memdb-100000-100-16-40/query-no-in-tree-guarantee-slow-8         	   95841	     12315 ns/op	     592 B/op	      12 allocs/op
BenchmarkMedium/memdb-100000-100-16-40/query-hits-slow-8                         	   85990	     15533 ns/op	     760 B/op	      15 allocs/op
BenchmarkMedium/memdb-100000-100-16-40/iteration-slow-8                          	       2	 838458850 ns/op	88841084 B/op	 1600092 allocs/op
--- BENCH: BenchmarkMedium/memdb-100000-100-16-40/iteration-slow-8
    bench_test.go:109: completed 100000 iterations
    bench_test.go:109: completed 100000 iterations
    bench_test.go:109: completed 100000 iterations
BenchmarkMedium/memdb-100000-100-16-40/update-8                                  	   10000	    148915 ns/op	   27312 B/op	     335 allocs/op
BenchmarkMedium/memdb-100000-100-16-40/block-8                                   	      76	  16865728 ns/op	 2910568 B/op	   35668 allocs/op
Init Tree took 46.71 MB
BenchmarkMedium/goleveldb-100000-100-16-40/query-no-in-tree-guarantee-slow-8     	   55309	     22354 ns/op	    1550 B/op	      30 allocs/op
BenchmarkMedium/goleveldb-100000-100-16-40/query-hits-slow-8                     	   43566	     27137 ns/op	    2093 B/op	      39 allocs/op
BenchmarkMedium/goleveldb-100000-100-16-40/iteration-slow-8                      	       1	2285116100 ns/op	225813440 B/op	 3857215 allocs/op
--- BENCH: BenchmarkMedium/goleveldb-100000-100-16-40/iteration-slow-8
    bench_test.go:109: completed 100000 iterations
BenchmarkMedium/goleveldb-100000-100-16-40/update-8                              	    6194	    307266 ns/op	   40138 B/op	     406 allocs/op
BenchmarkMedium/goleveldb-100000-100-16-40/block-8                               	      28	  40663600 ns/op	 5150422 B/op	   53771 allocs/op
PASS
ok  	github.com/cosmos/iavl/benchmarks	25.797s

Latest Benchmark
Date: 2022-01-22 10:15 AM PST
Branch: roman/fast-node-get-set

go version go1.17.6 linux/amd64

Init Tree took 114.29 MB
goos: linux
goarch: amd64
pkg: github.com/cosmos/iavl/benchmarks
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkMedium/memdb-100000-100-16-40/query-no-in-tree-guarantee-fast-8         	  672999	      1887 ns/op	     112 B/op	       4 allocs/op
BenchmarkMedium/memdb-100000-100-16-40/query-no-in-tree-guarantee-slow-8         	   95888	     11884 ns/op	     440 B/op	       8 allocs/op
BenchmarkMedium/memdb-100000-100-16-40/query-hits-fast-8                         	  891831	      1208 ns/op	      16 B/op	       0 allocs/op
BenchmarkMedium/memdb-100000-100-16-40/query-hits-slow-8                         	   79842	     15644 ns/op	     607 B/op	      11 allocs/op
BenchmarkMedium/memdb-100000-100-16-40/iteration-fast-8                          	      20	  63956090 ns/op	18400254 B/op	  300000 allocs/op
--- BENCH: BenchmarkMedium/memdb-100000-100-16-40/iteration-fast-8
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
	... [output truncated]
BenchmarkMedium/memdb-100000-100-16-40/iteration-slow-8                          	       2	 947611750 ns/op	88841044 B/op	 1600092 allocs/op
--- BENCH: BenchmarkMedium/memdb-100000-100-16-40/iteration-slow-8
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
BenchmarkMedium/memdb-100000-100-16-40/update-8                                  	    9198	    174306 ns/op	   27524 B/op	     342 allocs/op
BenchmarkMedium/memdb-100000-100-16-40/block-8                                   	      58	  19855266 ns/op	 2948779 B/op	   36495 allocs/op
Init Tree took 66.82 MB
BenchmarkMedium/goleveldb-100000-100-16-40/query-no-in-tree-guarantee-fast-8     	  228343	      4938 ns/op	     814 B/op	      16 allocs/op
BenchmarkMedium/goleveldb-100000-100-16-40/query-no-in-tree-guarantee-slow-8     	   59304	     18046 ns/op	    1420 B/op	      24 allocs/op
BenchmarkMedium/goleveldb-100000-100-16-40/query-hits-fast-8                     	  611349	      1684 ns/op	      93 B/op	       1 allocs/op
BenchmarkMedium/goleveldb-100000-100-16-40/query-hits-slow-8                     	   50778	     23126 ns/op	    2005 B/op	      34 allocs/op
BenchmarkMedium/goleveldb-100000-100-16-40/iteration-fast-8                      	      12	  94702442 ns/op	29327220 B/op	  522988 allocs/op
--- BENCH: BenchmarkMedium/goleveldb-100000-100-16-40/iteration-fast-8
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
	... [output truncated]
BenchmarkMedium/goleveldb-100000-100-16-40/iteration-slow-8                      	       1	1716585400 ns/op	235504072 B/op	 3998006 allocs/op
--- BENCH: BenchmarkMedium/goleveldb-100000-100-16-40/iteration-slow-8
    bench_test.go:117: completed 100000 iterations
BenchmarkMedium/goleveldb-100000-100-16-40/update-8                              	    8994	    257683 ns/op	   44702 B/op	     447 allocs/op
BenchmarkMedium/goleveldb-100000-100-16-40/block-8                               	      31	  44907345 ns/op	 6973362 B/op	   72924 allocs/op
PASS
ok  	github.com/cosmos/iavl/benchmarks	43.513s

Benchmarks Interpretation
Highlighting the difference in performance from the latest benchmarks:

Old branch is dev/iavl_data_locality
New branch is roman/fast-node-get-set

Initial size: 100,000 key-val pairs
Block size: 100 keys
Key length: 16 bytes
Value length: 40 bytes

Query with no guarantee of the key being in the tree:

Old: 22354 ns/op
New on regular logic: 18046 ns/op
New on fast logic: 4938 ns/op
New fast logic shows a 77% decrease in time relative to the old branch

Query with the key guaranteed to be in the latest tree:

Old: 27137 ns/op
New on regular logic: 23126 ns/op
New on fast logic: 1684 ns/op
New fast logic shows a 93% decrease in time relative to the old branch

Iteration:

Old: 2285116100 ns/op
New on old logic: 1716585400 ns/op
New on fast logic: 94702442 ns/op
New fast logic shows a 96% decrease in time relative to the old branch

Update:
run Set, if this is a try that is divisible by blockSize, attempt to SaveVersion and if the latest saved version number history exceeds 20, delete the oldest version

Old: 307266 ns/op
New: 257683 ns/op
New logic shows a 16% decrease in time relative to the old branch

Block:
for block size, run Get and Set. At the end of the block, SaveVersion and if the latest saved version number history exceeds 20, delete the oldest version

Old: 40663600 ns/op
New: 44907345 ns/op
New logic shows a 9% increase in time relative to the old branch

fast_node.go

immutable_tree.go

p0mvn · 2022-01-15T02:25:06Z

mutable_tree.go

+// GetVersionedFast gets the value at the specified key and version. The returned value must not be
+// modified, since it may point to data stored within IAVL. GetVersionedFast utilizes a more performant
+// strategy for retrieving the value than GetVersioned but falls back to regular strategy if fails.
+func (tree *MutableTree) GetVersionedFast(key []byte, version int64) []byte {


Similar reasoning to this

p0mvn · 2022-01-15T19:31:49Z

benchmarks/results/fastnode-get-set/fast.txt

@@ -0,0 +1,66 @@
+root@ubuntu-s-1vcpu-1gb-nyc1-01:~/iavl# cat bench_fast.txt


Left here for reviewers, will remove before merge

ValarDragon · 2022-01-15T19:59:10Z

Nice job on this! Also thanks for committing all the benchmarks! The CacheHit query speedup is amazing!

Haven't done a full review yet, but I was a bit surprised by why CacheMiss was significantly slower. I looked through the code, and I think the existing benchmark is more accurately "query not guaranteed to be in tree", and is incorrectly named.

One other thing we can do to make this faster, is that in immutable tree, if immutable tree version = node db latest version, then FastNode Miss implies that the key isn't in that version, so no reason to do the full search. (That way all queries against latest state are getting the speedup)

p0mvn · 2022-01-15T21:15:14Z

Thanks for the feedback. Yes, I agree with the naming and will change it to reflect that.

Great suggestion on making it faster! I'll benchmark it again with the fix

p0mvn · 2022-01-16T05:53:40Z

I implemented the suggested fix with the latest commit. It has significantly improved all queries.

However, updates and blocks are worse off now. The fix made me discover that I should have been updating the version for fast nodes on disk to be above the deleted version when calling a variation of DeleteVersion instead of simply deleting them (Example).

This is done to keep fast nodes on the disk to be consistent with the live state. If we were to simply delete the fast node from disk on deletion, then it would be impossible to implement your suggestion as the live state would diverge from fast node state.

The latest bench test got OOM killed for the last update and block (I verified with dmesg). It still gives us a good idea in terms of performance and is attached here. Please let me know what you think...

p0mvn · 2022-01-21T22:10:42Z

fast_node.go

 		versionLastUpdatedAt: ver,
 		value:                val,
 	}

 	return fastNode, nil
 }
+
+func (node *FastNode) encodedSize() int {


The same method of regular node here looks like the following:

func (node *Node) encodedSize() int { n := 1 + encodeVarintSize(node.size) + encodeVarintSize(node.version) + encodeBytesSize(node.key) if node.isLeaf() { n += encodeBytesSize(node.value) } else { n += encodeBytesSize(node.leftHash) + encodeBytesSize(node.rightHash) } return n }

I don't understand what is that extra byte in the beginning for. At first, I though that it might be used for prefix but it looks like encodedSize is only used for measuring the value. I don't think we need an extra byte for the null terminator since we encode each field with its own terminator. Could someone explain what is that extra byte meant for, please?

TBH I have no idea either. If thing are serializing + deserializing correctly without it, then it should be fine I hope?

(Its very possible this was just always an error, super under-documented code base with up until recently a bus factor of one)

p0mvn · 2022-01-22T19:18:27Z

mutable_tree.go


 	return tree.Hash(), version, nil
 }

+func (tree *MutableTree) saveFastNodeVersion() error {


I'm planning to look into the following:

if removing before writing to db is faster

if combining sorted removals with sorted additions is faster than doing them one after the other

If anyone knows, what patterns are the most efficient or how I can optimize this better, please let me know

fast_iterator.go

ValarDragon · 2022-01-25T03:49:54Z

fast_node.go

@@ -11,7 +12,7 @@ type FastNode struct {
 	key                  []byte


wdyt about removing key? Think we should keep it or remove it to help reduce the committed data overhead?

I removed the key from being written as a value of the fast node on the disk. We still need to keep it as a member of the struct for retrieval from disk and managing it in memory

oh awesome, great trade-off taken!

ValarDragon · 2022-01-25T03:54:12Z

mock/db_mock.go

+// MockDB is a mock of DB interface.
+type MockDB struct {


whats mockdb / gomock btw? I'm just not familiar

Seems pretty cool / like its code generator for something that satisfies an interface, but you can easily edit to tweak things?

It creates a mock implementation of an interface. Here, I created a mock of the database interface - MockDb. It allows controlling the behavior of the mock to enter any code branch for test coverage. I mostly used it to simulate DB errors in this PR which were difficult to simulate with the mem DB that we normally use for testing

testutils_test.go

ValarDragon · 2022-01-25T04:04:22Z

testutils_test.go

+		increment *= -1
+	}
+
+	for startIdx < endIdx {


How does this work when !ascending, doesn't startIdX = len(mirror) - 1, and endIdX = 0 ?

This was completely wrong. Thanks for catching that. Fixed now

basic_test.go

…des and orphans, update unit tests

Co-authored-by: Dev Ojha <[email protected]>

…12) * add leaf hash to fast node and unit test * refactor get with index and get by index, fix migration in load version and lazy load version * use Get in GetVersioned of mutable tree * refactor non membership proof to use fast storage if available * bench non-membership proof * fix bench tests to work with the new changes * add downgrade-reupgrade protection and unit test * remove leaf hash from fast node * resolve multithreading bug related to iterators not being closed * clean up * use correct tree in bench tests * add cache to tree used to bench non membership proofs * add benc tests for GetWithIndex and GetByIndex * revert GetWithIndex and GetByIndex * remove unused import * unit test re-upgrade protection and fix small issues * remove redundant setStorageVersion method * fix bug with appending to live stage version to storage version and nit test * add comment for setFastStorageVersionToBatch * refactor and improve unit tests for reupgrade protection * rename ndb's isFastStorageEnabled to hasUpgradedToFastStorage and add comments * comment out new implementation for GetNonMembershipProof * update comments in nodedb to reflect the difference between hasUpgradedToFastStorage and shouldForceFastStorageUpdate * refactor nodedb tests * downgrade tendermint to 0.34.14 - osmosis's latest cosmos sdk does not support 0.35.0 * fix bug where fast storage was not enabled when version 0 was attempted to be loaded * implement unsaved fast iterator to be used in mutable tree (#16)

* expose isUpgradeable method on mutable tree and unit test * go fmt

update fast node cache in set

4dfa41e

p0mvn mentioned this pull request Jan 12, 2022

WIP: FastNode Get and Set cosmos/iavl#459

Closed

p0mvn linked an issue Jan 12, 2022 that may be closed by this pull request

Setting & getting data for the Fast Cache #1

Closed

10 tasks

p0mvn commented Jan 14, 2022

View reviewed changes

fast_node.go Outdated Show resolved Hide resolved

p0mvn commented Jan 15, 2022

View reviewed changes

immutable_tree.go Outdated Show resolved Hide resolved

p0mvn commented Jan 15, 2022

View reviewed changes

p0mvn changed the title ~~WIP: FastNode Get and Set~~ FastNode Get and Set Jan 15, 2022

p0mvn marked this pull request as ready for review January 15, 2022 19:07

p0mvn commented Jan 15, 2022

View reviewed changes

p0mvn linked an issue Jan 20, 2022 that may be closed by this pull request

Iteration over FastCache #3

Closed

p0mvn changed the title ~~FastNode Get and Set~~ Fast Cache - Get, Set and Iteration Jan 20, 2022

p0mvn changed the title ~~Fast Cache - Get, Set and Iteration~~ Fast Cache - Get, Set, Iteration and Automatic Migration Jan 21, 2022

p0mvn linked an issue Jan 21, 2022 that may be closed by this pull request

Migration code #4

Closed

2 tasks

p0mvn commented Jan 21, 2022

View reviewed changes

p0mvn marked this pull request as draft January 22, 2022 01:19

p0mvn commented Jan 22, 2022

View reviewed changes

p0mvn marked this pull request as ready for review January 22, 2022 19:18

p0mvn requested a review from ValarDragon January 22, 2022 19:18