Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast storage optimization for queries and iterations #5

Merged
merged 73 commits into from
Feb 11, 2022

Conversation

p0mvn
Copy link
Member

@p0mvn p0mvn commented Jan 12, 2022

Background

Link to the original spec

We are conceptualizing the fast cache as this direct key-value store for the latest state. For simplicity of deployment/migration logic, our plan is to make this a secondary copy of the latest state in the database. (We do more egregious space overheads with the cosmos pruning strategy, so this is not that bad). The improved data locality on disk should reduce the time needed for retrieving data for keys that are close to each other. As a result, iterating over the latest state in order should become extremely fast since we do not need to read from random physical locations on a disk.

Original Issues
#1 Setting & getting data for the Fast Cache
#3 Iteration over Fast Cache
#4 Migration code

Summary of Changes
IAVL is divided into two trees, mutable_tree and immutable_tree. Sets only happen on the mutable tree.

Things that need to change and be investigated for getting and setting, and the fast node:

  • mutable tree

    • GetVersioned
      • Change the logic to check the FastNode cache first, then do the GetImmutable logic
    • Set
      • Cache unsaved fast nodes in memory, avoid persisting them to disk right away
      • Update unsaved removals if needed
    • Remove
      • Cache unsaved removals in memory, avoid persisting changes to disk
      • Update unsaved additions if needed
    • SaveVersion
      • Persist unsaved fast node additions and removals to disk
      • Sort by key before saving to db to ensure data locality
      • Potential optimizations:
    • Iterate
      • Now that we have some unsaved changes cached in the mutable tree, we cannot use the Iterate's implementation from the immutable tree. Introduce this new method.
      • Ensure that we iterate in the most efficient manner in sorted order by having 2 pointers:
        1. to the next element on disk
        2. to the next unsaved change in the mutable tree
      • Compare the values of the 2 pointers on each iteration and choose the next one
    • Iterator
      • Immutable tree is embedded into mutable. From the way composition works in Go, the mutable tree can access all methods and fields of the immutable. So we need to overwrite the implementation of the immutable tree and return an invalid iterator by default.
      • The iterator in the mutable tree is invalid because we cannot support updates and delayed iterations at the same time.
    • Get
      • For the same reason as in Iterate, we must check unsaved additions first before attempting to use the strategy employed by the immutable tree
    • enableFastStorageAndCommit and its variations
      • These are the methods that are used to perform automatic upgrades to fast storage if the system detects that fast cache is not used. This detection happens by checking the value of a special metadata key called mstorage_version where m is a new prefix. If the version is lower than the fastStorageVersionValue threshold - migration is triggered.
  • immutable_tree

    • Get and (GetWithIndex
      • renamed Get to GetWithIndex. GetWithIndex always uses the default live state traversal strategy
      • introduced Get method. Get attempts to use the fast cache first. Only fallbacks to regular tree traversal strategy if the fast cache is disabled or tree is not of the latest version
    • Iterator
      • returns either regular or updated fast iterator (see more below) depending on if fast storage is enabled and migration is complete
  • nodedb

    • updated and tested the underlying storage management logic
  • fast_iterator

    • introduced and tested a new iterator that binds directly to the database iterator by searching for keys that begin with the prefix f which stands for fast. Basically, all fast nodes are sorted on disk by key in ascending order so we can simply traverse the disk ensuring efficient hardware access.
  • testing

    • unit tests for mutable tree Get, Set, Save, Delete
    • unit tests for immutable - Get, Iterator - fast and slow
    • some unit tests for nodedb
    • updated randomized tests that function like integration tests
    • updated bench tests

Old Benchmark
Date: 2022-01-22 12:33 AM PST
Branch: dev/iavl_data_locality with some modifications to the bench tests

go version go1.17.6 linux/amd64

Init Tree took 78.75 MB
goos: linux
goarch: amd64
pkg: github.com/cosmos/iavl/benchmarks
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkMedium/memdb-100000-100-16-40/query-no-in-tree-guarantee-slow-8         	   95841	     12315 ns/op	     592 B/op	      12 allocs/op
BenchmarkMedium/memdb-100000-100-16-40/query-hits-slow-8                         	   85990	     15533 ns/op	     760 B/op	      15 allocs/op
BenchmarkMedium/memdb-100000-100-16-40/iteration-slow-8                          	       2	 838458850 ns/op	88841084 B/op	 1600092 allocs/op
--- BENCH: BenchmarkMedium/memdb-100000-100-16-40/iteration-slow-8
    bench_test.go:109: completed 100000 iterations
    bench_test.go:109: completed 100000 iterations
    bench_test.go:109: completed 100000 iterations
BenchmarkMedium/memdb-100000-100-16-40/update-8                                  	   10000	    148915 ns/op	   27312 B/op	     335 allocs/op
BenchmarkMedium/memdb-100000-100-16-40/block-8                                   	      76	  16865728 ns/op	 2910568 B/op	   35668 allocs/op
Init Tree took 46.71 MB
BenchmarkMedium/goleveldb-100000-100-16-40/query-no-in-tree-guarantee-slow-8     	   55309	     22354 ns/op	    1550 B/op	      30 allocs/op
BenchmarkMedium/goleveldb-100000-100-16-40/query-hits-slow-8                     	   43566	     27137 ns/op	    2093 B/op	      39 allocs/op
BenchmarkMedium/goleveldb-100000-100-16-40/iteration-slow-8                      	       1	2285116100 ns/op	225813440 B/op	 3857215 allocs/op
--- BENCH: BenchmarkMedium/goleveldb-100000-100-16-40/iteration-slow-8
    bench_test.go:109: completed 100000 iterations
BenchmarkMedium/goleveldb-100000-100-16-40/update-8                              	    6194	    307266 ns/op	   40138 B/op	     406 allocs/op
BenchmarkMedium/goleveldb-100000-100-16-40/block-8                               	      28	  40663600 ns/op	 5150422 B/op	   53771 allocs/op
PASS
ok  	github.com/cosmos/iavl/benchmarks	25.797s

Latest Benchmark
Date: 2022-01-22 10:15 AM PST
Branch: roman/fast-node-get-set

go version go1.17.6 linux/amd64

Init Tree took 114.29 MB
goos: linux
goarch: amd64
pkg: github.com/cosmos/iavl/benchmarks
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkMedium/memdb-100000-100-16-40/query-no-in-tree-guarantee-fast-8         	  672999	      1887 ns/op	     112 B/op	       4 allocs/op
BenchmarkMedium/memdb-100000-100-16-40/query-no-in-tree-guarantee-slow-8         	   95888	     11884 ns/op	     440 B/op	       8 allocs/op
BenchmarkMedium/memdb-100000-100-16-40/query-hits-fast-8                         	  891831	      1208 ns/op	      16 B/op	       0 allocs/op
BenchmarkMedium/memdb-100000-100-16-40/query-hits-slow-8                         	   79842	     15644 ns/op	     607 B/op	      11 allocs/op
BenchmarkMedium/memdb-100000-100-16-40/iteration-fast-8                          	      20	  63956090 ns/op	18400254 B/op	  300000 allocs/op
--- BENCH: BenchmarkMedium/memdb-100000-100-16-40/iteration-fast-8
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
	... [output truncated]
BenchmarkMedium/memdb-100000-100-16-40/iteration-slow-8                          	       2	 947611750 ns/op	88841044 B/op	 1600092 allocs/op
--- BENCH: BenchmarkMedium/memdb-100000-100-16-40/iteration-slow-8
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
BenchmarkMedium/memdb-100000-100-16-40/update-8                                  	    9198	    174306 ns/op	   27524 B/op	     342 allocs/op
BenchmarkMedium/memdb-100000-100-16-40/block-8                                   	      58	  19855266 ns/op	 2948779 B/op	   36495 allocs/op
Init Tree took 66.82 MB
BenchmarkMedium/goleveldb-100000-100-16-40/query-no-in-tree-guarantee-fast-8     	  228343	      4938 ns/op	     814 B/op	      16 allocs/op
BenchmarkMedium/goleveldb-100000-100-16-40/query-no-in-tree-guarantee-slow-8     	   59304	     18046 ns/op	    1420 B/op	      24 allocs/op
BenchmarkMedium/goleveldb-100000-100-16-40/query-hits-fast-8                     	  611349	      1684 ns/op	      93 B/op	       1 allocs/op
BenchmarkMedium/goleveldb-100000-100-16-40/query-hits-slow-8                     	   50778	     23126 ns/op	    2005 B/op	      34 allocs/op
BenchmarkMedium/goleveldb-100000-100-16-40/iteration-fast-8                      	      12	  94702442 ns/op	29327220 B/op	  522988 allocs/op
--- BENCH: BenchmarkMedium/goleveldb-100000-100-16-40/iteration-fast-8
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
    bench_test.go:117: completed 100000 iterations
	... [output truncated]
BenchmarkMedium/goleveldb-100000-100-16-40/iteration-slow-8                      	       1	1716585400 ns/op	235504072 B/op	 3998006 allocs/op
--- BENCH: BenchmarkMedium/goleveldb-100000-100-16-40/iteration-slow-8
    bench_test.go:117: completed 100000 iterations
BenchmarkMedium/goleveldb-100000-100-16-40/update-8                              	    8994	    257683 ns/op	   44702 B/op	     447 allocs/op
BenchmarkMedium/goleveldb-100000-100-16-40/block-8                               	      31	  44907345 ns/op	 6973362 B/op	   72924 allocs/op
PASS
ok  	github.com/cosmos/iavl/benchmarks	43.513s

Benchmarks Interpretation
Highlighting the difference in performance from the latest benchmarks:

Old branch is dev/iavl_data_locality
New branch is roman/fast-node-get-set

Initial size: 100,000 key-val pairs
Block size: 100 keys
Key length: 16 bytes
Value length: 40 bytes

Query with no guarantee of the key being in the tree:

  • Old: 22354 ns/op
  • New on regular logic: 18046 ns/op
  • New on fast logic: 4938 ns/op
  • New fast logic shows a 77% decrease in time relative to the old branch

Query with the key guaranteed to be in the latest tree:

  • Old: 27137 ns/op
  • New on regular logic: 23126 ns/op
  • New on fast logic: 1684 ns/op
  • New fast logic shows a 93% decrease in time relative to the old branch

Iteration:

  • Old: 2285116100 ns/op
  • New on old logic: 1716585400 ns/op
  • New on fast logic: 94702442 ns/op
  • New fast logic shows a 96% decrease in time relative to the old branch

Update:
run Set, if this is a try that is divisible by blockSize, attempt to SaveVersion and if the latest saved version number history exceeds 20, delete the oldest version

  • Old: 307266 ns/op
  • New: 257683 ns/op
  • New logic shows a 16% decrease in time relative to the old branch

Block:
for block size, run Get and Set. At the end of the block, SaveVersion and if the latest saved version number history exceeds 20, delete the oldest version

  • Old: 40663600 ns/op
  • New: 44907345 ns/op
  • New logic shows a 9% increase in time relative to the old branch

@p0mvn p0mvn linked an issue Jan 12, 2022 that may be closed by this pull request
10 tasks
fast_node.go Outdated Show resolved Hide resolved
immutable_tree.go Outdated Show resolved Hide resolved
mutable_tree.go Outdated
// GetVersionedFast gets the value at the specified key and version. The returned value must not be
// modified, since it may point to data stored within IAVL. GetVersionedFast utilizes a more performant
// strategy for retrieving the value than GetVersioned but falls back to regular strategy if fails.
func (tree *MutableTree) GetVersionedFast(key []byte, version int64) []byte {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar reasoning to this

@p0mvn p0mvn changed the title WIP: FastNode Get and Set FastNode Get and Set Jan 15, 2022
@p0mvn p0mvn marked this pull request as ready for review January 15, 2022 19:07
@@ -0,0 +1,66 @@
root@ubuntu-s-1vcpu-1gb-nyc1-01:~/iavl# cat bench_fast.txt
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left here for reviewers, will remove before merge

@ValarDragon
Copy link
Member

Nice job on this! Also thanks for committing all the benchmarks! The CacheHit query speedup is amazing!

Haven't done a full review yet, but I was a bit surprised by why CacheMiss was significantly slower. I looked through the code, and I think the existing benchmark is more accurately "query not guaranteed to be in tree", and is incorrectly named.

One other thing we can do to make this faster, is that in immutable tree, if immutable tree version = node db latest version, then FastNode Miss implies that the key isn't in that version, so no reason to do the full search. (That way all queries against latest state are getting the speedup)

@p0mvn
Copy link
Member Author

p0mvn commented Jan 15, 2022

Thanks for the feedback. Yes, I agree with the naming and will change it to reflect that.

Great suggestion on making it faster! I'll benchmark it again with the fix

@p0mvn
Copy link
Member Author

p0mvn commented Jan 16, 2022

I implemented the suggested fix with the latest commit. It has significantly improved all queries.

However, updates and blocks are worse off now. The fix made me discover that I should have been updating the version for fast nodes on disk to be above the deleted version when calling a variation of DeleteVersion instead of simply deleting them (Example).

This is done to keep fast nodes on the disk to be consistent with the live state. If we were to simply delete the fast node from disk on deletion, then it would be impossible to implement your suggestion as the live state would diverge from fast node state.

The latest bench test got OOM killed for the last update and block (I verified with dmesg). It still gives us a good idea in terms of performance and is attached here. Please let me know what you think...

@p0mvn p0mvn linked an issue Jan 20, 2022 that may be closed by this pull request
@p0mvn p0mvn changed the title FastNode Get and Set Fast Cache - Get, Set and Iteration Jan 20, 2022
@p0mvn p0mvn changed the title Fast Cache - Get, Set and Iteration Fast Cache - Get, Set, Iteration and Automatic Migration Jan 21, 2022
@p0mvn p0mvn linked an issue Jan 21, 2022 that may be closed by this pull request
2 tasks
versionLastUpdatedAt: ver,
value: val,
}

return fastNode, nil
}

func (node *FastNode) encodedSize() int {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same method of regular node here looks like the following:

func (node *Node) encodedSize() int {
	n := 1 +
		encodeVarintSize(node.size) +
		encodeVarintSize(node.version) +
		encodeBytesSize(node.key)
	if node.isLeaf() {
		n += encodeBytesSize(node.value)
	} else {
		n += encodeBytesSize(node.leftHash) +
			encodeBytesSize(node.rightHash)
	}
	return n
}

I don't understand what is that extra byte in the beginning for. At first, I though that it might be used for prefix but it looks like encodedSize is only used for measuring the value. I don't think we need an extra byte for the null terminator since we encode each field with its own terminator. Could someone explain what is that extra byte meant for, please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I have no idea either. If thing are serializing + deserializing correctly without it, then it should be fine I hope?

(Its very possible this was just always an error, super under-documented code base with up until recently a bus factor of one)

@p0mvn p0mvn marked this pull request as draft January 22, 2022 01:19

return tree.Hash(), version, nil
}

func (tree *MutableTree) saveFastNodeVersion() error {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to look into the following:

  • if removing before writing to db is faster
  • if combining sorted removals with sorted additions is faster than doing them one after the other

If anyone knows, what patterns are the most efficient or how I can optimize this better, please let me know

@p0mvn p0mvn marked this pull request as ready for review January 22, 2022 19:18
@p0mvn p0mvn requested a review from ValarDragon January 22, 2022 19:18
fast_iterator.go Outdated Show resolved Hide resolved
@@ -11,7 +12,7 @@ type FastNode struct {
key []byte
Copy link
Member

@ValarDragon ValarDragon Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt about removing key? Think we should keep it or remove it to help reduce the committed data overhead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the key from being written as a value of the fast node on the disk. We still need to keep it as a member of the struct for retrieval from disk and managing it in memory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh awesome, great trade-off taken!

Comment on lines +14 to +15
// MockDB is a mock of DB interface.
type MockDB struct {
Copy link
Member

@ValarDragon ValarDragon Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats mockdb / gomock btw? I'm just not familiar

Seems pretty cool / like its code generator for something that satisfies an interface, but you can easily edit to tweak things?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It creates a mock implementation of an interface. Here, I created a mock of the database interface - MockDb. It allows controlling the behavior of the mock to enter any code branch for test coverage. I mostly used it to simulate DB errors in this PR which were difficult to simulate with the mem DB that we normally use for testing

increment *= -1
}

for startIdx < endIdx {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work when !ascending, doesn't startIdX = len(mirror) - 1, and endIdX = 0 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was completely wrong. Thanks for catching that. Fixed now

p0mvn and others added 26 commits February 11, 2022 08:54
…12)

* add leaf hash to fast node and unit test

* refactor get with index and get by index, fix migration in load version and lazy load version

* use Get in GetVersioned of mutable tree

* refactor non membership proof to use fast storage if available

* bench non-membership proof

* fix bench tests to work with the new changes

* add downgrade-reupgrade protection and unit test

* remove leaf hash from fast node

* resolve multithreading bug related to iterators not being closed

* clean up

* use correct tree in bench tests

* add cache to tree used to bench non membership proofs

* add benc tests for GetWithIndex and GetByIndex

* revert GetWithIndex and GetByIndex

* remove unused import

* unit test re-upgrade protection and fix small issues

* remove redundant setStorageVersion method

* fix bug with appending to live stage version to storage version and nit test

* add comment for setFastStorageVersionToBatch

* refactor and improve unit tests for reupgrade protection

* rename ndb's isFastStorageEnabled to hasUpgradedToFastStorage and add comments

* comment out new implementation for GetNonMembershipProof

* update comments in nodedb to reflect the difference between hasUpgradedToFastStorage and shouldForceFastStorageUpdate

* refactor nodedb tests

* downgrade tendermint to 0.34.14 - osmosis's latest cosmos sdk does not support 0.35.0

* fix bug where fast storage was not enabled when version 0 was attempted to be loaded

* implement unsaved fast iterator to be used in mutable tree (#16)
* expose isUpgradeable method on mutable tree and unit test

* go fmt
@p0mvn p0mvn force-pushed the roman/fast-node-get-set branch from c983c62 to f0f815e Compare February 11, 2022 16:55
@p0mvn p0mvn merged commit 32f0265 into dev/iavl_data_locality Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
4 participants