IAVL panic with missing value #261

erikgrinaker · 2020-06-08T15:00:36Z

A user is reporting the following panics with the SDK (0.38.4) and options pruning=everything:

I[2020-06-08|14:01:38.646] starting ABCI with Tendermint                module=main 
panic: Value missing for hash 190e17e5722f481b33536af9525e21abd9e9a02e8e4fed46e3921010899ab65c corresponding to nodeKey n�r/HSj�R^!���.�O�F����\
goroutine 1 [running]:
github.com/tendermint/iavl.(*nodeDB).GetNode(0xc000df6580, 0xc002a77f60, 0x20, 0x20, 0x0)
        /home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/nodedb.go:109 +0x7bf

This does not happen when pruning is disabled.

We've seen similar panics related to concurrency, see e.g. #240.

The text was updated successfully, but these errors were encountered:

alexanderbez · 2020-06-08T15:25:28Z

Turns out the user changed pruning strategies inflight (via restart). While this behavior should be supported, I doubt it's functional and most likely due to the fact that we attempted to prune a version that did not exist (i.e. prune=nothing => prune=everything).

zmanian · 2020-06-09T17:09:55Z

Based on the latest comment here: cosmos/cosmos-sdk#6370 the problem can be reproduced when pruning='everything" is set from genesis.

The expected behavior here is that if pruning is `everything", the ABCI application would replay ever block from genesis against an empty IAVL tree to restore state to the current block height and it would never look for a snapshot.

zmanian · 2020-06-09T17:10:38Z

I have some long running prune=everything nodes from goz phase 3 that I haven't tried to restart yet to see if I can reproduce.

erikgrinaker · 2020-06-09T17:14:49Z

The expected behavior here is that if pruning is `everything", the ABCI application would replay ever block from genesis against an empty IAVL tree to restore state to the current block height and it would never look for a snapshot.

You mean this is the expected buggy behavior, not the desired behavior, right? With pruning=everything, I would expect the application to retain the latest version on disk, and start from that.

erikgrinaker · 2020-06-10T11:06:31Z

I suspect this may be as simple as DeleteVersion() being unsafe. Running with pruning=everything means that we are deleting a version for every single height, and from what I can tell DeleteVersion() can leave incomplete data on unclean shutdown, for two reasons:

IAVL first deletes nodes, then the version root - this can lead to an incomplete version still being visible.
GoLevelDB batches may not be atomic, for small batches (<4 MB by default).

I'll try to verify this - it's a bit unclear exactly how this might lead to the observed crashes. The fixes should be fairly straightforward, but may have performance implications.

erikgrinaker · 2020-06-10T11:54:54Z

Confirmed, I'm seeing this with a default iavl.NewMutableTree with GoLevelDB and no IAVL pruning. Deleting the previous version after saving the current (which is the same thing the SDK does for its own IAVL pruning) gives this:

Saved version 66
Deleted version 65
^Csignal: interrupt
$ go run ./main.go 
Loaded version 66
panic: Value missing for hash 0654c14468757f8c2afec9ab3cb2b38c3de8c612bb865439f0608a9b0dc65eea corresponding to nodeKey 6e0654c14468757f8c2afec9ab3cb2b38c3de8c612bb865439f0608a9b0dc65eea

goroutine 1 [running]:
github.com/tendermint/iavl.(*nodeDB).GetNode(0xc00018e300, 0xc000180f20, 0x20, 0x20, 0x0)
	/Users/erik/Projects/tendermint/iavl/nodedb.go:189 +0x7bf

erikgrinaker · 2020-06-10T12:54:04Z

Some more findings:

The problem occurs both with uncontrolled Ctrl-C abort and controlled os.Exit().
There is no problem if the program is kept running, producing and deleting new versions.
The problem occurs even when changing the GoLevelDB backend to use transactions for batches, enabling fsync, and explicitly closing the database before exiting.
The problem occurs with both RocksDB and BoltDB as well.

This appears to be a problem within IAVL itself, somehow related to in-memory state - I'm going to look closer at the orphan handling.

alexanderbez · 2020-06-10T13:44:53Z

Thanks for the great analysis @erikgrinaker! Are we confident this isn't an issue with the application's use of IAVL during Commit? During Commit, we delete a version as you've pointed out and sending a SIGINT can mess things up, right?

What if perform the DeleteVersion calls after committing somehow? Either via defer or processing DeleteVersion during the next commit (i.e. delay by height).

erikgrinaker · 2020-06-10T13:54:56Z

I think there are multiple related problems here (including unclean shutdowns, as you point out).

However, even after eliminating all other error sources, there is an underlying problem with orphans in IAVL - it only occurs when replacing a key, not when writing new keys. I'm going to focus on that for now, then we can work our way up and fix any related problems.

I don't feel like the application should have to do anything special for this to be safe. We may want to add a Close() method to the tree, which the application should call before exiting, but IAVL should also do whatever it can to prevent corruption in a default configuration.

erikgrinaker · 2020-06-11T09:53:22Z

The restarts were a red herring. I've been able to reproduce this by creating and deleting a sequence of versions with specific key/value pairs - added a test-case for it. I'm going through the orphan handling to identify the problem, but it's unfamiliar and non-trivial code so it may take a little while.

erikgrinaker · 2020-06-11T15:54:30Z

This turned out to be caused by a subtle flaw in the test case - consider this snippet:

key := make([]byte, 16)
value := make([]byte, 16)
random.Read(key)
random.Read(value)
// if we hit an existing key, regenerate and set a new key
for tree.Set(key, value) {
    random.Read(key)
}

Notice that the inner random.Read(key) call reads into the same byte-slice as originally used. The Set() call will have placed this byte-slice into the IAVL node, and I think it's somehow made its way into the orphans list. So when we change the key, this changes the nodes in the tree (giving them the wrong node position for their key), and I think somehow also changes which orphan nodes are scheduled for removal, causing the node to be missing in the next version.

Changing the loop as follows makes the panic go away:

for tree.Set(key, value) {
    key = make([]byte, 16)
    random.Read(key)
}

On one hand I'm relieved that this was a problem with the test rather than IAVL, but on the other hand I'm back to square one for debugging this crash. I would not be surprised if the user crash is caused by a similar problem in the SDK, since this is a very easy mistake to make, and I don't believe this is even documented for IAVL (although it is for tm-db).

As a side-note, MemDB in tm-db 0.5.0 also has this behavior (given and returned keys and values are pointers to the actual database items, for performance). In MemDB 0.4.0, as an implementation detail, the key was copied due to a string-cast for use as a map key, while the value was stored as a pointer. It is quite possible that this change has triggered similar bugs, although MemDB should not be involved at all with pruning=everything for GoLevelDB-based IAVL trees.

I'm considering further approaches to debugging this.

See #261.

erikgrinaker · 2020-06-12T14:47:22Z

I've been able to reproduce this again, have written a test-case here:

iavl/mutable_tree_test.go

Line 270 in d554b88

func TestDeleteVersion_issue261(t *testing.T) {

The behavior is basically this:

Use KeepEvery: 5 and KeepRecent: 1. When we save a version to disk (every 5), delete the previous saved version (so after 10 delete 5). This follows SDK pruning behavior. Stop producing versions after version 12.

Open a new tree, loading version 10, and write 12 new versions with the same key/value pairs as from the previous run. After version 15 is saved and version 10 is deleted, IAVL panics while generating version 16:

panic: Value missing for hash e61c06828e90b91404a9442f74d33f3b89f9d9fa93960df459a7ca313a99338b corresponding to nodeKey 6ee61c06828e90b91404a9442f74d33f3b89f9d9fa93960df459a7ca313a99338b [recovered]
	panic: Value missing for hash e61c06828e90b91404a9442f74d33f3b89f9d9fa93960df459a7ca313a99338b corresponding to nodeKey 6ee61c06828e90b91404a9442f74d33f3b89f9d9fa93960df459a7ca313a99338b

I don't think there are any flaws in the test case this time, but would appreciate a second pair of eyes @alexanderbez.

I tested this with the original 0.13.0 release as well (before the tm-db changes), and the panic is present there as well.

erikgrinaker · 2020-06-15T14:41:35Z

I've found the problem, and written a minimal test case:

iavl/mutable_tree_test.go

Line 336 in 60acfb1

func TestDeleteVersion_issue261_minimal(t *testing.T) {

It boils down to orphan entries being incorrectly persisted to disk for changes that are only done in memory, and then reloading the tree and making different changes causes these orphan entries to remove nodes that are now in use when the versions the orphans were visible in are deleted. Consider this history, with keepEvery=2 keepRecent=1:

Save version 1 (memory): set a=1 b=2
Save version 2 (disk): set c=3 d=4
Save version 3 (memory): set d=0 → orphan d=4 visible in versions <= 2 (persisted to disk)
Restart, load version 2 from disk
Save version 3 (memory): set a=0
Save version 4 (disk): set b=0
Delete version 2 (disk) → orphan for d=4 visible for version <= 2, no longer needed so remove node d=4
Load d=4 → panic, because the node no longer exists

This usually isn't a problem, because the replayed in-memory versions will be the same as before (since it's the same blocks). However, if the pruning settings have been changed, then these orphans can get removed before the change that orphaned them is made, causing this crash. There may be other corner-cases as well.

Will begin work on a fix.

alexanderbez · 2020-06-15T14:59:05Z

What is the proposed fix @erikgrinaker?

erikgrinaker · 2020-06-15T15:01:28Z

I don't know yet. We'll somehow have to avoid persisting the snapshot orphans to disk until a version is actually saved to disk. However, we can't use the recentBatch, since we keep fewer versions in memory than the interval between persisted versions, which means the recentDB has its own orphaning and pruning schedule.

One quick-fix may be to maintain a separate batch for snapshot orphans that haven't been persisted yet. But I feel like the correct fix would be to change the pruning logic to remove KeepRecent, and to always keep all version in memory since the last persisted version - that would also simplify a lot of the logic, and avoid many edge-cases.

I will look closer tomorrow.

erikgrinaker · 2020-06-24T17:42:35Z

This was "fixed" by #274.

erikgrinaker added T:bug P:High labels Jun 8, 2020

erikgrinaker self-assigned this Jun 8, 2020

erikgrinaker changed the title ~~IAVL panic on missing value~~ IAVL panic with missing value Jun 8, 2020

erikgrinaker mentioned this issue Jun 11, 2020

document byte slice immutability #262

Merged

mergify bot pushed a commit that referenced this issue Jun 12, 2020

document byte slice immutability (#262)

2a59047

See #261.

erikgrinaker mentioned this issue Jun 12, 2020

Logical Commit(s) are not Atomic cosmos/cosmos-sdk#6370

Closed

4 tasks

This was referenced Jun 17, 2020

use separate batch for snapshot orphans #268

Closed

LoadVersionForOverwriting can cause data loss #273

Closed

Poor SaveVersion() performance when using PruningOptions #256

Closed

remove pruning #274

Merged

erikgrinaker closed this as completed Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IAVL panic with missing value #261

IAVL panic with missing value #261

erikgrinaker commented Jun 8, 2020 •

edited

Loading

alexanderbez commented Jun 8, 2020

zmanian commented Jun 9, 2020

zmanian commented Jun 9, 2020

erikgrinaker commented Jun 9, 2020

erikgrinaker commented Jun 10, 2020 •

edited

Loading

erikgrinaker commented Jun 10, 2020 •

edited

Loading

erikgrinaker commented Jun 10, 2020

alexanderbez commented Jun 10, 2020 •

edited

Loading

erikgrinaker commented Jun 10, 2020

erikgrinaker commented Jun 11, 2020 •

edited

Loading

erikgrinaker commented Jun 11, 2020 •

edited

Loading

erikgrinaker commented Jun 12, 2020 •

edited

Loading

erikgrinaker commented Jun 15, 2020 •

edited

Loading

alexanderbez commented Jun 15, 2020

erikgrinaker commented Jun 15, 2020

erikgrinaker commented Jun 24, 2020

IAVL panic with missing value #261

IAVL panic with missing value #261

Comments

erikgrinaker commented Jun 8, 2020 • edited Loading

alexanderbez commented Jun 8, 2020

zmanian commented Jun 9, 2020

zmanian commented Jun 9, 2020

erikgrinaker commented Jun 9, 2020

erikgrinaker commented Jun 10, 2020 • edited Loading

erikgrinaker commented Jun 10, 2020 • edited Loading

erikgrinaker commented Jun 10, 2020

alexanderbez commented Jun 10, 2020 • edited Loading

erikgrinaker commented Jun 10, 2020

erikgrinaker commented Jun 11, 2020 • edited Loading

erikgrinaker commented Jun 11, 2020 • edited Loading

erikgrinaker commented Jun 12, 2020 • edited Loading

erikgrinaker commented Jun 15, 2020 • edited Loading

alexanderbez commented Jun 15, 2020

erikgrinaker commented Jun 15, 2020

erikgrinaker commented Jun 24, 2020

erikgrinaker commented Jun 8, 2020 •

edited

Loading

erikgrinaker commented Jun 10, 2020 •

edited

Loading

erikgrinaker commented Jun 10, 2020 •

edited

Loading

alexanderbez commented Jun 10, 2020 •

edited

Loading

erikgrinaker commented Jun 11, 2020 •

edited

Loading

erikgrinaker commented Jun 11, 2020 •

edited

Loading

erikgrinaker commented Jun 12, 2020 •

edited

Loading

erikgrinaker commented Jun 15, 2020 •

edited

Loading