-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IAVL panic with missing value #261
Comments
Turns out the user changed pruning strategies inflight (via restart). While this behavior should be supported, I doubt it's functional and most likely due to the fact that we attempted to prune a version that did not exist (i.e. |
Based on the latest comment here: cosmos/cosmos-sdk#6370 the problem can be reproduced when The expected behavior here is that if pruning is `everything", the ABCI application would replay ever block from genesis against an empty IAVL tree to restore state to the current block height and it would never look for a snapshot. |
I have some long running prune=everything nodes from goz phase 3 that I haven't tried to restart yet to see if I can reproduce. |
You mean this is the expected buggy behavior, not the desired behavior, right? With |
I suspect this may be as simple as
I'll try to verify this - it's a bit unclear exactly how this might lead to the observed crashes. The fixes should be fairly straightforward, but may have performance implications. |
Confirmed, I'm seeing this with a default
|
Some more findings:
This appears to be a problem within IAVL itself, somehow related to in-memory state - I'm going to look closer at the orphan handling. |
Thanks for the great analysis @erikgrinaker! Are we confident this isn't an issue with the application's use of IAVL during What if perform the |
I think there are multiple related problems here (including unclean shutdowns, as you point out). However, even after eliminating all other error sources, there is an underlying problem with orphans in IAVL - it only occurs when replacing a key, not when writing new keys. I'm going to focus on that for now, then we can work our way up and fix any related problems. I don't feel like the application should have to do anything special for this to be safe. We may want to add a |
The restarts were a red herring. I've been able to reproduce this by creating and deleting a sequence of versions with specific key/value pairs - added a test-case for it. I'm going through the orphan handling to identify the problem, but it's unfamiliar and non-trivial code so it may take a little while. |
This turned out to be caused by a subtle flaw in the test case - consider this snippet: key := make([]byte, 16)
value := make([]byte, 16)
random.Read(key)
random.Read(value)
// if we hit an existing key, regenerate and set a new key
for tree.Set(key, value) {
random.Read(key)
} Notice that the inner Changing the loop as follows makes the panic go away: for tree.Set(key, value) {
key = make([]byte, 16)
random.Read(key)
} On one hand I'm relieved that this was a problem with the test rather than IAVL, but on the other hand I'm back to square one for debugging this crash. I would not be surprised if the user crash is caused by a similar problem in the SDK, since this is a very easy mistake to make, and I don't believe this is even documented for IAVL (although it is for tm-db). As a side-note, I'm considering further approaches to debugging this. |
I've been able to reproduce this again, have written a test-case here: Line 270 in d554b88
The behavior is basically this: Use Open a new tree, loading version 10, and write 12 new versions with the same key/value pairs as from the previous run. After version 15 is saved and version 10 is deleted, IAVL panics while generating version 16:
I don't think there are any flaws in the test case this time, but would appreciate a second pair of eyes @alexanderbez. I tested this with the original 0.13.0 release as well (before the tm-db changes), and the panic is present there as well. |
I've found the problem, and written a minimal test case: Line 336 in 60acfb1
It boils down to orphan entries being incorrectly persisted to disk for changes that are only done in memory, and then reloading the tree and making different changes causes these orphan entries to remove nodes that are now in use when the versions the orphans were visible in are deleted. Consider this history, with keepEvery=2 keepRecent=1:
This usually isn't a problem, because the replayed in-memory versions will be the same as before (since it's the same blocks). However, if the pruning settings have been changed, then these orphans can get removed before the change that orphaned them is made, causing this crash. There may be other corner-cases as well. Will begin work on a fix. |
What is the proposed fix @erikgrinaker? |
I don't know yet. We'll somehow have to avoid persisting the snapshot orphans to disk until a version is actually saved to disk. However, we can't use the recentBatch, since we keep fewer versions in memory than the interval between persisted versions, which means the recentDB has its own orphaning and pruning schedule. One quick-fix may be to maintain a separate batch for snapshot orphans that haven't been persisted yet. But I feel like the correct fix would be to change the pruning logic to remove I will look closer tomorrow. |
This was "fixed" by #274. |
A user is reporting the following panics with the SDK (0.38.4) and options
pruning=everything
:This does not happen when pruning is disabled.
We've seen similar panics related to concurrency, see e.g. #240.
The text was updated successfully, but these errors were encountered: