-
Notifications
You must be signed in to change notification settings - Fork 767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not crash on block gap in displaced_leaves_after_finalizing
#4997
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tagging!
substrate/client/db/src/lib.rs
Outdated
{ | ||
let displaced = | ||
blockchain.displaced_leaves_after_finalizing(a3_hash, a3_number).unwrap(); | ||
assert_eq!(displaced.displaced_leaves, vec![(a1_number, a1_hash)]); | ||
assert_eq!(displaced.displaced_blocks, vec![]); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes a specific narrow case of warp sync: we download future blocks and then sync the gap. In Subspace's Snap sync we actually have a similar situation, but without gap sync afterwards. We have to clear the gap manually due to Substrate's hardcoded notion of warp sync, see #4607 and issue linked from there. Also currently we are only able to do a single jump from genesis to a future block and no more jumps after that, which is another annoying limitation in Substrate that we'd like to lift if possible.
Another case that seems to be missing here is this:
// g -> a1 -> a2 -> <unimported> -> a3 -> a4
// \-> b1 -> b2
If I understand proposed change correctly, it will remove both a2
and b2
leaves, while blocks themselves will remain in the database. Later when gap sync catches up, it'll connect to either a2
or b2
(or maybe some c3
even, I am not familiar with warp sync logic), which means there will be blocks that will essentially "leak" and never be removed from the database due to them not being part of any of the active leafs and thus never discovered as displaced blocks by displaced_leaves_after_finalizing
again.
The correct solution would be to simply delay the decision with two cases:
- for warp sync it is sufficient to simply delay the decision until gap sync fills unimported gap and then decide what to prune with information of the full finalized chain
- for other sync strategies like one used in Subspace we still need to delay the decision, but there will never be a decision, we can only wait for leaves that are 100% outside of pruning depth anyway, at which point the whole thing all the way down to genesis needs to be pruned unconditionally. I'm also wondering if it applies to warp sync if it is used on non-archival nodes (again, I'm not familiar with implementation, so let me know if warp sync is not applicable to non-archival nodes anyway).
For 2. above I think displaced_leaves_after_finalizing
should return these leafs with "unknown" state such that they can be acted upon based on pruning logic. And pruning logic will need to be extended to support 2. of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defering the decision is fair. It is also what earlier implementations where doing. I changed the logic to not mark any leafs as displaced that have no current connection to the to-be-finalized block.
Regarding 2., I am not fully able to follow your explanation. I don't think it is in scope for this PR, which is only focusing on not making the node crash in case of finalization gap. This should restore feature parity to earlier implementations while maintaining the speed boosts from your improvements. We can then further iterate on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not 100% in the scope, but definitely closely related. Simply imagine how would you prune blocks if gap is never synced after creation (because the chain doesn't need it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not saying we should not do it. I just want to have it in a follow up PR. I will not be able to work in the next days and would like to have this merged sooner rather than later due to the crashing. Once I am back I can take another look at it.
if leaf_number == Zero::zero() { | ||
debug!(target: crate::LOG_TARGET, %leaf_hash, "Leaf is genesis hash, removing."); | ||
if finalized_block_number > leaf_number { | ||
result.displaced_leaves.push((leaf_number, leaf_hash)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it also be added to displaced_blocks
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm so my idea here was that displaced_leaves is for removing entries from the leaves
, which is what we want. It is however not really a displaced branch where the genesis block itself needs to be pruned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then this needs to be delayed just like everything else until the gap is closed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is again going back to my other topic that this shouldn't assume gap will necessarly be closed at any point in the future. This genesis block may end up being pruned instead, but we need to be able to know that it is a detached leaf in order to be able to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, the correct generic way of handling this would be to delay the genesis block too. It comes however at the cost of performance of this method. If we don't remove the genesis block here, we will always have more than 2 leaves until the block gap is closed (if ever, as you say). So while the block gap is not closed, we will always iterate from the block-to-be-finalized to the first block after the gap. So by removing this genesis leaf here, we can avoid this iteration at the cost of leaking at most the genesis block, if the gap is never closed.
However, if we had the pruning handling of before-gap blocks outside the pruning window you described in the other comment, we would not have this problem. Currently I am leaning in favor of removing the genesis leaf handling "hack" here and implementing the pruning improvements rather soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it is not up to me to decide, but for completeness after each individual PR I'd probably still not mark genesis block as displaced. Either way I think we're on the same page here and if the change happens shortly afterwards then it is fine with me as well, just want to make sure it does happen.
blockchain.displaced_leaves_after_finalizing(a3_hash, a3_number).unwrap(); | ||
assert_eq!(blockchain.leaves().unwrap(), vec![a4_hash, genesis_hash]); | ||
assert_eq!(displaced.displaced_leaves, vec![(genesis_number, genesis_hash)]); | ||
assert_eq!(displaced.displaced_blocks, vec![]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think displaced blocks should be empty if there are displaced leaves. There should be at least the same block as displaced leaf, potentially more blocks depending on branch length. See displaced_leaves_after_finalizing_works
test where every time we have displaced leaf we also have corresponding block(s) displaced as well.
substrate/client/db/src/lib.rs
Outdated
{ | ||
let displaced = | ||
blockchain.displaced_leaves_after_finalizing(a3_hash, a3_number).unwrap(); | ||
assert_eq!(displaced.displaced_leaves, vec![(a1_number, a1_hash)]); | ||
assert_eq!(displaced.displaced_blocks, vec![]); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not 100% in the scope, but definitely closely related. Simply imagine how would you prune blocks if gap is never synced after creation (because the chain doesn't need it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes make sense to me and there is a clear path on where to go from here 👍
// Even though there is a disconnect, diplace should still detect | ||
// branches above the block gap. | ||
// /-> c4 | ||
// g -> a1 -> <unimported> -> a3 -> a4(f) | ||
// \-> b1 -> b2 ----------> b3 -> b4 -> b5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good example that made me think we could also probably detect that b5/b4/b3 are also clearly not going to make it and remove them changing leaf from b5 to b2, though that might be more tricky to implement than really worth it.
These test are really a nice way to understand the behavior.
.push_front(MinimalBlockMetadata::from(&self.header_metadata(finalized_block_hash)?)); | ||
let current_finalized = match self.header_metadata(finalized_block_hash) { | ||
Ok(metadata) => metadata, | ||
Err(Error::UnknownBlock(_)) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should never happen, as we import the warp sync target as finalized block. However, it should also not hurt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaict this can happen exactly when we import the finalized block after warp sync. It is imported and finalized at the same time. Not yet in the backend but we are directly finalizing it so this method is still called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah okay :P
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dq: If we finalize b5 here, instead of a4(f), do we report a4 as displaced leaf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍 |
bot fmt |
@bkchr https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6724423 was started for your command Comment |
We are migrating the command bot to be a GitHub Action |
@bkchr Command |
…itytech#4997) After the merge of paritytech#4922 we saw failing zombienet tests with the following error: ``` 2024-07-09 10:30:09 Error applying finality to block (0xb9e1d3d9cb2047fe61667e28a0963e0634a7b29781895bc9ca40c898027b4c09, 56685): UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000 2024-07-09 10:30:09 GRANDPA voter error: could not complete a round on disk: UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000 ``` [Example](https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6662262) The crashing situation is warp-sync related. After warp syncing, it can happen that there are gaps in block ancestry where we don't have the header. At the same time, the genesis hash is in the set of leaves. In `displaced_leaves_after_finalizing` we then iterate from the finalized block backwards until we hit an unknown block, crashing the node. This PR makes the detection of displaced branches resilient against unknown block in the finalized block chain. cc @nazar-pc (github won't let me request a review from you) --------- Co-authored-by: Bastian Köcher <[email protected]> Co-authored-by: command-bot <>
…itytech#4997) After the merge of paritytech#4922 we saw failing zombienet tests with the following error: ``` 2024-07-09 10:30:09 Error applying finality to block (0xb9e1d3d9cb2047fe61667e28a0963e0634a7b29781895bc9ca40c898027b4c09, 56685): UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000 2024-07-09 10:30:09 GRANDPA voter error: could not complete a round on disk: UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000 ``` [Example](https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6662262) The crashing situation is warp-sync related. After warp syncing, it can happen that there are gaps in block ancestry where we don't have the header. At the same time, the genesis hash is in the set of leaves. In `displaced_leaves_after_finalizing` we then iterate from the finalized block backwards until we hit an unknown block, crashing the node. This PR makes the detection of displaced branches resilient against unknown block in the finalized block chain. cc @nazar-pc (github won't let me request a review from you) --------- Co-authored-by: Bastian Köcher <[email protected]> Co-authored-by: command-bot <>
…itytech#4997) After the merge of paritytech#4922 we saw failing zombienet tests with the following error: ``` 2024-07-09 10:30:09 Error applying finality to block (0xb9e1d3d9cb2047fe61667e28a0963e0634a7b29781895bc9ca40c898027b4c09, 56685): UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000 2024-07-09 10:30:09 GRANDPA voter error: could not complete a round on disk: UnknownBlock: Header was not found in the database: 0x0000000000000000000000000000000000000000000000000000000000000000 ``` [Example](https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6662262) The crashing situation is warp-sync related. After warp syncing, it can happen that there are gaps in block ancestry where we don't have the header. At the same time, the genesis hash is in the set of leaves. In `displaced_leaves_after_finalizing` we then iterate from the finalized block backwards until we hit an unknown block, crashing the node. This PR makes the detection of displaced branches resilient against unknown block in the finalized block chain. cc @nazar-pc (github won't let me request a review from you) --------- Co-authored-by: Bastian Köcher <[email protected]> Co-authored-by: command-bot <>
After the merge of #4922 we saw failing zombienet tests with the following error:
Example
The crashing situation is warp-sync related. After warp syncing, it can happen that there are gaps in block ancestry where we don't have the header. At the same time, the genesis hash is in the set of leaves. In
displaced_leaves_after_finalizing
we then iterate from the finalized block backwards until we hit an unknown block, crashing the node.This PR makes the detection of displaced branches resilient against unknown block in the finalized block chain.
cc @nazar-pc (github won't let me request a review from you)