-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework chain pull to be concurrency friendly #1346
Conversation
Can I take this one for a spin or not yet ready? |
Yes you can. |
Testing now with "legacy", looks normal. |
Testing, in another network and yes looks much "cleaner" (granted that the nodes may have changed an is not the same situation, but still). Now, I went after panics related to
Attached also the full rel8-chain_pool_root.log |
The panic is due to another instance of out of order locking. |
I've run tests with some private network topologies. At first glance looks like there is little improvement (pass rate [4/11] is the same like from before the change) in node synchronisation before there were cases in which one out of 7 nodes was not in sync at all. In this run, difference was more subtle (2 or 3 blocks were off) will update more after deeper analysis |
Now that multiple chain pull tasks can pull the same or mostly the same header chain, and in the mean time between arriving headers some blocks in the chain may be received and applied, CandidateForest needs to account for this concurrency. Do this by only retiring candidate entries on GC, setting the state to Applied when the block is committed to storage. When asynchronous ChainAdvance processing sees such an entry at the current block's parent, it knows to drop the hashes it collected to fetch and re-land the chain from the current hash.
@@ -393,6 +434,9 @@ impl CandidateForest { | |||
debug_assert!(block.header().hash() == block_hash); | |||
Ok(()) | |||
} | |||
CandidateData::Applied(_) => { | |||
panic!("caching block {} which is already in storage", block_hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't you worried this will blow in production in some cases? can't we just "accept" the state and continue? If the block is already applied then this is not necessarily cause for panic, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should never happen as the blocks are checked with storage first before being cached.
This removes race conditions in the candidate forest, and handles out-of-order blocks cached due to lacking a parent.
7132afa
to
ce0af7f
Compare
I have added an improvement to remove a cause for panics that @rinor has observed and properly handle out-of-order blocks. |
Tested quickly under the same reproducible scenario and one is still there. The fastest way to reproduce this is:
|
Now that multiple chain pull tasks can pull the same or mostly the same header chain, and in the mean time between arriving headers some blocks in the chain may be received and applied,
CandidateForest
needs to account for this concurrency.Do this by only retiring candidate entries on GC, setting the state to
Applied
when the block is committed to storage. When asynchronousChainAdvance
processing sees such an entry at the current block's parent, it knows to drop the hashes it collected to fetch and re-land the chain from the current hash.Fixes #1327