-
Notifications
You must be signed in to change notification settings - Fork 767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
beefy: Tolerate pruned state on runtime API call #5197
Conversation
@@ -451,7 +451,8 @@ where | |||
state.set_best_grandpa(best_grandpa.clone()); | |||
// Overwrite persisted data with newly provided `min_block_delta`. | |||
state.set_min_block_delta(min_block_delta); | |||
debug!(target: LOG_TARGET, "🥩 Loading BEEFY voter state from db: {:?}.", state); | |||
debug!(target: LOG_TARGET, "🥩 Loading BEEFY voter state from db."); | |||
trace!(target: LOG_TARGET, "🥩 Loaded state: {:?}.", state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unrelated, but the state is quite big and causes log files to be multiple gigabytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Now that the beefy worker is more resilient to the runtime call errors and not restarting that often, I wonder if it made things better from the memory perspective as well 🤔
I've seen some periodic spikes (up to 8mib nothing concerning, something to keep in mind) with beefy unbounded channels:
Err(api_error) => { | ||
// This can happen in case the block was already pruned. | ||
// Mostly after warp sync when finality notifications are piled up. | ||
debug!(target: LOG_TARGET, "🥩 Unable to check beefy genesis: {}", api_error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is safe since this way we might miss consensus resets. And in this case, further worker processing might be incorrect. Thinking.
Longer term we plan to add a header log to handle this kind of situations. Something like this: paritytech/substrate#14765
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to store the beefy_genesis
in UnpinnedFinalityNotification
. We could retrieve it inside the transformer, when it should be available. Would that fix the issue ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to store the
beefy_genesis
inUnpinnedFinalityNotification
. We could retrieve it inside the transformer, when it should be available. Would that fix the issue ?
This could work 👍 .
Regarding missing the ConsensusReset. So eventually we will catch up to the latest blocks and will find that the beefy_genesis is indeed different to what we have stored. In that case we would trigger the consensus reset, just later as we reach the tip. Is that acceptable? What are the risks here? If not acceptable we could add beefy genesis in the UnpinnedFinalityNotification
as you proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm looked a bit on the code. I was thinking there might be problems with the authority set, or the payload, but it looks like this can't happen. I think it's ok. But I would also wait for @acatangiu 's input on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change itself is correct in what it does, but we might still have undiscovered corner cases that this will surface. I suggest to test it on Rococo where we have actually reset consensus and see how it behaves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay I was not aware that this was done on rococo, will perform some tests there before merging
@@ -32,7 +32,7 @@ use crate::{ | |||
metrics::register_metrics, | |||
}; | |||
use futures::{stream::Fuse, FutureExt, StreamExt}; | |||
use log::{debug, error, info, warn}; | |||
use log::{debug, error, info, log_enabled, trace, warn, Level}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
are log_enabled
and Level
used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will remove
The CI pipeline was cancelled due to failure one of the required jobs. |
2abd03e
* master: (51 commits) Remove unused feature gated code from the minimal template (#5237) make polkadot-parachain startup errors pretty (#5214) Coretime auto-renew (#4424) network/strategy: Backoff and ban overloaded peers to avoid submitting the same request multiple times (#5029) Fix frame crate usage doc (#5222) beefy: Tolerate pruned state on runtime API call (#5197) rpc: Enable ChainSpec for polkadot-parachain (#5205) Add an adapter for configuring AssetExchanger (#5130) Replace env_logger with sp_tracing (#5065) Adjust sync templates flow to use new release branch (#5182) litep2p/discovery: Publish authority records with external addresses only (#5176) Run UI tests in CI for some other crates (#5167) Remove `pallet::getter` usage from the pallet-balances (#4967) pallet-timestamp: `UnixTime::now` implementation logs error only if called at genesis (#5055) [CI] Cache try-runtime check (#5179) [Backport] version bumps and the prdocs reordering from stable2407 (#5178) [subsystem-benchmark] Update availability-distribution-regression-bench baseline after recent subsystem changes (#5180) Remove pallet::getter usage from proxy (#4963) Remove pallet::getter macro usage from pallet-election-provider-multi-phase (#4487) [email protected] (#5177) ...
While working on paritytech#5129 I noticed that after warp sync, nodes would print: ``` 2024-07-29 17:59:23.898 ERROR ⋮beefy: 🥩 Error: ConsensusReset. Restarting voter. ``` After some debugging I found that we enter the following loop: 1. Wait for beefy pallet to be available: Pallet is detected available directly after warp sync since we are at the tip. 2. Wait for headers from tip to beefy genesis to be available: During this time we don't process finality notifications, since we later want to inspect all the headers for authority set changes. 3. Gap sync finishes, route to beefy genesis is available. 4. The worker starts acting, tries to fetch beefy genesis block. It fails, since we are acting on old finality notifications where the state is already pruned. 5. Whole beefy subsystem is being restarted, loading the state from db again and iterating a lot of headers. This already happened before paritytech#5129.
While working on #5129 I noticed that after warp sync, nodes would print:
After some debugging I found that we enter the following loop:
This already happened before #5129.