-
Notifications
You must be signed in to change notification settings - Fork 767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare PVFs if node is a validator in the next session #4791
Conversation
1298e1b
to
404847f
Compare
The CI pipeline was cancelled due to failure one of the required jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass overall looks good, left you some comments.
On top of that we need some tests as well that test the following invariants:
- The logic doesn't get triggered if we are in active set.
- Checks that when we enter that active set no compilation is triggered because we already compiled them.
let new_session_index = new_session_index(&mut sender, session_index, leaf.hash).await; | ||
if new_session_index.is_some() { | ||
session_index = new_session_index; | ||
already_prepared_code_hashes.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why clearing here? Once we prepare it should be there until restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the same, but chances are that by this point, the artifact might have been pruned because it had been unused for more than 24 hours. So clearing here is more foolproof than not clearing, but some better solution may also be considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I supposed we should start from scratch on every new session. It’s just a mechanism to not overload the pvf host with unnecessary checking.
processed_code_hashes.push(code_hash); | ||
} | ||
|
||
if let Err(err) = validation_backend.heads_up(active_pvfs).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering HOST_MESSAGE_QUEUE_SIZE
is 10 can this block ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be a problem if we stall the main loop here while we are active validator. Processing of any incoming validation requests will also be delayed by up to 100 (n_cores) validation code decompressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARAIR the whole subsystem has been moved to the blocking pool (#3122)
Yes, it would indeed be good to offload those decompressions to a separate task on a blocking pool; we discussed that. At that point, it seemed like firing cannons at sparrows. But in the context of #5012 it makes much more sense now.
Still, that's out of the scope of this issue. Raised #5071.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning to the original question. We don't do any work when we're an active validator. In other cases we allow to prepare only one PVF per block to not load the validator. It's plenty of time to prepare all PVFs during the session if a validator connected in the beginning but don't overload if it connected in the end.
let new_session_index = new_session_index(&mut sender, session_index, leaf.hash).await; | ||
if new_session_index.is_some() { | ||
session_index = new_session_index; | ||
already_prepared_code_hashes.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the same, but chances are that by this point, the artifact might have been pruned because it had been unused for more than 24 hours. So clearing here is more foolproof than not clearing, but some better solution may also be considered.
); | ||
return PreCheckOutcome::Invalid | ||
}; | ||
let executor_params = if let Ok(executor_params) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question indeed: What set of executor parameters should be used to compile the PVFs in advance? There may already be something new in pending_config
that will become effective in the next session, and all this in-advance compilation will be in vain in that case. That's a corner case, as changes to the executor parameters are extremely rare, so I do not insist it should be addressed in this very PR, but it's something that might probably be considered in the future. Generally, it's not the first time we're having hard times trying to predict next session's executor parameters: #694
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not be so bad to precompile pvf anyway. Anyway, the validator will have to compile a new one in the next session. Or we can consider a smarter solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure I'm not missing something, but it seems like we could use pending_config
from the configuration
pallet to get the executor environment params. That gives us two benefits:
- We're always getting params that will be effective in the future session;
- We get it directly from storage, thus saving a couple of runtime API calls.
@sandreim WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we already know that in session N we are gonna become a parachain validator. So, we need to prepare the artifacts using the same execution params from session N.
I think 1 is preferable as a proper interaface, rather than reading storage directly.
Alternatively, we could could introduce a new runtime API that returns the execution parameters for a given future session index. This should solve future problems with getting the next session executor params in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sandreim @s0me0ne-unkn0wn I think it can be a good job for an additional PR. What do you think if I move this conversation to an issue and start doing that after we merged this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After recalling how it works, I agree. That will require changing the runtime API, putting that into the staging API, then bumping its version, etc. Not an easy change. Raised #5080.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a conversation with @sandreim:
- This concern is valid and not only for just connected validators but for all. See PVF: Consider re-preparing PVFs ahead of time if
PendingConfigs
changesExecutorParams
#4203 - It's OK to solve this issue separately.
@s0me0ne-unkn0wn what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with implementing #4203 in a separate PR which applies to these changes as well.
Getting this one in the current release if we can still make it to backport there would be great. I'd want to wait for validators to upgrade to this fix before bumping Polkadot validator count to 400.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I would also test with an integration test(zombienet), to make sure the changes have the desired effect.
continue; | ||
}; | ||
|
||
let pvf = match sp_maybe_compressed_blob::decompress( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this just keeps decompressing needlessly if we are a validator intermittently: at session 1, then at session 2 we are no longer, but we are again at session 3.
All that was prepared for session 1 will be reset as soon as we go into session 2 (already_prepared_code_hashes
is cleared). Is there a better way to check? Like query the PVF subsystem if it has it already ? Also I think the query needs to take into account the executor params which we are preparing with not just code hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's handle it a new PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's handle it a new PR.
Or maybe will not be even needed: #5071 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to polish it. #5071 ticket talks about the other path, where we actually do candidate validation, here we just try to compile artifacts for blocks backed on chain. Given that, I expect decompression should be fast, the only thing concerning here is that we do it for all blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @AndreiEres. LGTM modulo the optimisations that were discussed , but you are already on them in other PRs.
Let's burn this in on our Kusama validators until the release comes out, just to get more data from real world sooner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
* master: Bump slotmap from 1.0.6 to 1.0.7 (#5096) feat: introduce pallet-parameters to Westend to parameterize inflation (#4938) Bump openssl from 0.10.64 to 0.10.66 (#5107) Bump lycheeverse/lychee-action from 1.9.1 to 1.10.0 (#5094) docs: remove the duplicate word (#5095) Prepare PVFs if node is a validator in the next session (#4791) Update parity publish (#5105)
) Closes paritytech#4324 - On every active leaf candidate-validation subsystem checks if the node is the next session authority. - If it is, it fetches backed candidates and prepares unknown PVFs. - We limit number of PVFs per block to not overload subsystem.
Closes #4324 - On every active leaf candidate-validation subsystem checks if the node is the next session authority. - If it is, it fetches backed candidates and prepares unknown PVFs. - We limit number of PVFs per block to not overload subsystem.
Co-authored-by: Oliver Tale-Yazdi <[email protected]> Co-authored-by: Jun Jiang <[email protected]>
Closes #4324