Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare PVFs if node is a validator in the next session #4791

Merged
merged 35 commits into from
Jul 22, 2024

Conversation

AndreiEres
Copy link
Contributor

@AndreiEres AndreiEres commented Jun 13, 2024

Closes #4324

  • On every active leaf candidate-validation subsystem checks if the node is the next session authority.
  • If it is, it fetches backed candidates and prepares unknown PVFs.
  • We limit number of PVFs per block to not overload subsystem.

@AndreiEres AndreiEres force-pushed the AndreiEres/prepare-pvf branch from 1298e1b to 404847f Compare June 14, 2024 08:10
@AndreiEres AndreiEres added T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Jun 17, 2024
@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6483094

@AndreiEres AndreiEres changed the title [WIP] Prepare PVFs in advance Prepare PVFs in advance Jun 17, 2024
@AndreiEres AndreiEres marked this pull request as ready for review June 17, 2024 15:39
Copy link
Contributor

@alexggh alexggh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass overall looks good, left you some comments.

On top of that we need some tests as well that test the following invariants:

  1. The logic doesn't get triggered if we are in active set.
  2. Checks that when we enter that active set no compilation is triggered because we already compiled them.

polkadot/node/core/candidate-validation/src/lib.rs Outdated Show resolved Hide resolved
polkadot/node/core/candidate-validation/src/lib.rs Outdated Show resolved Hide resolved
let new_session_index = new_session_index(&mut sender, session_index, leaf.hash).await;
if new_session_index.is_some() {
session_index = new_session_index;
already_prepared_code_hashes.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why clearing here? Once we prepare it should be there until restart.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same, but chances are that by this point, the artifact might have been pruned because it had been unused for more than 24 hours. So clearing here is more foolproof than not clearing, but some better solution may also be considered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I supposed we should start from scratch on every new session. It’s just a mechanism to not overload the pvf host with unnecessary checking.

polkadot/node/core/candidate-validation/src/lib.rs Outdated Show resolved Hide resolved
polkadot/node/core/candidate-validation/src/lib.rs Outdated Show resolved Hide resolved
processed_code_hashes.push(code_hash);
}

if let Err(err) = validation_backend.heads_up(active_pvfs).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering HOST_MESSAGE_QUEUE_SIZE is 10 can this block ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be a problem if we stall the main loop here while we are active validator. Processing of any incoming validation requests will also be delayed by up to 100 (n_cores) validation code decompressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARAIR the whole subsystem has been moved to the blocking pool (#3122)
Yes, it would indeed be good to offload those decompressions to a separate task on a blocking pool; we discussed that. At that point, it seemed like firing cannons at sparrows. But in the context of #5012 it makes much more sense now.
Still, that's out of the scope of this issue. Raised #5071.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning to the original question. We don't do any work when we're an active validator. In other cases we allow to prepare only one PVF per block to not load the validator. It's plenty of time to prepare all PVFs during the session if a validator connected in the beginning but don't overload if it connected in the end.

let new_session_index = new_session_index(&mut sender, session_index, leaf.hash).await;
if new_session_index.is_some() {
session_index = new_session_index;
already_prepared_code_hashes.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same, but chances are that by this point, the artifact might have been pruned because it had been unused for more than 24 hours. So clearing here is more foolproof than not clearing, but some better solution may also be considered.

polkadot/node/core/candidate-validation/src/lib.rs Outdated Show resolved Hide resolved
);
return PreCheckOutcome::Invalid
};
let executor_params = if let Ok(executor_params) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question indeed: What set of executor parameters should be used to compile the PVFs in advance? There may already be something new in pending_config that will become effective in the next session, and all this in-advance compilation will be in vain in that case. That's a corner case, as changes to the executor parameters are extremely rare, so I do not insist it should be addressed in this very PR, but it's something that might probably be considered in the future. Generally, it's not the first time we're having hard times trying to predict next session's executor parameters: #694

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be so bad to precompile pvf anyway. Anyway, the validator will have to compile a new one in the next session. Or we can consider a smarter solution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure I'm not missing something, but it seems like we could use pending_config from the configuration pallet to get the executor environment params. That gives us two benefits:

  1. We're always getting params that will be effective in the future session;
  2. We get it directly from storage, thus saving a couple of runtime API calls.

@sandreim WDYT?

Copy link
Contributor

@sandreim sandreim Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we already know that in session N we are gonna become a parachain validator. So, we need to prepare the artifacts using the same execution params from session N.

I think 1 is preferable as a proper interaface, rather than reading storage directly.

Alternatively, we could could introduce a new runtime API that returns the execution parameters for a given future session index. This should solve future problems with getting the next session executor params in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sandreim @s0me0ne-unkn0wn I think it can be a good job for an additional PR. What do you think if I move this conversation to an issue and start doing that after we merged this one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After recalling how it works, I agree. That will require changing the runtime API, putting that into the staging API, then bumping its version, etc. Not an easy change. Raised #5080.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a conversation with @sandreim:

@s0me0ne-unkn0wn what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with implementing #4203 in a separate PR which applies to these changes as well.

Getting this one in the current release if we can still make it to backport there would be great. I'd want to wait for validators to upgrade to this fix before bumping Polkadot validator count to 400.

Copy link
Contributor

@alexggh alexggh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I would also test with an integration test(zombienet), to make sure the changes have the desired effect.

@AndreiEres AndreiEres requested a review from a team as a code owner July 12, 2024 09:51
polkadot/node/core/candidate-validation/src/lib.rs Outdated Show resolved Hide resolved
continue;
};

let pvf = match sp_maybe_compressed_blob::decompress(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this just keeps decompressing needlessly if we are a validator intermittently: at session 1, then at session 2 we are no longer, but we are again at session 3.

All that was prepared for session 1 will be reset as soon as we go into session 2 (already_prepared_code_hashes is cleared). Is there a better way to check? Like query the PVF subsystem if it has it already ? Also I think the query needs to take into account the executor params which we are preparing with not just code hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's handle it a new PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's handle it a new PR.

Or maybe will not be even needed: #5071 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to polish it. #5071 ticket talks about the other path, where we actually do candidate validation, here we just try to compile artifacts for blocks backed on chain. Given that, I expect decompression should be fast, the only thing concerning here is that we do it for all blocks.

Copy link
Contributor

@sandreim sandreim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AndreiEres. LGTM modulo the optimisations that were discussed , but you are already on them in other PRs.

Let's burn this in on our Kusama validators until the release comes out, just to get more data from real world sooner.

Copy link
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@AndreiEres AndreiEres added this pull request to the merge queue Jul 22, 2024
Merged via the queue into master with commit 612c1bd Jul 22, 2024
158 of 162 checks passed
@AndreiEres AndreiEres deleted the AndreiEres/prepare-pvf branch July 22, 2024 17:04
ordian added a commit that referenced this pull request Jul 23, 2024
* master:
  Bump slotmap from 1.0.6 to 1.0.7 (#5096)
  feat: introduce pallet-parameters to Westend to parameterize inflation (#4938)
  Bump openssl from 0.10.64 to 0.10.66 (#5107)
  Bump lycheeverse/lychee-action from 1.9.1 to 1.10.0 (#5094)
  docs: remove the duplicate word (#5095)
  Prepare PVFs if node is a validator in the next session (#4791)
  Update parity publish (#5105)
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Aug 2, 2024
)

Closes paritytech#4324
- On every active leaf candidate-validation subsystem checks if the node
is the next session authority.
- If it is, it fetches backed candidates and prepares unknown PVFs.
- We limit number of PVFs per block to not overload subsystem.
AndreiEres added a commit that referenced this pull request Aug 5, 2024
Closes #4324
- On every active leaf candidate-validation subsystem checks if the node
is the next session authority.
- If it is, it fetches backed candidates and prepares unknown PVFs.
- We limit number of PVFs per block to not overload subsystem.
ggwpez added a commit that referenced this pull request Aug 9, 2024
Co-authored-by: Oliver Tale-Yazdi <[email protected]>
Co-authored-by: Jun Jiang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validators entering the active set are slow on validation because PVF artifacts are not compiled
7 participants