-
Notifications
You must be signed in to change notification settings - Fork 768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
elastic scaling: add core selector to cumulus #5372
Conversation
@@ -332,6 +332,10 @@ pub mod rpsr_digest { | |||
} | |||
} | |||
|
|||
/// The default claim queue offset to be used if it's not configured/accessible in the parachain | |||
/// runtime | |||
pub const DEFAULT_CLAIM_QUEUE_OFFSET: u8 = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
open question: should this be 0 or 1.
0 would preserve backwards compatibility with the current collators (they are all building right at the top of the claim queue).
1 would be what core-sharing parachains should use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1
should be the default value. With 0 they can't fully use the 2s of execution if the core is shared and can't find any reason why this would be useful in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default needs to be configurable by the parachain ... or we derive it from async backing settings. In other words, we either can make a default that does the right thing for all chains or we should not provide one, but make it configurable with simple doc explaining when to use what. (async backing/6s block times yet or not?)
Can you please explain why we need this? AFAIR the parachain runtime does not really cares on which core it is build, it is just more about ensuring that someone is not sending a block build for core X to be included on core Y. |
IMO passing a pre-digest that contains the |
The RFC has more details about it. @sandreim may have more inside info but from my understanding: The parachain runtime will indirectly specify the core on which the candidate is valid on (to prevent the situation you mentioned). The way it specifies the core is by sending a UMP message with a sequence number (which can be the parachain block number) and a claim queue offset. The relay chain and the collator node will then use this info to determine the core index. This was done to avoid passing the claim queue snapshot into the parachain's inherent data. So indeed the parachain doesn't really care on which core it is built, but it needs to somehow make sure that it commits to one core. That's one reason why we don't specify the core in the UMP message. |
@@ -332,6 +332,10 @@ pub mod rpsr_digest { | |||
} | |||
} | |||
|
|||
/// The default claim queue offset to be used if it's not configured/accessible in the parachain | |||
/// runtime | |||
pub const DEFAULT_CLAIM_QUEUE_OFFSET: u8 = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1
should be the default value. With 0 they can't fully use the 2s of execution if the core is shared and can't find any reason why this would be useful in practice.
Can you detail how it would work ? The only way the runtime could know the exact core index would require to see the claim queue assignments and that is more expensive in terms of storage proof compared to the current proposal. |
The block author just provides a pre-digest. Like for example the BABE randomness or the AURA author index. The runtime can read it from there. From there it will be able to expose the information at validation. I don't see the need for having the parachain runtime know anything about the claim queue. |
I am not familiar with that code, but I will look into it. If we do this, does the collator actually have any control over the core index ? Also, how would the parachain runtime verify if the core index is correct ? |
I first thought it doesn't need, but we need to have at least the selector. I mean in the runtime doesn't need to care about the index as long as it outputs the correct index. I left some questions on the RFC. I think if they are answered, I have a better understanding. 👍 |
The CI pipeline was cancelled due to failure one of the required jobs. |
The CI pipeline was cancelled due to failure one of the required jobs. |
I am also having some problems fully wrapping my head around what the impact of a claim queue offset is.
Lets imagine an offset of 2. During authoring I check the claim queue at position 2 and see that I have cores 0, 1, 2. Now I submit 3 blocks at this relay parent. If for some reason, previous validators skipped a block, would the claims for core 1 at position 0 and 1 interfere with the candidate chain? Basically collators before me skipped some blocks and one of the blocks I submit on core 1 get an earlier claim but might depend on blocks I submitted on core 0 and 2? Or my mental model does not match what is actually happening? |
Collators don't submit to a specific core. They submit to the nth assigned core in the claim queue. So if the previous collators skipped authoring, you can use their slots in the claim queue. @sandreim correct me if I'm wrong |
That'd be a serious limit on elastic scaling. You can make a parachain with faster collators, but you cannot speed up validators so easily. I suppose the conversation moved on from there though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but left a few comments.
cumulus/client/consensus/aura/src/collators/slot_based/block_builder_task.rs
Outdated
Show resolved
Hide resolved
cumulus/client/consensus/aura/src/collators/slot_based/block_builder_task.rs
Outdated
Show resolved
Hide resolved
self.last_data = Some((relay_parent, data)); | ||
Ok(&mut self.last_data.as_mut().expect("last_data was just set above").1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit clunky. Can we just check/update if needed before and then in this match we'd just return a mutable reference to the inner ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried making this look nicer but this was the best I could :D
cumulus/client/consensus/aura/src/collators/slot_based/block_builder_task.rs
Outdated
Show resolved
Hide resolved
- name: rococo-parachain-runtime | ||
bump: minor | ||
- name: polkadot-parachain-bin | ||
bump: major |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this a major bump ? Anyone using the binary is broken ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a major bump on the polkadot-parachain-lib because we add a new trait bound to a public trait.
When I initially created this prdoc the polkadot-parachain-lib was part of the bin crate so that's why I bumped the bin crate. Fixed it now
@@ -186,6 +191,25 @@ pub mod ump_constants { | |||
pub const MESSAGE_SIZE_FEE_BASE: FixedU128 = FixedU128::from_rational(1, 1000); // 0.001 | |||
} | |||
|
|||
/// Trait for selecting the next core to build the candidate for. | |||
pub trait SelectCore { | |||
fn select_core_for_child() -> (CoreSelector, ClaimQueueOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why for_child
suffix ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so that it better expresses what this API returns. It doesn't return the core selector for the current block. It returns the core selector for the next block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good, only nits. I like the trait based config approach we have here now.
cumulus/client/consensus/aura/src/collators/slot_based/block_builder_task.rs
Outdated
Show resolved
Hide resolved
if !claimed_cores.insert(*core_index) { | ||
tracing::debug!( | ||
target: LOG_TARGET, | ||
"Core {:?} was already claimed at this relay chain slot", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically the condition that is described above when we have not enough cores scheduled to support the paras slot duration. (at least without merging of para blocks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
cumulus/client/consensus/aura/src/collators/slot_based/block_builder_task.rs
Outdated
Show resolved
Hide resolved
@@ -363,4 +374,11 @@ where | |||
async fn version(&self, relay_parent: PHash) -> RelayChainResult<RuntimeVersion> { | |||
(**self).version(relay_parent).await | |||
} | |||
|
|||
async fn claim_queue( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the introduction of this can we remove availability_cores
from the interface? Did not check, but should not be used anymore now right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we could. But is there a good reason for that? There's still information that the availability-cores API contains that you cannot get elsewhere: like when a core is occupied or not. Maybe this will come in handy at some point in the future.
Partially implements #5048
What's left to be implemented (in a follow-up PR):
View the RFC for more context: polkadot-fellows/RFCs#103