-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Collator Protocol Revamp #2888
Comments
Will there be a PoS mechanism for the selection of collators? |
One avenue would be to explore which APIs/primitives we need to expose in order to make a PoS mechanism possible in the collator selection subprotocol, along with which other types of collator selection protocols can be achieved by exposing a specific set of APIs. Or do we just allow to submit some code that allows to do arbitrary selection, e.g. a wasm blob. For example, a PoA collator selection protocol, e.g. a collator must prove knowledge of the private key for one out of a fixed set of public keys, uses a subset of the APIs of a PoS collator selection protocol, e.g. a collator must prove knowledge of the private key for one out of the n highest (delegate or non-delegate) staking public keys. Therefore we can support PoA by exposing the primitives required to enable PoS. I believe that delegate PoS might require an additional primitive over non-delegate PoS. |
Yes, we'd expect most parachains wind up being proof-of-stake. We do not need NPoS on parachains like the relay chain does, which simplifies the problem enormously. A priori, collators' slots being weighted by stake simplifies this further, but considerable middle ground exists between NPoS and collation slots being stake weighted. Babe has a natural stake weighted variant, which provides a straightforward approach, but comes with minor down sides. We've never discussed making Aura stake weighted, some stake weighting makes sense, but maybe ad hoc middle grounds suffice here, ala at most x collator slots run by the largest x crowd fund contributors. In future, we envision Sassafras being simpler without stake weighting, although a natural albeit slow stake weighting design exists. We'll triage the stake weighted ring VRF SNARK higher than I'd previously planned I guess. In all cases, there are nuances like pre-validation functions and allocating collator-validator connections, which this issue tracks. See paritytech/cumulus#388 (comment) and #2203 (comment) |
In order to begin addressing this issue, @rphmeier and I discussed the following approach: As a first step, we plan to leave the collator_side of the collator-protocol unchanged. The overarching goal is to create a set of metrics associated to a CollatorId, adaptively manage these metrics in the CollatorProtocol subsystem and subsequently curate a validator's connected collators based on A) the current parachain assignment and B) the fitness of each collator according to the local view of the validator (based on the associated metrics). In order to achieve this overarching goal, I would like to present the following proposal. Current State:Currently, peer reputation is managed via the In addition, the CollatorProtocol subsystem currently does not have any concept of which Collators (identified by CollatorId) are attempting to connect; it only has access to the Proposal:Optimally, the validator would manage connections based on a collator's reputation as well as the current assigned parachain (and the sequence of parachain assignments). One approach of designing such a collator fitness reputation system to keep track of key collator metrics for each CollatorId, e.g. last_submitted_collation, total_submitted_collations, avg_submitted_collations_per_hour, etc, and subsequently reward or punish a collator based on their behavior, e.g. sending an unexpected message. When managing incoming connections, the validator would prefer to connect to collators with better metrics and would (for instance) disconnect certain collators after each parachain assignment, e.g. disconnect collators for the parachain that was assigned prior. A few considerations in this regard:
However, this is an inherently difficult problem to address, since determining a threshold of when to disconnect a collator or how to compare two different sets of metrics is a problem that can only be addressed via heuristic. A different approach would be to attempt to maintain a probability distribution for each collator, skew the probability distribution based on the collator parachain's distance from being assigned and randomly sample the Validator's As a first step in this direction we should design a reputation system in the CollatorProtocol subsystem that keeps track of specific metrics and attempt to manage the PeerSet based on these metrics and the currently assigned parachain. Then we can revisit once we've achieved these goals and potentially explore probabilistic methods. Design Proposal:Metrics In
The exhaustive list of COST (penalties) in validator_side.rs is: Unknown Message, Corrupt Message, Network Error, Request Time Out, Invalid Signature, Report Bad, Wrong Para, Unneeded Collator, and the only benefit is: Notify Good Metric persistence We can begin by adding a member to collator_protocol::validator_side::State that keeps track of these metrics. Since we don't want to maintain metrics for every single collator we've ever seen and only want to keep metrics for the collators we connect/disconnect to/from regularly, we would prefer a collection like LRU Cache over a simple HashMap. Note: When it comes to maintaining metrics for a Collator, we really only want to maintain metrics for Collators we connect/disconnect often so that we know not to connect to them once they've behaved nefariously, but prefer to connect to them before their parachain gets assigned otherwise. On the other hand, well behaved collators that we connect to once and never disconnect needn't be kept track of in the metrics unless we expect them to disconnect. Effectively, there exists a dichotomy between maintaining state for known collators and freeing up space for new collators in the reputation system. Long-term we can address this trade-off by introducing a Database (possibly key-value). We would maintain state for the most interesting collators in the LRU cache and subsequently write to DB the metrics for peers we haven't seen in a while or haven't disconnected in a while; effectively keeping in LRU cache the collators that we connect/disconnect regularly, e.g. because they're a good validator in their parachain but not good enough to get a long-term connection. Therefore, I would suggest not to implement a database now and focus on the LRU Cache design. As the proposal matures, we can design for having a database where we maintain state and leverage the LRU Cache as an opportunistic cache in order to save cycles if we needn't hit the DB. Additional Network APIs Subsequently, we will need to expose additional networking data to the CollatorProtocol subsystem such that it may access not only it's Progress based connection management Finally, we will need to develop a very basic mechanism for disconnecting peers as the relay chain makes progress, regardless of these collators' reputation. One simple mechanism would be to create a weighted function that maps the metrics we keep track of above and subsequently scale the output of that function for each collator based on that Collator Parachain's "distance" to the next parachain being assigned. Suggestion The local metrics that the collator-protocol subsystem keeps track of will depend on the outcome of the candidate-selection subsystem. In order to avoid having to pass messages from candidate-selection to collator-protocol for the purpose of maintaining Collator metrics, it is probably very beneficial to subsume the candidate-selection subsystem into the collator-protocol subsystem. Once these basic facilities will exist in the collator-protocol, we should be able to easily adapt our adaptive connections mechanism to use learning methods. Effectively, the best way to map the Metrics of a Collator to a particular probability distribution would be to perform supervised learning, but it won't be easy to create training data for such a learning problem. Therefore, we will most likely need to explore adaptive or evolutionary methods that enable to learn such a probability. The rule-based approach could however prove sufficiently reliable in practice that we needn't explore such sophisticated methods at all. |
I'm dubious that positive reputation metrics scale well.. If we've 1500 validators and 500-1500 parachains then each validator visits each parachain maybe once per hour, and backs maybe 5 candidates per parachain per hour. These candidates originate form a uniform distribution on collators, so if we've 100-1000 collators per parachain then each validator learns quite slowly. It's possible 1000 overestimates the collators per parachain, but parathreads have no solid limit and proof-of-work is expressly unlimited. We thus have a high churn uniform distribution on useful validator-collator connections. We could incentivizes gossip within the parachains via hierarchical watermarking of candidate block seals, which then helps slows the churn on validator-collator connections, but only under economic assumptions and it does not reduce the underlying parachain churn. I'm afraid proof-of-work distracts us from more useful steps.. We want parachain's proof-of-stake rules to be verifiable from headers alone, without either the state or block body. Aura block headers need a Merkle proof that shows sequencing. Babe block headers need a Merkle proof to their Babe VRF key. In future, Sassafras block headers would need a Merkle proof into the ring VRF output list. Assuming headers fully authorize collations then validators could more freely accept connections from collators but then maybe black ball collators that do nothing useful. I'd expect negative reputation would prove far more useful than positive reputation. We've a complex protocol designed largely around explicitly authorized nodes, with at least four places where proof-of-work acts like a denial-of-service attack. We know proof-of-work gets of hand quickly, which adds risks everywhere, well witness services like LayerCI GitLab, TravisCI, Shippable, and sourcehut CI removing or restricting their free tiers. |
Great points. If we want to use a uniform distribution among the most viable collators for each parachain assignment, then I believe reservoir sampling is our best option. If we are only worried about negative reinforcement, the problem is more constrained and will more than likely more fruitful results when applying heuristics/algorithms. Makes a lot of sense to focus only on negative behavior. |
Are parachain's proof-of-stake block production criteria currently verifiable from headers alone? If not, that's the low hanging fruit, although maybe that's someone else's problem, not yours, not sure. ;) |
I believe they are not right now, but the plan is to make that happen as soon as we implement the PrePVF logic. @rphmeier mentioned that we would probably expose a function in addition to the PVF's I think it would be worth considering BLS 12-381 signatures for the PreVF in order to have (somewhat) constant-time signature verification on each block. We could (for instance) expose miller_loop and final_exp in the runtime and allow parachains to leverage those primitives in the PVF ( We subsume all possible signature schemes from ecdsa/eddsa but also enable additional verification's potentionally, e.g. snarky signatures. tbh I'm not a huge fan of BLS signatures due to slow signature verification, so you require ~20-of-N multisig before aggregation outperforms serial signature verification. But for the specific PreVF case that we are talking about here, BLS signatures offer
Overall, signing Parachain Blocks seems like the perfect situation to require BLS signatures and leave all other signatures schemes in the system completely alone. |
We've BLS for a BEEFY post-finality layer, but that's for light clients, not here. We're only talking Merkle proofs, ed25519, sr25519, and sr25519's VRF for the crypto at the PrePVF layer, plus a Jubjub VRF almost exactly like the sr25519 VRF for Sassafras. We'll eventually use a ring VRF for Sassafras using the same Jubjub key, which requires two-ish groth16 SNARKs. We'll actually sort outputs into a list on-chain, and then reference them using Merkle proofs, so they never appears inside the PrePVF. I'm more concerned about Merkle proof sizes inside the header, meaning we'll optimize the state referenced from the header separately form the full parachain state. We'll churn this state on epoch boundaries when we swap validators and rerun the Aura/Sassafras sequencing. If properly minimized, Aura and Babe need a Merkle proof of size 16 log_2 num_collators while Sassafras needs a Merkle proof of size 16 log_2 num_slots_per_epoch by making its public key implicit in the ring VRF output, or maybe that 16 is a 32 or something. We've discussed using fancy accumulators before but no benefit here I think. Actually, the fancy crypto I brought up above is the prepared batch signatures in schnorkel. At present, these cannot be done recursively, but nothing really prevents recursion. If you can recurse into n layers of pairwise batch preparations, then a gossip message can accumulate n 64 byte watermarks that say "node x forwarded, node y forwarded, etc.". If optimized better then each layer costs maybe 70 ms extra CPU. In principle, these watermarks yield positive information about who actually forwards messages, except I do not trust this not to be gameable, so I'm reluctant to use it. It's maybe usable via tit-for-tat ala #2089 (comment) and https://github.com/w3f/research-security-issues/issues/30#issuecomment-768593265 not sure. |
We discussed this and noticed the pre-collation proof could be sent before the block header, so we've two possible sequences when a collator opens a connection to backing validator:
|
And if it is not assigned anymore, but still holds a relevant block? |
I appreciate the generality of a pre-validation function, and find PoW a useful motivating example. I hope we won't have to confine parachain consensus to the box of finite authority sets. |
Yes, this makes sense, within whatever bandwidth limits we discover.
We think collators should send their pre-collation authorization before they've even constructed their collation block, so the validator does not disconnect them while they're doing the work.
Any reputation system turns PoW into a finite authority set. You might move the PoW to the pre-collation function, meaning doing work proves you're worthy of gossip, but nobody every explored this, and it still gives some finite authority set. If you want a "less finite" authority set to capture the "run a node get paid" spirit, then a good solution might be balance weighted Praos/Babe or balance weighted Sassafras. I think every on-chain account could participates in block production using its minimum balance over the previous three epochs starts. In this way, your block producer set could be larger than the number of nodes who even participate in gossip. If you've a very small balance, then your node starts up at the beginning of the epoch, checks if you've any collation assignments that epoch, if none then sleep, but if some collation assignments then warp syncs so you can catch up and build them, or maybe sells your slots to someone online if that does not break anything. We'd never know how much balance participates, so proof-of-work's adjustment issues exist in balance weighted Praos/Babe, and in the ring VRF phase of balance weighted Sassafras. |
https://github.com/orgs/paritytech/projects/68
Collators in Polkadot are associated with some particular parachain and are responsible for creating blocks for that parachain as well as Proofs-of-Validity (PoV), which they send to the Polkadot validators temporarily assigned to the parachain.
From the perspective of Polkadot, the collator set for any particular parachain is unspecified. Any node can connect to a validator, claim to be a collator, and provide a collation. At the moment, parachains written in Cumulus do, in fact, allow any node to be a collator. In the near future, Cumulus will support parachains which have collator-selection algorithms that specifies who, specifically, is allowed to author a parachain block at a given point in time. However, we intend to support parachains with unspecified collator sets indefinitely. Particularly for PoW parachains. This makes registration of the collator set on-chain difficult.
We need to overhaul the collator protocol, particularly the validator side, to avoid validators being overwhelmed by useless peers who claim to be collators for the validator's assigned parachain, but in fact provide nothing useful. This involves the introduction of some kind of local rating or reputation system where validators keep track of which peers have historically provided useful collations for each parachain and ensure that they accept or even possibly initiate connections to such peers.
To better support specific collator-selection algorithms such as Aura (round-robin) we've in the past discussed having a "pre-validation" function as part of the Parachain Validation Function (PVF). At the moment, the only thing validators can do with a PVF is validate an entire parablock, but it'd also make sense to have a second function for pre-validation which is analogous to verifying the parablock's header. The pre-validation function is an opaque abstraction for performing actions like proof-of-work verification or signature & credential verification on block headers. We should revisit this feature and determine how to introduce it to our PVF system.
As a last pain point in our collation/seconding pipeline, currently, if a validator accepts a collation from a collator which turns out to be invalid, it does not proceed to select a new collation for the same parachain. We need to build retrying into the system and extend the
candidate-selection
subsystem to support it.candidate-selection
could reasonably be merged into thecollator-protocol
, but does not need to be.The text was updated successfully, but these errors were encountered: