From f9de5dbd79de7b56493351eaf27d43567b7004fd Mon Sep 17 00:00:00 2001 From: Alin Dima Date: Tue, 28 May 2024 11:15:50 +0300 Subject: [PATCH] Add availability-recovery from systematic chunks (#1644) **Don't look at the commit history, it's confusing, as this branch is based on another branch that was merged** Fixes #598 Also implements [RFC #47](https://github.com/polkadot-fellows/RFCs/pull/47) ## Description - Availability-recovery now first attempts to request the systematic chunks for large POVs (which are the first ~n/3 chunks, which can recover the full data without doing the costly reed-solomon decoding process). This has a fallback of recovering from all chunks, if for some reason the process fails. Additionally, backers are also used as a backup for requesting the systematic chunks if the assigned validator is not offering the chunk (each backer is only used for one systematic chunk, to not overload them). - Quite obviously, recovering from systematic chunks is much faster than recovering from regular chunks (4000% faster as measured on my apple M2 Pro). - Introduces a `ValidatorIndex` -> `ChunkIndex` mapping which is different for every core, in order to avoid only querying the first n/3 validators over and over again in the same session. The mapping is the one described in RFC 47. - The mapping is feature-gated by the [NodeFeatures runtime API](https://github.com/paritytech/polkadot-sdk/pull/2177) so that it can only be enabled via a governance call once a sufficient majority of validators have upgraded their client. If the feature is not enabled, the mapping will be the identity mapping and backwards-compatibility will be preserved. - Adds a new chunk request protocol version (v2), which adds the ChunkIndex to the response. This may or may not be checked against the expected chunk index. For av-distribution and systematic recovery, this will be checked, but for regular recovery, no. This is backwards compatible. First, a v2 request is attempted. If that fails during protocol negotiation, v1 is used. - Systematic recovery is only attempted during approval-voting, where we have easy access to the core_index. For disputes and collator pov_recovery, regular chunk requests are used, just as before. ## Performance results Some results from subsystem-bench: with regular chunk recovery: CPU usage per block 39.82s with recovery from backers: CPU usage per block 16.03s with systematic recovery: CPU usage per block 19.07s End-to-end results here: https://github.com/paritytech/polkadot-sdk/issues/598#issuecomment-1792007099 #### TODO: - [x] [RFC #47](https://github.com/polkadot-fellows/RFCs/pull/47) - [x] merge https://github.com/paritytech/polkadot-sdk/pull/2177 and rebase on top of those changes - [x] merge https://github.com/paritytech/polkadot-sdk/pull/2771 and rebase - [x] add tests - [x] preliminary performance measure on Versi: see https://github.com/paritytech/polkadot-sdk/issues/598#issuecomment-1792007099 - [x] Rewrite the implementer's guide documentation - [x] https://github.com/paritytech/polkadot-sdk/pull/3065 - [x] https://github.com/paritytech/zombienet/issues/1705 and fix zombienet tests - [x] security audit - [x] final versi test and performance measure --------- Signed-off-by: alindima Co-authored-by: Javier Viola --- .gitlab/pipeline/zombienet.yml | 2 +- .gitlab/pipeline/zombienet/polkadot.yml | 16 + Cargo.lock | 8 +- .../src/active_candidate_recovery.rs | 1 + .../relay-chain-minimal-node/src/lib.rs | 3 + cumulus/test/service/src/lib.rs | 5 +- polkadot/erasure-coding/Cargo.toml | 1 + polkadot/erasure-coding/benches/README.md | 6 +- .../benches/scaling_with_validators.rs | 36 +- polkadot/erasure-coding/src/lib.rs | 93 + polkadot/node/core/approval-voting/src/lib.rs | 16 +- .../node/core/approval-voting/src/tests.rs | 2 +- polkadot/node/core/av-store/src/lib.rs | 76 +- polkadot/node/core/av-store/src/tests.rs | 380 ++- polkadot/node/core/backing/src/lib.rs | 28 +- polkadot/node/core/backing/src/tests/mod.rs | 44 +- .../src/tests/prospective_parachains.rs | 8 +- .../node/core/bitfield-signing/src/lib.rs | 53 +- .../node/core/bitfield-signing/src/tests.rs | 4 +- .../src/participation/mod.rs | 1 + .../src/participation/tests.rs | 12 +- polkadot/node/jaeger/src/spans.rs | 8 +- .../availability-distribution/Cargo.toml | 2 + .../availability-distribution/src/error.rs | 8 +- .../availability-distribution/src/lib.rs | 41 +- .../src/requester/fetch_task/mod.rs | 131 +- .../src/requester/fetch_task/tests.rs | 291 ++- .../src/requester/mod.rs | 127 +- .../src/requester/session_cache.rs | 63 +- .../src/requester/tests.rs | 36 +- .../src/responder.rs | 124 +- .../src/tests/mock.rs | 26 +- .../src/tests/mod.rs | 121 +- .../src/tests/state.rs | 196 +- .../network/availability-recovery/Cargo.toml | 3 +- .../availability-recovery-regression-bench.rs | 4 +- .../availability-recovery/src/error.rs | 58 +- .../network/availability-recovery/src/lib.rs | 562 +++-- .../availability-recovery/src/metrics.rs | 242 +- .../network/availability-recovery/src/task.rs | 861 ------- .../availability-recovery/src/task/mod.rs | 197 ++ .../src/task/strategy/chunks.rs | 335 +++ .../src/task/strategy/full.rs | 174 ++ .../src/task/strategy/mod.rs | 1558 ++++++++++++ .../src/task/strategy/systematic.rs | 343 +++ .../availability-recovery/src/tests.rs | 2140 ++++++++++++++--- polkadot/node/network/bridge/src/tx/mod.rs | 10 +- .../protocol/src/request_response/mod.rs | 12 +- .../protocol/src/request_response/outgoing.rs | 36 +- .../protocol/src/request_response/v1.rs | 14 +- .../protocol/src/request_response/v2.rs | 62 +- polkadot/node/overseer/src/tests.rs | 1 + polkadot/node/primitives/src/lib.rs | 7 +- polkadot/node/service/src/lib.rs | 8 +- polkadot/node/service/src/overseer.rs | 24 +- polkadot/node/subsystem-bench/Cargo.toml | 1 + .../examples/availability_read.yaml | 8 +- .../src/lib/availability/mod.rs | 124 +- .../src/lib/availability/test_state.rs | 41 +- .../subsystem-bench/src/lib/mock/av_store.rs | 111 +- .../src/lib/mock/network_bridge.rs | 2 +- .../src/lib/mock/runtime_api.rs | 29 +- .../node/subsystem-bench/src/lib/network.rs | 8 +- .../node/subsystem-test-helpers/src/lib.rs | 4 +- polkadot/node/subsystem-types/Cargo.toml | 1 + polkadot/node/subsystem-types/src/errors.rs | 26 +- polkadot/node/subsystem-types/src/messages.rs | 11 +- polkadot/node/subsystem-util/Cargo.toml | 1 + .../subsystem-util/src/availability_chunks.rs | 227 ++ polkadot/node/subsystem-util/src/lib.rs | 26 +- .../node/subsystem-util/src/runtime/error.rs | 2 +- .../node/subsystem-util/src/runtime/mod.rs | 11 +- polkadot/primitives/src/lib.rs | 40 +- polkadot/primitives/src/v7/mod.rs | 45 +- .../src/node/approval/approval-voting.md | 2 +- .../availability/availability-recovery.md | 249 +- .../src/types/overseer-protocol.md | 3 + .../functional/0013-enable-node-feature.js | 35 + .../0013-systematic-chunk-recovery.toml | 46 + .../0013-systematic-chunk-recovery.zndsl | 43 + ...-chunk-fetching-network-compatibility.toml | 48 + ...chunk-fetching-network-compatibility.zndsl | 53 + prdoc/pr_1644.prdoc | 59 + substrate/client/network/src/service.rs | 2 +- 84 files changed, 7540 insertions(+), 2338 deletions(-) delete mode 100644 polkadot/node/network/availability-recovery/src/task.rs create mode 100644 polkadot/node/network/availability-recovery/src/task/mod.rs create mode 100644 polkadot/node/network/availability-recovery/src/task/strategy/chunks.rs create mode 100644 polkadot/node/network/availability-recovery/src/task/strategy/full.rs create mode 100644 polkadot/node/network/availability-recovery/src/task/strategy/mod.rs create mode 100644 polkadot/node/network/availability-recovery/src/task/strategy/systematic.rs create mode 100644 polkadot/node/subsystem-util/src/availability_chunks.rs create mode 100644 polkadot/zombienet_tests/functional/0013-enable-node-feature.js create mode 100644 polkadot/zombienet_tests/functional/0013-systematic-chunk-recovery.toml create mode 100644 polkadot/zombienet_tests/functional/0013-systematic-chunk-recovery.zndsl create mode 100644 polkadot/zombienet_tests/functional/0014-chunk-fetching-network-compatibility.toml create mode 100644 polkadot/zombienet_tests/functional/0014-chunk-fetching-network-compatibility.zndsl create mode 100644 prdoc/pr_1644.prdoc diff --git a/.gitlab/pipeline/zombienet.yml b/.gitlab/pipeline/zombienet.yml index 404b57b07c59f..7897e55e291bd 100644 --- a/.gitlab/pipeline/zombienet.yml +++ b/.gitlab/pipeline/zombienet.yml @@ -1,7 +1,7 @@ .zombienet-refs: extends: .build-refs variables: - ZOMBIENET_IMAGE: "docker.io/paritytech/zombienet:v1.3.104" + ZOMBIENET_IMAGE: "docker.io/paritytech/zombienet:v1.3.105" PUSHGATEWAY_URL: "http://zombienet-prometheus-pushgateway.managed-monitoring:9091/metrics/job/zombie-metrics" DEBUG: "zombie,zombie::network-node,zombie::kube::client::logs" diff --git a/.gitlab/pipeline/zombienet/polkadot.yml b/.gitlab/pipeline/zombienet/polkadot.yml index a9f0eb9303371..b158cbe0b5aa3 100644 --- a/.gitlab/pipeline/zombienet/polkadot.yml +++ b/.gitlab/pipeline/zombienet/polkadot.yml @@ -183,6 +183,22 @@ zombienet-polkadot-functional-0012-spam-statement-distribution-requests: --local-dir="${LOCAL_DIR}/functional" --test="0012-spam-statement-distribution-requests.zndsl" +zombienet-polkadot-functional-0013-systematic-chunk-recovery: + extends: + - .zombienet-polkadot-common + script: + - /home/nonroot/zombie-net/scripts/ci/run-test-local-env-manager.sh + --local-dir="${LOCAL_DIR}/functional" + --test="0013-systematic-chunk-recovery.zndsl" + +zombienet-polkadot-functional-0014-chunk-fetching-network-compatibility: + extends: + - .zombienet-polkadot-common + script: + - /home/nonroot/zombie-net/scripts/ci/run-test-local-env-manager.sh + --local-dir="${LOCAL_DIR}/functional" + --test="0014-chunk-fetching-network-compatibility.zndsl" + zombienet-polkadot-smoke-0001-parachains-smoke-test: extends: - .zombienet-polkadot-common diff --git a/Cargo.lock b/Cargo.lock index 3d6cbc9e83f91..6240d9db2ea6a 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -12625,6 +12625,7 @@ dependencies = [ "polkadot-primitives-test-helpers", "polkadot-subsystem-bench", "rand 0.8.5", + "rstest", "sc-network", "schnellru", "sp-core", @@ -12641,7 +12642,6 @@ version = "7.0.0" dependencies = [ "assert_matches", "async-trait", - "env_logger 0.11.3", "fatality", "futures", "futures-timer", @@ -12657,11 +12657,13 @@ dependencies = [ "polkadot-primitives-test-helpers", "polkadot-subsystem-bench", "rand 0.8.5", + "rstest", "sc-network", "schnellru", "sp-application-crypto", "sp-core", "sp-keyring", + "sp-tracing 16.0.0", "thiserror", "tokio", "tracing-gum", @@ -12789,6 +12791,7 @@ dependencies = [ "parity-scale-codec", "polkadot-node-primitives", "polkadot-primitives", + "quickcheck", "reed-solomon-novelpoly", "sp-core", "sp-trie", @@ -13435,6 +13438,7 @@ dependencies = [ "async-trait", "bitvec", "derive_more", + "fatality", "futures", "orchestra", "polkadot-node-jaeger", @@ -13477,6 +13481,7 @@ dependencies = [ "parity-scale-codec", "parking_lot 0.12.1", "pin-project", + "polkadot-erasure-coding", "polkadot-node-jaeger", "polkadot-node-metrics", "polkadot-node-network-protocol", @@ -14564,6 +14569,7 @@ dependencies = [ "sp-keystore", "sp-runtime", "sp-timestamp", + "strum 0.24.1", "substrate-prometheus-endpoint", "tokio", "tracing-gum", diff --git a/cumulus/client/pov-recovery/src/active_candidate_recovery.rs b/cumulus/client/pov-recovery/src/active_candidate_recovery.rs index 2c635320ff4ae..c41c543f04d1f 100644 --- a/cumulus/client/pov-recovery/src/active_candidate_recovery.rs +++ b/cumulus/client/pov-recovery/src/active_candidate_recovery.rs @@ -56,6 +56,7 @@ impl ActiveCandidateRecovery { candidate.receipt.clone(), candidate.session_index, None, + None, tx, ), "ActiveCandidateRecovery", diff --git a/cumulus/client/relay-chain-minimal-node/src/lib.rs b/cumulus/client/relay-chain-minimal-node/src/lib.rs index b84427c3a75a5..699393e2d48a7 100644 --- a/cumulus/client/relay-chain-minimal-node/src/lib.rs +++ b/cumulus/client/relay-chain-minimal-node/src/lib.rs @@ -285,5 +285,8 @@ fn build_request_response_protocol_receivers< let cfg = Protocol::ChunkFetchingV1.get_outbound_only_config::<_, Network>(request_protocol_names); config.add_request_response_protocol(cfg); + let cfg = + Protocol::ChunkFetchingV2.get_outbound_only_config::<_, Network>(request_protocol_names); + config.add_request_response_protocol(cfg); (collation_req_v1_receiver, collation_req_v2_receiver, available_data_req_receiver) } diff --git a/cumulus/test/service/src/lib.rs b/cumulus/test/service/src/lib.rs index f2a612803861c..6f8b9d19bb29b 100644 --- a/cumulus/test/service/src/lib.rs +++ b/cumulus/test/service/src/lib.rs @@ -152,7 +152,7 @@ impl RecoveryHandle for FailingRecoveryHandle { message: AvailabilityRecoveryMessage, origin: &'static str, ) { - let AvailabilityRecoveryMessage::RecoverAvailableData(ref receipt, _, _, _) = message; + let AvailabilityRecoveryMessage::RecoverAvailableData(ref receipt, _, _, _, _) = message; let candidate_hash = receipt.hash(); // For every 3rd block we immediately signal unavailability to trigger @@ -160,7 +160,8 @@ impl RecoveryHandle for FailingRecoveryHandle { if self.counter % 3 == 0 && self.failed_hashes.insert(candidate_hash) { tracing::info!(target: LOG_TARGET, ?candidate_hash, "Failing pov recovery."); - let AvailabilityRecoveryMessage::RecoverAvailableData(_, _, _, back_sender) = message; + let AvailabilityRecoveryMessage::RecoverAvailableData(_, _, _, _, back_sender) = + message; back_sender .send(Err(RecoveryError::Unavailable)) .expect("Return channel should work here."); diff --git a/polkadot/erasure-coding/Cargo.toml b/polkadot/erasure-coding/Cargo.toml index b230631f72b04..bf152e03be711 100644 --- a/polkadot/erasure-coding/Cargo.toml +++ b/polkadot/erasure-coding/Cargo.toml @@ -19,6 +19,7 @@ sp-trie = { path = "../../substrate/primitives/trie" } thiserror = { workspace = true } [dev-dependencies] +quickcheck = { version = "1.0.3", default-features = false } criterion = { version = "0.5.1", default-features = false, features = ["cargo_bench_support"] } [[bench]] diff --git a/polkadot/erasure-coding/benches/README.md b/polkadot/erasure-coding/benches/README.md index 94fca5400c610..20f79827d280b 100644 --- a/polkadot/erasure-coding/benches/README.md +++ b/polkadot/erasure-coding/benches/README.md @@ -7,7 +7,8 @@ cargo bench ## `scaling_with_validators` This benchmark evaluates the performance of constructing the chunks and the erasure root from PoV and -reconstructing the PoV from chunks. You can see the results of running this bench on 5950x below. +reconstructing the PoV from chunks (either from systematic chunks or regular chunks). +You can see the results of running this bench on 5950x below (only including recovery from regular chunks). Interestingly, with `10_000` chunks (validators) its slower than with `50_000` for both construction and reconstruction. ``` @@ -37,3 +38,6 @@ reconstruct/10000 time: [496.35 ms 505.17 ms 515.42 ms] reconstruct/50000 time: [276.56 ms 277.53 ms 278.58 ms] thrpt: [17.948 MiB/s 18.016 MiB/s 18.079 MiB/s] ``` + +Results from running on an Apple M2 Pro, systematic recovery is generally 40 times faster than +regular recovery, achieving 1 Gib/s. diff --git a/polkadot/erasure-coding/benches/scaling_with_validators.rs b/polkadot/erasure-coding/benches/scaling_with_validators.rs index 759385bbdef4e..3d743faa4169b 100644 --- a/polkadot/erasure-coding/benches/scaling_with_validators.rs +++ b/polkadot/erasure-coding/benches/scaling_with_validators.rs @@ -53,12 +53,16 @@ fn construct_and_reconstruct_5mb_pov(c: &mut Criterion) { } group.finish(); - let mut group = c.benchmark_group("reconstruct"); + let mut group = c.benchmark_group("reconstruct_regular"); for n_validators in N_VALIDATORS { let all_chunks = chunks(n_validators, &pov); - let mut c: Vec<_> = all_chunks.iter().enumerate().map(|(i, c)| (&c[..], i)).collect(); - let last_chunks = c.split_off((c.len() - 1) * 2 / 3); + let chunks: Vec<_> = all_chunks + .iter() + .enumerate() + .take(polkadot_erasure_coding::recovery_threshold(n_validators).unwrap()) + .map(|(i, c)| (&c[..], i)) + .collect(); group.throughput(Throughput::Bytes(pov.len() as u64)); group.bench_with_input( @@ -67,7 +71,31 @@ fn construct_and_reconstruct_5mb_pov(c: &mut Criterion) { |b, &n| { b.iter(|| { let _pov: Vec = - polkadot_erasure_coding::reconstruct(n, last_chunks.clone()).unwrap(); + polkadot_erasure_coding::reconstruct(n, chunks.clone()).unwrap(); + }); + }, + ); + } + group.finish(); + + let mut group = c.benchmark_group("reconstruct_systematic"); + for n_validators in N_VALIDATORS { + let all_chunks = chunks(n_validators, &pov); + + let chunks = all_chunks + .into_iter() + .take(polkadot_erasure_coding::systematic_recovery_threshold(n_validators).unwrap()) + .collect::>(); + + group.throughput(Throughput::Bytes(pov.len() as u64)); + group.bench_with_input( + BenchmarkId::from_parameter(n_validators), + &n_validators, + |b, &n| { + b.iter(|| { + let _pov: Vec = + polkadot_erasure_coding::reconstruct_from_systematic(n, chunks.clone()) + .unwrap(); }); }, ); diff --git a/polkadot/erasure-coding/src/lib.rs b/polkadot/erasure-coding/src/lib.rs index e5155df4beba9..b354c3dac64ce 100644 --- a/polkadot/erasure-coding/src/lib.rs +++ b/polkadot/erasure-coding/src/lib.rs @@ -69,6 +69,9 @@ pub enum Error { /// Bad payload in reconstructed bytes. #[error("Reconstructed payload invalid")] BadPayload, + /// Unable to decode reconstructed bytes. + #[error("Unable to decode reconstructed payload: {0}")] + Decode(#[source] parity_scale_codec::Error), /// Invalid branch proof. #[error("Invalid branch proof")] InvalidBranchProof, @@ -110,6 +113,14 @@ pub const fn recovery_threshold(n_validators: usize) -> Result { Ok(needed + 1) } +/// Obtain the threshold of systematic chunks that should be enough to recover the data. +/// +/// If the regular `recovery_threshold` is a power of two, then it returns the same value. +/// Otherwise, it returns the next lower power of two. +pub fn systematic_recovery_threshold(n_validators: usize) -> Result { + code_params(n_validators).map(|params| params.k()) +} + fn code_params(n_validators: usize) -> Result { // we need to be able to reconstruct from 1/3 - eps @@ -127,6 +138,41 @@ fn code_params(n_validators: usize) -> Result { }) } +/// Reconstruct the v1 available data from the set of systematic chunks. +/// +/// Provide a vector containing chunk data. If too few chunks are provided, recovery is not +/// possible. +pub fn reconstruct_from_systematic_v1( + n_validators: usize, + chunks: Vec>, +) -> Result { + reconstruct_from_systematic(n_validators, chunks) +} + +/// Reconstruct the available data from the set of systematic chunks. +/// +/// Provide a vector containing the first k chunks in order. If too few chunks are provided, +/// recovery is not possible. +pub fn reconstruct_from_systematic( + n_validators: usize, + chunks: Vec>, +) -> Result { + let code_params = code_params(n_validators)?; + let k = code_params.k(); + + for chunk_data in chunks.iter().take(k) { + if chunk_data.len() % 2 != 0 { + return Err(Error::UnevenLength) + } + } + + let bytes = code_params.make_encoder().reconstruct_from_systematic( + chunks.into_iter().take(k).map(|data| WrappedShard::new(data)).collect(), + )?; + + Decode::decode(&mut &bytes[..]).map_err(|err| Error::Decode(err)) +} + /// Obtain erasure-coded chunks for v1 `AvailableData`, one for each validator. /// /// Works only up to 65536 validators, and `n_validators` must be non-zero. @@ -285,13 +331,41 @@ pub fn branch_hash(root: &H256, branch_nodes: &Proof, index: usize) -> Result

Self { + // Limit the POV len to 1 mib, otherwise the test will take forever + let pov_len = (u32::arbitrary(g) % (1024 * 1024)).max(2); + + let pov = (0..pov_len).map(|_| u8::arbitrary(g)).collect(); + + let pvd = PersistedValidationData { + parent_head: HeadData((0..u16::arbitrary(g)).map(|_| u8::arbitrary(g)).collect()), + relay_parent_number: u32::arbitrary(g), + relay_parent_storage_root: [u8::arbitrary(g); 32].into(), + max_pov_size: u32::arbitrary(g), + }; + + ArbitraryAvailableData(AvailableData { + pov: Arc::new(PoV { block_data: BlockData(pov) }), + validation_data: pvd, + }) + } + } + #[test] fn field_order_is_right_size() { assert_eq!(MAX_VALIDATORS, 65536); @@ -318,6 +392,25 @@ mod tests { assert_eq!(reconstructed, available_data); } + #[test] + fn round_trip_systematic_works() { + fn property(available_data: ArbitraryAvailableData, n_validators: u16) { + let n_validators = n_validators.max(2); + let kpow2 = systematic_recovery_threshold(n_validators as usize).unwrap(); + let chunks = obtain_chunks(n_validators as usize, &available_data.0).unwrap(); + assert_eq!( + reconstruct_from_systematic_v1( + n_validators as usize, + chunks.into_iter().take(kpow2).collect() + ) + .unwrap(), + available_data.0 + ); + } + + QuickCheck::new().quickcheck(property as fn(ArbitraryAvailableData, u16)) + } + #[test] fn reconstruct_does_not_panic_on_low_validator_count() { let reconstructed = reconstruct_v1(1, [].iter().cloned()); diff --git a/polkadot/node/core/approval-voting/src/lib.rs b/polkadot/node/core/approval-voting/src/lib.rs index b5ed92fa39c87..c667aee73613f 100644 --- a/polkadot/node/core/approval-voting/src/lib.rs +++ b/polkadot/node/core/approval-voting/src/lib.rs @@ -914,6 +914,7 @@ enum Action { candidate: CandidateReceipt, backing_group: GroupIndex, distribute_assignment: bool, + core_index: Option, }, NoteApprovedInChainSelection(Hash), IssueApproval(CandidateHash, ApprovalVoteRequest), @@ -1174,6 +1175,7 @@ async fn handle_actions( candidate, backing_group, distribute_assignment, + core_index, } => { // Don't launch approval work if the node is syncing. if let Mode::Syncing(_) = *mode { @@ -1230,6 +1232,7 @@ async fn handle_actions( block_hash, backing_group, executor_params, + core_index, &launch_approval_span, ) .await @@ -1467,6 +1470,7 @@ async fn distribution_messages_for_activation( candidate: candidate_entry.candidate_receipt().clone(), backing_group: approval_entry.backing_group(), distribute_assignment: false, + core_index: Some(*core_index), }); } }, @@ -3050,6 +3054,11 @@ async fn process_wakeup( "Launching approval work.", ); + let candidate_core_index = block_entry + .candidates() + .iter() + .find_map(|(core_index, h)| (h == &candidate_hash).then_some(*core_index)); + if let Some(claimed_core_indices) = get_assignment_core_indices(&indirect_cert.cert.kind, &candidate_hash, &block_entry) { @@ -3062,7 +3071,6 @@ async fn process_wakeup( true }; db.write_block_entry(block_entry.clone()); - actions.push(Action::LaunchApproval { claimed_candidate_indices, candidate_hash, @@ -3074,10 +3082,12 @@ async fn process_wakeup( candidate: candidate_receipt, backing_group, distribute_assignment, + core_index: candidate_core_index, }); }, Err(err) => { - // Never happens, it should only happen if no cores are claimed, which is a bug. + // Never happens, it should only happen if no cores are claimed, which is a + // bug. gum::warn!( target: LOG_TARGET, block_hash = ?relay_block, @@ -3133,6 +3143,7 @@ async fn launch_approval( block_hash: Hash, backing_group: GroupIndex, executor_params: ExecutorParams, + core_index: Option, span: &jaeger::Span, ) -> SubsystemResult> { let (a_tx, a_rx) = oneshot::channel(); @@ -3179,6 +3190,7 @@ async fn launch_approval( candidate.clone(), session_index, Some(backing_group), + core_index, a_tx, )) .await; diff --git a/polkadot/node/core/approval-voting/src/tests.rs b/polkadot/node/core/approval-voting/src/tests.rs index 312d805bbefb7..c3709de59e8bc 100644 --- a/polkadot/node/core/approval-voting/src/tests.rs +++ b/polkadot/node/core/approval-voting/src/tests.rs @@ -3330,7 +3330,7 @@ async fn recover_available_data(virtual_overseer: &mut VirtualOverseer) { assert_matches!( virtual_overseer.recv().await, AllMessages::AvailabilityRecovery( - AvailabilityRecoveryMessage::RecoverAvailableData(_, _, _, tx) + AvailabilityRecoveryMessage::RecoverAvailableData(_, _, _, _, tx) ) => { tx.send(Ok(available_data)).unwrap(); }, diff --git a/polkadot/node/core/av-store/src/lib.rs b/polkadot/node/core/av-store/src/lib.rs index 68db4686a9740..59a35a6a45a91 100644 --- a/polkadot/node/core/av-store/src/lib.rs +++ b/polkadot/node/core/av-store/src/lib.rs @@ -48,8 +48,10 @@ use polkadot_node_subsystem::{ }; use polkadot_node_subsystem_util as util; use polkadot_primitives::{ - BlockNumber, CandidateEvent, CandidateHash, CandidateReceipt, Hash, Header, ValidatorIndex, + BlockNumber, CandidateEvent, CandidateHash, CandidateReceipt, ChunkIndex, CoreIndex, Hash, + Header, NodeFeatures, ValidatorIndex, }; +use util::availability_chunks::availability_chunk_indices; mod metrics; pub use self::metrics::*; @@ -208,9 +210,9 @@ fn load_chunk( db: &Arc, config: &Config, candidate_hash: &CandidateHash, - chunk_index: ValidatorIndex, + validator_index: ValidatorIndex, ) -> Result, Error> { - let key = (CHUNK_PREFIX, candidate_hash, chunk_index).encode(); + let key = (CHUNK_PREFIX, candidate_hash, validator_index).encode(); query_inner(db, config.col_data, &key) } @@ -219,10 +221,10 @@ fn write_chunk( tx: &mut DBTransaction, config: &Config, candidate_hash: &CandidateHash, - chunk_index: ValidatorIndex, + validator_index: ValidatorIndex, erasure_chunk: &ErasureChunk, ) { - let key = (CHUNK_PREFIX, candidate_hash, chunk_index).encode(); + let key = (CHUNK_PREFIX, candidate_hash, validator_index).encode(); tx.put_vec(config.col_data, &key, erasure_chunk.encode()); } @@ -231,9 +233,9 @@ fn delete_chunk( tx: &mut DBTransaction, config: &Config, candidate_hash: &CandidateHash, - chunk_index: ValidatorIndex, + validator_index: ValidatorIndex, ) { - let key = (CHUNK_PREFIX, candidate_hash, chunk_index).encode(); + let key = (CHUNK_PREFIX, candidate_hash, validator_index).encode(); tx.delete(config.col_data, &key[..]); } @@ -1139,20 +1141,23 @@ fn process_message( Some(meta) => { let mut chunks = Vec::new(); - for (index, _) in meta.chunks_stored.iter().enumerate().filter(|(_, b)| **b) { + for (validator_index, _) in + meta.chunks_stored.iter().enumerate().filter(|(_, b)| **b) + { + let validator_index = ValidatorIndex(validator_index as _); let _timer = subsystem.metrics.time_get_chunk(); match load_chunk( &subsystem.db, &subsystem.config, &candidate, - ValidatorIndex(index as _), + validator_index, )? { - Some(c) => chunks.push(c), + Some(c) => chunks.push((validator_index, c)), None => { gum::warn!( target: LOG_TARGET, ?candidate, - index, + ?validator_index, "No chunk found for set bit in meta" ); }, @@ -1169,11 +1174,17 @@ fn process_message( }); let _ = tx.send(a); }, - AvailabilityStoreMessage::StoreChunk { candidate_hash, chunk, tx } => { + AvailabilityStoreMessage::StoreChunk { candidate_hash, validator_index, chunk, tx } => { subsystem.metrics.on_chunks_received(1); let _timer = subsystem.metrics.time_store_chunk(); - match store_chunk(&subsystem.db, &subsystem.config, candidate_hash, chunk) { + match store_chunk( + &subsystem.db, + &subsystem.config, + candidate_hash, + validator_index, + chunk, + ) { Ok(true) => { let _ = tx.send(Ok(())); }, @@ -1191,6 +1202,8 @@ fn process_message( n_validators, available_data, expected_erasure_root, + core_index, + node_features, tx, } => { subsystem.metrics.on_chunks_received(n_validators as _); @@ -1203,6 +1216,8 @@ fn process_message( n_validators as _, available_data, expected_erasure_root, + core_index, + node_features, ); match res { @@ -1233,6 +1248,7 @@ fn store_chunk( db: &Arc, config: &Config, candidate_hash: CandidateHash, + validator_index: ValidatorIndex, chunk: ErasureChunk, ) -> Result { let mut tx = DBTransaction::new(); @@ -1242,12 +1258,12 @@ fn store_chunk( None => return Ok(false), // we weren't informed of this candidate by import events. }; - match meta.chunks_stored.get(chunk.index.0 as usize).map(|b| *b) { + match meta.chunks_stored.get(validator_index.0 as usize).map(|b| *b) { Some(true) => return Ok(true), // already stored. Some(false) => { - meta.chunks_stored.set(chunk.index.0 as usize, true); + meta.chunks_stored.set(validator_index.0 as usize, true); - write_chunk(&mut tx, config, &candidate_hash, chunk.index, &chunk); + write_chunk(&mut tx, config, &candidate_hash, validator_index, &chunk); write_meta(&mut tx, config, &candidate_hash, &meta); }, None => return Ok(false), // out of bounds. @@ -1257,6 +1273,7 @@ fn store_chunk( target: LOG_TARGET, ?candidate_hash, chunk_index = %chunk.index.0, + validator_index = %validator_index.0, "Stored chunk index for candidate.", ); @@ -1264,13 +1281,14 @@ fn store_chunk( Ok(true) } -// Ok(true) on success, Ok(false) on failure, and Err on internal error. fn store_available_data( subsystem: &AvailabilityStoreSubsystem, candidate_hash: CandidateHash, n_validators: usize, available_data: AvailableData, expected_erasure_root: Hash, + core_index: CoreIndex, + node_features: NodeFeatures, ) -> Result<(), Error> { let mut tx = DBTransaction::new(); @@ -1312,16 +1330,26 @@ fn store_available_data( drop(erasure_span); - let erasure_chunks = chunks.iter().zip(branches.map(|(proof, _)| proof)).enumerate().map( - |(index, (chunk, proof))| ErasureChunk { + let erasure_chunks: Vec<_> = chunks + .iter() + .zip(branches.map(|(proof, _)| proof)) + .enumerate() + .map(|(index, (chunk, proof))| ErasureChunk { chunk: chunk.clone(), proof, - index: ValidatorIndex(index as u32), - }, - ); + index: ChunkIndex(index as u32), + }) + .collect(); - for chunk in erasure_chunks { - write_chunk(&mut tx, &subsystem.config, &candidate_hash, chunk.index, &chunk); + let chunk_indices = availability_chunk_indices(Some(&node_features), n_validators, core_index)?; + for (validator_index, chunk_index) in chunk_indices.into_iter().enumerate() { + write_chunk( + &mut tx, + &subsystem.config, + &candidate_hash, + ValidatorIndex(validator_index as u32), + &erasure_chunks[chunk_index.0 as usize], + ); } meta.data_available = true; diff --git a/polkadot/node/core/av-store/src/tests.rs b/polkadot/node/core/av-store/src/tests.rs index 652bf2a3fda48..e87f7cc3b8d6c 100644 --- a/polkadot/node/core/av-store/src/tests.rs +++ b/polkadot/node/core/av-store/src/tests.rs @@ -18,6 +18,7 @@ use super::*; use assert_matches::assert_matches; use futures::{channel::oneshot, executor, future, Future}; +use util::availability_chunks::availability_chunk_index; use self::test_helpers::mock::new_leaf; use ::test_helpers::TestCandidateBuilder; @@ -31,7 +32,7 @@ use polkadot_node_subsystem::{ use polkadot_node_subsystem_test_helpers as test_helpers; use polkadot_node_subsystem_util::{database::Database, TimeoutExt}; use polkadot_primitives::{ - CandidateHash, CandidateReceipt, CoreIndex, GroupIndex, HeadData, Header, + node_features, CandidateHash, CandidateReceipt, CoreIndex, GroupIndex, HeadData, Header, PersistedValidationData, ValidatorId, }; use sp_keyring::Sr25519Keyring; @@ -272,8 +273,7 @@ fn runtime_api_error_does_not_stop_the_subsystem() { // but that's fine, we're still alive let (tx, rx) = oneshot::channel(); let candidate_hash = CandidateHash(Hash::repeat_byte(33)); - let validator_index = ValidatorIndex(5); - let query_chunk = AvailabilityStoreMessage::QueryChunk(candidate_hash, validator_index, tx); + let query_chunk = AvailabilityStoreMessage::QueryChunk(candidate_hash, 5.into(), tx); overseer_send(&mut virtual_overseer, query_chunk.into()).await; @@ -288,12 +288,13 @@ fn store_chunk_works() { test_harness(TestState::default(), store.clone(), |mut virtual_overseer| async move { let candidate_hash = CandidateHash(Hash::repeat_byte(33)); - let validator_index = ValidatorIndex(5); + let chunk_index = ChunkIndex(5); + let validator_index = ValidatorIndex(2); let n_validators = 10; let chunk = ErasureChunk { chunk: vec![1, 2, 3], - index: validator_index, + index: chunk_index, proof: Proof::try_from(vec![vec![3, 4, 5]]).unwrap(), }; @@ -314,8 +315,12 @@ fn store_chunk_works() { let (tx, rx) = oneshot::channel(); - let chunk_msg = - AvailabilityStoreMessage::StoreChunk { candidate_hash, chunk: chunk.clone(), tx }; + let chunk_msg = AvailabilityStoreMessage::StoreChunk { + candidate_hash, + validator_index, + chunk: chunk.clone(), + tx, + }; overseer_send(&mut virtual_overseer, chunk_msg).await; assert_eq!(rx.await.unwrap(), Ok(())); @@ -336,18 +341,23 @@ fn store_chunk_does_nothing_if_no_entry_already() { test_harness(TestState::default(), store.clone(), |mut virtual_overseer| async move { let candidate_hash = CandidateHash(Hash::repeat_byte(33)); - let validator_index = ValidatorIndex(5); + let chunk_index = ChunkIndex(5); + let validator_index = ValidatorIndex(2); let chunk = ErasureChunk { chunk: vec![1, 2, 3], - index: validator_index, + index: chunk_index, proof: Proof::try_from(vec![vec![3, 4, 5]]).unwrap(), }; let (tx, rx) = oneshot::channel(); - let chunk_msg = - AvailabilityStoreMessage::StoreChunk { candidate_hash, chunk: chunk.clone(), tx }; + let chunk_msg = AvailabilityStoreMessage::StoreChunk { + candidate_hash, + validator_index, + chunk: chunk.clone(), + tx, + }; overseer_send(&mut virtual_overseer, chunk_msg).await; assert_eq!(rx.await.unwrap(), Err(())); @@ -418,6 +428,8 @@ fn store_available_data_erasure_mismatch() { let candidate_hash = CandidateHash(Hash::repeat_byte(1)); let validator_index = ValidatorIndex(5); let n_validators = 10; + let core_index = CoreIndex(8); + let node_features = NodeFeatures::EMPTY; let pov = PoV { block_data: BlockData(vec![4, 5, 6]) }; @@ -431,6 +443,8 @@ fn store_available_data_erasure_mismatch() { candidate_hash, n_validators, available_data: available_data.clone(), + core_index, + node_features, tx, // A dummy erasure root should lead to failure. expected_erasure_root: Hash::default(), @@ -450,97 +464,183 @@ fn store_available_data_erasure_mismatch() { } #[test] -fn store_block_works() { - let store = test_store(); - let test_state = TestState::default(); - test_harness(test_state.clone(), store.clone(), |mut virtual_overseer| async move { - let candidate_hash = CandidateHash(Hash::repeat_byte(1)); - let validator_index = ValidatorIndex(5); - let n_validators = 10; - - let pov = PoV { block_data: BlockData(vec![4, 5, 6]) }; - - let available_data = AvailableData { - pov: Arc::new(pov), - validation_data: test_state.persisted_validation_data.clone(), - }; - let (tx, rx) = oneshot::channel(); - - let chunks = erasure::obtain_chunks_v1(10, &available_data).unwrap(); - let mut branches = erasure::branches(chunks.as_ref()); - - let block_msg = AvailabilityStoreMessage::StoreAvailableData { - candidate_hash, - n_validators, - available_data: available_data.clone(), - tx, - expected_erasure_root: branches.root(), - }; - - virtual_overseer.send(FromOrchestra::Communication { msg: block_msg }).await; - assert_eq!(rx.await.unwrap(), Ok(())); - - let pov = query_available_data(&mut virtual_overseer, candidate_hash).await.unwrap(); - assert_eq!(pov, available_data); - - let chunk = query_chunk(&mut virtual_overseer, candidate_hash, validator_index) - .await - .unwrap(); - - let branch = branches.nth(5).unwrap(); - let expected_chunk = ErasureChunk { - chunk: branch.1.to_vec(), - index: ValidatorIndex(5), - proof: Proof::try_from(branch.0).unwrap(), - }; - - assert_eq!(chunk, expected_chunk); - virtual_overseer - }); -} - -#[test] -fn store_pov_and_query_chunk_works() { - let store = test_store(); - let test_state = TestState::default(); - - test_harness(test_state.clone(), store.clone(), |mut virtual_overseer| async move { - let candidate_hash = CandidateHash(Hash::repeat_byte(1)); - let n_validators = 10; - - let pov = PoV { block_data: BlockData(vec![4, 5, 6]) }; - - let available_data = AvailableData { - pov: Arc::new(pov), - validation_data: test_state.persisted_validation_data.clone(), - }; - - let chunks_expected = - erasure::obtain_chunks_v1(n_validators as _, &available_data).unwrap(); - let branches = erasure::branches(chunks_expected.as_ref()); - - let (tx, rx) = oneshot::channel(); - let block_msg = AvailabilityStoreMessage::StoreAvailableData { - candidate_hash, - n_validators, - available_data, - tx, - expected_erasure_root: branches.root(), - }; - - virtual_overseer.send(FromOrchestra::Communication { msg: block_msg }).await; +fn store_pov_and_queries_work() { + // If the AvailabilityChunkMapping feature is not enabled, + // ValidatorIndex->ChunkIndex mapping should be 1:1 for all core indices. + { + let n_cores = 4; + for core_index in 0..n_cores { + let store = test_store(); + let test_state = TestState::default(); + let core_index = CoreIndex(core_index); + + test_harness(test_state.clone(), store.clone(), |mut virtual_overseer| async move { + let node_features = NodeFeatures::EMPTY; + let candidate_hash = CandidateHash(Hash::repeat_byte(1)); + let n_validators = 10; + + let pov = PoV { block_data: BlockData(vec![4, 5, 6]) }; + let available_data = AvailableData { + pov: Arc::new(pov), + validation_data: test_state.persisted_validation_data.clone(), + }; + + let chunks = erasure::obtain_chunks_v1(n_validators as _, &available_data).unwrap(); + + let branches = erasure::branches(chunks.as_ref()); + + let (tx, rx) = oneshot::channel(); + let block_msg = AvailabilityStoreMessage::StoreAvailableData { + candidate_hash, + n_validators, + available_data: available_data.clone(), + tx, + core_index, + expected_erasure_root: branches.root(), + node_features: node_features.clone(), + }; + + virtual_overseer.send(FromOrchestra::Communication { msg: block_msg }).await; + assert_eq!(rx.await.unwrap(), Ok(())); + + let pov: AvailableData = + query_available_data(&mut virtual_overseer, candidate_hash).await.unwrap(); + assert_eq!(pov, available_data); + + let query_all_chunks_res = query_all_chunks( + &mut virtual_overseer, + availability_chunk_indices( + Some(&node_features), + n_validators as usize, + core_index, + ) + .unwrap(), + candidate_hash, + ) + .await; + assert_eq!(query_all_chunks_res.len(), chunks.len()); + + let branches: Vec<_> = branches.collect(); + + for validator_index in 0..n_validators { + let chunk = query_chunk( + &mut virtual_overseer, + candidate_hash, + ValidatorIndex(validator_index as _), + ) + .await + .unwrap(); + let branch = &branches[validator_index as usize]; + let expected_chunk = ErasureChunk { + chunk: branch.1.to_vec(), + index: validator_index.into(), + proof: Proof::try_from(branch.0.clone()).unwrap(), + }; + assert_eq!(chunk, expected_chunk); + assert_eq!(chunk, query_all_chunks_res[validator_index as usize]); + } - assert_eq!(rx.await.unwrap(), Ok(())); + virtual_overseer + }); + } + } - for i in 0..n_validators { - let chunk = query_chunk(&mut virtual_overseer, candidate_hash, ValidatorIndex(i as _)) - .await - .unwrap(); + // If the AvailabilityChunkMapping feature is enabled, let's also test the + // ValidatorIndex -> ChunkIndex mapping. + { + let n_cores = 4; + for core_index in 0..n_cores { + let store = test_store(); + let test_state = TestState::default(); + + test_harness(test_state.clone(), store.clone(), |mut virtual_overseer| async move { + let mut node_features = NodeFeatures::EMPTY; + let feature_bit = node_features::FeatureIndex::AvailabilityChunkMapping; + node_features.resize((feature_bit as u8 + 1) as usize, false); + node_features.set(feature_bit as u8 as usize, true); + + let candidate_hash = CandidateHash(Hash::repeat_byte(1)); + let n_validators = 10; + + let pov = PoV { block_data: BlockData(vec![4, 5, 6]) }; + let available_data = AvailableData { + pov: Arc::new(pov), + validation_data: test_state.persisted_validation_data.clone(), + }; + + let chunks = erasure::obtain_chunks_v1(n_validators as _, &available_data).unwrap(); + + let branches = erasure::branches(chunks.as_ref()); + let core_index = CoreIndex(core_index); + + let (tx, rx) = oneshot::channel(); + let block_msg = AvailabilityStoreMessage::StoreAvailableData { + candidate_hash, + n_validators, + available_data: available_data.clone(), + tx, + core_index, + expected_erasure_root: branches.root(), + node_features: node_features.clone(), + }; + + virtual_overseer.send(FromOrchestra::Communication { msg: block_msg }).await; + assert_eq!(rx.await.unwrap(), Ok(())); + + let pov: AvailableData = + query_available_data(&mut virtual_overseer, candidate_hash).await.unwrap(); + assert_eq!(pov, available_data); + + let query_all_chunks_res = query_all_chunks( + &mut virtual_overseer, + availability_chunk_indices( + Some(&node_features), + n_validators as usize, + core_index, + ) + .unwrap(), + candidate_hash, + ) + .await; + assert_eq!(query_all_chunks_res.len(), chunks.len()); + + let branches: Vec<_> = branches.collect(); + + for validator_index in 0..n_validators { + let chunk = query_chunk( + &mut virtual_overseer, + candidate_hash, + ValidatorIndex(validator_index as _), + ) + .await + .unwrap(); + let expected_chunk_index = availability_chunk_index( + Some(&node_features), + n_validators as usize, + core_index, + ValidatorIndex(validator_index), + ) + .unwrap(); + let branch = &branches[expected_chunk_index.0 as usize]; + let expected_chunk = ErasureChunk { + chunk: branch.1.to_vec(), + index: expected_chunk_index, + proof: Proof::try_from(branch.0.clone()).unwrap(), + }; + assert_eq!(chunk, expected_chunk); + assert_eq!( + &chunk, + query_all_chunks_res + .iter() + .find(|c| c.index == expected_chunk_index) + .unwrap() + ); + } - assert_eq!(chunk.chunk, chunks_expected[i as usize]); + virtual_overseer + }); } - virtual_overseer - }); + } } #[test] @@ -575,6 +675,8 @@ fn query_all_chunks_works() { n_validators, available_data, tx, + core_index: CoreIndex(1), + node_features: NodeFeatures::EMPTY, expected_erasure_root: branches.root(), }; @@ -598,7 +700,7 @@ fn query_all_chunks_works() { let chunk = ErasureChunk { chunk: vec![1, 2, 3], - index: ValidatorIndex(1), + index: ChunkIndex(1), proof: Proof::try_from(vec![vec![3, 4, 5]]).unwrap(), }; @@ -606,6 +708,7 @@ fn query_all_chunks_works() { let store_chunk_msg = AvailabilityStoreMessage::StoreChunk { candidate_hash: candidate_hash_2, chunk, + validator_index: ValidatorIndex(1), tx, }; @@ -615,29 +718,29 @@ fn query_all_chunks_works() { assert_eq!(rx.await.unwrap(), Ok(())); } - { - let (tx, rx) = oneshot::channel(); + let chunk_indices = + availability_chunk_indices(None, n_validators as usize, CoreIndex(0)).unwrap(); - let msg = AvailabilityStoreMessage::QueryAllChunks(candidate_hash_1, tx); - virtual_overseer.send(FromOrchestra::Communication { msg }).await; - assert_eq!(rx.await.unwrap().len(), n_validators as usize); - } - - { - let (tx, rx) = oneshot::channel(); - - let msg = AvailabilityStoreMessage::QueryAllChunks(candidate_hash_2, tx); - virtual_overseer.send(FromOrchestra::Communication { msg }).await; - assert_eq!(rx.await.unwrap().len(), 1); - } + assert_eq!( + query_all_chunks(&mut virtual_overseer, chunk_indices.clone(), candidate_hash_1) + .await + .len(), + n_validators as usize + ); - { - let (tx, rx) = oneshot::channel(); + assert_eq!( + query_all_chunks(&mut virtual_overseer, chunk_indices.clone(), candidate_hash_2) + .await + .len(), + 1 + ); + assert_eq!( + query_all_chunks(&mut virtual_overseer, chunk_indices.clone(), candidate_hash_3) + .await + .len(), + 0 + ); - let msg = AvailabilityStoreMessage::QueryAllChunks(candidate_hash_3, tx); - virtual_overseer.send(FromOrchestra::Communication { msg }).await; - assert_eq!(rx.await.unwrap().len(), 0); - } virtual_overseer }); } @@ -667,6 +770,8 @@ fn stored_but_not_included_data_is_pruned() { n_validators, available_data: available_data.clone(), tx, + node_features: NodeFeatures::EMPTY, + core_index: CoreIndex(1), expected_erasure_root: branches.root(), }; @@ -723,6 +828,8 @@ fn stored_data_kept_until_finalized() { n_validators, available_data: available_data.clone(), tx, + node_features: NodeFeatures::EMPTY, + core_index: CoreIndex(1), expected_erasure_root: branches.root(), }; @@ -998,6 +1105,8 @@ fn forkfullness_works() { n_validators, available_data: available_data_1.clone(), tx, + node_features: NodeFeatures::EMPTY, + core_index: CoreIndex(1), expected_erasure_root: branches.root(), }; @@ -1014,6 +1123,8 @@ fn forkfullness_works() { n_validators, available_data: available_data_2.clone(), tx, + node_features: NodeFeatures::EMPTY, + core_index: CoreIndex(1), expected_erasure_root: branches.root(), }; @@ -1126,6 +1237,25 @@ async fn query_chunk( rx.await.unwrap() } +async fn query_all_chunks( + virtual_overseer: &mut VirtualOverseer, + chunk_mapping: Vec, + candidate_hash: CandidateHash, +) -> Vec { + let (tx, rx) = oneshot::channel(); + + let msg = AvailabilityStoreMessage::QueryAllChunks(candidate_hash, tx); + virtual_overseer.send(FromOrchestra::Communication { msg }).await; + + let resp = rx.await.unwrap(); + resp.into_iter() + .map(|(val_idx, chunk)| { + assert_eq!(chunk.index, chunk_mapping[val_idx.0 as usize]); + chunk + }) + .collect() +} + async fn has_all_chunks( virtual_overseer: &mut VirtualOverseer, candidate_hash: CandidateHash, @@ -1206,12 +1336,12 @@ fn query_chunk_size_works() { test_harness(TestState::default(), store.clone(), |mut virtual_overseer| async move { let candidate_hash = CandidateHash(Hash::repeat_byte(33)); - let validator_index = ValidatorIndex(5); + let chunk_index = ChunkIndex(5); let n_validators = 10; let chunk = ErasureChunk { chunk: vec![1, 2, 3], - index: validator_index, + index: chunk_index, proof: Proof::try_from(vec![vec![3, 4, 5]]).unwrap(), }; @@ -1232,8 +1362,12 @@ fn query_chunk_size_works() { let (tx, rx) = oneshot::channel(); - let chunk_msg = - AvailabilityStoreMessage::StoreChunk { candidate_hash, chunk: chunk.clone(), tx }; + let chunk_msg = AvailabilityStoreMessage::StoreChunk { + candidate_hash, + chunk: chunk.clone(), + tx, + validator_index: chunk_index.into(), + }; overseer_send(&mut virtual_overseer, chunk_msg).await; assert_eq!(rx.await.unwrap(), Ok(())); diff --git a/polkadot/node/core/backing/src/lib.rs b/polkadot/node/core/backing/src/lib.rs index a45edcbef52a9..2fa8ad29efe5f 100644 --- a/polkadot/node/core/backing/src/lib.rs +++ b/polkadot/node/core/backing/src/lib.rs @@ -210,6 +210,8 @@ struct PerRelayParentState { prospective_parachains_mode: ProspectiveParachainsMode, /// The hash of the relay parent on top of which this job is doing it's work. parent: Hash, + /// Session index. + session_index: SessionIndex, /// The `ParaId` assigned to the local validator at this relay parent. assigned_para: Option, /// The `CoreIndex` assigned to the local validator at this relay parent. @@ -534,6 +536,8 @@ async fn store_available_data( candidate_hash: CandidateHash, available_data: AvailableData, expected_erasure_root: Hash, + core_index: CoreIndex, + node_features: NodeFeatures, ) -> Result<(), Error> { let (tx, rx) = oneshot::channel(); // Important: the `av-store` subsystem will check if the erasure root of the `available_data` @@ -546,6 +550,8 @@ async fn store_available_data( n_validators, available_data, expected_erasure_root, + core_index, + node_features, tx, }) .await; @@ -569,6 +575,8 @@ async fn make_pov_available( candidate_hash: CandidateHash, validation_data: PersistedValidationData, expected_erasure_root: Hash, + core_index: CoreIndex, + node_features: NodeFeatures, ) -> Result<(), Error> { store_available_data( sender, @@ -576,6 +584,8 @@ async fn make_pov_available( candidate_hash, AvailableData { pov, validation_data }, expected_erasure_root, + core_index, + node_features, ) .await } @@ -646,6 +656,7 @@ struct BackgroundValidationParams { tx_command: mpsc::Sender<(Hash, ValidatedCandidateCommand)>, candidate: CandidateReceipt, relay_parent: Hash, + session_index: SessionIndex, persisted_validation_data: PersistedValidationData, pov: PoVData, n_validators: usize, @@ -657,12 +668,14 @@ async fn validate_and_make_available( impl overseer::CandidateBackingSenderTrait, impl Fn(BackgroundValidationResult) -> ValidatedCandidateCommand + Sync, >, + core_index: CoreIndex, ) -> Result<(), Error> { let BackgroundValidationParams { mut sender, mut tx_command, candidate, relay_parent, + session_index, persisted_validation_data, pov, n_validators, @@ -692,6 +705,10 @@ async fn validate_and_make_available( Err(e) => return Err(Error::UtilError(e)), }; + let node_features = request_node_features(relay_parent, session_index, &mut sender) + .await? + .unwrap_or(NodeFeatures::EMPTY); + let pov = match pov { PoVData::Ready(pov) => pov, PoVData::FetchFromValidator { from_validator, candidate_hash, pov_hash } => @@ -747,6 +764,8 @@ async fn validate_and_make_available( candidate.hash(), validation_data.clone(), candidate.descriptor.erasure_root, + core_index, + node_features, ) .await; @@ -1191,6 +1210,7 @@ async fn construct_per_relay_parent_state( Ok(Some(PerRelayParentState { prospective_parachains_mode: mode, parent, + session_index, assigned_core, assigned_para, backed: HashSet::new(), @@ -1788,10 +1808,11 @@ async fn background_validate_and_make_available( >, ) -> Result<(), Error> { let candidate_hash = params.candidate.hash(); + let Some(core_index) = rp_state.assigned_core else { return Ok(()) }; if rp_state.awaiting_validation.insert(candidate_hash) { // spawn background task. let bg = async move { - if let Err(error) = validate_and_make_available(params).await { + if let Err(error) = validate_and_make_available(params, core_index).await { if let Error::BackgroundValidationMpsc(error) = error { gum::debug!( target: LOG_TARGET, @@ -1866,6 +1887,7 @@ async fn kick_off_validation_work( tx_command: background_validation_tx.clone(), candidate: attesting.candidate, relay_parent: rp_state.parent, + session_index: rp_state.session_index, persisted_validation_data, pov, n_validators: rp_state.table_context.validators.len(), @@ -2019,6 +2041,7 @@ async fn validate_and_second( tx_command: background_validation_tx.clone(), candidate: candidate.clone(), relay_parent: rp_state.parent, + session_index: rp_state.session_index, persisted_validation_data, pov: PoVData::Ready(pov), n_validators: rp_state.table_context.validators.len(), @@ -2084,8 +2107,7 @@ async fn handle_second_message( collation = ?candidate.descriptor().para_id, "Subsystem asked to second for para outside of our assignment", ); - - return Ok(()) + return Ok(()); } gum::debug!( diff --git a/polkadot/node/core/backing/src/tests/mod.rs b/polkadot/node/core/backing/src/tests/mod.rs index d1969e656db67..00f9e4cd8ff68 100644 --- a/polkadot/node/core/backing/src/tests/mod.rs +++ b/polkadot/node/core/backing/src/tests/mod.rs @@ -367,6 +367,15 @@ async fn assert_validation_requests( tx.send(Ok(Some(ExecutorParams::default()))).unwrap(); } ); + + assert_matches!( + virtual_overseer.recv().await, + AllMessages::RuntimeApi( + RuntimeApiMessage::Request(_, RuntimeApiRequest::NodeFeatures(sess_idx, tx)) + ) if sess_idx == 1 => { + tx.send(Ok(NodeFeatures::EMPTY)).unwrap(); + } + ); } async fn assert_validate_from_exhaustive( @@ -2084,7 +2093,7 @@ fn retry_works() { virtual_overseer.send(FromOrchestra::Communication { msg: statement }).await; // Not deterministic which message comes first: - for _ in 0u32..5 { + for _ in 0u32..6 { match virtual_overseer.recv().await { AllMessages::Provisioner(ProvisionerMessage::ProvisionableData( _, @@ -2115,6 +2124,12 @@ fn retry_works() { )) => { tx.send(Ok(Some(ExecutorParams::default()))).unwrap(); }, + AllMessages::RuntimeApi(RuntimeApiMessage::Request( + _, + RuntimeApiRequest::NodeFeatures(1, tx), + )) => { + tx.send(Ok(NodeFeatures::EMPTY)).unwrap(); + }, msg => { assert!(false, "Unexpected message: {:?}", msg); }, @@ -2662,32 +2677,7 @@ fn validator_ignores_statements_from_disabled_validators() { virtual_overseer.send(FromOrchestra::Communication { msg: statement_3 }).await; - assert_matches!( - virtual_overseer.recv().await, - AllMessages::RuntimeApi( - RuntimeApiMessage::Request(_, RuntimeApiRequest::ValidationCodeByHash(hash, tx)) - ) if hash == validation_code.hash() => { - tx.send(Ok(Some(validation_code.clone()))).unwrap(); - } - ); - - assert_matches!( - virtual_overseer.recv().await, - AllMessages::RuntimeApi( - RuntimeApiMessage::Request(_, RuntimeApiRequest::SessionIndexForChild(tx)) - ) => { - tx.send(Ok(1u32.into())).unwrap(); - } - ); - - assert_matches!( - virtual_overseer.recv().await, - AllMessages::RuntimeApi( - RuntimeApiMessage::Request(_, RuntimeApiRequest::SessionExecutorParams(sess_idx, tx)) - ) if sess_idx == 1 => { - tx.send(Ok(Some(ExecutorParams::default()))).unwrap(); - } - ); + assert_validation_requests(&mut virtual_overseer, validation_code.clone()).await; // Sending a `Statement::Seconded` for our assignment will start // validation process. The first thing requested is the PoV. diff --git a/polkadot/node/core/backing/src/tests/prospective_parachains.rs b/polkadot/node/core/backing/src/tests/prospective_parachains.rs index c93cf21ef7d8e..5ef3a3b15285c 100644 --- a/polkadot/node/core/backing/src/tests/prospective_parachains.rs +++ b/polkadot/node/core/backing/src/tests/prospective_parachains.rs @@ -1435,7 +1435,13 @@ fn concurrent_dependent_candidates() { )) => { tx.send(Ok(test_state.validator_groups.clone())).unwrap(); }, - + AllMessages::RuntimeApi(RuntimeApiMessage::Request( + _, + RuntimeApiRequest::NodeFeatures(sess_idx, tx), + )) => { + assert_eq!(sess_idx, 1); + tx.send(Ok(NodeFeatures::EMPTY)).unwrap(); + }, AllMessages::RuntimeApi(RuntimeApiMessage::Request( _parent, RuntimeApiRequest::AvailabilityCores(tx), diff --git a/polkadot/node/core/bitfield-signing/src/lib.rs b/polkadot/node/core/bitfield-signing/src/lib.rs index 89851c4a033b5..e3effb7949eae 100644 --- a/polkadot/node/core/bitfield-signing/src/lib.rs +++ b/polkadot/node/core/bitfield-signing/src/lib.rs @@ -27,15 +27,14 @@ use futures::{ FutureExt, }; use polkadot_node_subsystem::{ - errors::RuntimeApiError, jaeger, - messages::{ - AvailabilityStoreMessage, BitfieldDistributionMessage, RuntimeApiMessage, RuntimeApiRequest, - }, + messages::{AvailabilityStoreMessage, BitfieldDistributionMessage}, overseer, ActivatedLeaf, FromOrchestra, OverseerSignal, PerLeafSpan, SpawnedSubsystem, - SubsystemError, SubsystemResult, SubsystemSender, + SubsystemError, SubsystemResult, +}; +use polkadot_node_subsystem_util::{ + self as util, request_availability_cores, runtime::recv_runtime, Validator, }; -use polkadot_node_subsystem_util::{self as util, Validator}; use polkadot_primitives::{AvailabilityBitfield, CoreState, Hash, ValidatorIndex}; use sp_keystore::{Error as KeystoreError, KeystorePtr}; use std::{collections::HashMap, time::Duration}; @@ -69,7 +68,7 @@ pub enum Error { MpscSend(#[from] mpsc::SendError), #[error(transparent)] - Runtime(#[from] RuntimeApiError), + Runtime(#[from] util::runtime::Error), #[error("Keystore failed: {0:?}")] Keystore(KeystoreError), @@ -79,8 +78,8 @@ pub enum Error { /// for whether we have the availability chunk for our validator index. async fn get_core_availability( core: &CoreState, - validator_idx: ValidatorIndex, - sender: &Mutex<&mut impl SubsystemSender>, + validator_index: ValidatorIndex, + sender: &Mutex<&mut impl overseer::BitfieldSigningSenderTrait>, span: &jaeger::Span, ) -> Result { if let CoreState::Occupied(core) = core { @@ -90,14 +89,11 @@ async fn get_core_availability( sender .lock() .await - .send_message( - AvailabilityStoreMessage::QueryChunkAvailability( - core.candidate_hash, - validator_idx, - tx, - ) - .into(), - ) + .send_message(AvailabilityStoreMessage::QueryChunkAvailability( + core.candidate_hash, + validator_index, + tx, + )) .await; let res = rx.await.map_err(Into::into); @@ -116,25 +112,6 @@ async fn get_core_availability( } } -/// delegates to the v1 runtime API -async fn get_availability_cores( - relay_parent: Hash, - sender: &mut impl SubsystemSender, -) -> Result, Error> { - let (tx, rx) = oneshot::channel(); - sender - .send_message( - RuntimeApiMessage::Request(relay_parent, RuntimeApiRequest::AvailabilityCores(tx)) - .into(), - ) - .await; - match rx.await { - Ok(Ok(out)) => Ok(out), - Ok(Err(runtime_err)) => Err(runtime_err.into()), - Err(err) => Err(err.into()), - } -} - /// - get the list of core states from the runtime /// - for each core, concurrently determine chunk availability (see `get_core_availability`) /// - return the bitfield if there were no errors at any point in this process (otherwise, it's @@ -143,12 +120,12 @@ async fn construct_availability_bitfield( relay_parent: Hash, span: &jaeger::Span, validator_idx: ValidatorIndex, - sender: &mut impl SubsystemSender, + sender: &mut impl overseer::BitfieldSigningSenderTrait, ) -> Result { // get the set of availability cores from the runtime let availability_cores = { let _span = span.child("get-availability-cores"); - get_availability_cores(relay_parent, sender).await? + recv_runtime(request_availability_cores(relay_parent, sender).await).await? }; // Wrap the sender in a Mutex to share it between the futures. diff --git a/polkadot/node/core/bitfield-signing/src/tests.rs b/polkadot/node/core/bitfield-signing/src/tests.rs index 106ecc06b1569..0e61e6086d285 100644 --- a/polkadot/node/core/bitfield-signing/src/tests.rs +++ b/polkadot/node/core/bitfield-signing/src/tests.rs @@ -16,7 +16,7 @@ use super::*; use futures::{executor::block_on, pin_mut, StreamExt}; -use polkadot_node_subsystem::messages::AllMessages; +use polkadot_node_subsystem::messages::{AllMessages, RuntimeApiMessage, RuntimeApiRequest}; use polkadot_primitives::{CandidateHash, OccupiedCore}; use test_helpers::dummy_candidate_descriptor; @@ -64,7 +64,7 @@ fn construct_availability_bitfield_works() { AllMessages::AvailabilityStore( AvailabilityStoreMessage::QueryChunkAvailability(c_hash, vidx, tx), ) => { - assert_eq!(validator_index, vidx); + assert_eq!(validator_index, vidx.into()); tx.send(c_hash == hash_a).unwrap(); }, diff --git a/polkadot/node/core/dispute-coordinator/src/participation/mod.rs b/polkadot/node/core/dispute-coordinator/src/participation/mod.rs index 05ea7323af141..b58ce570f8fff 100644 --- a/polkadot/node/core/dispute-coordinator/src/participation/mod.rs +++ b/polkadot/node/core/dispute-coordinator/src/participation/mod.rs @@ -305,6 +305,7 @@ async fn participate( req.candidate_receipt().clone(), req.session(), None, + None, recover_available_data_tx, )) .await; diff --git a/polkadot/node/core/dispute-coordinator/src/participation/tests.rs b/polkadot/node/core/dispute-coordinator/src/participation/tests.rs index 367454115f0be..1316508e84cf8 100644 --- a/polkadot/node/core/dispute-coordinator/src/participation/tests.rs +++ b/polkadot/node/core/dispute-coordinator/src/participation/tests.rs @@ -132,7 +132,7 @@ pub async fn participation_missing_availability(ctx_handle: &mut VirtualOverseer assert_matches!( ctx_handle.recv().await, AllMessages::AvailabilityRecovery( - AvailabilityRecoveryMessage::RecoverAvailableData(_, _, _, tx) + AvailabilityRecoveryMessage::RecoverAvailableData(_, _, _, _, tx) ) => { tx.send(Err(RecoveryError::Unavailable)).unwrap(); }, @@ -151,7 +151,7 @@ async fn recover_available_data(virtual_overseer: &mut VirtualOverseer) { assert_matches!( virtual_overseer.recv().await, AllMessages::AvailabilityRecovery( - AvailabilityRecoveryMessage::RecoverAvailableData(_, _, _, tx) + AvailabilityRecoveryMessage::RecoverAvailableData(_, _, _, _, tx) ) => { tx.send(Ok(available_data)).unwrap(); }, @@ -195,7 +195,7 @@ fn same_req_wont_get_queued_if_participation_is_already_running() { assert_matches!( ctx_handle.recv().await, AllMessages::AvailabilityRecovery( - AvailabilityRecoveryMessage::RecoverAvailableData(_, _, _, tx) + AvailabilityRecoveryMessage::RecoverAvailableData(_, _, _, _, tx) ) => { tx.send(Err(RecoveryError::Unavailable)).unwrap(); }, @@ -260,7 +260,7 @@ fn reqs_get_queued_when_out_of_capacity() { { match ctx_handle.recv().await { AllMessages::AvailabilityRecovery( - AvailabilityRecoveryMessage::RecoverAvailableData(_, _, _, tx), + AvailabilityRecoveryMessage::RecoverAvailableData(_, _, _, _, tx), ) => { tx.send(Err(RecoveryError::Unavailable)).unwrap(); recover_available_data_msg_count += 1; @@ -346,7 +346,7 @@ fn cannot_participate_if_cannot_recover_available_data() { assert_matches!( ctx_handle.recv().await, AllMessages::AvailabilityRecovery( - AvailabilityRecoveryMessage::RecoverAvailableData(_, _, _, tx) + AvailabilityRecoveryMessage::RecoverAvailableData(_, _, _, _, tx) ) => { tx.send(Err(RecoveryError::Unavailable)).unwrap(); }, @@ -412,7 +412,7 @@ fn cast_invalid_vote_if_available_data_is_invalid() { assert_matches!( ctx_handle.recv().await, AllMessages::AvailabilityRecovery( - AvailabilityRecoveryMessage::RecoverAvailableData(_, _, _, tx) + AvailabilityRecoveryMessage::RecoverAvailableData(_, _, _, _, tx) ) => { tx.send(Err(RecoveryError::Invalid)).unwrap(); }, diff --git a/polkadot/node/jaeger/src/spans.rs b/polkadot/node/jaeger/src/spans.rs index 68fa57e2ca14f..fcee8be9a50f5 100644 --- a/polkadot/node/jaeger/src/spans.rs +++ b/polkadot/node/jaeger/src/spans.rs @@ -85,7 +85,9 @@ use parity_scale_codec::Encode; use polkadot_node_primitives::PoV; -use polkadot_primitives::{BlakeTwo256, CandidateHash, Hash, HashT, Id as ParaId, ValidatorIndex}; +use polkadot_primitives::{ + BlakeTwo256, CandidateHash, ChunkIndex, Hash, HashT, Id as ParaId, ValidatorIndex, +}; use sc_network_types::PeerId; use std::{fmt, sync::Arc}; @@ -338,8 +340,8 @@ impl Span { } #[inline(always)] - pub fn with_chunk_index(self, chunk_index: u32) -> Self { - self.with_string_tag("chunk-index", chunk_index) + pub fn with_chunk_index(self, chunk_index: ChunkIndex) -> Self { + self.with_string_tag("chunk-index", &chunk_index.0) } #[inline(always)] diff --git a/polkadot/node/network/availability-distribution/Cargo.toml b/polkadot/node/network/availability-distribution/Cargo.toml index 39e2985a88cfa..01b208421d793 100644 --- a/polkadot/node/network/availability-distribution/Cargo.toml +++ b/polkadot/node/network/availability-distribution/Cargo.toml @@ -19,6 +19,7 @@ polkadot-node-network-protocol = { path = "../protocol" } polkadot-node-subsystem = { path = "../../subsystem" } polkadot-node-subsystem-util = { path = "../../subsystem-util" } polkadot-node-primitives = { path = "../../primitives" } +sc-network = { path = "../../../../substrate/client/network" } sp-core = { path = "../../../../substrate/primitives/core", features = ["std"] } sp-keystore = { path = "../../../../substrate/primitives/keystore" } thiserror = { workspace = true } @@ -36,6 +37,7 @@ sc-network = { path = "../../../../substrate/client/network" } futures-timer = "3.0.2" assert_matches = "1.4.0" polkadot-primitives-test-helpers = { path = "../../../primitives/test-helpers" } +rstest = "0.18.2" polkadot-subsystem-bench = { path = "../../subsystem-bench" } diff --git a/polkadot/node/network/availability-distribution/src/error.rs b/polkadot/node/network/availability-distribution/src/error.rs index c547a1abbc276..72a809dd11408 100644 --- a/polkadot/node/network/availability-distribution/src/error.rs +++ b/polkadot/node/network/availability-distribution/src/error.rs @@ -49,7 +49,7 @@ pub enum Error { #[fatal] #[error("Oneshot for receiving response from Chain API got cancelled")] - ChainApiSenderDropped(#[source] oneshot::Canceled), + ChainApiSenderDropped(#[from] oneshot::Canceled), #[fatal] #[error("Retrieving response from Chain API unexpectedly failed with error: {0}")] @@ -82,6 +82,9 @@ pub enum Error { #[error("Given validator index could not be found in current session")] InvalidValidatorIndex, + + #[error("Erasure coding error: {0}")] + ErasureCoding(#[from] polkadot_erasure_coding::Error), } /// General result abbreviation type alias. @@ -104,7 +107,8 @@ pub fn log_error( JfyiError::InvalidValidatorIndex | JfyiError::NoSuchCachedSession { .. } | JfyiError::QueryAvailableDataResponseChannel(_) | - JfyiError::QueryChunkResponseChannel(_) => gum::warn!(target: LOG_TARGET, error = %jfyi, ctx), + JfyiError::QueryChunkResponseChannel(_) | + JfyiError::ErasureCoding(_) => gum::warn!(target: LOG_TARGET, error = %jfyi, ctx), JfyiError::FetchPoV(_) | JfyiError::SendResponse | JfyiError::NoSuchPoV | diff --git a/polkadot/node/network/availability-distribution/src/lib.rs b/polkadot/node/network/availability-distribution/src/lib.rs index c62ce1dd981a9..ec2c01f99b018 100644 --- a/polkadot/node/network/availability-distribution/src/lib.rs +++ b/polkadot/node/network/availability-distribution/src/lib.rs @@ -18,7 +18,9 @@ use futures::{future::Either, FutureExt, StreamExt, TryFutureExt}; use sp_keystore::KeystorePtr; -use polkadot_node_network_protocol::request_response::{v1, IncomingRequestReceiver}; +use polkadot_node_network_protocol::request_response::{ + v1, v2, IncomingRequestReceiver, ReqProtocolNames, +}; use polkadot_node_subsystem::{ jaeger, messages::AvailabilityDistributionMessage, overseer, FromOrchestra, OverseerSignal, SpawnedSubsystem, SubsystemError, @@ -41,7 +43,7 @@ mod pov_requester; /// Responding to erasure chunk requests: mod responder; -use responder::{run_chunk_receiver, run_pov_receiver}; +use responder::{run_chunk_receivers, run_pov_receiver}; mod metrics; /// Prometheus `Metrics` for availability distribution. @@ -58,6 +60,8 @@ pub struct AvailabilityDistributionSubsystem { runtime: RuntimeInfo, /// Receivers to receive messages from. recvs: IncomingRequestReceivers, + /// Mapping of the req-response protocols to the full protocol names. + req_protocol_names: ReqProtocolNames, /// Prometheus metrics. metrics: Metrics, } @@ -66,8 +70,10 @@ pub struct AvailabilityDistributionSubsystem { pub struct IncomingRequestReceivers { /// Receiver for incoming PoV requests. pub pov_req_receiver: IncomingRequestReceiver, - /// Receiver for incoming availability chunk requests. - pub chunk_req_receiver: IncomingRequestReceiver, + /// Receiver for incoming v1 availability chunk requests. + pub chunk_req_v1_receiver: IncomingRequestReceiver, + /// Receiver for incoming v2 availability chunk requests. + pub chunk_req_v2_receiver: IncomingRequestReceiver, } #[overseer::subsystem(AvailabilityDistribution, error=SubsystemError, prefix=self::overseer)] @@ -85,18 +91,27 @@ impl AvailabilityDistributionSubsystem { #[overseer::contextbounds(AvailabilityDistribution, prefix = self::overseer)] impl AvailabilityDistributionSubsystem { /// Create a new instance of the availability distribution. - pub fn new(keystore: KeystorePtr, recvs: IncomingRequestReceivers, metrics: Metrics) -> Self { + pub fn new( + keystore: KeystorePtr, + recvs: IncomingRequestReceivers, + req_protocol_names: ReqProtocolNames, + metrics: Metrics, + ) -> Self { let runtime = RuntimeInfo::new(Some(keystore)); - Self { runtime, recvs, metrics } + Self { runtime, recvs, req_protocol_names, metrics } } /// Start processing work as passed on from the Overseer. async fn run(self, mut ctx: Context) -> std::result::Result<(), FatalError> { - let Self { mut runtime, recvs, metrics } = self; + let Self { mut runtime, recvs, metrics, req_protocol_names } = self; let mut spans: HashMap = HashMap::new(); - let IncomingRequestReceivers { pov_req_receiver, chunk_req_receiver } = recvs; - let mut requester = Requester::new(metrics.clone()).fuse(); + let IncomingRequestReceivers { + pov_req_receiver, + chunk_req_v1_receiver, + chunk_req_v2_receiver, + } = recvs; + let mut requester = Requester::new(req_protocol_names, metrics.clone()).fuse(); let mut warn_freq = gum::Freq::new(); { @@ -109,7 +124,13 @@ impl AvailabilityDistributionSubsystem { ctx.spawn( "chunk-receiver", - run_chunk_receiver(sender, chunk_req_receiver, metrics.clone()).boxed(), + run_chunk_receivers( + sender, + chunk_req_v1_receiver, + chunk_req_v2_receiver, + metrics.clone(), + ) + .boxed(), ) .map_err(FatalError::SpawnTask)?; } diff --git a/polkadot/node/network/availability-distribution/src/requester/fetch_task/mod.rs b/polkadot/node/network/availability-distribution/src/requester/fetch_task/mod.rs index f478defcaa965..7bd36709bc5f3 100644 --- a/polkadot/node/network/availability-distribution/src/requester/fetch_task/mod.rs +++ b/polkadot/node/network/availability-distribution/src/requester/fetch_task/mod.rs @@ -22,10 +22,12 @@ use futures::{ FutureExt, SinkExt, }; +use parity_scale_codec::Decode; use polkadot_erasure_coding::branch_hash; use polkadot_node_network_protocol::request_response::{ outgoing::{OutgoingRequest, Recipient, RequestError, Requests}, - v1::{ChunkFetchingRequest, ChunkFetchingResponse}, + v1::{self, ChunkResponse}, + v2, }; use polkadot_node_primitives::ErasureChunk; use polkadot_node_subsystem::{ @@ -34,9 +36,10 @@ use polkadot_node_subsystem::{ overseer, }; use polkadot_primitives::{ - AuthorityDiscoveryId, BlakeTwo256, CandidateHash, GroupIndex, Hash, HashT, OccupiedCore, - SessionIndex, + AuthorityDiscoveryId, BlakeTwo256, CandidateHash, ChunkIndex, GroupIndex, Hash, HashT, + OccupiedCore, SessionIndex, }; +use sc_network::ProtocolName; use crate::{ error::{FatalError, Result}, @@ -111,8 +114,8 @@ struct RunningTask { /// This vector gets drained during execution of the task (it will be empty afterwards). group: Vec, - /// The request to send. - request: ChunkFetchingRequest, + /// The request to send. We can store it as either v1 or v2, they have the same payload. + request: v2::ChunkFetchingRequest, /// Root hash, for verifying the chunks validity. erasure_root: Hash, @@ -128,6 +131,16 @@ struct RunningTask { /// Span tracking the fetching of this chunk. span: jaeger::Span, + + /// Expected chunk index. We'll validate that the remote did send us the correct chunk (only + /// important for v2 requests). + chunk_index: ChunkIndex, + + /// Full protocol name for ChunkFetchingV1. + req_v1_protocol_name: ProtocolName, + + /// Full protocol name for ChunkFetchingV2. + req_v2_protocol_name: ProtocolName, } impl FetchTaskConfig { @@ -140,13 +153,17 @@ impl FetchTaskConfig { sender: mpsc::Sender, metrics: Metrics, session_info: &SessionInfo, + chunk_index: ChunkIndex, span: jaeger::Span, + req_v1_protocol_name: ProtocolName, + req_v2_protocol_name: ProtocolName, ) -> Self { let span = span .child("fetch-task-config") .with_trace_id(core.candidate_hash) .with_string_tag("leaf", format!("{:?}", leaf)) .with_validator_index(session_info.our_index) + .with_chunk_index(chunk_index) .with_uint_tag("group-index", core.group_responsible.0 as u64) .with_relay_parent(core.candidate_descriptor.relay_parent) .with_string_tag("pov-hash", format!("{:?}", core.candidate_descriptor.pov_hash)) @@ -165,7 +182,7 @@ impl FetchTaskConfig { group: session_info.validator_groups.get(core.group_responsible.0 as usize) .expect("The responsible group of a candidate should be available in the corresponding session. qed.") .clone(), - request: ChunkFetchingRequest { + request: v2::ChunkFetchingRequest { candidate_hash: core.candidate_hash, index: session_info.our_index, }, @@ -174,6 +191,9 @@ impl FetchTaskConfig { metrics, sender, span, + chunk_index, + req_v1_protocol_name, + req_v2_protocol_name }; FetchTaskConfig { live_in, prepared_running: Some(prepared_running) } } @@ -271,7 +291,8 @@ impl RunningTask { count += 1; let _chunk_fetch_span = span .child("fetch-chunk-request") - .with_chunk_index(self.request.index.0) + .with_validator_index(self.request.index) + .with_chunk_index(self.chunk_index) .with_stage(jaeger::Stage::AvailabilityDistribution); // Send request: let resp = match self @@ -296,11 +317,12 @@ impl RunningTask { drop(_chunk_fetch_span); let _chunk_recombine_span = span .child("recombine-chunk") - .with_chunk_index(self.request.index.0) + .with_validator_index(self.request.index) + .with_chunk_index(self.chunk_index) .with_stage(jaeger::Stage::AvailabilityDistribution); let chunk = match resp { - ChunkFetchingResponse::Chunk(resp) => resp.recombine_into_chunk(&self.request), - ChunkFetchingResponse::NoSuchChunk => { + Some(chunk) => chunk, + None => { gum::debug!( target: LOG_TARGET, validator = ?validator, @@ -320,11 +342,12 @@ impl RunningTask { drop(_chunk_recombine_span); let _chunk_validate_and_store_span = span .child("validate-and-store-chunk") - .with_chunk_index(self.request.index.0) + .with_validator_index(self.request.index) + .with_chunk_index(self.chunk_index) .with_stage(jaeger::Stage::AvailabilityDistribution); // Data genuine? - if !self.validate_chunk(&validator, &chunk) { + if !self.validate_chunk(&validator, &chunk, self.chunk_index) { bad_validators.push(validator); continue } @@ -350,7 +373,7 @@ impl RunningTask { validator: &AuthorityDiscoveryId, network_error_freq: &mut gum::Freq, canceled_freq: &mut gum::Freq, - ) -> std::result::Result { + ) -> std::result::Result, TaskError> { gum::trace!( target: LOG_TARGET, origin = ?validator, @@ -362,9 +385,13 @@ impl RunningTask { "Starting chunk request", ); - let (full_request, response_recv) = - OutgoingRequest::new(Recipient::Authority(validator.clone()), self.request); - let requests = Requests::ChunkFetchingV1(full_request); + let (full_request, response_recv) = OutgoingRequest::new_with_fallback( + Recipient::Authority(validator.clone()), + self.request, + // Fallback to v1, for backwards compatibility. + v1::ChunkFetchingRequest::from(self.request), + ); + let requests = Requests::ChunkFetching(full_request); self.sender .send(FromFetchTask::Message( @@ -378,7 +405,58 @@ impl RunningTask { .map_err(|_| TaskError::ShuttingDown)?; match response_recv.await { - Ok(resp) => Ok(resp), + Ok((bytes, protocol)) => match protocol { + _ if protocol == self.req_v2_protocol_name => + match v2::ChunkFetchingResponse::decode(&mut &bytes[..]) { + Ok(chunk_response) => Ok(Option::::from(chunk_response)), + Err(e) => { + gum::warn!( + target: LOG_TARGET, + origin = ?validator, + relay_parent = ?self.relay_parent, + group_index = ?self.group_index, + session_index = ?self.session_index, + chunk_index = ?self.request.index, + candidate_hash = ?self.request.candidate_hash, + err = ?e, + "Peer sent us invalid erasure chunk data (v2)" + ); + Err(TaskError::PeerError) + }, + }, + _ if protocol == self.req_v1_protocol_name => + match v1::ChunkFetchingResponse::decode(&mut &bytes[..]) { + Ok(chunk_response) => Ok(Option::::from(chunk_response) + .map(|c| c.recombine_into_chunk(&self.request.into()))), + Err(e) => { + gum::warn!( + target: LOG_TARGET, + origin = ?validator, + relay_parent = ?self.relay_parent, + group_index = ?self.group_index, + session_index = ?self.session_index, + chunk_index = ?self.request.index, + candidate_hash = ?self.request.candidate_hash, + err = ?e, + "Peer sent us invalid erasure chunk data" + ); + Err(TaskError::PeerError) + }, + }, + _ => { + gum::warn!( + target: LOG_TARGET, + origin = ?validator, + relay_parent = ?self.relay_parent, + group_index = ?self.group_index, + session_index = ?self.session_index, + chunk_index = ?self.request.index, + candidate_hash = ?self.request.candidate_hash, + "Peer sent us invalid erasure chunk data - unknown protocol" + ); + Err(TaskError::PeerError) + }, + }, Err(RequestError::InvalidResponse(err)) => { gum::warn!( target: LOG_TARGET, @@ -427,7 +505,23 @@ impl RunningTask { } } - fn validate_chunk(&self, validator: &AuthorityDiscoveryId, chunk: &ErasureChunk) -> bool { + fn validate_chunk( + &self, + validator: &AuthorityDiscoveryId, + chunk: &ErasureChunk, + expected_chunk_index: ChunkIndex, + ) -> bool { + if chunk.index != expected_chunk_index { + gum::warn!( + target: LOG_TARGET, + candidate_hash = ?self.request.candidate_hash, + origin = ?validator, + chunk_index = ?chunk.index, + expected_chunk_index = ?expected_chunk_index, + "Validator sent the wrong chunk", + ); + return false + } let anticipated_hash = match branch_hash(&self.erasure_root, chunk.proof(), chunk.index.0 as usize) { Ok(hash) => hash, @@ -459,6 +553,7 @@ impl RunningTask { AvailabilityStoreMessage::StoreChunk { candidate_hash: self.request.candidate_hash, chunk, + validator_index: self.request.index, tx, } .into(), diff --git a/polkadot/node/network/availability-distribution/src/requester/fetch_task/tests.rs b/polkadot/node/network/availability-distribution/src/requester/fetch_task/tests.rs index a5a81082e39ad..25fae37f725aa 100644 --- a/polkadot/node/network/availability-distribution/src/requester/fetch_task/tests.rs +++ b/polkadot/node/network/availability-distribution/src/requester/fetch_task/tests.rs @@ -24,21 +24,26 @@ use futures::{ task::{noop_waker, Context, Poll}, Future, FutureExt, StreamExt, }; +use rstest::rstest; use sc_network::{self as network, ProtocolName}; use sp_keyring::Sr25519Keyring; -use polkadot_node_network_protocol::request_response::{v1, Recipient}; +use polkadot_node_network_protocol::request_response::{ + v1::{self, ChunkResponse}, + Protocol, Recipient, ReqProtocolNames, +}; use polkadot_node_primitives::{BlockData, PoV, Proof}; use polkadot_node_subsystem::messages::AllMessages; -use polkadot_primitives::{CandidateHash, ValidatorIndex}; +use polkadot_primitives::{CandidateHash, ChunkIndex, ValidatorIndex}; use super::*; use crate::{metrics::Metrics, tests::mock::get_valid_chunk_data}; #[test] fn task_can_be_canceled() { - let (task, _rx) = get_test_running_task(); + let req_protocol_names = ReqProtocolNames::new(&Hash::repeat_byte(0xff), None); + let (task, _rx) = get_test_running_task(&req_protocol_names, 0.into(), 0.into()); let (handle, kill) = oneshot::channel(); std::mem::drop(handle); let running_task = task.run(kill); @@ -49,96 +54,130 @@ fn task_can_be_canceled() { } /// Make sure task won't accept a chunk that has is invalid. -#[test] -fn task_does_not_accept_invalid_chunk() { - let (mut task, rx) = get_test_running_task(); +#[rstest] +#[case(Protocol::ChunkFetchingV1)] +#[case(Protocol::ChunkFetchingV2)] +fn task_does_not_accept_invalid_chunk(#[case] protocol: Protocol) { + let req_protocol_names = ReqProtocolNames::new(&Hash::repeat_byte(0xff), None); + let chunk_index = ChunkIndex(1); + let validator_index = ValidatorIndex(0); + let (mut task, rx) = get_test_running_task(&req_protocol_names, validator_index, chunk_index); let validators = vec![Sr25519Keyring::Alice.public().into()]; task.group = validators; + let protocol_name = req_protocol_names.get_name(protocol); let test = TestRun { chunk_responses: { - let mut m = HashMap::new(); - m.insert( + [( Recipient::Authority(Sr25519Keyring::Alice.public().into()), - ChunkFetchingResponse::Chunk(v1::ChunkResponse { - chunk: vec![1, 2, 3], - proof: Proof::try_from(vec![vec![9, 8, 2], vec![2, 3, 4]]).unwrap(), - }), - ); - m + get_response( + protocol, + protocol_name.clone(), + Some(( + vec![1, 2, 3], + Proof::try_from(vec![vec![9, 8, 2], vec![2, 3, 4]]).unwrap(), + chunk_index, + )), + ), + )] + .into_iter() + .collect() }, valid_chunks: HashSet::new(), + req_protocol_names, }; test.run(task, rx); } -#[test] -fn task_stores_valid_chunk() { - let (mut task, rx) = get_test_running_task(); +#[rstest] +#[case(Protocol::ChunkFetchingV1)] +#[case(Protocol::ChunkFetchingV2)] +fn task_stores_valid_chunk(#[case] protocol: Protocol) { + let req_protocol_names = ReqProtocolNames::new(&Hash::repeat_byte(0xff), None); + // In order for protocol version 1 to work, the chunk index needs to be equal to the validator + // index. + let chunk_index = ChunkIndex(0); + let validator_index = + if protocol == Protocol::ChunkFetchingV1 { ValidatorIndex(0) } else { ValidatorIndex(1) }; + let (mut task, rx) = get_test_running_task(&req_protocol_names, validator_index, chunk_index); + let validators = vec![Sr25519Keyring::Alice.public().into()]; let pov = PoV { block_data: BlockData(vec![45, 46, 47]) }; - let (root_hash, chunk) = get_valid_chunk_data(pov); + let (root_hash, chunk) = get_valid_chunk_data(pov, 10, chunk_index); task.erasure_root = root_hash; - task.request.index = chunk.index; - - let validators = vec![Sr25519Keyring::Alice.public().into()]; task.group = validators; + let protocol_name = req_protocol_names.get_name(protocol); let test = TestRun { chunk_responses: { - let mut m = HashMap::new(); - m.insert( + [( Recipient::Authority(Sr25519Keyring::Alice.public().into()), - ChunkFetchingResponse::Chunk(v1::ChunkResponse { - chunk: chunk.chunk.clone(), - proof: chunk.proof, - }), - ); - m - }, - valid_chunks: { - let mut s = HashSet::new(); - s.insert(chunk.chunk); - s + get_response( + protocol, + protocol_name.clone(), + Some((chunk.chunk.clone(), chunk.proof, chunk_index)), + ), + )] + .into_iter() + .collect() }, + valid_chunks: [(chunk.chunk)].into_iter().collect(), + req_protocol_names, }; test.run(task, rx); } -#[test] -fn task_does_not_accept_wrongly_indexed_chunk() { - let (mut task, rx) = get_test_running_task(); - let pov = PoV { block_data: BlockData(vec![45, 46, 47]) }; - let (root_hash, chunk) = get_valid_chunk_data(pov); - task.erasure_root = root_hash; - task.request.index = ValidatorIndex(chunk.index.0 + 1); +#[rstest] +#[case(Protocol::ChunkFetchingV1)] +#[case(Protocol::ChunkFetchingV2)] +fn task_does_not_accept_wrongly_indexed_chunk(#[case] protocol: Protocol) { + let req_protocol_names = ReqProtocolNames::new(&Hash::repeat_byte(0xff), None); + // In order for protocol version 1 to work, the chunk index needs to be equal to the validator + // index. + let chunk_index = ChunkIndex(0); + let validator_index = + if protocol == Protocol::ChunkFetchingV1 { ValidatorIndex(0) } else { ValidatorIndex(1) }; + let (mut task, rx) = get_test_running_task(&req_protocol_names, validator_index, chunk_index); let validators = vec![Sr25519Keyring::Alice.public().into()]; + let pov = PoV { block_data: BlockData(vec![45, 46, 47]) }; + let (_, other_chunk) = get_valid_chunk_data(pov.clone(), 10, ChunkIndex(3)); + let (root_hash, chunk) = get_valid_chunk_data(pov, 10, ChunkIndex(0)); + task.erasure_root = root_hash; + task.request.index = chunk.index.into(); task.group = validators; + let protocol_name = req_protocol_names.get_name(protocol); let test = TestRun { chunk_responses: { - let mut m = HashMap::new(); - m.insert( + [( Recipient::Authority(Sr25519Keyring::Alice.public().into()), - ChunkFetchingResponse::Chunk(v1::ChunkResponse { - chunk: chunk.chunk.clone(), - proof: chunk.proof, - }), - ); - m + get_response( + protocol, + protocol_name.clone(), + Some((other_chunk.chunk.clone(), chunk.proof, other_chunk.index)), + ), + )] + .into_iter() + .collect() }, valid_chunks: HashSet::new(), + req_protocol_names, }; test.run(task, rx); } /// Task stores chunk, if there is at least one validator having a valid chunk. -#[test] -fn task_stores_valid_chunk_if_there_is_one() { - let (mut task, rx) = get_test_running_task(); +#[rstest] +#[case(Protocol::ChunkFetchingV1)] +#[case(Protocol::ChunkFetchingV2)] +fn task_stores_valid_chunk_if_there_is_one(#[case] protocol: Protocol) { + let req_protocol_names = ReqProtocolNames::new(&Hash::repeat_byte(0xff), None); + // In order for protocol version 1 to work, the chunk index needs to be equal to the validator + // index. + let chunk_index = ChunkIndex(1); + let validator_index = + if protocol == Protocol::ChunkFetchingV1 { ValidatorIndex(1) } else { ValidatorIndex(2) }; + let (mut task, rx) = get_test_running_task(&req_protocol_names, validator_index, chunk_index); let pov = PoV { block_data: BlockData(vec![45, 46, 47]) }; - let (root_hash, chunk) = get_valid_chunk_data(pov); - task.erasure_root = root_hash; - task.request.index = chunk.index; let validators = [ // Only Alice has valid chunk - should succeed, even though she is tried last. @@ -151,37 +190,45 @@ fn task_stores_valid_chunk_if_there_is_one() { .iter() .map(|v| v.public().into()) .collect::>(); + + let (root_hash, chunk) = get_valid_chunk_data(pov, 10, chunk_index); + task.erasure_root = root_hash; task.group = validators; + let protocol_name = req_protocol_names.get_name(protocol); let test = TestRun { chunk_responses: { - let mut m = HashMap::new(); - m.insert( - Recipient::Authority(Sr25519Keyring::Alice.public().into()), - ChunkFetchingResponse::Chunk(v1::ChunkResponse { - chunk: chunk.chunk.clone(), - proof: chunk.proof, - }), - ); - m.insert( - Recipient::Authority(Sr25519Keyring::Bob.public().into()), - ChunkFetchingResponse::NoSuchChunk, - ); - m.insert( - Recipient::Authority(Sr25519Keyring::Charlie.public().into()), - ChunkFetchingResponse::Chunk(v1::ChunkResponse { - chunk: vec![1, 2, 3], - proof: Proof::try_from(vec![vec![9, 8, 2], vec![2, 3, 4]]).unwrap(), - }), - ); - - m - }, - valid_chunks: { - let mut s = HashSet::new(); - s.insert(chunk.chunk); - s + [ + ( + Recipient::Authority(Sr25519Keyring::Alice.public().into()), + get_response( + protocol, + protocol_name.clone(), + Some((chunk.chunk.clone(), chunk.proof, chunk_index)), + ), + ), + ( + Recipient::Authority(Sr25519Keyring::Bob.public().into()), + get_response(protocol, protocol_name.clone(), None), + ), + ( + Recipient::Authority(Sr25519Keyring::Charlie.public().into()), + get_response( + protocol, + protocol_name.clone(), + Some(( + vec![1, 2, 3], + Proof::try_from(vec![vec![9, 8, 2], vec![2, 3, 4]]).unwrap(), + chunk_index, + )), + ), + ), + ] + .into_iter() + .collect() }, + valid_chunks: [(chunk.chunk)].into_iter().collect(), + req_protocol_names, }; test.run(task, rx); } @@ -189,14 +236,16 @@ fn task_stores_valid_chunk_if_there_is_one() { struct TestRun { /// Response to deliver for a given validator index. /// None means, answer with `NetworkError`. - chunk_responses: HashMap, + chunk_responses: HashMap, ProtocolName)>, /// Set of chunks that should be considered valid: valid_chunks: HashSet>, + /// Request protocol names + req_protocol_names: ReqProtocolNames, } impl TestRun { fn run(self, task: RunningTask, rx: mpsc::Receiver) { - sp_tracing::try_init_simple(); + sp_tracing::init_for_tests(); let mut rx = rx.fuse(); let task = task.run_inner().fuse(); futures::pin_mut!(task); @@ -240,20 +289,41 @@ impl TestRun { let mut valid_responses = 0; for req in reqs { let req = match req { - Requests::ChunkFetchingV1(req) => req, + Requests::ChunkFetching(req) => req, _ => panic!("Unexpected request"), }; let response = self.chunk_responses.get(&req.peer).ok_or(network::RequestFailure::Refused); - if let Ok(ChunkFetchingResponse::Chunk(resp)) = &response { - if self.valid_chunks.contains(&resp.chunk) { - valid_responses += 1; + if let Ok((resp, protocol)) = response { + let chunk = if protocol == + &self.req_protocol_names.get_name(Protocol::ChunkFetchingV1) + { + Into::>::into( + v1::ChunkFetchingResponse::decode(&mut &resp[..]).unwrap(), + ) + .map(|c| c.chunk) + } else if protocol == + &self.req_protocol_names.get_name(Protocol::ChunkFetchingV2) + { + Into::>::into( + v2::ChunkFetchingResponse::decode(&mut &resp[..]).unwrap(), + ) + .map(|c| c.chunk) + } else { + unreachable!() + }; + + if let Some(chunk) = chunk { + if self.valid_chunks.contains(&chunk) { + valid_responses += 1; + } } + + req.pending_response + .send(response.cloned()) + .expect("Sending response should succeed"); } - req.pending_response - .send(response.map(|r| (r.encode(), ProtocolName::from("")))) - .expect("Sending response should succeed"); } return (valid_responses == 0) && self.valid_chunks.is_empty() }, @@ -274,8 +344,12 @@ impl TestRun { } } -/// Get a `RunningTask` filled with dummy values. -fn get_test_running_task() -> (RunningTask, mpsc::Receiver) { +/// Get a `RunningTask` filled with (mostly) dummy values. +fn get_test_running_task( + req_protocol_names: &ReqProtocolNames, + validator_index: ValidatorIndex, + chunk_index: ChunkIndex, +) -> (RunningTask, mpsc::Receiver) { let (tx, rx) = mpsc::channel(0); ( @@ -283,16 +357,45 @@ fn get_test_running_task() -> (RunningTask, mpsc::Receiver) { session_index: 0, group_index: GroupIndex(0), group: Vec::new(), - request: ChunkFetchingRequest { + request: v2::ChunkFetchingRequest { candidate_hash: CandidateHash([43u8; 32].into()), - index: ValidatorIndex(0), + index: validator_index, }, erasure_root: Hash::repeat_byte(99), relay_parent: Hash::repeat_byte(71), sender: tx, metrics: Metrics::new_dummy(), span: jaeger::Span::Disabled, + req_v1_protocol_name: req_protocol_names.get_name(Protocol::ChunkFetchingV1), + req_v2_protocol_name: req_protocol_names.get_name(Protocol::ChunkFetchingV2), + chunk_index, }, rx, ) } + +/// Make a versioned ChunkFetchingResponse. +fn get_response( + protocol: Protocol, + protocol_name: ProtocolName, + chunk: Option<(Vec, Proof, ChunkIndex)>, +) -> (Vec, ProtocolName) { + ( + match protocol { + Protocol::ChunkFetchingV1 => if let Some((chunk, proof, _)) = chunk { + v1::ChunkFetchingResponse::Chunk(ChunkResponse { chunk, proof }) + } else { + v1::ChunkFetchingResponse::NoSuchChunk + } + .encode(), + Protocol::ChunkFetchingV2 => if let Some((chunk, proof, index)) = chunk { + v2::ChunkFetchingResponse::Chunk(ErasureChunk { chunk, index, proof }) + } else { + v2::ChunkFetchingResponse::NoSuchChunk + } + .encode(), + _ => unreachable!(), + }, + protocol_name, + ) +} diff --git a/polkadot/node/network/availability-distribution/src/requester/mod.rs b/polkadot/node/network/availability-distribution/src/requester/mod.rs index 97e80d696e7ef..efbdceb43bddc 100644 --- a/polkadot/node/network/availability-distribution/src/requester/mod.rs +++ b/polkadot/node/network/availability-distribution/src/requester/mod.rs @@ -18,10 +18,7 @@ //! availability. use std::{ - collections::{ - hash_map::{Entry, HashMap}, - hash_set::HashSet, - }, + collections::{hash_map::HashMap, hash_set::HashSet}, iter::IntoIterator, pin::Pin, }; @@ -32,13 +29,17 @@ use futures::{ Stream, }; +use polkadot_node_network_protocol::request_response::{v1, v2, IsRequest, ReqProtocolNames}; use polkadot_node_subsystem::{ jaeger, messages::{ChainApiMessage, RuntimeApiMessage}, overseer, ActivatedLeaf, ActiveLeavesUpdate, }; -use polkadot_node_subsystem_util::runtime::{get_occupied_cores, RuntimeInfo}; -use polkadot_primitives::{CandidateHash, Hash, OccupiedCore, SessionIndex}; +use polkadot_node_subsystem_util::{ + availability_chunks::availability_chunk_index, + runtime::{get_occupied_cores, RuntimeInfo}, +}; +use polkadot_primitives::{CandidateHash, CoreIndex, Hash, OccupiedCore, SessionIndex}; use super::{FatalError, Metrics, Result, LOG_TARGET}; @@ -77,6 +78,9 @@ pub struct Requester { /// Prometheus Metrics metrics: Metrics, + + /// Mapping of the req-response protocols to the full protocol names. + req_protocol_names: ReqProtocolNames, } #[overseer::contextbounds(AvailabilityDistribution, prefix = self::overseer)] @@ -88,9 +92,16 @@ impl Requester { /// /// You must feed it with `ActiveLeavesUpdate` via `update_fetching_heads` and make it progress /// by advancing the stream. - pub fn new(metrics: Metrics) -> Self { + pub fn new(req_protocol_names: ReqProtocolNames, metrics: Metrics) -> Self { let (tx, rx) = mpsc::channel(1); - Requester { fetches: HashMap::new(), session_cache: SessionCache::new(), tx, rx, metrics } + Requester { + fetches: HashMap::new(), + session_cache: SessionCache::new(), + tx, + rx, + metrics, + req_protocol_names, + } } /// Update heads that need availability distribution. @@ -197,56 +208,76 @@ impl Requester { runtime: &mut RuntimeInfo, leaf: Hash, leaf_session_index: SessionIndex, - cores: impl IntoIterator, + cores: impl IntoIterator, span: jaeger::Span, ) -> Result<()> { - for core in cores { + for (core_index, core) in cores { let mut span = span .child("check-fetch-candidate") .with_trace_id(core.candidate_hash) .with_string_tag("leaf", format!("{:?}", leaf)) .with_candidate(core.candidate_hash) .with_stage(jaeger::Stage::AvailabilityDistribution); - match self.fetches.entry(core.candidate_hash) { - Entry::Occupied(mut e) => + + if let Some(e) = self.fetches.get_mut(&core.candidate_hash) { // Just book keeping - we are already requesting that chunk: - { - span.add_string_tag("already-requested-chunk", "true"); - e.get_mut().add_leaf(leaf); - }, - Entry::Vacant(e) => { - span.add_string_tag("already-requested-chunk", "false"); - let tx = self.tx.clone(); - let metrics = self.metrics.clone(); - - let task_cfg = self - .session_cache - .with_session_info( - context, - runtime, - // We use leaf here, the relay_parent must be in the same session as - // the leaf. This is guaranteed by runtime which ensures that cores are - // cleared at session boundaries. At the same time, only leaves are - // guaranteed to be fetchable by the state trie. - leaf, - leaf_session_index, - |info| FetchTaskConfig::new(leaf, &core, tx, metrics, info, span), - ) - .await - .map_err(|err| { - gum::warn!( - target: LOG_TARGET, - error = ?err, - "Failed to spawn a fetch task" - ); - err + span.add_string_tag("already-requested-chunk", "true"); + e.add_leaf(leaf); + } else { + span.add_string_tag("already-requested-chunk", "false"); + let tx = self.tx.clone(); + let metrics = self.metrics.clone(); + + let session_info = self + .session_cache + .get_session_info( + context, + runtime, + // We use leaf here, the relay_parent must be in the same session as + // the leaf. This is guaranteed by runtime which ensures that cores are + // cleared at session boundaries. At the same time, only leaves are + // guaranteed to be fetchable by the state trie. + leaf, + leaf_session_index, + ) + .await + .map_err(|err| { + gum::warn!( + target: LOG_TARGET, + error = ?err, + "Failed to spawn a fetch task" + ); + err + })?; + + if let Some(session_info) = session_info { + let n_validators = + session_info.validator_groups.iter().fold(0usize, |mut acc, group| { + acc = acc.saturating_add(group.len()); + acc }); - - if let Ok(Some(task_cfg)) = task_cfg { - e.insert(FetchTask::start(task_cfg, context).await?); - } - // Not a validator, nothing to do. - }, + let chunk_index = availability_chunk_index( + session_info.node_features.as_ref(), + n_validators, + core_index, + session_info.our_index, + )?; + + let task_cfg = FetchTaskConfig::new( + leaf, + &core, + tx, + metrics, + session_info, + chunk_index, + span, + self.req_protocol_names.get_name(v1::ChunkFetchingRequest::PROTOCOL), + self.req_protocol_names.get_name(v2::ChunkFetchingRequest::PROTOCOL), + ); + + self.fetches + .insert(core.candidate_hash, FetchTask::start(task_cfg, context).await?); + } } } Ok(()) diff --git a/polkadot/node/network/availability-distribution/src/requester/session_cache.rs b/polkadot/node/network/availability-distribution/src/requester/session_cache.rs index 8a48e19c2827d..a762c262dba3e 100644 --- a/polkadot/node/network/availability-distribution/src/requester/session_cache.rs +++ b/polkadot/node/network/availability-distribution/src/requester/session_cache.rs @@ -20,8 +20,10 @@ use rand::{seq::SliceRandom, thread_rng}; use schnellru::{ByLength, LruMap}; use polkadot_node_subsystem::overseer; -use polkadot_node_subsystem_util::runtime::RuntimeInfo; -use polkadot_primitives::{AuthorityDiscoveryId, GroupIndex, Hash, SessionIndex, ValidatorIndex}; +use polkadot_node_subsystem_util::runtime::{request_node_features, RuntimeInfo}; +use polkadot_primitives::{ + AuthorityDiscoveryId, GroupIndex, Hash, NodeFeatures, SessionIndex, ValidatorIndex, +}; use crate::{ error::{Error, Result}, @@ -62,6 +64,9 @@ pub struct SessionInfo { /// /// `None`, if we are not in fact part of any group. pub our_group: Option, + + /// Node features. + pub node_features: Option, } /// Report of bad validators. @@ -87,39 +92,29 @@ impl SessionCache { } } - /// Tries to retrieve `SessionInfo` and calls `with_info` if successful. - /// + /// Tries to retrieve `SessionInfo`. /// If this node is not a validator, the function will return `None`. - /// - /// Use this function over any `fetch_session_info` if all you need is a reference to - /// `SessionInfo`, as it avoids an expensive clone. - pub async fn with_session_info( - &mut self, + pub async fn get_session_info<'a, Context>( + &'a mut self, ctx: &mut Context, runtime: &mut RuntimeInfo, parent: Hash, session_index: SessionIndex, - with_info: F, - ) -> Result> - where - F: FnOnce(&SessionInfo) -> R, - { - if let Some(o_info) = self.session_info_cache.get(&session_index) { - gum::trace!(target: LOG_TARGET, session_index, "Got session from lru"); - return Ok(Some(with_info(o_info))) + ) -> Result> { + gum::trace!(target: LOG_TARGET, session_index, "Calling `get_session_info`"); + + if self.session_info_cache.get(&session_index).is_none() { + if let Some(info) = + Self::query_info_from_runtime(ctx, runtime, parent, session_index).await? + { + gum::trace!(target: LOG_TARGET, session_index, "Storing session info in lru!"); + self.session_info_cache.insert(session_index, info); + } else { + return Ok(None) + } } - if let Some(info) = - self.query_info_from_runtime(ctx, runtime, parent, session_index).await? - { - gum::trace!(target: LOG_TARGET, session_index, "Calling `with_info`"); - let r = with_info(&info); - gum::trace!(target: LOG_TARGET, session_index, "Storing session info in lru!"); - self.session_info_cache.insert(session_index, info); - Ok(Some(r)) - } else { - Ok(None) - } + Ok(self.session_info_cache.get(&session_index).map(|i| &*i)) } /// Variant of `report_bad` that never fails, but just logs errors. @@ -171,7 +166,6 @@ impl SessionCache { /// /// Returns: `None` if not a validator. async fn query_info_from_runtime( - &self, ctx: &mut Context, runtime: &mut RuntimeInfo, relay_parent: Hash, @@ -181,6 +175,9 @@ impl SessionCache { .get_session_info_by_index(ctx.sender(), relay_parent, session_index) .await?; + let node_features = + request_node_features(relay_parent, session_index, ctx.sender()).await?; + let discovery_keys = info.session_info.discovery_keys.clone(); let mut validator_groups = info.session_info.validator_groups.clone(); @@ -208,7 +205,13 @@ impl SessionCache { }) .collect(); - let info = SessionInfo { validator_groups, our_index, session_index, our_group }; + let info = SessionInfo { + validator_groups, + our_index, + session_index, + our_group, + node_features, + }; return Ok(Some(info)) } return Ok(None) diff --git a/polkadot/node/network/availability-distribution/src/requester/tests.rs b/polkadot/node/network/availability-distribution/src/requester/tests.rs index 0dedd4f091acd..09567a8f87d32 100644 --- a/polkadot/node/network/availability-distribution/src/requester/tests.rs +++ b/polkadot/node/network/availability-distribution/src/requester/tests.rs @@ -14,21 +14,17 @@ // You should have received a copy of the GNU General Public License // along with Polkadot. If not, see . -use std::collections::HashMap; - -use std::future::Future; - use futures::FutureExt; +use std::{collections::HashMap, future::Future}; -use polkadot_node_network_protocol::jaeger; +use polkadot_node_network_protocol::{jaeger, request_response::ReqProtocolNames}; use polkadot_node_primitives::{BlockData, ErasureChunk, PoV}; -use polkadot_node_subsystem_test_helpers::mock::new_leaf; use polkadot_node_subsystem_util::runtime::RuntimeInfo; use polkadot_primitives::{ - BlockNumber, CoreState, ExecutorParams, GroupIndex, Hash, Id as ParaId, NodeFeatures, + BlockNumber, ChunkIndex, CoreState, ExecutorParams, GroupIndex, Hash, Id as ParaId, ScheduledCore, SessionIndex, SessionInfo, }; -use sp_core::traits::SpawnNamed; +use sp_core::{testing::TaskExecutor, traits::SpawnNamed}; use polkadot_node_subsystem::{ messages::{ @@ -38,19 +34,21 @@ use polkadot_node_subsystem::{ ActiveLeavesUpdate, SpawnGlue, }; use polkadot_node_subsystem_test_helpers::{ - make_subsystem_context, mock::make_ferdie_keystore, TestSubsystemContext, - TestSubsystemContextHandle, + make_subsystem_context, + mock::{make_ferdie_keystore, new_leaf}, + TestSubsystemContext, TestSubsystemContextHandle, }; -use sp_core::testing::TaskExecutor; - -use crate::tests::mock::{get_valid_chunk_data, make_session_info, OccupiedCoreBuilder}; +use crate::tests::{ + mock::{get_valid_chunk_data, make_session_info, OccupiedCoreBuilder}, + node_features_with_mapping_enabled, +}; use super::Requester; fn get_erasure_chunk() -> ErasureChunk { let pov = PoV { block_data: BlockData(vec![45, 46, 47]) }; - get_valid_chunk_data(pov).1 + get_valid_chunk_data(pov, 10, ChunkIndex(0)).1 } #[derive(Clone)] @@ -126,7 +124,7 @@ fn spawn_virtual_overseer( .expect("Receiver should be alive."); }, RuntimeApiRequest::NodeFeatures(_, tx) => { - tx.send(Ok(NodeFeatures::EMPTY)) + tx.send(Ok(node_features_with_mapping_enabled())) .expect("Receiver should be alive."); }, RuntimeApiRequest::AvailabilityCores(tx) => { @@ -146,6 +144,8 @@ fn spawn_virtual_overseer( group_responsible: GroupIndex(1), para_id, relay_parent: hash, + n_validators: 10, + chunk_index: ChunkIndex(0), } .build() .0, @@ -201,7 +201,8 @@ fn test_harness>( #[test] fn check_ancestry_lookup_in_same_session() { let test_state = TestState::new(); - let mut requester = Requester::new(Default::default()); + let mut requester = + Requester::new(ReqProtocolNames::new(&Hash::repeat_byte(0xff), None), Default::default()); let keystore = make_ferdie_keystore(); let mut runtime = RuntimeInfo::new(Some(keystore)); @@ -268,7 +269,8 @@ fn check_ancestry_lookup_in_same_session() { #[test] fn check_ancestry_lookup_in_different_sessions() { let mut test_state = TestState::new(); - let mut requester = Requester::new(Default::default()); + let mut requester = + Requester::new(ReqProtocolNames::new(&Hash::repeat_byte(0xff), None), Default::default()); let keystore = make_ferdie_keystore(); let mut runtime = RuntimeInfo::new(Some(keystore)); diff --git a/polkadot/node/network/availability-distribution/src/responder.rs b/polkadot/node/network/availability-distribution/src/responder.rs index 54b188f7f01fc..2c1885d277275 100644 --- a/polkadot/node/network/availability-distribution/src/responder.rs +++ b/polkadot/node/network/availability-distribution/src/responder.rs @@ -18,11 +18,12 @@ use std::sync::Arc; -use futures::channel::oneshot; +use futures::{channel::oneshot, select, FutureExt}; use fatality::Nested; +use parity_scale_codec::{Decode, Encode}; use polkadot_node_network_protocol::{ - request_response::{v1, IncomingRequest, IncomingRequestReceiver}, + request_response::{v1, v2, IncomingRequest, IncomingRequestReceiver, IsRequest}, UnifiedReputationChange as Rep, }; use polkadot_node_primitives::{AvailableData, ErasureChunk}; @@ -66,33 +67,66 @@ pub async fn run_pov_receiver( } /// Receiver task to be forked as a separate task to handle chunk requests. -pub async fn run_chunk_receiver( +pub async fn run_chunk_receivers( mut sender: Sender, - mut receiver: IncomingRequestReceiver, + mut receiver_v1: IncomingRequestReceiver, + mut receiver_v2: IncomingRequestReceiver, metrics: Metrics, ) where Sender: SubsystemSender, { + let make_resp_v1 = |chunk: Option| match chunk { + None => v1::ChunkFetchingResponse::NoSuchChunk, + Some(chunk) => v1::ChunkFetchingResponse::Chunk(chunk.into()), + }; + + let make_resp_v2 = |chunk: Option| match chunk { + None => v2::ChunkFetchingResponse::NoSuchChunk, + Some(chunk) => v2::ChunkFetchingResponse::Chunk(chunk.into()), + }; + loop { - match receiver.recv(|| vec![COST_INVALID_REQUEST]).await.into_nested() { - Ok(Ok(msg)) => { - answer_chunk_request_log(&mut sender, msg, &metrics).await; - }, - Err(fatal) => { - gum::debug!( - target: LOG_TARGET, - error = ?fatal, - "Shutting down chunk receiver." - ); - return - }, - Ok(Err(jfyi)) => { - gum::debug!( - target: LOG_TARGET, - error = ?jfyi, - "Error decoding incoming chunk request." - ); + select! { + res = receiver_v1.recv(|| vec![COST_INVALID_REQUEST]).fuse() => match res.into_nested() { + Ok(Ok(msg)) => { + answer_chunk_request_log(&mut sender, msg, make_resp_v1, &metrics).await; + }, + Err(fatal) => { + gum::debug!( + target: LOG_TARGET, + error = ?fatal, + "Shutting down chunk receiver." + ); + return + }, + Ok(Err(jfyi)) => { + gum::debug!( + target: LOG_TARGET, + error = ?jfyi, + "Error decoding incoming chunk request." + ); + } }, + res = receiver_v2.recv(|| vec![COST_INVALID_REQUEST]).fuse() => match res.into_nested() { + Ok(Ok(msg)) => { + answer_chunk_request_log(&mut sender, msg.into(), make_resp_v2, &metrics).await; + }, + Err(fatal) => { + gum::debug!( + target: LOG_TARGET, + error = ?fatal, + "Shutting down chunk receiver." + ); + return + }, + Ok(Err(jfyi)) => { + gum::debug!( + target: LOG_TARGET, + error = ?jfyi, + "Error decoding incoming chunk request." + ); + } + } } } } @@ -124,15 +158,18 @@ pub async fn answer_pov_request_log( /// Variant of `answer_chunk_request` that does Prometheus metric and logging on errors. /// /// Any errors of `answer_request` will simply be logged. -pub async fn answer_chunk_request_log( +pub async fn answer_chunk_request_log( sender: &mut Sender, - req: IncomingRequest, + req: IncomingRequest, + make_response: MakeResp, metrics: &Metrics, -) -> () -where +) where + Req: IsRequest + Decode + Encode + Into, + Req::Response: Encode, Sender: SubsystemSender, + MakeResp: Fn(Option) -> Req::Response, { - let res = answer_chunk_request(sender, req).await; + let res = answer_chunk_request(sender, req, make_response).await; match res { Ok(result) => metrics.on_served_chunk(if result { SUCCEEDED } else { NOT_FOUND }), Err(err) => { @@ -177,39 +214,46 @@ where /// Answer an incoming chunk request by querying the av store. /// /// Returns: `Ok(true)` if chunk was found and served. -pub async fn answer_chunk_request( +pub async fn answer_chunk_request( sender: &mut Sender, - req: IncomingRequest, + req: IncomingRequest, + make_response: MakeResp, ) -> Result where Sender: SubsystemSender, + Req: IsRequest + Decode + Encode + Into, + Req::Response: Encode, + MakeResp: Fn(Option) -> Req::Response, { - let span = jaeger::Span::new(req.payload.candidate_hash, "answer-chunk-request"); + // V1 and V2 requests have the same payload, so decoding into either one will work. It's the + // responses that differ, hence the `MakeResp` generic. + let payload: v1::ChunkFetchingRequest = req.payload.into(); + let span = jaeger::Span::new(payload.candidate_hash, "answer-chunk-request"); let _child_span = span .child("answer-chunk-request") - .with_trace_id(req.payload.candidate_hash) - .with_chunk_index(req.payload.index.0); + .with_trace_id(payload.candidate_hash) + .with_validator_index(payload.index); - let chunk = query_chunk(sender, req.payload.candidate_hash, req.payload.index).await?; + let chunk = query_chunk(sender, payload.candidate_hash, payload.index).await?; let result = chunk.is_some(); gum::trace!( target: LOG_TARGET, - hash = ?req.payload.candidate_hash, - index = ?req.payload.index, + hash = ?payload.candidate_hash, + index = ?payload.index, peer = ?req.peer, has_data = ?chunk.is_some(), "Serving chunk", ); - let response = match chunk { - None => v1::ChunkFetchingResponse::NoSuchChunk, - Some(chunk) => v1::ChunkFetchingResponse::Chunk(chunk.into()), - }; + let response = make_response(chunk); + + req.pending_response + .send_response(response) + .map_err(|_| JfyiError::SendResponse)?; - req.send_response(response).map_err(|_| JfyiError::SendResponse)?; Ok(result) } diff --git a/polkadot/node/network/availability-distribution/src/tests/mock.rs b/polkadot/node/network/availability-distribution/src/tests/mock.rs index 3df662fe546c0..b41c493a10721 100644 --- a/polkadot/node/network/availability-distribution/src/tests/mock.rs +++ b/polkadot/node/network/availability-distribution/src/tests/mock.rs @@ -23,9 +23,9 @@ use sp_keyring::Sr25519Keyring; use polkadot_erasure_coding::{branches, obtain_chunks_v1 as obtain_chunks}; use polkadot_node_primitives::{AvailableData, BlockData, ErasureChunk, PoV, Proof}; use polkadot_primitives::{ - CandidateCommitments, CandidateDescriptor, CandidateHash, CommittedCandidateReceipt, - GroupIndex, Hash, HeadData, Id as ParaId, IndexedVec, OccupiedCore, PersistedValidationData, - SessionInfo, ValidatorIndex, + CandidateCommitments, CandidateDescriptor, CandidateHash, ChunkIndex, + CommittedCandidateReceipt, GroupIndex, Hash, HeadData, Id as ParaId, IndexedVec, OccupiedCore, + PersistedValidationData, SessionInfo, ValidatorIndex, }; use polkadot_primitives_test_helpers::{ dummy_collator, dummy_collator_signature, dummy_hash, dummy_validation_code, @@ -75,13 +75,16 @@ pub struct OccupiedCoreBuilder { pub group_responsible: GroupIndex, pub para_id: ParaId, pub relay_parent: Hash, + pub n_validators: usize, + pub chunk_index: ChunkIndex, } impl OccupiedCoreBuilder { pub fn build(self) -> (OccupiedCore, (CandidateHash, ErasureChunk)) { let pov = PoV { block_data: BlockData(vec![45, 46, 47]) }; let pov_hash = pov.hash(); - let (erasure_root, chunk) = get_valid_chunk_data(pov.clone()); + let (erasure_root, chunk) = + get_valid_chunk_data(pov.clone(), self.n_validators, self.chunk_index); let candidate_receipt = TestCandidateBuilder { para_id: self.para_id, pov_hash, @@ -133,8 +136,11 @@ impl TestCandidateBuilder { } // Get chunk for index 0 -pub fn get_valid_chunk_data(pov: PoV) -> (Hash, ErasureChunk) { - let fake_validator_count = 10; +pub fn get_valid_chunk_data( + pov: PoV, + n_validators: usize, + chunk_index: ChunkIndex, +) -> (Hash, ErasureChunk) { let persisted = PersistedValidationData { parent_head: HeadData(vec![7, 8, 9]), relay_parent_number: Default::default(), @@ -142,17 +148,17 @@ pub fn get_valid_chunk_data(pov: PoV) -> (Hash, ErasureChunk) { relay_parent_storage_root: Default::default(), }; let available_data = AvailableData { validation_data: persisted, pov: Arc::new(pov) }; - let chunks = obtain_chunks(fake_validator_count, &available_data).unwrap(); + let chunks = obtain_chunks(n_validators, &available_data).unwrap(); let branches = branches(chunks.as_ref()); let root = branches.root(); let chunk = branches .enumerate() .map(|(index, (proof, chunk))| ErasureChunk { chunk: chunk.to_vec(), - index: ValidatorIndex(index as _), + index: ChunkIndex(index as _), proof: Proof::try_from(proof).unwrap(), }) - .next() - .expect("There really should be 10 chunks."); + .nth(chunk_index.0 as usize) + .expect("There really should be enough chunks."); (root, chunk) } diff --git a/polkadot/node/network/availability-distribution/src/tests/mod.rs b/polkadot/node/network/availability-distribution/src/tests/mod.rs index 214498979fb68..b30e11a293c8d 100644 --- a/polkadot/node/network/availability-distribution/src/tests/mod.rs +++ b/polkadot/node/network/availability-distribution/src/tests/mod.rs @@ -17,9 +17,12 @@ use std::collections::HashSet; use futures::{executor, future, Future}; +use rstest::rstest; -use polkadot_node_network_protocol::request_response::{IncomingRequest, ReqProtocolNames}; -use polkadot_primitives::{Block, CoreState, Hash}; +use polkadot_node_network_protocol::request_response::{ + IncomingRequest, Protocol, ReqProtocolNames, +}; +use polkadot_primitives::{node_features, Block, CoreState, Hash, NodeFeatures}; use sp_keystore::KeystorePtr; use polkadot_node_subsystem_test_helpers as test_helpers; @@ -35,67 +38,129 @@ pub(crate) mod mock; fn test_harness>( keystore: KeystorePtr, + req_protocol_names: ReqProtocolNames, test_fx: impl FnOnce(TestHarness) -> T, -) { - sp_tracing::try_init_simple(); +) -> std::result::Result<(), FatalError> { + sp_tracing::init_for_tests(); let pool = sp_core::testing::TaskExecutor::new(); let (context, virtual_overseer) = test_helpers::make_subsystem_context(pool.clone()); - let genesis_hash = Hash::repeat_byte(0xff); - let req_protocol_names = ReqProtocolNames::new(&genesis_hash, None); let (pov_req_receiver, pov_req_cfg) = IncomingRequest::get_config_receiver::< Block, sc_network::NetworkWorker, >(&req_protocol_names); - let (chunk_req_receiver, chunk_req_cfg) = IncomingRequest::get_config_receiver::< + let (chunk_req_v1_receiver, chunk_req_v1_cfg) = IncomingRequest::get_config_receiver::< + Block, + sc_network::NetworkWorker, + >(&req_protocol_names); + let (chunk_req_v2_receiver, chunk_req_v2_cfg) = IncomingRequest::get_config_receiver::< Block, sc_network::NetworkWorker, >(&req_protocol_names); let subsystem = AvailabilityDistributionSubsystem::new( keystore, - IncomingRequestReceivers { pov_req_receiver, chunk_req_receiver }, + IncomingRequestReceivers { pov_req_receiver, chunk_req_v1_receiver, chunk_req_v2_receiver }, + req_protocol_names, Default::default(), ); let subsystem = subsystem.run(context); - let test_fut = test_fx(TestHarness { virtual_overseer, pov_req_cfg, chunk_req_cfg, pool }); + let test_fut = test_fx(TestHarness { + virtual_overseer, + pov_req_cfg, + chunk_req_v1_cfg, + chunk_req_v2_cfg, + pool, + }); futures::pin_mut!(test_fut); futures::pin_mut!(subsystem); - executor::block_on(future::join(test_fut, subsystem)).1.unwrap(); + executor::block_on(future::join(test_fut, subsystem)).1 +} + +pub fn node_features_with_mapping_enabled() -> NodeFeatures { + let mut node_features = NodeFeatures::new(); + node_features.resize(node_features::FeatureIndex::AvailabilityChunkMapping as usize + 1, false); + node_features.set(node_features::FeatureIndex::AvailabilityChunkMapping as u8 as usize, true); + node_features } /// Simple basic check, whether the subsystem works as expected. /// /// Exceptional cases are tested as unit tests in `fetch_task`. -#[test] -fn check_basic() { - let state = TestState::default(); - test_harness(state.keystore.clone(), move |harness| state.run(harness)); +#[rstest] +#[case(NodeFeatures::EMPTY, Protocol::ChunkFetchingV1)] +#[case(NodeFeatures::EMPTY, Protocol::ChunkFetchingV2)] +#[case(node_features_with_mapping_enabled(), Protocol::ChunkFetchingV1)] +#[case(node_features_with_mapping_enabled(), Protocol::ChunkFetchingV2)] +fn check_basic(#[case] node_features: NodeFeatures, #[case] chunk_resp_protocol: Protocol) { + let req_protocol_names = ReqProtocolNames::new(&Hash::repeat_byte(0xff), None); + let state = + TestState::new(node_features.clone(), req_protocol_names.clone(), chunk_resp_protocol); + + if node_features == node_features_with_mapping_enabled() && + chunk_resp_protocol == Protocol::ChunkFetchingV1 + { + // For this specific case, chunk fetching is not possible, because the ValidatorIndex is not + // equal to the ChunkIndex and the peer does not send back the actual ChunkIndex. + let _ = test_harness(state.keystore.clone(), req_protocol_names, move |harness| { + state.run_assert_timeout(harness) + }); + } else { + test_harness(state.keystore.clone(), req_protocol_names, move |harness| state.run(harness)) + .unwrap(); + } } /// Check whether requester tries all validators in group. -#[test] -fn check_fetch_tries_all() { - let mut state = TestState::default(); +#[rstest] +#[case(NodeFeatures::EMPTY, Protocol::ChunkFetchingV1)] +#[case(NodeFeatures::EMPTY, Protocol::ChunkFetchingV2)] +#[case(node_features_with_mapping_enabled(), Protocol::ChunkFetchingV1)] +#[case(node_features_with_mapping_enabled(), Protocol::ChunkFetchingV2)] +fn check_fetch_tries_all( + #[case] node_features: NodeFeatures, + #[case] chunk_resp_protocol: Protocol, +) { + let req_protocol_names = ReqProtocolNames::new(&Hash::repeat_byte(0xff), None); + let mut state = + TestState::new(node_features.clone(), req_protocol_names.clone(), chunk_resp_protocol); for (_, v) in state.chunks.iter_mut() { // 4 validators in group, so this should still succeed: v.push(None); v.push(None); v.push(None); } - test_harness(state.keystore.clone(), move |harness| state.run(harness)); + + if node_features == node_features_with_mapping_enabled() && + chunk_resp_protocol == Protocol::ChunkFetchingV1 + { + // For this specific case, chunk fetching is not possible, because the ValidatorIndex is not + // equal to the ChunkIndex and the peer does not send back the actual ChunkIndex. + let _ = test_harness(state.keystore.clone(), req_protocol_names, move |harness| { + state.run_assert_timeout(harness) + }); + } else { + test_harness(state.keystore.clone(), req_protocol_names, move |harness| state.run(harness)) + .unwrap(); + } } /// Check whether requester tries all validators in group /// /// Check that requester will retry the fetch on error on the next block still pending /// availability. -#[test] -fn check_fetch_retry() { - let mut state = TestState::default(); +#[rstest] +#[case(NodeFeatures::EMPTY, Protocol::ChunkFetchingV1)] +#[case(NodeFeatures::EMPTY, Protocol::ChunkFetchingV2)] +#[case(node_features_with_mapping_enabled(), Protocol::ChunkFetchingV1)] +#[case(node_features_with_mapping_enabled(), Protocol::ChunkFetchingV2)] +fn check_fetch_retry(#[case] node_features: NodeFeatures, #[case] chunk_resp_protocol: Protocol) { + let req_protocol_names = ReqProtocolNames::new(&Hash::repeat_byte(0xff), None); + let mut state = + TestState::new(node_features.clone(), req_protocol_names.clone(), chunk_resp_protocol); state .cores .insert(state.relay_chain[2], state.cores.get(&state.relay_chain[1]).unwrap().clone()); @@ -126,5 +191,17 @@ fn check_fetch_retry() { v.push(None); v.push(None); } - test_harness(state.keystore.clone(), move |harness| state.run(harness)); + + if node_features == node_features_with_mapping_enabled() && + chunk_resp_protocol == Protocol::ChunkFetchingV1 + { + // For this specific case, chunk fetching is not possible, because the ValidatorIndex is not + // equal to the ChunkIndex and the peer does not send back the actual ChunkIndex. + let _ = test_harness(state.keystore.clone(), req_protocol_names, move |harness| { + state.run_assert_timeout(harness) + }); + } else { + test_harness(state.keystore.clone(), req_protocol_names, move |harness| state.run(harness)) + .unwrap(); + } } diff --git a/polkadot/node/network/availability-distribution/src/tests/state.rs b/polkadot/node/network/availability-distribution/src/tests/state.rs index 93411511e763a..ecc3eefbf3da3 100644 --- a/polkadot/node/network/availability-distribution/src/tests/state.rs +++ b/polkadot/node/network/availability-distribution/src/tests/state.rs @@ -19,9 +19,9 @@ use std::{ time::Duration, }; -use network::ProtocolName; +use network::{request_responses::OutgoingResponse, ProtocolName, RequestFailure}; use polkadot_node_subsystem_test_helpers::TestSubsystemContextHandle; -use polkadot_node_subsystem_util::TimeoutExt; +use polkadot_node_subsystem_util::{availability_chunks::availability_chunk_index, TimeoutExt}; use futures::{ channel::{mpsc, oneshot}, @@ -35,7 +35,7 @@ use sp_core::{testing::TaskExecutor, traits::SpawnNamed}; use sp_keystore::KeystorePtr; use polkadot_node_network_protocol::request_response::{ - v1, IncomingRequest, OutgoingRequest, Requests, + v1, v2, IncomingRequest, OutgoingRequest, Protocol, ReqProtocolNames, Requests, }; use polkadot_node_primitives::ErasureChunk; use polkadot_node_subsystem::{ @@ -47,8 +47,8 @@ use polkadot_node_subsystem::{ }; use polkadot_node_subsystem_test_helpers as test_helpers; use polkadot_primitives::{ - CandidateHash, CoreState, ExecutorParams, GroupIndex, Hash, Id as ParaId, NodeFeatures, - ScheduledCore, SessionInfo, ValidatorIndex, + CandidateHash, ChunkIndex, CoreIndex, CoreState, ExecutorParams, GroupIndex, Hash, + Id as ParaId, NodeFeatures, ScheduledCore, SessionInfo, ValidatorIndex, }; use test_helpers::mock::{make_ferdie_keystore, new_leaf}; @@ -59,7 +59,8 @@ type VirtualOverseer = test_helpers::TestSubsystemContextHandle>, pub keystore: KeystorePtr, + pub node_features: NodeFeatures, + pub chunk_response_protocol: Protocol, + pub req_protocol_names: ReqProtocolNames, + pub our_chunk_index: ChunkIndex, } -impl Default for TestState { - fn default() -> Self { +impl TestState { + /// Initialize a default test state. + pub fn new( + node_features: NodeFeatures, + req_protocol_names: ReqProtocolNames, + chunk_response_protocol: Protocol, + ) -> Self { let relay_chain: Vec<_> = (1u8..10).map(Hash::repeat_byte).collect(); let chain_a = ParaId::from(1); let chain_b = ParaId::from(2); @@ -97,6 +107,14 @@ impl Default for TestState { let session_info = make_session_info(); + let our_chunk_index = availability_chunk_index( + Some(&node_features), + session_info.validators.len(), + CoreIndex(1), + ValidatorIndex(0), + ) + .unwrap(); + let (cores, chunks) = { let mut cores = HashMap::new(); let mut chunks = HashMap::new(); @@ -123,6 +141,8 @@ impl Default for TestState { group_responsible: GroupIndex(i as _), para_id: *para_id, relay_parent: *relay_parent, + n_validators: session_info.validators.len(), + chunk_index: our_chunk_index, } .build(); (CoreState::Occupied(core), chunk) @@ -132,8 +152,8 @@ impl Default for TestState { // Skip chunks for our own group (won't get fetched): let mut chunks_other_groups = p_chunks.into_iter(); chunks_other_groups.next(); - for (validator_index, chunk) in chunks_other_groups { - chunks.insert((validator_index, chunk.index), vec![Some(chunk)]); + for (candidate, chunk) in chunks_other_groups { + chunks.insert((candidate, ValidatorIndex(0)), vec![Some(chunk)]); } } (cores, chunks) @@ -145,18 +165,27 @@ impl Default for TestState { session_info, cores, keystore, + node_features, + chunk_response_protocol, + req_protocol_names, + our_chunk_index, } } -} -impl TestState { /// Run, but fail after some timeout. pub async fn run(self, harness: TestHarness) { // Make sure test won't run forever. - let f = self.run_inner(harness).timeout(Duration::from_secs(10)); + let f = self.run_inner(harness).timeout(Duration::from_secs(5)); assert!(f.await.is_some(), "Test ran into timeout"); } + /// Run, and assert an expected timeout. + pub async fn run_assert_timeout(self, harness: TestHarness) { + // Make sure test won't run forever. + let f = self.run_inner(harness).timeout(Duration::from_secs(5)); + assert!(f.await.is_none(), "Test should have run into timeout"); + } + /// Run tests with the given mock values in `TestState`. /// /// This will simply advance through the simulated chain and examines whether the subsystem @@ -214,15 +243,41 @@ impl TestState { )) => { for req in reqs { // Forward requests: - let in_req = to_incoming_req(&harness.pool, req); - harness - .chunk_req_cfg - .inbound_queue - .as_mut() - .unwrap() - .send(in_req.into_raw()) - .await - .unwrap(); + match self.chunk_response_protocol { + Protocol::ChunkFetchingV1 => { + let in_req = to_incoming_req_v1( + &harness.pool, + req, + self.req_protocol_names.get_name(Protocol::ChunkFetchingV1), + ); + + harness + .chunk_req_v1_cfg + .inbound_queue + .as_mut() + .unwrap() + .send(in_req.into_raw()) + .await + .unwrap(); + }, + Protocol::ChunkFetchingV2 => { + let in_req = to_incoming_req_v2( + &harness.pool, + req, + self.req_protocol_names.get_name(Protocol::ChunkFetchingV2), + ); + + harness + .chunk_req_v2_cfg + .inbound_queue + .as_mut() + .unwrap() + .send(in_req.into_raw()) + .await + .unwrap(); + }, + _ => panic!("Unexpected protocol"), + } } }, AllMessages::AvailabilityStore(AvailabilityStoreMessage::QueryChunk( @@ -240,13 +295,16 @@ impl TestState { AllMessages::AvailabilityStore(AvailabilityStoreMessage::StoreChunk { candidate_hash, chunk, + validator_index, tx, .. }) => { assert!( - self.valid_chunks.contains(&(candidate_hash, chunk.index)), + self.valid_chunks.contains(&(candidate_hash, validator_index)), "Only valid chunks should ever get stored." ); + assert_eq!(self.our_chunk_index, chunk.index); + tx.send(Ok(())).expect("Receiver is expected to be alive"); gum::trace!(target: LOG_TARGET, "'Stored' fetched chunk."); remaining_stores -= 1; @@ -265,14 +323,15 @@ impl TestState { tx.send(Ok(Some(ExecutorParams::default()))) .expect("Receiver should be alive."); }, - RuntimeApiRequest::NodeFeatures(_, si_tx) => { - si_tx.send(Ok(NodeFeatures::EMPTY)).expect("Receiver should be alive."); - }, RuntimeApiRequest::AvailabilityCores(tx) => { gum::trace!(target: LOG_TARGET, cores= ?self.cores[&hash], hash = ?hash, "Sending out cores for hash"); tx.send(Ok(self.cores[&hash].clone())) .expect("Receiver should still be alive"); }, + RuntimeApiRequest::NodeFeatures(_, tx) => { + tx.send(Ok(self.node_features.clone())) + .expect("Receiver should still be alive"); + }, _ => { panic!("Unexpected runtime request: {:?}", req); }, @@ -286,7 +345,10 @@ impl TestState { .unwrap_or_default(); response_channel.send(Ok(ancestors)).expect("Receiver is expected to be alive"); }, - _ => {}, + + _ => { + panic!("Received unexpected message") + }, } } @@ -310,30 +372,47 @@ async fn overseer_recv(rx: &mut mpsc::UnboundedReceiver) -> AllMess rx.next().await.expect("Test subsystem no longer live") } -fn to_incoming_req( +fn to_incoming_req_v1( executor: &TaskExecutor, outgoing: Requests, + protocol_name: ProtocolName, ) -> IncomingRequest { match outgoing { - Requests::ChunkFetchingV1(OutgoingRequest { payload, pending_response, .. }) => { - let (tx, rx): (oneshot::Sender, oneshot::Receiver<_>) = - oneshot::channel(); - executor.spawn( - "message-forwarding", - None, - async { - let response = rx.await; - let payload = response.expect("Unexpected canceled request").result; - pending_response - .send( - payload - .map_err(|_| network::RequestFailure::Refused) - .map(|r| (r, ProtocolName::from(""))), - ) - .expect("Sending response is expected to work"); - } - .boxed(), - ); + Requests::ChunkFetching(OutgoingRequest { + pending_response, + fallback_request: Some((fallback_request, fallback_protocol)), + .. + }) => { + assert_eq!(fallback_protocol, Protocol::ChunkFetchingV1); + + let tx = spawn_message_forwarding(executor, protocol_name, pending_response); + + IncomingRequest::new( + // We don't really care: + network::PeerId::random().into(), + fallback_request, + tx, + ) + }, + _ => panic!("Unexpected request!"), + } +} + +fn to_incoming_req_v2( + executor: &TaskExecutor, + outgoing: Requests, + protocol_name: ProtocolName, +) -> IncomingRequest { + match outgoing { + Requests::ChunkFetching(OutgoingRequest { + payload, + pending_response, + fallback_request: Some((_, fallback_protocol)), + .. + }) => { + assert_eq!(fallback_protocol, Protocol::ChunkFetchingV1); + + let tx = spawn_message_forwarding(executor, protocol_name, pending_response); IncomingRequest::new( // We don't really care: @@ -345,3 +424,26 @@ fn to_incoming_req( _ => panic!("Unexpected request!"), } } + +fn spawn_message_forwarding( + executor: &TaskExecutor, + protocol_name: ProtocolName, + pending_response: oneshot::Sender, ProtocolName), RequestFailure>>, +) -> oneshot::Sender { + let (tx, rx): (oneshot::Sender, oneshot::Receiver<_>) = + oneshot::channel(); + executor.spawn( + "message-forwarding", + None, + async { + let response = rx.await; + let payload = response.expect("Unexpected canceled request").result; + pending_response + .send(payload.map_err(|_| RequestFailure::Refused).map(|r| (r, protocol_name))) + .expect("Sending response is expected to work"); + } + .boxed(), + ); + + tx +} diff --git a/polkadot/node/network/availability-recovery/Cargo.toml b/polkadot/node/network/availability-recovery/Cargo.toml index eb503f502b298..1c2b5f4968ad2 100644 --- a/polkadot/node/network/availability-recovery/Cargo.toml +++ b/polkadot/node/network/availability-recovery/Cargo.toml @@ -30,10 +30,11 @@ sc-network = { path = "../../../../substrate/client/network" } [dev-dependencies] assert_matches = "1.4.0" -env_logger = "0.11" futures-timer = "3.0.2" +rstest = "0.18.2" log = { workspace = true, default-features = true } +sp-tracing = { path = "../../../../substrate/primitives/tracing" } sp-core = { path = "../../../../substrate/primitives/core" } sp-keyring = { path = "../../../../substrate/primitives/keyring" } sp-application-crypto = { path = "../../../../substrate/primitives/application-crypto" } diff --git a/polkadot/node/network/availability-recovery/benches/availability-recovery-regression-bench.rs b/polkadot/node/network/availability-recovery/benches/availability-recovery-regression-bench.rs index d36b898ea159b..e5a8f1eb7c913 100644 --- a/polkadot/node/network/availability-recovery/benches/availability-recovery-regression-bench.rs +++ b/polkadot/node/network/availability-recovery/benches/availability-recovery-regression-bench.rs @@ -23,7 +23,7 @@ use polkadot_subsystem_bench::{ availability::{ - benchmark_availability_read, prepare_test, DataAvailabilityReadOptions, + benchmark_availability_read, prepare_test, DataAvailabilityReadOptions, Strategy, TestDataAvailability, TestState, }, configuration::TestConfiguration, @@ -37,7 +37,7 @@ const BENCH_COUNT: usize = 10; fn main() -> Result<(), String> { let mut messages = vec![]; - let options = DataAvailabilityReadOptions { fetch_from_backers: true }; + let options = DataAvailabilityReadOptions { strategy: Strategy::FullFromBackers }; let mut config = TestConfiguration::default(); config.num_blocks = 3; config.generate_pov_sizes(); diff --git a/polkadot/node/network/availability-recovery/src/error.rs b/polkadot/node/network/availability-recovery/src/error.rs index 47277a521b81e..eaec4cbc9d942 100644 --- a/polkadot/node/network/availability-recovery/src/error.rs +++ b/polkadot/node/network/availability-recovery/src/error.rs @@ -16,20 +16,34 @@ //! The `Error` and `Result` types used by the subsystem. +use crate::LOG_TARGET; +use fatality::{fatality, Nested}; use futures::channel::oneshot; -use thiserror::Error; +use polkadot_node_network_protocol::request_response::incoming; +use polkadot_node_subsystem::{RecoveryError, SubsystemError}; +use polkadot_primitives::Hash; /// Error type used by the Availability Recovery subsystem. -#[derive(Debug, Error)] +#[fatality(splitable)] pub enum Error { - #[error(transparent)] - Subsystem(#[from] polkadot_node_subsystem::SubsystemError), + #[fatal] + #[error("Spawning subsystem task failed: {0}")] + SpawnTask(#[source] SubsystemError), + + /// Receiving subsystem message from overseer failed. + #[fatal] + #[error("Receiving message from overseer failed: {0}")] + SubsystemReceive(#[source] SubsystemError), + #[fatal] #[error("failed to query full data from store")] CanceledQueryFullData(#[source] oneshot::Canceled), - #[error("failed to query session info")] - CanceledSessionInfo(#[source] oneshot::Canceled), + #[error("`SessionInfo` is `None` at {0}")] + SessionInfoUnavailable(Hash), + + #[error("failed to query node features from runtime")] + RequestNodeFeatures(#[source] polkadot_node_subsystem_util::runtime::Error), #[error("failed to send response")] CanceledResponseSender, @@ -40,8 +54,38 @@ pub enum Error { #[error(transparent)] Erasure(#[from] polkadot_erasure_coding::Error), + #[fatal] #[error(transparent)] - Util(#[from] polkadot_node_subsystem_util::Error), + Oneshot(#[from] oneshot::Canceled), + + #[fatal(forward)] + #[error("Error during recovery: {0}")] + Recovery(#[from] RecoveryError), + + #[fatal(forward)] + #[error("Retrieving next incoming request failed: {0}")] + IncomingRequest(#[from] incoming::Error), } pub type Result = std::result::Result; + +/// Utility for eating top level errors and log them. +/// +/// We basically always want to try and continue on error, unless the error is fatal for the entire +/// subsystem. +pub fn log_error(result: Result<()>) -> std::result::Result<(), FatalError> { + match result.into_nested()? { + Ok(()) => Ok(()), + Err(jfyi) => { + jfyi.log(); + Ok(()) + }, + } +} + +impl JfyiError { + /// Log a `JfyiError`. + pub fn log(self) { + gum::warn!(target: LOG_TARGET, "{}", self); + } +} diff --git a/polkadot/node/network/availability-recovery/src/lib.rs b/polkadot/node/network/availability-recovery/src/lib.rs index b836870cd8afc..167125f987ab8 100644 --- a/polkadot/node/network/availability-recovery/src/lib.rs +++ b/polkadot/node/network/availability-recovery/src/lib.rs @@ -19,7 +19,7 @@ #![warn(missing_docs)] use std::{ - collections::{HashMap, VecDeque}, + collections::{BTreeMap, VecDeque}, iter::Iterator, num::NonZeroUsize, pin::Pin, @@ -34,31 +34,41 @@ use futures::{ stream::{FuturesUnordered, StreamExt}, task::{Context, Poll}, }; +use sc_network::ProtocolName; use schnellru::{ByLength, LruMap}; -use task::{FetchChunks, FetchChunksParams, FetchFull, FetchFullParams}; +use task::{ + FetchChunks, FetchChunksParams, FetchFull, FetchFullParams, FetchSystematicChunks, + FetchSystematicChunksParams, +}; -use fatality::Nested; use polkadot_erasure_coding::{ - branch_hash, branches, obtain_chunks_v1, recovery_threshold, Error as ErasureEncodingError, + branches, obtain_chunks_v1, recovery_threshold, systematic_recovery_threshold, + Error as ErasureEncodingError, }; use task::{RecoveryParams, RecoveryStrategy, RecoveryTask}; +use error::{log_error, Error, FatalError, Result}; use polkadot_node_network_protocol::{ - request_response::{v1 as request_v1, IncomingRequestReceiver}, + request_response::{ + v1 as request_v1, v2 as request_v2, IncomingRequestReceiver, IsRequest, ReqProtocolNames, + }, UnifiedReputationChange as Rep, }; -use polkadot_node_primitives::{AvailableData, ErasureChunk}; +use polkadot_node_primitives::AvailableData; use polkadot_node_subsystem::{ errors::RecoveryError, jaeger, messages::{AvailabilityRecoveryMessage, AvailabilityStoreMessage}, overseer, ActiveLeavesUpdate, FromOrchestra, OverseerSignal, SpawnedSubsystem, - SubsystemContext, SubsystemError, SubsystemResult, + SubsystemContext, SubsystemError, +}; +use polkadot_node_subsystem_util::{ + availability_chunks::availability_chunk_indices, + runtime::{ExtendedSessionInfo, RuntimeInfo}, }; -use polkadot_node_subsystem_util::request_session_info; use polkadot_primitives::{ - BlakeTwo256, BlockNumber, CandidateHash, CandidateReceipt, GroupIndex, Hash, HashT, - SessionIndex, SessionInfo, ValidatorIndex, + node_features, BlockNumber, CandidateHash, CandidateReceipt, ChunkIndex, CoreIndex, GroupIndex, + Hash, SessionIndex, ValidatorIndex, }; mod error; @@ -70,6 +80,8 @@ pub use metrics::Metrics; #[cfg(test)] mod tests; +type RecoveryResult = std::result::Result; + const LOG_TARGET: &str = "parachain::availability-recovery"; // Size of the LRU cache where we keep recovered data. @@ -85,13 +97,27 @@ pub const FETCH_CHUNKS_THRESHOLD: usize = 4 * 1024 * 1024; #[derive(Clone, PartialEq)] /// The strategy we use to recover the PoV. pub enum RecoveryStrategyKind { - /// We always try the backing group first, then fallback to validator chunks. - BackersFirstAlways, /// We try the backing group first if PoV size is lower than specified, then fallback to /// validator chunks. BackersFirstIfSizeLower(usize), + /// We try the backing group first if PoV size is lower than specified, then fallback to + /// systematic chunks. Regular chunk recovery as a last resort. + BackersFirstIfSizeLowerThenSystematicChunks(usize), + + /// The following variants are only helpful for integration tests. + /// + /// We always try the backing group first, then fallback to validator chunks. + #[allow(dead_code)] + BackersFirstAlways, /// We always recover using validator chunks. + #[allow(dead_code)] ChunksAlways, + /// First try the backing group. Then systematic chunks. + #[allow(dead_code)] + BackersThenSystematicChunks, + /// Always recover using systematic chunks, fall back to regular chunks. + #[allow(dead_code)] + SystematicChunks, } /// The Availability Recovery Subsystem. @@ -109,11 +135,15 @@ pub struct AvailabilityRecoverySubsystem { metrics: Metrics, /// The type of check to perform after available data was recovered. post_recovery_check: PostRecoveryCheck, + /// Full protocol name for ChunkFetchingV1. + req_v1_protocol_name: ProtocolName, + /// Full protocol name for ChunkFetchingV2. + req_v2_protocol_name: ProtocolName, } #[derive(Clone, PartialEq, Debug)] /// The type of check to perform after available data was recovered. -pub enum PostRecoveryCheck { +enum PostRecoveryCheck { /// Reencode the data and check erasure root. For validators. Reencode, /// Only check the pov hash. For collators only. @@ -121,56 +151,18 @@ pub enum PostRecoveryCheck { } /// Expensive erasure coding computations that we want to run on a blocking thread. -pub enum ErasureTask { +enum ErasureTask { /// Reconstructs `AvailableData` from chunks given `n_validators`. Reconstruct( usize, - HashMap, - oneshot::Sender>, + BTreeMap>, + oneshot::Sender>, ), /// Re-encode `AvailableData` into erasure chunks in order to verify the provided root hash of /// the Merkle tree. Reencode(usize, Hash, AvailableData, oneshot::Sender>), } -const fn is_unavailable( - received_chunks: usize, - requesting_chunks: usize, - unrequested_validators: usize, - threshold: usize, -) -> bool { - received_chunks + requesting_chunks + unrequested_validators < threshold -} - -/// Check validity of a chunk. -fn is_chunk_valid(params: &RecoveryParams, chunk: &ErasureChunk) -> bool { - let anticipated_hash = - match branch_hash(¶ms.erasure_root, chunk.proof(), chunk.index.0 as usize) { - Ok(hash) => hash, - Err(e) => { - gum::debug!( - target: LOG_TARGET, - candidate_hash = ?params.candidate_hash, - validator_index = ?chunk.index, - error = ?e, - "Invalid Merkle proof", - ); - return false - }, - }; - let erasure_chunk_hash = BlakeTwo256::hash(&chunk.chunk); - if anticipated_hash != erasure_chunk_hash { - gum::debug!( - target: LOG_TARGET, - candidate_hash = ?params.candidate_hash, - validator_index = ?chunk.index, - "Merkle proof mismatch" - ); - return false - } - true -} - /// Re-encode the data into erasure chunks in order to verify /// the root hash of the provided Merkle tree, which is built /// on-top of the encoded chunks. @@ -214,12 +206,12 @@ fn reconstructed_data_matches_root( /// Accumulate all awaiting sides for some particular `AvailableData`. struct RecoveryHandle { candidate_hash: CandidateHash, - remote: RemoteHandle>, - awaiting: Vec>>, + remote: RemoteHandle, + awaiting: Vec>, } impl Future for RecoveryHandle { - type Output = Option<(CandidateHash, Result)>; + type Output = Option<(CandidateHash, RecoveryResult)>; fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll { let mut indices_to_remove = Vec::new(); @@ -273,7 +265,7 @@ enum CachedRecovery { impl CachedRecovery { /// Convert back to `Result` to deliver responses. - fn into_result(self) -> Result { + fn into_result(self) -> RecoveryResult { match self { Self::Valid(d) => Ok(d), Self::Invalid => Err(RecoveryError::Invalid), @@ -281,9 +273,9 @@ impl CachedRecovery { } } -impl TryFrom> for CachedRecovery { +impl TryFrom for CachedRecovery { type Error = (); - fn try_from(o: Result) -> Result { + fn try_from(o: RecoveryResult) -> std::result::Result { match o { Ok(d) => Ok(Self::Valid(d)), Err(RecoveryError::Invalid) => Ok(Self::Invalid), @@ -305,6 +297,9 @@ struct State { /// An LRU cache of recently recovered data. availability_lru: LruMap, + + /// Cached runtime info. + runtime_info: RuntimeInfo, } impl Default for State { @@ -313,6 +308,7 @@ impl Default for State { ongoing_recoveries: FuturesUnordered::new(), live_block: (0, Hash::default()), availability_lru: LruMap::new(ByLength::new(LRU_SIZE)), + runtime_info: RuntimeInfo::new(None), } } } @@ -329,9 +325,10 @@ impl AvailabilityRecoverySubsystem { } /// Handles a signal from the overseer. -async fn handle_signal(state: &mut State, signal: OverseerSignal) -> SubsystemResult { +/// Returns true if subsystem receives a deadly signal. +async fn handle_signal(state: &mut State, signal: OverseerSignal) -> bool { match signal { - OverseerSignal::Conclude => Ok(true), + OverseerSignal::Conclude => true, OverseerSignal::ActiveLeaves(ActiveLeavesUpdate { activated, .. }) => { // if activated is non-empty, set state.live_block to the highest block in `activated` if let Some(activated) = activated { @@ -340,9 +337,9 @@ async fn handle_signal(state: &mut State, signal: OverseerSignal) -> SubsystemRe } } - Ok(false) + false }, - OverseerSignal::BlockFinalized(_, _) => Ok(false), + OverseerSignal::BlockFinalized(_, _) => false, } } @@ -351,27 +348,11 @@ async fn handle_signal(state: &mut State, signal: OverseerSignal) -> SubsystemRe async fn launch_recovery_task( state: &mut State, ctx: &mut Context, - session_info: SessionInfo, - receipt: CandidateReceipt, - response_sender: oneshot::Sender>, - metrics: &Metrics, + response_sender: oneshot::Sender, recovery_strategies: VecDeque::Sender>>>, - bypass_availability_store: bool, - post_recovery_check: PostRecoveryCheck, -) -> error::Result<()> { - let candidate_hash = receipt.hash(); - let params = RecoveryParams { - validator_authority_keys: session_info.discovery_keys.clone(), - n_validators: session_info.validators.len(), - threshold: recovery_threshold(session_info.validators.len())?, - candidate_hash, - erasure_root: receipt.descriptor.erasure_root, - metrics: metrics.clone(), - bypass_availability_store, - post_recovery_check, - pov_hash: receipt.descriptor.pov_hash, - }; - + params: RecoveryParams, +) -> Result<()> { + let candidate_hash = params.candidate_hash; let recovery_task = RecoveryTask::new(ctx.sender().clone(), params, recovery_strategies); let (remote, remote_handle) = recovery_task.run().remote_handle(); @@ -382,15 +363,8 @@ async fn launch_recovery_task( awaiting: vec![response_sender], }); - if let Err(e) = ctx.spawn("recovery-task", Box::pin(remote)) { - gum::warn!( - target: LOG_TARGET, - err = ?e, - "Failed to spawn a recovery task", - ); - } - - Ok(()) + ctx.spawn("recovery-task", Box::pin(remote)) + .map_err(|err| Error::SpawnTask(err)) } /// Handles an availability recovery request. @@ -401,13 +375,16 @@ async fn handle_recover( receipt: CandidateReceipt, session_index: SessionIndex, backing_group: Option, - response_sender: oneshot::Sender>, + response_sender: oneshot::Sender, metrics: &Metrics, erasure_task_tx: futures::channel::mpsc::Sender, recovery_strategy_kind: RecoveryStrategyKind, bypass_availability_store: bool, post_recovery_check: PostRecoveryCheck, -) -> error::Result<()> { + maybe_core_index: Option, + req_v1_protocol_name: ProtocolName, + req_v2_protocol_name: ProtocolName, +) -> Result<()> { let candidate_hash = receipt.hash(); let span = jaeger::Span::new(candidate_hash, "availability-recovery") @@ -416,14 +393,7 @@ async fn handle_recover( if let Some(result) = state.availability_lru.get(&candidate_hash).cloned().map(|v| v.into_result()) { - if let Err(e) = response_sender.send(result) { - gum::warn!( - target: LOG_TARGET, - err = ?e, - "Error responding with an availability recovery result", - ); - } - return Ok(()) + return response_sender.send(result).map_err(|_| Error::CanceledResponseSender) } if let Some(i) = @@ -434,100 +404,182 @@ async fn handle_recover( } let _span = span.child("not-cached"); - let session_info = request_session_info(state.live_block.1, session_index, ctx.sender()) - .await - .await - .map_err(error::Error::CanceledSessionInfo)??; + let session_info_res = state + .runtime_info + .get_session_info_by_index(ctx.sender(), state.live_block.1, session_index) + .await; let _span = span.child("session-info-ctx-received"); - match session_info { - Some(session_info) => { + match session_info_res { + Ok(ExtendedSessionInfo { session_info, node_features, .. }) => { + let mut backer_group = None; + let n_validators = session_info.validators.len(); + let systematic_threshold = systematic_recovery_threshold(n_validators)?; let mut recovery_strategies: VecDeque< Box::Sender>>, - > = VecDeque::with_capacity(2); + > = VecDeque::with_capacity(3); if let Some(backing_group) = backing_group { if let Some(backing_validators) = session_info.validator_groups.get(backing_group) { let mut small_pov_size = true; - if let RecoveryStrategyKind::BackersFirstIfSizeLower(fetch_chunks_threshold) = - recovery_strategy_kind - { - // Get our own chunk size to get an estimate of the PoV size. - let chunk_size: Result, error::Error> = - query_chunk_size(ctx, candidate_hash).await; - if let Ok(Some(chunk_size)) = chunk_size { - let pov_size_estimate = - chunk_size.saturating_mul(session_info.validators.len()) / 3; - small_pov_size = pov_size_estimate < fetch_chunks_threshold; - - gum::trace!( - target: LOG_TARGET, - ?candidate_hash, - pov_size_estimate, - fetch_chunks_threshold, - enabled = small_pov_size, - "Prefer fetch from backing group", - ); - } else { - // we have a POV limit but were not able to query the chunk size, so - // don't use the backing group. - small_pov_size = false; - } + match recovery_strategy_kind { + RecoveryStrategyKind::BackersFirstIfSizeLower(fetch_chunks_threshold) | + RecoveryStrategyKind::BackersFirstIfSizeLowerThenSystematicChunks( + fetch_chunks_threshold, + ) => { + // Get our own chunk size to get an estimate of the PoV size. + let chunk_size: Result> = + query_chunk_size(ctx, candidate_hash).await; + if let Ok(Some(chunk_size)) = chunk_size { + let pov_size_estimate = chunk_size * systematic_threshold; + small_pov_size = pov_size_estimate < fetch_chunks_threshold; + + if small_pov_size { + gum::trace!( + target: LOG_TARGET, + ?candidate_hash, + pov_size_estimate, + fetch_chunks_threshold, + "Prefer fetch from backing group", + ); + } + } else { + // we have a POV limit but were not able to query the chunk size, so + // don't use the backing group. + small_pov_size = false; + } + }, + _ => {}, }; match (&recovery_strategy_kind, small_pov_size) { (RecoveryStrategyKind::BackersFirstAlways, _) | - (RecoveryStrategyKind::BackersFirstIfSizeLower(_), true) => recovery_strategies.push_back( - Box::new(FetchFull::new(FetchFullParams { - validators: backing_validators.to_vec(), - erasure_task_tx: erasure_task_tx.clone(), - })), - ), + (RecoveryStrategyKind::BackersFirstIfSizeLower(_), true) | + ( + RecoveryStrategyKind::BackersFirstIfSizeLowerThenSystematicChunks(_), + true, + ) | + (RecoveryStrategyKind::BackersThenSystematicChunks, _) => + recovery_strategies.push_back(Box::new(FetchFull::new( + FetchFullParams { validators: backing_validators.to_vec() }, + ))), _ => {}, }; + + backer_group = Some(backing_validators); + } + } + + let chunk_mapping_enabled = if let Some(&true) = node_features + .get(usize::from(node_features::FeatureIndex::AvailabilityChunkMapping as u8)) + .as_deref() + { + true + } else { + false + }; + + // We can only attempt systematic recovery if we received the core index of the + // candidate and chunk mapping is enabled. + if let Some(core_index) = maybe_core_index { + if matches!( + recovery_strategy_kind, + RecoveryStrategyKind::BackersThenSystematicChunks | + RecoveryStrategyKind::SystematicChunks | + RecoveryStrategyKind::BackersFirstIfSizeLowerThenSystematicChunks(_) + ) && chunk_mapping_enabled + { + let chunk_indices = + availability_chunk_indices(Some(node_features), n_validators, core_index)?; + + let chunk_indices: VecDeque<_> = chunk_indices + .iter() + .enumerate() + .map(|(v_index, c_index)| { + ( + *c_index, + ValidatorIndex( + u32::try_from(v_index) + .expect("validator count should not exceed u32"), + ), + ) + }) + .collect(); + + // Only get the validators according to the threshold. + let validators = chunk_indices + .clone() + .into_iter() + .filter(|(c_index, _)| { + usize::try_from(c_index.0) + .expect("usize is at least u32 bytes on all modern targets.") < + systematic_threshold + }) + .collect(); + + recovery_strategies.push_back(Box::new(FetchSystematicChunks::new( + FetchSystematicChunksParams { + validators, + backers: backer_group.map(|v| v.to_vec()).unwrap_or_else(|| vec![]), + }, + ))); } } recovery_strategies.push_back(Box::new(FetchChunks::new(FetchChunksParams { n_validators: session_info.validators.len(), - erasure_task_tx, }))); + let session_info = session_info.clone(); + + let n_validators = session_info.validators.len(); + launch_recovery_task( state, ctx, - session_info, - receipt, response_sender, - metrics, recovery_strategies, - bypass_availability_store, - post_recovery_check, + RecoveryParams { + validator_authority_keys: session_info.discovery_keys.clone(), + n_validators, + threshold: recovery_threshold(n_validators)?, + systematic_threshold, + candidate_hash, + erasure_root: receipt.descriptor.erasure_root, + metrics: metrics.clone(), + bypass_availability_store, + post_recovery_check, + pov_hash: receipt.descriptor.pov_hash, + req_v1_protocol_name, + req_v2_protocol_name, + chunk_mapping_enabled, + erasure_task_tx, + }, ) .await }, - None => { - gum::warn!(target: LOG_TARGET, "SessionInfo is `None` at {:?}", state.live_block); + Err(_) => { response_sender .send(Err(RecoveryError::Unavailable)) - .map_err(|_| error::Error::CanceledResponseSender)?; - Ok(()) + .map_err(|_| Error::CanceledResponseSender)?; + + Err(Error::SessionInfoUnavailable(state.live_block.1)) }, } } -/// Queries a chunk from av-store. +/// Queries the full `AvailableData` from av-store. #[overseer::contextbounds(AvailabilityRecovery, prefix = self::overseer)] async fn query_full_data( ctx: &mut Context, candidate_hash: CandidateHash, -) -> error::Result> { +) -> Result> { let (tx, rx) = oneshot::channel(); ctx.send_message(AvailabilityStoreMessage::QueryAvailableData(candidate_hash, tx)) .await; - rx.await.map_err(error::Error::CanceledQueryFullData) + rx.await.map_err(Error::CanceledQueryFullData) } /// Queries a chunk from av-store. @@ -535,12 +587,12 @@ async fn query_full_data( async fn query_chunk_size( ctx: &mut Context, candidate_hash: CandidateHash, -) -> error::Result> { +) -> Result> { let (tx, rx) = oneshot::channel(); ctx.send_message(AvailabilityStoreMessage::QueryChunkSize(candidate_hash, tx)) .await; - rx.await.map_err(error::Error::CanceledQueryFullData) + rx.await.map_err(Error::CanceledQueryFullData) } #[overseer::contextbounds(AvailabilityRecovery, prefix = self::overseer)] @@ -551,6 +603,7 @@ impl AvailabilityRecoverySubsystem { pub fn for_collator( fetch_chunks_threshold: Option, req_receiver: IncomingRequestReceiver, + req_protocol_names: &ReqProtocolNames, metrics: Metrics, ) -> Self { Self { @@ -561,58 +614,67 @@ impl AvailabilityRecoverySubsystem { post_recovery_check: PostRecoveryCheck::PovHash, req_receiver, metrics, + req_v1_protocol_name: req_protocol_names + .get_name(request_v1::ChunkFetchingRequest::PROTOCOL), + req_v2_protocol_name: req_protocol_names + .get_name(request_v2::ChunkFetchingRequest::PROTOCOL), } } - /// Create a new instance of `AvailabilityRecoverySubsystem` which starts with a fast path to - /// request data from backers. - pub fn with_fast_path( - req_receiver: IncomingRequestReceiver, - metrics: Metrics, - ) -> Self { - Self { - recovery_strategy_kind: RecoveryStrategyKind::BackersFirstAlways, - bypass_availability_store: false, - post_recovery_check: PostRecoveryCheck::Reencode, - req_receiver, - metrics, - } - } - - /// Create a new instance of `AvailabilityRecoverySubsystem` which requests only chunks - pub fn with_chunks_only( + /// Create an optimised new instance of `AvailabilityRecoverySubsystem` suitable for validator + /// nodes, which: + /// - for small POVs (over the `fetch_chunks_threshold` or the + /// `CONSERVATIVE_FETCH_CHUNKS_THRESHOLD`), it attempts full recovery from backers, if backing + /// group supplied. + /// - for large POVs, attempts systematic recovery, if core_index supplied and + /// AvailabilityChunkMapping node feature is enabled. + /// - as a last resort, attempt regular chunk recovery from all validators. + pub fn for_validator( + fetch_chunks_threshold: Option, req_receiver: IncomingRequestReceiver, + req_protocol_names: &ReqProtocolNames, metrics: Metrics, ) -> Self { Self { - recovery_strategy_kind: RecoveryStrategyKind::ChunksAlways, + recovery_strategy_kind: + RecoveryStrategyKind::BackersFirstIfSizeLowerThenSystematicChunks( + fetch_chunks_threshold.unwrap_or(CONSERVATIVE_FETCH_CHUNKS_THRESHOLD), + ), bypass_availability_store: false, post_recovery_check: PostRecoveryCheck::Reencode, req_receiver, metrics, + req_v1_protocol_name: req_protocol_names + .get_name(request_v1::ChunkFetchingRequest::PROTOCOL), + req_v2_protocol_name: req_protocol_names + .get_name(request_v2::ChunkFetchingRequest::PROTOCOL), } } - /// Create a new instance of `AvailabilityRecoverySubsystem` which requests chunks if PoV is - /// above a threshold. - pub fn with_chunks_if_pov_large( - fetch_chunks_threshold: Option, + /// Customise the recovery strategy kind + /// Currently only useful for tests. + #[cfg(any(test, feature = "subsystem-benchmarks"))] + pub fn with_recovery_strategy_kind( req_receiver: IncomingRequestReceiver, + req_protocol_names: &ReqProtocolNames, metrics: Metrics, + recovery_strategy_kind: RecoveryStrategyKind, ) -> Self { Self { - recovery_strategy_kind: RecoveryStrategyKind::BackersFirstIfSizeLower( - fetch_chunks_threshold.unwrap_or(CONSERVATIVE_FETCH_CHUNKS_THRESHOLD), - ), + recovery_strategy_kind, bypass_availability_store: false, post_recovery_check: PostRecoveryCheck::Reencode, req_receiver, metrics, + req_v1_protocol_name: req_protocol_names + .get_name(request_v1::ChunkFetchingRequest::PROTOCOL), + req_v2_protocol_name: req_protocol_names + .get_name(request_v2::ChunkFetchingRequest::PROTOCOL), } } /// Starts the inner subsystem loop. - pub async fn run(self, mut ctx: Context) -> SubsystemResult<()> { + pub async fn run(self, mut ctx: Context) -> std::result::Result<(), FatalError> { let mut state = State::default(); let Self { mut req_receiver, @@ -620,6 +682,8 @@ impl AvailabilityRecoverySubsystem { recovery_strategy_kind, bypass_availability_store, post_recovery_check, + req_v1_protocol_name, + req_v2_protocol_name, } = self; let (erasure_task_tx, erasure_task_rx) = futures::channel::mpsc::channel(16); @@ -655,53 +719,44 @@ impl AvailabilityRecoverySubsystem { loop { let recv_req = req_receiver.recv(|| vec![COST_INVALID_REQUEST]).fuse(); pin_mut!(recv_req); - futures::select! { + let res = futures::select! { erasure_task = erasure_task_rx.next() => { match erasure_task { Some(task) => { - let send_result = to_pool + to_pool .next() .expect("Pool size is `NonZeroUsize`; qed") .send(task) .await - .map_err(|_| RecoveryError::ChannelClosed); - - if let Err(err) = send_result { - gum::warn!( - target: LOG_TARGET, - ?err, - "Failed to send erasure coding task", - ); - } + .map_err(|_| RecoveryError::ChannelClosed) }, None => { - gum::debug!( - target: LOG_TARGET, - "Erasure task channel closed", - ); - - return Err(SubsystemError::with_origin("availability-recovery", RecoveryError::ChannelClosed)) + Err(RecoveryError::ChannelClosed) } - } + }.map_err(Into::into) } - v = ctx.recv().fuse() => { - match v? { - FromOrchestra::Signal(signal) => if handle_signal( - &mut state, - signal, - ).await? { - gum::debug!(target: LOG_TARGET, "subsystem concluded"); - return Ok(()); - } - FromOrchestra::Communication { msg } => { - match msg { - AvailabilityRecoveryMessage::RecoverAvailableData( - receipt, - session_index, - maybe_backing_group, - response_sender, - ) => { - if let Err(e) = handle_recover( + signal = ctx.recv().fuse() => { + match signal { + Ok(signal) => { + match signal { + FromOrchestra::Signal(signal) => if handle_signal( + &mut state, + signal, + ).await { + gum::debug!(target: LOG_TARGET, "subsystem concluded"); + return Ok(()); + } else { + Ok(()) + }, + FromOrchestra::Communication { + msg: AvailabilityRecoveryMessage::RecoverAvailableData( + receipt, + session_index, + maybe_backing_group, + maybe_core_index, + response_sender, + ) + } => handle_recover( &mut state, &mut ctx, receipt, @@ -712,21 +767,18 @@ impl AvailabilityRecoverySubsystem { erasure_task_tx.clone(), recovery_strategy_kind.clone(), bypass_availability_store, - post_recovery_check.clone() - ).await { - gum::warn!( - target: LOG_TARGET, - err = ?e, - "Error handling a recovery request", - ); - } - } + post_recovery_check.clone(), + maybe_core_index, + req_v1_protocol_name.clone(), + req_v2_protocol_name.clone(), + ).await } - } + }, + Err(e) => Err(Error::SubsystemReceive(e)) } } in_req = recv_req => { - match in_req.into_nested().map_err(|fatal| SubsystemError::with_origin("availability-recovery", fatal))? { + match in_req { Ok(req) => { if bypass_availability_store { gum::debug!( @@ -734,40 +786,42 @@ impl AvailabilityRecoverySubsystem { "Skipping request to availability-store.", ); let _ = req.send_response(None.into()); - continue - } - match query_full_data(&mut ctx, req.payload.candidate_hash).await { - Ok(res) => { - let _ = req.send_response(res.into()); - } - Err(e) => { - gum::debug!( - target: LOG_TARGET, - err = ?e, - "Failed to query available data.", - ); - - let _ = req.send_response(None.into()); + Ok(()) + } else { + match query_full_data(&mut ctx, req.payload.candidate_hash).await { + Ok(res) => { + let _ = req.send_response(res.into()); + Ok(()) + } + Err(e) => { + let _ = req.send_response(None.into()); + Err(e) + } } } } - Err(jfyi) => { - gum::debug!( - target: LOG_TARGET, - error = ?jfyi, - "Decoding incoming request failed" - ); - continue - } + Err(e) => Err(Error::IncomingRequest(e)) } } output = state.ongoing_recoveries.select_next_some() => { + let mut res = Ok(()); if let Some((candidate_hash, result)) = output { + if let Err(ref e) = result { + res = Err(Error::Recovery(e.clone())); + } + if let Ok(recovery) = CachedRecovery::try_from(result) { state.availability_lru.insert(candidate_hash, recovery); } } + + res } + }; + + // Only bubble up fatal errors, but log all of them. + if let Err(e) = res { + log_error(Err(e))?; } } } @@ -835,7 +889,13 @@ async fn erasure_task_thread( Some(ErasureTask::Reconstruct(n_validators, chunks, sender)) => { let _ = sender.send(polkadot_erasure_coding::reconstruct_v1( n_validators, - chunks.values().map(|c| (&c.chunk[..], c.index.0 as usize)), + chunks.iter().map(|(c_index, chunk)| { + ( + &chunk[..], + usize::try_from(c_index.0) + .expect("usize is at least u32 bytes on all modern targets."), + ) + }), )); }, Some(ErasureTask::Reencode(n_validators, root, available_data, sender)) => { diff --git a/polkadot/node/network/availability-recovery/src/metrics.rs b/polkadot/node/network/availability-recovery/src/metrics.rs index 9f4cddc57e43a..4e269df55027b 100644 --- a/polkadot/node/network/availability-recovery/src/metrics.rs +++ b/polkadot/node/network/availability-recovery/src/metrics.rs @@ -14,9 +14,13 @@ // You should have received a copy of the GNU General Public License // along with Polkadot. If not, see . +use polkadot_node_subsystem::prometheus::HistogramVec; use polkadot_node_subsystem_util::metrics::{ self, - prometheus::{self, Counter, CounterVec, Histogram, Opts, PrometheusError, Registry, U64}, + prometheus::{ + self, prometheus::HistogramTimer, Counter, CounterVec, Histogram, Opts, PrometheusError, + Registry, U64, + }, }; /// Availability Distribution metrics. @@ -28,26 +32,61 @@ struct MetricsInner { /// Number of sent chunk requests. /// /// Gets incremented on each sent chunk requests. - chunk_requests_issued: Counter, + /// + /// Split by chunk type: + /// - `regular_chunks` + /// - `systematic_chunks` + chunk_requests_issued: CounterVec, + /// Total number of bytes recovered /// /// Gets incremented on each successful recovery recovered_bytes_total: Counter, + /// A counter for finished chunk requests. /// - /// Split by result: + /// Split by the chunk type (`regular_chunks` or `systematic_chunks`) + /// + /// Also split by result: /// - `no_such_chunk` ... peer did not have the requested chunk /// - `timeout` ... request timed out. - /// - `network_error` ... Some networking issue except timeout + /// - `error` ... Some networking issue except timeout /// - `invalid` ... Chunk was received, but not valid. /// - `success` chunk_requests_finished: CounterVec, + /// A counter for successful chunk requests, split by the network protocol version. + chunk_request_protocols: CounterVec, + + /// Number of sent available data requests. + full_data_requests_issued: Counter, + + /// Counter for finished available data requests. + /// + /// Split by the result type: + /// + /// - `no_such_data` ... peer did not have the requested data + /// - `timeout` ... request timed out. + /// - `error` ... Some networking issue except timeout + /// - `invalid` ... data was received, but not valid. + /// - `success` + full_data_requests_finished: CounterVec, + /// The duration of request to response. - time_chunk_request: Histogram, + /// + /// Split by chunk type (`regular_chunks` or `systematic_chunks`). + time_chunk_request: HistogramVec, /// The duration between the pure recovery and verification. - time_erasure_recovery: Histogram, + /// + /// Split by recovery type (`regular_chunks`, `systematic_chunks` or `full_from_backers`). + time_erasure_recovery: HistogramVec, + + /// How much time it takes to reconstruct the available data from chunks. + /// + /// Split by chunk type (`regular_chunks` or `systematic_chunks`), as the algorithms are + /// different. + time_erasure_reconstruct: HistogramVec, /// How much time it takes to re-encode the data into erasure chunks in order to verify /// the root hash of the provided Merkle tree. See `reconstructed_data_matches_root`. @@ -58,6 +97,10 @@ struct MetricsInner { time_full_recovery: Histogram, /// Number of full recoveries that have been finished one way or the other. + /// + /// Split by recovery `strategy_type` (`full_from_backers, systematic_chunks, regular_chunks, + /// all`). `all` is used for failed recoveries that tried all available strategies. + /// Also split by `result` type. full_recoveries_finished: CounterVec, /// Number of full recoveries that have been started on this subsystem. @@ -73,87 +116,175 @@ impl Metrics { Metrics(None) } - /// Increment counter on fetched labels. - pub fn on_chunk_request_issued(&self) { + /// Increment counter for chunk requests. + pub fn on_chunk_request_issued(&self, chunk_type: &str) { if let Some(metrics) = &self.0 { - metrics.chunk_requests_issued.inc() + metrics.chunk_requests_issued.with_label_values(&[chunk_type]).inc() + } + } + + /// Increment counter for full data requests. + pub fn on_full_request_issued(&self) { + if let Some(metrics) = &self.0 { + metrics.full_data_requests_issued.inc() } } /// A chunk request timed out. - pub fn on_chunk_request_timeout(&self) { + pub fn on_chunk_request_timeout(&self, chunk_type: &str) { + if let Some(metrics) = &self.0 { + metrics + .chunk_requests_finished + .with_label_values(&[chunk_type, "timeout"]) + .inc() + } + } + + /// A full data request timed out. + pub fn on_full_request_timeout(&self) { if let Some(metrics) = &self.0 { - metrics.chunk_requests_finished.with_label_values(&["timeout"]).inc() + metrics.full_data_requests_finished.with_label_values(&["timeout"]).inc() } } /// A chunk request failed because validator did not have its chunk. - pub fn on_chunk_request_no_such_chunk(&self) { + pub fn on_chunk_request_no_such_chunk(&self, chunk_type: &str) { + if let Some(metrics) = &self.0 { + metrics + .chunk_requests_finished + .with_label_values(&[chunk_type, "no_such_chunk"]) + .inc() + } + } + + /// A full data request failed because the validator did not have it. + pub fn on_full_request_no_such_data(&self) { if let Some(metrics) = &self.0 { - metrics.chunk_requests_finished.with_label_values(&["no_such_chunk"]).inc() + metrics.full_data_requests_finished.with_label_values(&["no_such_data"]).inc() } } /// A chunk request failed for some non timeout related network error. - pub fn on_chunk_request_error(&self) { + pub fn on_chunk_request_error(&self, chunk_type: &str) { if let Some(metrics) = &self.0 { - metrics.chunk_requests_finished.with_label_values(&["error"]).inc() + metrics.chunk_requests_finished.with_label_values(&[chunk_type, "error"]).inc() + } + } + + /// A full data request failed for some non timeout related network error. + pub fn on_full_request_error(&self) { + if let Some(metrics) = &self.0 { + metrics.full_data_requests_finished.with_label_values(&["error"]).inc() } } /// A chunk request succeeded, but was not valid. - pub fn on_chunk_request_invalid(&self) { + pub fn on_chunk_request_invalid(&self, chunk_type: &str) { if let Some(metrics) = &self.0 { - metrics.chunk_requests_finished.with_label_values(&["invalid"]).inc() + metrics + .chunk_requests_finished + .with_label_values(&[chunk_type, "invalid"]) + .inc() + } + } + + /// A full data request succeeded, but was not valid. + pub fn on_full_request_invalid(&self) { + if let Some(metrics) = &self.0 { + metrics.full_data_requests_finished.with_label_values(&["invalid"]).inc() } } /// A chunk request succeeded. - pub fn on_chunk_request_succeeded(&self) { + pub fn on_chunk_request_succeeded(&self, chunk_type: &str) { if let Some(metrics) = &self.0 { - metrics.chunk_requests_finished.with_label_values(&["success"]).inc() + metrics + .chunk_requests_finished + .with_label_values(&[chunk_type, "success"]) + .inc() + } + } + + /// A chunk response was received on the v1 protocol. + pub fn on_chunk_response_v1(&self) { + if let Some(metrics) = &self.0 { + metrics.chunk_request_protocols.with_label_values(&["v1"]).inc() + } + } + + /// A chunk response was received on the v2 protocol. + pub fn on_chunk_response_v2(&self) { + if let Some(metrics) = &self.0 { + metrics.chunk_request_protocols.with_label_values(&["v2"]).inc() + } + } + + /// A full data request succeeded. + pub fn on_full_request_succeeded(&self) { + if let Some(metrics) = &self.0 { + metrics.full_data_requests_finished.with_label_values(&["success"]).inc() } } /// Get a timer to time request/response duration. - pub fn time_chunk_request(&self) -> Option { - self.0.as_ref().map(|metrics| metrics.time_chunk_request.start_timer()) + pub fn time_chunk_request(&self, chunk_type: &str) -> Option { + self.0.as_ref().map(|metrics| { + metrics.time_chunk_request.with_label_values(&[chunk_type]).start_timer() + }) } /// Get a timer to time erasure code recover. - pub fn time_erasure_recovery(&self) -> Option { - self.0.as_ref().map(|metrics| metrics.time_erasure_recovery.start_timer()) + pub fn time_erasure_recovery(&self, chunk_type: &str) -> Option { + self.0.as_ref().map(|metrics| { + metrics.time_erasure_recovery.with_label_values(&[chunk_type]).start_timer() + }) + } + + /// Get a timer for available data reconstruction. + pub fn time_erasure_reconstruct(&self, chunk_type: &str) -> Option { + self.0.as_ref().map(|metrics| { + metrics.time_erasure_reconstruct.with_label_values(&[chunk_type]).start_timer() + }) } /// Get a timer to time chunk encoding. - pub fn time_reencode_chunks(&self) -> Option { + pub fn time_reencode_chunks(&self) -> Option { self.0.as_ref().map(|metrics| metrics.time_reencode_chunks.start_timer()) } /// Get a timer to measure the time of the complete recovery process. - pub fn time_full_recovery(&self) -> Option { + pub fn time_full_recovery(&self) -> Option { self.0.as_ref().map(|metrics| metrics.time_full_recovery.start_timer()) } /// A full recovery succeeded. - pub fn on_recovery_succeeded(&self, bytes: usize) { + pub fn on_recovery_succeeded(&self, strategy_type: &str, bytes: usize) { if let Some(metrics) = &self.0 { - metrics.full_recoveries_finished.with_label_values(&["success"]).inc(); + metrics + .full_recoveries_finished + .with_label_values(&["success", strategy_type]) + .inc(); metrics.recovered_bytes_total.inc_by(bytes as u64) } } /// A full recovery failed (data not available). - pub fn on_recovery_failed(&self) { + pub fn on_recovery_failed(&self, strategy_type: &str) { if let Some(metrics) = &self.0 { - metrics.full_recoveries_finished.with_label_values(&["failure"]).inc() + metrics + .full_recoveries_finished + .with_label_values(&["failure", strategy_type]) + .inc() } } /// A full recovery failed (data was recovered, but invalid). - pub fn on_recovery_invalid(&self) { + pub fn on_recovery_invalid(&self, strategy_type: &str) { if let Some(metrics) = &self.0 { - metrics.full_recoveries_finished.with_label_values(&["invalid"]).inc() + metrics + .full_recoveries_finished + .with_label_values(&["invalid", strategy_type]) + .inc() } } @@ -169,9 +300,17 @@ impl metrics::Metrics for Metrics { fn try_register(registry: &Registry) -> Result { let metrics = MetricsInner { chunk_requests_issued: prometheus::register( + CounterVec::new( + Opts::new("polkadot_parachain_availability_recovery_chunk_requests_issued", + "Total number of issued chunk requests."), + &["type"] + )?, + registry, + )?, + full_data_requests_issued: prometheus::register( Counter::new( - "polkadot_parachain_availability_recovery_chunk_requests_issued", - "Total number of issued chunk requests.", + "polkadot_parachain_availability_recovery_full_data_requests_issued", + "Total number of issued full data requests.", )?, registry, )?, @@ -188,22 +327,49 @@ impl metrics::Metrics for Metrics { "polkadot_parachain_availability_recovery_chunk_requests_finished", "Total number of chunk requests finished.", ), + &["result", "type"], + )?, + registry, + )?, + chunk_request_protocols: prometheus::register( + CounterVec::new( + Opts::new( + "polkadot_parachain_availability_recovery_chunk_request_protocols", + "Total number of successful chunk requests, mapped by the protocol version (v1 or v2).", + ), + &["protocol"], + )?, + registry, + )?, + full_data_requests_finished: prometheus::register( + CounterVec::new( + Opts::new( + "polkadot_parachain_availability_recovery_full_data_requests_finished", + "Total number of full data requests finished.", + ), &["result"], )?, registry, )?, time_chunk_request: prometheus::register( - prometheus::Histogram::with_opts(prometheus::HistogramOpts::new( + prometheus::HistogramVec::new(prometheus::HistogramOpts::new( "polkadot_parachain_availability_recovery_time_chunk_request", "Time spent waiting for a response to a chunk request", - ))?, + ), &["type"])?, registry, )?, time_erasure_recovery: prometheus::register( - prometheus::Histogram::with_opts(prometheus::HistogramOpts::new( + prometheus::HistogramVec::new(prometheus::HistogramOpts::new( "polkadot_parachain_availability_recovery_time_erasure_recovery", "Time spent to recover the erasure code and verify the merkle root by re-encoding as erasure chunks", - ))?, + ), &["type"])?, + registry, + )?, + time_erasure_reconstruct: prometheus::register( + prometheus::HistogramVec::new(prometheus::HistogramOpts::new( + "polkadot_parachain_availability_recovery_time_erasure_reconstruct", + "Time spent to reconstruct the data from chunks", + ), &["type"])?, registry, )?, time_reencode_chunks: prometheus::register( @@ -226,7 +392,7 @@ impl metrics::Metrics for Metrics { "polkadot_parachain_availability_recovery_recoveries_finished", "Total number of recoveries that finished.", ), - &["result"], + &["result", "strategy_type"], )?, registry, )?, diff --git a/polkadot/node/network/availability-recovery/src/task.rs b/polkadot/node/network/availability-recovery/src/task.rs deleted file mode 100644 index c300c221da5c6..0000000000000 --- a/polkadot/node/network/availability-recovery/src/task.rs +++ /dev/null @@ -1,861 +0,0 @@ -// Copyright (C) Parity Technologies (UK) Ltd. -// This file is part of Polkadot. - -// Polkadot is free software: you can redistribute it and/or modify -// it under the terms of the GNU General Public License as published by -// the Free Software Foundation, either version 3 of the License, or -// (at your option) any later version. - -// Polkadot is distributed in the hope that it will be useful, -// but WITHOUT ANY WARRANTY; without even the implied warranty of -// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -// GNU General Public License for more details. - -// You should have received a copy of the GNU General Public License -// along with Polkadot. If not, see . - -//! Recovery task and associated strategies. - -#![warn(missing_docs)] - -use crate::{ - futures_undead::FuturesUndead, is_chunk_valid, is_unavailable, metrics::Metrics, ErasureTask, - PostRecoveryCheck, LOG_TARGET, -}; -use futures::{channel::oneshot, SinkExt}; -use parity_scale_codec::Encode; -#[cfg(not(test))] -use polkadot_node_network_protocol::request_response::CHUNK_REQUEST_TIMEOUT; -use polkadot_node_network_protocol::request_response::{ - self as req_res, outgoing::RequestError, OutgoingRequest, Recipient, Requests, -}; -use polkadot_node_primitives::{AvailableData, ErasureChunk}; -use polkadot_node_subsystem::{ - messages::{AvailabilityStoreMessage, NetworkBridgeTxMessage}, - overseer, RecoveryError, -}; -use polkadot_primitives::{AuthorityDiscoveryId, CandidateHash, Hash, ValidatorIndex}; -use rand::seq::SliceRandom; -use sc_network::{IfDisconnected, OutboundFailure, RequestFailure}; -use std::{ - collections::{HashMap, VecDeque}, - time::Duration, -}; - -// How many parallel recovery tasks should be running at once. -const N_PARALLEL: usize = 50; - -/// Time after which we consider a request to have failed -/// -/// and we should try more peers. Note in theory the request times out at the network level, -/// measurements have shown, that in practice requests might actually take longer to fail in -/// certain occasions. (The very least, authority discovery is not part of the timeout.) -/// -/// For the time being this value is the same as the timeout on the networking layer, but as this -/// timeout is more soft than the networking one, it might make sense to pick different values as -/// well. -#[cfg(not(test))] -const TIMEOUT_START_NEW_REQUESTS: Duration = CHUNK_REQUEST_TIMEOUT; -#[cfg(test)] -const TIMEOUT_START_NEW_REQUESTS: Duration = Duration::from_millis(100); - -#[async_trait::async_trait] -/// Common trait for runnable recovery strategies. -pub trait RecoveryStrategy: Send { - /// Main entry point of the strategy. - async fn run( - &mut self, - state: &mut State, - sender: &mut Sender, - common_params: &RecoveryParams, - ) -> Result; - - /// Return the name of the strategy for logging purposes. - fn display_name(&self) -> &'static str; -} - -/// Recovery parameters common to all strategies in a `RecoveryTask`. -pub struct RecoveryParams { - /// Discovery ids of `validators`. - pub validator_authority_keys: Vec, - - /// Number of validators. - pub n_validators: usize, - - /// The number of chunks needed. - pub threshold: usize, - - /// A hash of the relevant candidate. - pub candidate_hash: CandidateHash, - - /// The root of the erasure encoding of the candidate. - pub erasure_root: Hash, - - /// Metrics to report. - pub metrics: Metrics, - - /// Do not request data from availability-store. Useful for collators. - pub bypass_availability_store: bool, - - /// The type of check to perform after available data was recovered. - pub post_recovery_check: PostRecoveryCheck, - - /// The blake2-256 hash of the PoV. - pub pov_hash: Hash, -} - -/// Intermediate/common data that must be passed between `RecoveryStrategy`s belonging to the -/// same `RecoveryTask`. -pub struct State { - /// Chunks received so far. - received_chunks: HashMap, -} - -impl State { - fn new() -> Self { - Self { received_chunks: HashMap::new() } - } - - fn insert_chunk(&mut self, validator: ValidatorIndex, chunk: ErasureChunk) { - self.received_chunks.insert(validator, chunk); - } - - fn chunk_count(&self) -> usize { - self.received_chunks.len() - } - - /// Retrieve the local chunks held in the av-store (either 0 or 1). - async fn populate_from_av_store( - &mut self, - params: &RecoveryParams, - sender: &mut Sender, - ) -> Vec { - let (tx, rx) = oneshot::channel(); - sender - .send_message(AvailabilityStoreMessage::QueryAllChunks(params.candidate_hash, tx)) - .await; - - match rx.await { - Ok(chunks) => { - // This should either be length 1 or 0. If we had the whole data, - // we wouldn't have reached this stage. - let chunk_indices: Vec<_> = chunks.iter().map(|c| c.index).collect(); - - for chunk in chunks { - if is_chunk_valid(params, &chunk) { - gum::trace!( - target: LOG_TARGET, - candidate_hash = ?params.candidate_hash, - validator_index = ?chunk.index, - "Found valid chunk on disk" - ); - self.insert_chunk(chunk.index, chunk); - } else { - gum::error!( - target: LOG_TARGET, - "Loaded invalid chunk from disk! Disk/Db corruption _very_ likely - please fix ASAP!" - ); - }; - } - - chunk_indices - }, - Err(oneshot::Canceled) => { - gum::warn!( - target: LOG_TARGET, - candidate_hash = ?params.candidate_hash, - "Failed to reach the availability store" - ); - - vec![] - }, - } - } - - /// Launch chunk requests in parallel, according to the parameters. - async fn launch_parallel_chunk_requests( - &mut self, - params: &RecoveryParams, - sender: &mut Sender, - desired_requests_count: usize, - validators: &mut VecDeque, - requesting_chunks: &mut FuturesUndead< - Result, (ValidatorIndex, RequestError)>, - >, - ) where - Sender: overseer::AvailabilityRecoverySenderTrait, - { - let candidate_hash = ¶ms.candidate_hash; - let already_requesting_count = requesting_chunks.len(); - - let mut requests = Vec::with_capacity(desired_requests_count - already_requesting_count); - - while requesting_chunks.len() < desired_requests_count { - if let Some(validator_index) = validators.pop_back() { - let validator = params.validator_authority_keys[validator_index.0 as usize].clone(); - gum::trace!( - target: LOG_TARGET, - ?validator, - ?validator_index, - ?candidate_hash, - "Requesting chunk", - ); - - // Request data. - let raw_request = req_res::v1::ChunkFetchingRequest { - candidate_hash: params.candidate_hash, - index: validator_index, - }; - - let (req, res) = OutgoingRequest::new(Recipient::Authority(validator), raw_request); - requests.push(Requests::ChunkFetchingV1(req)); - - params.metrics.on_chunk_request_issued(); - let timer = params.metrics.time_chunk_request(); - - requesting_chunks.push(Box::pin(async move { - let _timer = timer; - match res.await { - Ok(req_res::v1::ChunkFetchingResponse::Chunk(chunk)) => - Ok(Some(chunk.recombine_into_chunk(&raw_request))), - Ok(req_res::v1::ChunkFetchingResponse::NoSuchChunk) => Ok(None), - Err(e) => Err((validator_index, e)), - } - })); - } else { - break - } - } - - sender - .send_message(NetworkBridgeTxMessage::SendRequests( - requests, - IfDisconnected::TryConnect, - )) - .await; - } - - /// Wait for a sufficient amount of chunks to reconstruct according to the provided `params`. - async fn wait_for_chunks( - &mut self, - params: &RecoveryParams, - validators: &mut VecDeque, - requesting_chunks: &mut FuturesUndead< - Result, (ValidatorIndex, RequestError)>, - >, - can_conclude: impl Fn(usize, usize, usize, &RecoveryParams, usize) -> bool, - ) -> (usize, usize) { - let metrics = ¶ms.metrics; - - let mut total_received_responses = 0; - let mut error_count = 0; - - // Wait for all current requests to conclude or time-out, or until we reach enough chunks. - // We also declare requests undead, once `TIMEOUT_START_NEW_REQUESTS` is reached and will - // return in that case for `launch_parallel_requests` to fill up slots again. - while let Some(request_result) = - requesting_chunks.next_with_timeout(TIMEOUT_START_NEW_REQUESTS).await - { - total_received_responses += 1; - - match request_result { - Ok(Some(chunk)) => - if is_chunk_valid(params, &chunk) { - metrics.on_chunk_request_succeeded(); - gum::trace!( - target: LOG_TARGET, - candidate_hash = ?params.candidate_hash, - validator_index = ?chunk.index, - "Received valid chunk", - ); - self.insert_chunk(chunk.index, chunk); - } else { - metrics.on_chunk_request_invalid(); - error_count += 1; - }, - Ok(None) => { - metrics.on_chunk_request_no_such_chunk(); - error_count += 1; - }, - Err((validator_index, e)) => { - error_count += 1; - - gum::trace!( - target: LOG_TARGET, - candidate_hash= ?params.candidate_hash, - err = ?e, - ?validator_index, - "Failure requesting chunk", - ); - - match e { - RequestError::InvalidResponse(_) => { - metrics.on_chunk_request_invalid(); - - gum::debug!( - target: LOG_TARGET, - candidate_hash = ?params.candidate_hash, - err = ?e, - ?validator_index, - "Chunk fetching response was invalid", - ); - }, - RequestError::NetworkError(err) => { - // No debug logs on general network errors - that became very spammy - // occasionally. - if let RequestFailure::Network(OutboundFailure::Timeout) = err { - metrics.on_chunk_request_timeout(); - } else { - metrics.on_chunk_request_error(); - } - - validators.push_front(validator_index); - }, - RequestError::Canceled(_) => { - metrics.on_chunk_request_error(); - - validators.push_front(validator_index); - }, - } - }, - } - - // Stop waiting for requests when we either can already recover the data - // or have gotten firm 'No' responses from enough validators. - if can_conclude( - validators.len(), - requesting_chunks.total_len(), - self.chunk_count(), - params, - error_count, - ) { - gum::debug!( - target: LOG_TARGET, - candidate_hash = ?params.candidate_hash, - received_chunks_count = ?self.chunk_count(), - requested_chunks_count = ?requesting_chunks.len(), - threshold = ?params.threshold, - "Can conclude availability for a candidate", - ); - break - } - } - - (total_received_responses, error_count) - } -} - -/// A stateful reconstruction of availability data in reference to -/// a candidate hash. -pub struct RecoveryTask { - sender: Sender, - params: RecoveryParams, - strategies: VecDeque>>, - state: State, -} - -impl RecoveryTask -where - Sender: overseer::AvailabilityRecoverySenderTrait, -{ - /// Instantiate a new recovery task. - pub fn new( - sender: Sender, - params: RecoveryParams, - strategies: VecDeque>>, - ) -> Self { - Self { sender, params, strategies, state: State::new() } - } - - async fn in_availability_store(&mut self) -> Option { - if !self.params.bypass_availability_store { - let (tx, rx) = oneshot::channel(); - self.sender - .send_message(AvailabilityStoreMessage::QueryAvailableData( - self.params.candidate_hash, - tx, - )) - .await; - - match rx.await { - Ok(Some(data)) => return Some(data), - Ok(None) => {}, - Err(oneshot::Canceled) => { - gum::warn!( - target: LOG_TARGET, - candidate_hash = ?self.params.candidate_hash, - "Failed to reach the availability store", - ) - }, - } - } - - None - } - - /// Run this recovery task to completion. It will loop through the configured strategies - /// in-order and return whenever the first one recovers the full `AvailableData`. - pub async fn run(mut self) -> Result { - if let Some(data) = self.in_availability_store().await { - return Ok(data) - } - - self.params.metrics.on_recovery_started(); - - let _timer = self.params.metrics.time_full_recovery(); - - while let Some(mut current_strategy) = self.strategies.pop_front() { - gum::debug!( - target: LOG_TARGET, - candidate_hash = ?self.params.candidate_hash, - "Starting `{}` strategy", - current_strategy.display_name(), - ); - - let res = current_strategy.run(&mut self.state, &mut self.sender, &self.params).await; - - match res { - Err(RecoveryError::Unavailable) => - if self.strategies.front().is_some() { - gum::debug!( - target: LOG_TARGET, - candidate_hash = ?self.params.candidate_hash, - "Recovery strategy `{}` did not conclude. Trying the next one.", - current_strategy.display_name(), - ); - continue - }, - Err(err) => { - match &err { - RecoveryError::Invalid => self.params.metrics.on_recovery_invalid(), - _ => self.params.metrics.on_recovery_failed(), - } - return Err(err) - }, - Ok(data) => { - self.params.metrics.on_recovery_succeeded(data.encoded_size()); - return Ok(data) - }, - } - } - - // We have no other strategies to try. - gum::warn!( - target: LOG_TARGET, - candidate_hash = ?self.params.candidate_hash, - "Recovery of available data failed.", - ); - self.params.metrics.on_recovery_failed(); - - Err(RecoveryError::Unavailable) - } -} - -/// `RecoveryStrategy` that sequentially tries to fetch the full `AvailableData` from -/// already-connected validators in the configured validator set. -pub struct FetchFull { - params: FetchFullParams, -} - -pub struct FetchFullParams { - /// Validators that will be used for fetching the data. - pub validators: Vec, - /// Channel to the erasure task handler. - pub erasure_task_tx: futures::channel::mpsc::Sender, -} - -impl FetchFull { - /// Create a new `FetchFull` recovery strategy. - pub fn new(mut params: FetchFullParams) -> Self { - params.validators.shuffle(&mut rand::thread_rng()); - Self { params } - } -} - -#[async_trait::async_trait] -impl RecoveryStrategy for FetchFull { - fn display_name(&self) -> &'static str { - "Full recovery from backers" - } - - async fn run( - &mut self, - _: &mut State, - sender: &mut Sender, - common_params: &RecoveryParams, - ) -> Result { - loop { - // Pop the next validator, and proceed to next fetch_chunks_task if we're out. - let validator_index = - self.params.validators.pop().ok_or_else(|| RecoveryError::Unavailable)?; - - // Request data. - let (req, response) = OutgoingRequest::new( - Recipient::Authority( - common_params.validator_authority_keys[validator_index.0 as usize].clone(), - ), - req_res::v1::AvailableDataFetchingRequest { - candidate_hash: common_params.candidate_hash, - }, - ); - - sender - .send_message(NetworkBridgeTxMessage::SendRequests( - vec![Requests::AvailableDataFetchingV1(req)], - IfDisconnected::ImmediateError, - )) - .await; - - match response.await { - Ok(req_res::v1::AvailableDataFetchingResponse::AvailableData(data)) => { - let maybe_data = match common_params.post_recovery_check { - PostRecoveryCheck::Reencode => { - let (reencode_tx, reencode_rx) = oneshot::channel(); - self.params - .erasure_task_tx - .send(ErasureTask::Reencode( - common_params.n_validators, - common_params.erasure_root, - data, - reencode_tx, - )) - .await - .map_err(|_| RecoveryError::ChannelClosed)?; - - reencode_rx.await.map_err(|_| RecoveryError::ChannelClosed)? - }, - PostRecoveryCheck::PovHash => - (data.pov.hash() == common_params.pov_hash).then_some(data), - }; - - match maybe_data { - Some(data) => { - gum::trace!( - target: LOG_TARGET, - candidate_hash = ?common_params.candidate_hash, - "Received full data", - ); - - return Ok(data) - }, - None => { - gum::debug!( - target: LOG_TARGET, - candidate_hash = ?common_params.candidate_hash, - ?validator_index, - "Invalid data response", - ); - - // it doesn't help to report the peer with req/res. - // we'll try the next backer. - }, - }; - }, - Ok(req_res::v1::AvailableDataFetchingResponse::NoSuchData) => {}, - Err(e) => gum::debug!( - target: LOG_TARGET, - candidate_hash = ?common_params.candidate_hash, - ?validator_index, - err = ?e, - "Error fetching full available data." - ), - } - } - } -} - -/// `RecoveryStrategy` that requests chunks from validators, in parallel. -pub struct FetchChunks { - /// How many requests have been unsuccessful so far. - error_count: usize, - /// Total number of responses that have been received, including failed ones. - total_received_responses: usize, - /// Collection of in-flight requests. - requesting_chunks: FuturesUndead, (ValidatorIndex, RequestError)>>, - /// A random shuffling of the validators which indicates the order in which we connect to the - /// validators and request the chunk from them. - validators: VecDeque, - /// Channel to the erasure task handler. - erasure_task_tx: futures::channel::mpsc::Sender, -} - -/// Parameters specific to the `FetchChunks` strategy. -pub struct FetchChunksParams { - /// Total number of validators. - pub n_validators: usize, - /// Channel to the erasure task handler. - pub erasure_task_tx: futures::channel::mpsc::Sender, -} - -impl FetchChunks { - /// Instantiate a new strategy. - pub fn new(params: FetchChunksParams) -> Self { - let mut shuffling: Vec<_> = (0..params.n_validators) - .map(|i| ValidatorIndex(i.try_into().expect("number of validators must fit in a u32"))) - .collect(); - shuffling.shuffle(&mut rand::thread_rng()); - - Self { - error_count: 0, - total_received_responses: 0, - requesting_chunks: FuturesUndead::new(), - validators: shuffling.into(), - erasure_task_tx: params.erasure_task_tx, - } - } - - fn is_unavailable( - unrequested_validators: usize, - in_flight_requests: usize, - chunk_count: usize, - threshold: usize, - ) -> bool { - is_unavailable(chunk_count, in_flight_requests, unrequested_validators, threshold) - } - - /// Desired number of parallel requests. - /// - /// For the given threshold (total required number of chunks) get the desired number of - /// requests we want to have running in parallel at this time. - fn get_desired_request_count(&self, chunk_count: usize, threshold: usize) -> usize { - // Upper bound for parallel requests. - // We want to limit this, so requests can be processed within the timeout and we limit the - // following feedback loop: - // 1. Requests fail due to timeout - // 2. We request more chunks to make up for it - // 3. Bandwidth is spread out even more, so we get even more timeouts - // 4. We request more chunks to make up for it ... - let max_requests_boundary = std::cmp::min(N_PARALLEL, threshold); - // How many chunks are still needed? - let remaining_chunks = threshold.saturating_sub(chunk_count); - // What is the current error rate, so we can make up for it? - let inv_error_rate = - self.total_received_responses.checked_div(self.error_count).unwrap_or(0); - // Actual number of requests we want to have in flight in parallel: - std::cmp::min( - max_requests_boundary, - remaining_chunks + remaining_chunks.checked_div(inv_error_rate).unwrap_or(0), - ) - } - - async fn attempt_recovery( - &mut self, - state: &mut State, - common_params: &RecoveryParams, - ) -> Result { - let recovery_duration = common_params.metrics.time_erasure_recovery(); - - // Send request to reconstruct available data from chunks. - let (avilable_data_tx, available_data_rx) = oneshot::channel(); - self.erasure_task_tx - .send(ErasureTask::Reconstruct( - common_params.n_validators, - // Safe to leave an empty vec in place, as we're stopping the recovery process if - // this reconstruct fails. - std::mem::take(&mut state.received_chunks), - avilable_data_tx, - )) - .await - .map_err(|_| RecoveryError::ChannelClosed)?; - - let available_data_response = - available_data_rx.await.map_err(|_| RecoveryError::ChannelClosed)?; - - match available_data_response { - Ok(data) => { - let maybe_data = match common_params.post_recovery_check { - PostRecoveryCheck::Reencode => { - // Send request to re-encode the chunks and check merkle root. - let (reencode_tx, reencode_rx) = oneshot::channel(); - self.erasure_task_tx - .send(ErasureTask::Reencode( - common_params.n_validators, - common_params.erasure_root, - data, - reencode_tx, - )) - .await - .map_err(|_| RecoveryError::ChannelClosed)?; - - reencode_rx.await.map_err(|_| RecoveryError::ChannelClosed)?.or_else(|| { - gum::trace!( - target: LOG_TARGET, - candidate_hash = ?common_params.candidate_hash, - erasure_root = ?common_params.erasure_root, - "Data recovery error - root mismatch", - ); - None - }) - }, - PostRecoveryCheck::PovHash => - (data.pov.hash() == common_params.pov_hash).then_some(data).or_else(|| { - gum::trace!( - target: LOG_TARGET, - candidate_hash = ?common_params.candidate_hash, - pov_hash = ?common_params.pov_hash, - "Data recovery error - PoV hash mismatch", - ); - None - }), - }; - - if let Some(data) = maybe_data { - gum::trace!( - target: LOG_TARGET, - candidate_hash = ?common_params.candidate_hash, - erasure_root = ?common_params.erasure_root, - "Data recovery from chunks complete", - ); - - Ok(data) - } else { - recovery_duration.map(|rd| rd.stop_and_discard()); - - Err(RecoveryError::Invalid) - } - }, - Err(err) => { - recovery_duration.map(|rd| rd.stop_and_discard()); - gum::trace!( - target: LOG_TARGET, - candidate_hash = ?common_params.candidate_hash, - erasure_root = ?common_params.erasure_root, - ?err, - "Data recovery error ", - ); - - Err(RecoveryError::Invalid) - }, - } - } -} - -#[async_trait::async_trait] -impl RecoveryStrategy for FetchChunks { - fn display_name(&self) -> &'static str { - "Fetch chunks" - } - - async fn run( - &mut self, - state: &mut State, - sender: &mut Sender, - common_params: &RecoveryParams, - ) -> Result { - // First query the store for any chunks we've got. - if !common_params.bypass_availability_store { - let local_chunk_indices = state.populate_from_av_store(common_params, sender).await; - self.validators.retain(|i| !local_chunk_indices.contains(i)); - } - - // No need to query the validators that have the chunks we already received. - self.validators.retain(|i| !state.received_chunks.contains_key(i)); - - loop { - // If received_chunks has more than threshold entries, attempt to recover the data. - // If that fails, or a re-encoding of it doesn't match the expected erasure root, - // return Err(RecoveryError::Invalid). - // Do this before requesting any chunks because we may have enough of them coming from - // past RecoveryStrategies. - if state.chunk_count() >= common_params.threshold { - return self.attempt_recovery(state, common_params).await - } - - if Self::is_unavailable( - self.validators.len(), - self.requesting_chunks.total_len(), - state.chunk_count(), - common_params.threshold, - ) { - gum::debug!( - target: LOG_TARGET, - candidate_hash = ?common_params.candidate_hash, - erasure_root = ?common_params.erasure_root, - received = %state.chunk_count(), - requesting = %self.requesting_chunks.len(), - total_requesting = %self.requesting_chunks.total_len(), - n_validators = %common_params.n_validators, - "Data recovery from chunks is not possible", - ); - - return Err(RecoveryError::Unavailable) - } - - let desired_requests_count = - self.get_desired_request_count(state.chunk_count(), common_params.threshold); - let already_requesting_count = self.requesting_chunks.len(); - gum::debug!( - target: LOG_TARGET, - ?common_params.candidate_hash, - ?desired_requests_count, - error_count= ?self.error_count, - total_received = ?self.total_received_responses, - threshold = ?common_params.threshold, - ?already_requesting_count, - "Requesting availability chunks for a candidate", - ); - state - .launch_parallel_chunk_requests( - common_params, - sender, - desired_requests_count, - &mut self.validators, - &mut self.requesting_chunks, - ) - .await; - - let (total_responses, error_count) = state - .wait_for_chunks( - common_params, - &mut self.validators, - &mut self.requesting_chunks, - |unrequested_validators, reqs, chunk_count, params, _error_count| { - chunk_count >= params.threshold || - Self::is_unavailable( - unrequested_validators, - reqs, - chunk_count, - params.threshold, - ) - }, - ) - .await; - - self.total_received_responses += total_responses; - self.error_count += error_count; - } - } -} - -#[cfg(test)] -mod tests { - use super::*; - use polkadot_erasure_coding::recovery_threshold; - - #[test] - fn parallel_request_calculation_works_as_expected() { - let num_validators = 100; - let threshold = recovery_threshold(num_validators).unwrap(); - let (erasure_task_tx, _erasure_task_rx) = futures::channel::mpsc::channel(16); - - let mut fetch_chunks_task = - FetchChunks::new(FetchChunksParams { n_validators: 100, erasure_task_tx }); - assert_eq!(fetch_chunks_task.get_desired_request_count(0, threshold), threshold); - fetch_chunks_task.error_count = 1; - fetch_chunks_task.total_received_responses = 1; - // We saturate at threshold (34): - assert_eq!(fetch_chunks_task.get_desired_request_count(0, threshold), threshold); - - fetch_chunks_task.total_received_responses = 2; - // With given error rate - still saturating: - assert_eq!(fetch_chunks_task.get_desired_request_count(1, threshold), threshold); - fetch_chunks_task.total_received_responses += 8; - // error rate: 1/10 - // remaining chunks needed: threshold (34) - 9 - // expected: 24 * (1+ 1/10) = (next greater integer) = 27 - assert_eq!(fetch_chunks_task.get_desired_request_count(9, threshold), 27); - fetch_chunks_task.error_count = 0; - // With error count zero - we should fetch exactly as needed: - assert_eq!(fetch_chunks_task.get_desired_request_count(10, threshold), threshold - 10); - } -} diff --git a/polkadot/node/network/availability-recovery/src/task/mod.rs b/polkadot/node/network/availability-recovery/src/task/mod.rs new file mode 100644 index 0000000000000..800a82947d6f3 --- /dev/null +++ b/polkadot/node/network/availability-recovery/src/task/mod.rs @@ -0,0 +1,197 @@ +// Copyright (C) Parity Technologies (UK) Ltd. +// This file is part of Polkadot. + +// Polkadot is free software: you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation, either version 3 of the License, or +// (at your option) any later version. + +// Polkadot is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. + +// You should have received a copy of the GNU General Public License +// along with Polkadot. If not, see . + +//! Main recovery task logic. Runs recovery strategies. + +#![warn(missing_docs)] + +mod strategy; + +pub use self::strategy::{ + FetchChunks, FetchChunksParams, FetchFull, FetchFullParams, FetchSystematicChunks, + FetchSystematicChunksParams, RecoveryStrategy, State, +}; + +#[cfg(test)] +pub use self::strategy::{REGULAR_CHUNKS_REQ_RETRY_LIMIT, SYSTEMATIC_CHUNKS_REQ_RETRY_LIMIT}; + +use crate::{metrics::Metrics, ErasureTask, PostRecoveryCheck, LOG_TARGET}; + +use parity_scale_codec::Encode; +use polkadot_node_primitives::AvailableData; +use polkadot_node_subsystem::{messages::AvailabilityStoreMessage, overseer, RecoveryError}; +use polkadot_primitives::{AuthorityDiscoveryId, CandidateHash, Hash}; +use sc_network::ProtocolName; + +use futures::channel::{mpsc, oneshot}; +use std::collections::VecDeque; + +/// Recovery parameters common to all strategies in a `RecoveryTask`. +#[derive(Clone)] +pub struct RecoveryParams { + /// Discovery ids of `validators`. + pub validator_authority_keys: Vec, + + /// Number of validators. + pub n_validators: usize, + + /// The number of regular chunks needed. + pub threshold: usize, + + /// The number of systematic chunks needed. + pub systematic_threshold: usize, + + /// A hash of the relevant candidate. + pub candidate_hash: CandidateHash, + + /// The root of the erasure encoding of the candidate. + pub erasure_root: Hash, + + /// Metrics to report. + pub metrics: Metrics, + + /// Do not request data from availability-store. Useful for collators. + pub bypass_availability_store: bool, + + /// The type of check to perform after available data was recovered. + pub post_recovery_check: PostRecoveryCheck, + + /// The blake2-256 hash of the PoV. + pub pov_hash: Hash, + + /// Protocol name for ChunkFetchingV1. + pub req_v1_protocol_name: ProtocolName, + + /// Protocol name for ChunkFetchingV2. + pub req_v2_protocol_name: ProtocolName, + + /// Whether or not chunk mapping is enabled. + pub chunk_mapping_enabled: bool, + + /// Channel to the erasure task handler. + pub erasure_task_tx: mpsc::Sender, +} + +/// A stateful reconstruction of availability data in reference to +/// a candidate hash. +pub struct RecoveryTask { + sender: Sender, + params: RecoveryParams, + strategies: VecDeque>>, + state: State, +} + +impl RecoveryTask +where + Sender: overseer::AvailabilityRecoverySenderTrait, +{ + /// Instantiate a new recovery task. + pub fn new( + sender: Sender, + params: RecoveryParams, + strategies: VecDeque>>, + ) -> Self { + Self { sender, params, strategies, state: State::new() } + } + + async fn in_availability_store(&mut self) -> Option { + if !self.params.bypass_availability_store { + let (tx, rx) = oneshot::channel(); + self.sender + .send_message(AvailabilityStoreMessage::QueryAvailableData( + self.params.candidate_hash, + tx, + )) + .await; + + match rx.await { + Ok(Some(data)) => return Some(data), + Ok(None) => {}, + Err(oneshot::Canceled) => { + gum::warn!( + target: LOG_TARGET, + candidate_hash = ?self.params.candidate_hash, + "Failed to reach the availability store", + ) + }, + } + } + + None + } + + /// Run this recovery task to completion. It will loop through the configured strategies + /// in-order and return whenever the first one recovers the full `AvailableData`. + pub async fn run(mut self) -> Result { + if let Some(data) = self.in_availability_store().await { + return Ok(data) + } + + self.params.metrics.on_recovery_started(); + + let _timer = self.params.metrics.time_full_recovery(); + + while let Some(current_strategy) = self.strategies.pop_front() { + let display_name = current_strategy.display_name(); + let strategy_type = current_strategy.strategy_type(); + + gum::debug!( + target: LOG_TARGET, + candidate_hash = ?self.params.candidate_hash, + "Starting `{}` strategy", + display_name + ); + + let res = current_strategy.run(&mut self.state, &mut self.sender, &self.params).await; + + match res { + Err(RecoveryError::Unavailable) => + if self.strategies.front().is_some() { + gum::debug!( + target: LOG_TARGET, + candidate_hash = ?self.params.candidate_hash, + "Recovery strategy `{}` did not conclude. Trying the next one.", + display_name + ); + continue + }, + Err(err) => { + match &err { + RecoveryError::Invalid => + self.params.metrics.on_recovery_invalid(strategy_type), + _ => self.params.metrics.on_recovery_failed(strategy_type), + } + return Err(err) + }, + Ok(data) => { + self.params.metrics.on_recovery_succeeded(strategy_type, data.encoded_size()); + return Ok(data) + }, + } + } + + // We have no other strategies to try. + gum::warn!( + target: LOG_TARGET, + candidate_hash = ?self.params.candidate_hash, + "Recovery of available data failed.", + ); + + self.params.metrics.on_recovery_failed("all"); + + Err(RecoveryError::Unavailable) + } +} diff --git a/polkadot/node/network/availability-recovery/src/task/strategy/chunks.rs b/polkadot/node/network/availability-recovery/src/task/strategy/chunks.rs new file mode 100644 index 0000000000000..b6376a5b543ed --- /dev/null +++ b/polkadot/node/network/availability-recovery/src/task/strategy/chunks.rs @@ -0,0 +1,335 @@ +// Copyright (C) Parity Technologies (UK) Ltd. +// This file is part of Polkadot. + +// Polkadot is free software: you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation, either version 3 of the License, or +// (at your option) any later version. + +// Polkadot is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. + +// You should have received a copy of the GNU General Public License +// along with Polkadot. If not, see . + +use crate::{ + futures_undead::FuturesUndead, + task::{ + strategy::{ + do_post_recovery_check, is_unavailable, OngoingRequests, N_PARALLEL, + REGULAR_CHUNKS_REQ_RETRY_LIMIT, + }, + RecoveryParams, State, + }, + ErasureTask, RecoveryStrategy, LOG_TARGET, +}; + +use polkadot_node_primitives::AvailableData; +use polkadot_node_subsystem::{overseer, RecoveryError}; +use polkadot_primitives::ValidatorIndex; + +use futures::{channel::oneshot, SinkExt}; +use rand::seq::SliceRandom; +use std::collections::VecDeque; + +/// Parameters specific to the `FetchChunks` strategy. +pub struct FetchChunksParams { + pub n_validators: usize, +} + +/// `RecoveryStrategy` that requests chunks from validators, in parallel. +pub struct FetchChunks { + /// How many requests have been unsuccessful so far. + error_count: usize, + /// Total number of responses that have been received, including failed ones. + total_received_responses: usize, + /// A shuffled array of validator indices. + validators: VecDeque, + /// Collection of in-flight requests. + requesting_chunks: OngoingRequests, +} + +impl FetchChunks { + /// Instantiate a new strategy. + pub fn new(params: FetchChunksParams) -> Self { + // Shuffle the validators to make sure that we don't request chunks from the same + // validators over and over. + let mut validators: VecDeque = + (0..params.n_validators).map(|i| ValidatorIndex(i as u32)).collect(); + validators.make_contiguous().shuffle(&mut rand::thread_rng()); + + Self { + error_count: 0, + total_received_responses: 0, + validators, + requesting_chunks: FuturesUndead::new(), + } + } + + fn is_unavailable( + unrequested_validators: usize, + in_flight_requests: usize, + chunk_count: usize, + threshold: usize, + ) -> bool { + is_unavailable(chunk_count, in_flight_requests, unrequested_validators, threshold) + } + + /// Desired number of parallel requests. + /// + /// For the given threshold (total required number of chunks) get the desired number of + /// requests we want to have running in parallel at this time. + fn get_desired_request_count(&self, chunk_count: usize, threshold: usize) -> usize { + // Upper bound for parallel requests. + // We want to limit this, so requests can be processed within the timeout and we limit the + // following feedback loop: + // 1. Requests fail due to timeout + // 2. We request more chunks to make up for it + // 3. Bandwidth is spread out even more, so we get even more timeouts + // 4. We request more chunks to make up for it ... + let max_requests_boundary = std::cmp::min(N_PARALLEL, threshold); + // How many chunks are still needed? + let remaining_chunks = threshold.saturating_sub(chunk_count); + // What is the current error rate, so we can make up for it? + let inv_error_rate = + self.total_received_responses.checked_div(self.error_count).unwrap_or(0); + // Actual number of requests we want to have in flight in parallel: + std::cmp::min( + max_requests_boundary, + remaining_chunks + remaining_chunks.checked_div(inv_error_rate).unwrap_or(0), + ) + } + + async fn attempt_recovery( + &mut self, + state: &mut State, + common_params: &RecoveryParams, + ) -> Result { + let recovery_duration = common_params + .metrics + .time_erasure_recovery(RecoveryStrategy::::strategy_type(self)); + + // Send request to reconstruct available data from chunks. + let (avilable_data_tx, available_data_rx) = oneshot::channel(); + + let mut erasure_task_tx = common_params.erasure_task_tx.clone(); + erasure_task_tx + .send(ErasureTask::Reconstruct( + common_params.n_validators, + // Safe to leave an empty vec in place, as we're stopping the recovery process if + // this reconstruct fails. + std::mem::take(&mut state.received_chunks) + .into_iter() + .map(|(c_index, chunk)| (c_index, chunk.chunk)) + .collect(), + avilable_data_tx, + )) + .await + .map_err(|_| RecoveryError::ChannelClosed)?; + + let available_data_response = + available_data_rx.await.map_err(|_| RecoveryError::ChannelClosed)?; + + match available_data_response { + // Attempt post-recovery check. + Ok(data) => do_post_recovery_check(common_params, data) + .await + .map_err(|e| { + recovery_duration.map(|rd| rd.stop_and_discard()); + e + }) + .map(|data| { + gum::trace!( + target: LOG_TARGET, + candidate_hash = ?common_params.candidate_hash, + erasure_root = ?common_params.erasure_root, + "Data recovery from chunks complete", + ); + data + }), + Err(err) => { + recovery_duration.map(|rd| rd.stop_and_discard()); + gum::debug!( + target: LOG_TARGET, + candidate_hash = ?common_params.candidate_hash, + erasure_root = ?common_params.erasure_root, + ?err, + "Data recovery error", + ); + + Err(RecoveryError::Invalid) + }, + } + } +} + +#[async_trait::async_trait] +impl RecoveryStrategy for FetchChunks { + fn display_name(&self) -> &'static str { + "Fetch chunks" + } + + fn strategy_type(&self) -> &'static str { + "regular_chunks" + } + + async fn run( + mut self: Box, + state: &mut State, + sender: &mut Sender, + common_params: &RecoveryParams, + ) -> Result { + // First query the store for any chunks we've got. + if !common_params.bypass_availability_store { + let local_chunk_indices = state.populate_from_av_store(common_params, sender).await; + self.validators.retain(|validator_index| { + !local_chunk_indices.iter().any(|(v_index, _)| v_index == validator_index) + }); + } + + // No need to query the validators that have the chunks we already received or that we know + // don't have the data from previous strategies. + self.validators.retain(|v_index| { + !state.received_chunks.values().any(|c| v_index == &c.validator_index) && + state.can_retry_request( + &(common_params.validator_authority_keys[v_index.0 as usize].clone(), *v_index), + REGULAR_CHUNKS_REQ_RETRY_LIMIT, + ) + }); + + // Safe to `take` here, as we're consuming `self` anyway and we're not using the + // `validators` field in other methods. + let mut validators_queue: VecDeque<_> = std::mem::take(&mut self.validators) + .into_iter() + .map(|validator_index| { + ( + common_params.validator_authority_keys[validator_index.0 as usize].clone(), + validator_index, + ) + }) + .collect(); + + loop { + // If received_chunks has more than threshold entries, attempt to recover the data. + // If that fails, or a re-encoding of it doesn't match the expected erasure root, + // return Err(RecoveryError::Invalid). + // Do this before requesting any chunks because we may have enough of them coming from + // past RecoveryStrategies. + if state.chunk_count() >= common_params.threshold { + return self.attempt_recovery::(state, common_params).await + } + + if Self::is_unavailable( + validators_queue.len(), + self.requesting_chunks.total_len(), + state.chunk_count(), + common_params.threshold, + ) { + gum::debug!( + target: LOG_TARGET, + candidate_hash = ?common_params.candidate_hash, + erasure_root = ?common_params.erasure_root, + received = %state.chunk_count(), + requesting = %self.requesting_chunks.len(), + total_requesting = %self.requesting_chunks.total_len(), + n_validators = %common_params.n_validators, + "Data recovery from chunks is not possible", + ); + + return Err(RecoveryError::Unavailable) + } + + let desired_requests_count = + self.get_desired_request_count(state.chunk_count(), common_params.threshold); + let already_requesting_count = self.requesting_chunks.len(); + gum::debug!( + target: LOG_TARGET, + ?common_params.candidate_hash, + ?desired_requests_count, + error_count= ?self.error_count, + total_received = ?self.total_received_responses, + threshold = ?common_params.threshold, + ?already_requesting_count, + "Requesting availability chunks for a candidate", + ); + + let strategy_type = RecoveryStrategy::::strategy_type(&*self); + + state + .launch_parallel_chunk_requests( + strategy_type, + common_params, + sender, + desired_requests_count, + &mut validators_queue, + &mut self.requesting_chunks, + ) + .await; + + let (total_responses, error_count) = state + .wait_for_chunks( + strategy_type, + common_params, + REGULAR_CHUNKS_REQ_RETRY_LIMIT, + &mut validators_queue, + &mut self.requesting_chunks, + &mut vec![], + |unrequested_validators, + in_flight_reqs, + chunk_count, + _systematic_chunk_count| { + chunk_count >= common_params.threshold || + Self::is_unavailable( + unrequested_validators, + in_flight_reqs, + chunk_count, + common_params.threshold, + ) + }, + ) + .await; + + self.total_received_responses += total_responses; + self.error_count += error_count; + } + } +} + +#[cfg(test)] +mod tests { + use super::*; + use polkadot_erasure_coding::recovery_threshold; + + #[test] + fn test_get_desired_request_count() { + let n_validators = 100; + let threshold = recovery_threshold(n_validators).unwrap(); + + let mut fetch_chunks_task = FetchChunks::new(FetchChunksParams { n_validators }); + assert_eq!(fetch_chunks_task.get_desired_request_count(0, threshold), threshold); + fetch_chunks_task.error_count = 1; + fetch_chunks_task.total_received_responses = 1; + // We saturate at threshold (34): + assert_eq!(fetch_chunks_task.get_desired_request_count(0, threshold), threshold); + + // We saturate at the parallel limit. + assert_eq!(fetch_chunks_task.get_desired_request_count(0, N_PARALLEL + 2), N_PARALLEL); + + fetch_chunks_task.total_received_responses = 2; + // With given error rate - still saturating: + assert_eq!(fetch_chunks_task.get_desired_request_count(1, threshold), threshold); + fetch_chunks_task.total_received_responses = 10; + // error rate: 1/10 + // remaining chunks needed: threshold (34) - 9 + // expected: 24 * (1+ 1/10) = (next greater integer) = 27 + assert_eq!(fetch_chunks_task.get_desired_request_count(9, threshold), 27); + // We saturate at the parallel limit. + assert_eq!(fetch_chunks_task.get_desired_request_count(9, N_PARALLEL + 9), N_PARALLEL); + + fetch_chunks_task.error_count = 0; + // With error count zero - we should fetch exactly as needed: + assert_eq!(fetch_chunks_task.get_desired_request_count(10, threshold), threshold - 10); + } +} diff --git a/polkadot/node/network/availability-recovery/src/task/strategy/full.rs b/polkadot/node/network/availability-recovery/src/task/strategy/full.rs new file mode 100644 index 0000000000000..1d7fbe8ea3c8d --- /dev/null +++ b/polkadot/node/network/availability-recovery/src/task/strategy/full.rs @@ -0,0 +1,174 @@ +// Copyright (C) Parity Technologies (UK) Ltd. +// This file is part of Polkadot. + +// Polkadot is free software: you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation, either version 3 of the License, or +// (at your option) any later version. + +// Polkadot is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. + +// You should have received a copy of the GNU General Public License +// along with Polkadot. If not, see . + +use crate::{ + task::{RecoveryParams, RecoveryStrategy, State}, + ErasureTask, PostRecoveryCheck, LOG_TARGET, +}; + +use polkadot_node_network_protocol::request_response::{ + self as req_res, outgoing::RequestError, OutgoingRequest, Recipient, Requests, +}; +use polkadot_node_primitives::AvailableData; +use polkadot_node_subsystem::{messages::NetworkBridgeTxMessage, overseer, RecoveryError}; +use polkadot_primitives::ValidatorIndex; +use sc_network::{IfDisconnected, OutboundFailure, RequestFailure}; + +use futures::{channel::oneshot, SinkExt}; +use rand::seq::SliceRandom; + +/// Parameters specific to the `FetchFull` strategy. +pub struct FetchFullParams { + /// Validators that will be used for fetching the data. + pub validators: Vec, +} + +/// `RecoveryStrategy` that sequentially tries to fetch the full `AvailableData` from +/// already-connected validators in the configured validator set. +pub struct FetchFull { + params: FetchFullParams, +} + +impl FetchFull { + /// Create a new `FetchFull` recovery strategy. + pub fn new(mut params: FetchFullParams) -> Self { + params.validators.shuffle(&mut rand::thread_rng()); + Self { params } + } +} + +#[async_trait::async_trait] +impl RecoveryStrategy for FetchFull { + fn display_name(&self) -> &'static str { + "Full recovery from backers" + } + + fn strategy_type(&self) -> &'static str { + "full_from_backers" + } + + async fn run( + mut self: Box, + _: &mut State, + sender: &mut Sender, + common_params: &RecoveryParams, + ) -> Result { + let strategy_type = RecoveryStrategy::::strategy_type(&*self); + + loop { + // Pop the next validator. + let validator_index = + self.params.validators.pop().ok_or_else(|| RecoveryError::Unavailable)?; + + // Request data. + let (req, response) = OutgoingRequest::new( + Recipient::Authority( + common_params.validator_authority_keys[validator_index.0 as usize].clone(), + ), + req_res::v1::AvailableDataFetchingRequest { + candidate_hash: common_params.candidate_hash, + }, + ); + + sender + .send_message(NetworkBridgeTxMessage::SendRequests( + vec![Requests::AvailableDataFetchingV1(req)], + IfDisconnected::ImmediateError, + )) + .await; + + common_params.metrics.on_full_request_issued(); + + match response.await { + Ok(req_res::v1::AvailableDataFetchingResponse::AvailableData(data)) => { + let recovery_duration = + common_params.metrics.time_erasure_recovery(strategy_type); + let maybe_data = match common_params.post_recovery_check { + PostRecoveryCheck::Reencode => { + let (reencode_tx, reencode_rx) = oneshot::channel(); + let mut erasure_task_tx = common_params.erasure_task_tx.clone(); + + erasure_task_tx + .send(ErasureTask::Reencode( + common_params.n_validators, + common_params.erasure_root, + data, + reencode_tx, + )) + .await + .map_err(|_| RecoveryError::ChannelClosed)?; + + reencode_rx.await.map_err(|_| RecoveryError::ChannelClosed)? + }, + PostRecoveryCheck::PovHash => + (data.pov.hash() == common_params.pov_hash).then_some(data), + }; + + match maybe_data { + Some(data) => { + gum::trace!( + target: LOG_TARGET, + candidate_hash = ?common_params.candidate_hash, + "Received full data", + ); + + common_params.metrics.on_full_request_succeeded(); + return Ok(data) + }, + None => { + common_params.metrics.on_full_request_invalid(); + recovery_duration.map(|rd| rd.stop_and_discard()); + + gum::debug!( + target: LOG_TARGET, + candidate_hash = ?common_params.candidate_hash, + ?validator_index, + "Invalid data response", + ); + + // it doesn't help to report the peer with req/res. + // we'll try the next backer. + }, + } + }, + Ok(req_res::v1::AvailableDataFetchingResponse::NoSuchData) => { + common_params.metrics.on_full_request_no_such_data(); + }, + Err(e) => { + match &e { + RequestError::Canceled(_) => common_params.metrics.on_full_request_error(), + RequestError::InvalidResponse(_) => + common_params.metrics.on_full_request_invalid(), + RequestError::NetworkError(req_failure) => { + if let RequestFailure::Network(OutboundFailure::Timeout) = req_failure { + common_params.metrics.on_full_request_timeout(); + } else { + common_params.metrics.on_full_request_error(); + } + }, + }; + gum::debug!( + target: LOG_TARGET, + candidate_hash = ?common_params.candidate_hash, + ?validator_index, + err = ?e, + "Error fetching full available data." + ); + }, + } + } + } +} diff --git a/polkadot/node/network/availability-recovery/src/task/strategy/mod.rs b/polkadot/node/network/availability-recovery/src/task/strategy/mod.rs new file mode 100644 index 0000000000000..fb31ff6aa7792 --- /dev/null +++ b/polkadot/node/network/availability-recovery/src/task/strategy/mod.rs @@ -0,0 +1,1558 @@ +// Copyright (C) Parity Technologies (UK) Ltd. +// This file is part of Polkadot. + +// Polkadot is free software: you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation, either version 3 of the License, or +// (at your option) any later version. + +// Polkadot is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. + +// You should have received a copy of the GNU General Public License +// along with Polkadot. If not, see . + +//! Recovery strategies. + +mod chunks; +mod full; +mod systematic; + +pub use self::{ + chunks::{FetchChunks, FetchChunksParams}, + full::{FetchFull, FetchFullParams}, + systematic::{FetchSystematicChunks, FetchSystematicChunksParams}, +}; +use crate::{ + futures_undead::FuturesUndead, ErasureTask, PostRecoveryCheck, RecoveryParams, LOG_TARGET, +}; + +use futures::{channel::oneshot, SinkExt}; +use parity_scale_codec::Decode; +use polkadot_erasure_coding::branch_hash; +#[cfg(not(test))] +use polkadot_node_network_protocol::request_response::CHUNK_REQUEST_TIMEOUT; +use polkadot_node_network_protocol::request_response::{ + self as req_res, outgoing::RequestError, OutgoingRequest, Recipient, Requests, +}; +use polkadot_node_primitives::{AvailableData, ErasureChunk}; +use polkadot_node_subsystem::{ + messages::{AvailabilityStoreMessage, NetworkBridgeTxMessage}, + overseer, RecoveryError, +}; +use polkadot_primitives::{AuthorityDiscoveryId, BlakeTwo256, ChunkIndex, HashT, ValidatorIndex}; +use sc_network::{IfDisconnected, OutboundFailure, ProtocolName, RequestFailure}; +use std::{ + collections::{BTreeMap, HashMap, VecDeque}, + time::Duration, +}; + +// How many parallel chunk fetching requests should be running at once. +const N_PARALLEL: usize = 50; + +/// Time after which we consider a request to have failed +/// +/// and we should try more peers. Note in theory the request times out at the network level, +/// measurements have shown, that in practice requests might actually take longer to fail in +/// certain occasions. (The very least, authority discovery is not part of the timeout.) +/// +/// For the time being this value is the same as the timeout on the networking layer, but as this +/// timeout is more soft than the networking one, it might make sense to pick different values as +/// well. +#[cfg(not(test))] +const TIMEOUT_START_NEW_REQUESTS: Duration = CHUNK_REQUEST_TIMEOUT; +#[cfg(test)] +const TIMEOUT_START_NEW_REQUESTS: Duration = Duration::from_millis(100); + +/// The maximum number of times systematic chunk recovery will try making a request for a given +/// (validator,chunk) pair, if the error was not fatal. Added so that we don't get stuck in an +/// infinite retry loop. +pub const SYSTEMATIC_CHUNKS_REQ_RETRY_LIMIT: u32 = 2; +/// The maximum number of times regular chunk recovery will try making a request for a given +/// (validator,chunk) pair, if the error was not fatal. Added so that we don't get stuck in an +/// infinite retry loop. +pub const REGULAR_CHUNKS_REQ_RETRY_LIMIT: u32 = 5; + +// Helpful type alias for tracking ongoing chunk requests. +type OngoingRequests = FuturesUndead<( + AuthorityDiscoveryId, + ValidatorIndex, + Result<(Option, ProtocolName), RequestError>, +)>; + +const fn is_unavailable( + received_chunks: usize, + requesting_chunks: usize, + unrequested_validators: usize, + threshold: usize, +) -> bool { + received_chunks + requesting_chunks + unrequested_validators < threshold +} + +/// Check validity of a chunk. +fn is_chunk_valid(params: &RecoveryParams, chunk: &ErasureChunk) -> bool { + let anticipated_hash = + match branch_hash(¶ms.erasure_root, chunk.proof(), chunk.index.0 as usize) { + Ok(hash) => hash, + Err(e) => { + gum::debug!( + target: LOG_TARGET, + candidate_hash = ?params.candidate_hash, + chunk_index = ?chunk.index, + error = ?e, + "Invalid Merkle proof", + ); + return false + }, + }; + let erasure_chunk_hash = BlakeTwo256::hash(&chunk.chunk); + if anticipated_hash != erasure_chunk_hash { + gum::debug!( + target: LOG_TARGET, + candidate_hash = ?params.candidate_hash, + chunk_index = ?chunk.index, + "Merkle proof mismatch" + ); + return false + } + true +} + +/// Perform the validity checks after recovery. +async fn do_post_recovery_check( + params: &RecoveryParams, + data: AvailableData, +) -> Result { + let mut erasure_task_tx = params.erasure_task_tx.clone(); + match params.post_recovery_check { + PostRecoveryCheck::Reencode => { + // Send request to re-encode the chunks and check merkle root. + let (reencode_tx, reencode_rx) = oneshot::channel(); + erasure_task_tx + .send(ErasureTask::Reencode( + params.n_validators, + params.erasure_root, + data, + reencode_tx, + )) + .await + .map_err(|_| RecoveryError::ChannelClosed)?; + + reencode_rx.await.map_err(|_| RecoveryError::ChannelClosed)?.ok_or_else(|| { + gum::trace!( + target: LOG_TARGET, + candidate_hash = ?params.candidate_hash, + erasure_root = ?params.erasure_root, + "Data recovery error - root mismatch", + ); + RecoveryError::Invalid + }) + }, + PostRecoveryCheck::PovHash => { + let pov = data.pov.clone(); + (pov.hash() == params.pov_hash).then_some(data).ok_or_else(|| { + gum::trace!( + target: LOG_TARGET, + candidate_hash = ?params.candidate_hash, + expected_pov_hash = ?params.pov_hash, + actual_pov_hash = ?pov.hash(), + "Data recovery error - PoV hash mismatch", + ); + RecoveryError::Invalid + }) + }, + } +} + +#[async_trait::async_trait] +/// Common trait for runnable recovery strategies. +pub trait RecoveryStrategy: Send { + /// Main entry point of the strategy. + async fn run( + mut self: Box, + state: &mut State, + sender: &mut Sender, + common_params: &RecoveryParams, + ) -> Result; + + /// Return the name of the strategy for logging purposes. + fn display_name(&self) -> &'static str; + + /// Return the strategy type for use as a metric label. + fn strategy_type(&self) -> &'static str; +} + +/// Utility type used for recording the result of requesting a chunk from a validator. +enum ErrorRecord { + NonFatal(u32), + Fatal, +} + +/// Helper struct used for the `received_chunks` mapping. +/// Compared to `ErasureChunk`, it doesn't need to hold the `ChunkIndex` (because it's the key used +/// for the map) and proof, but needs to hold the `ValidatorIndex` instead. +struct Chunk { + /// The erasure-encoded chunk of data belonging to the candidate block. + chunk: Vec, + /// The validator index that corresponds to this chunk. Not always the same as the chunk index. + validator_index: ValidatorIndex, +} + +/// Intermediate/common data that must be passed between `RecoveryStrategy`s belonging to the +/// same `RecoveryTask`. +pub struct State { + /// Chunks received so far. + /// This MUST be a `BTreeMap` in order for systematic recovery to work (the algorithm assumes + /// that chunks are ordered by their index). If we ever switch this to some non-ordered + /// collection, we need to add a sort step to the systematic recovery. + received_chunks: BTreeMap, + + /// A record of errors returned when requesting a chunk from a validator. + recorded_errors: HashMap<(AuthorityDiscoveryId, ValidatorIndex), ErrorRecord>, +} + +impl State { + pub fn new() -> Self { + Self { received_chunks: BTreeMap::new(), recorded_errors: HashMap::new() } + } + + fn insert_chunk(&mut self, chunk_index: ChunkIndex, chunk: Chunk) { + self.received_chunks.insert(chunk_index, chunk); + } + + fn chunk_count(&self) -> usize { + self.received_chunks.len() + } + + fn systematic_chunk_count(&self, systematic_threshold: usize) -> usize { + self.received_chunks + .range(ChunkIndex(0)..ChunkIndex(systematic_threshold as u32)) + .count() + } + + fn record_error_fatal( + &mut self, + authority_id: AuthorityDiscoveryId, + validator_index: ValidatorIndex, + ) { + self.recorded_errors.insert((authority_id, validator_index), ErrorRecord::Fatal); + } + + fn record_error_non_fatal( + &mut self, + authority_id: AuthorityDiscoveryId, + validator_index: ValidatorIndex, + ) { + self.recorded_errors + .entry((authority_id, validator_index)) + .and_modify(|record| { + if let ErrorRecord::NonFatal(ref mut count) = record { + *count = count.saturating_add(1); + } + }) + .or_insert(ErrorRecord::NonFatal(1)); + } + + fn can_retry_request( + &self, + key: &(AuthorityDiscoveryId, ValidatorIndex), + retry_threshold: u32, + ) -> bool { + match self.recorded_errors.get(key) { + None => true, + Some(entry) => match entry { + ErrorRecord::Fatal => false, + ErrorRecord::NonFatal(count) if *count < retry_threshold => true, + ErrorRecord::NonFatal(_) => false, + }, + } + } + + /// Retrieve the local chunks held in the av-store (should be either 0 or 1). + async fn populate_from_av_store( + &mut self, + params: &RecoveryParams, + sender: &mut Sender, + ) -> Vec<(ValidatorIndex, ChunkIndex)> { + let (tx, rx) = oneshot::channel(); + sender + .send_message(AvailabilityStoreMessage::QueryAllChunks(params.candidate_hash, tx)) + .await; + + match rx.await { + Ok(chunks) => { + // This should either be length 1 or 0. If we had the whole data, + // we wouldn't have reached this stage. + let chunk_indices: Vec<_> = chunks + .iter() + .map(|(validator_index, chunk)| (*validator_index, chunk.index)) + .collect(); + + for (validator_index, chunk) in chunks { + if is_chunk_valid(params, &chunk) { + gum::trace!( + target: LOG_TARGET, + candidate_hash = ?params.candidate_hash, + chunk_index = ?chunk.index, + "Found valid chunk on disk" + ); + self.insert_chunk( + chunk.index, + Chunk { chunk: chunk.chunk, validator_index }, + ); + } else { + gum::error!( + target: LOG_TARGET, + "Loaded invalid chunk from disk! Disk/Db corruption _very_ likely - please fix ASAP!" + ); + }; + } + + chunk_indices + }, + Err(oneshot::Canceled) => { + gum::warn!( + target: LOG_TARGET, + candidate_hash = ?params.candidate_hash, + "Failed to reach the availability store" + ); + + vec![] + }, + } + } + + /// Launch chunk requests in parallel, according to the parameters. + async fn launch_parallel_chunk_requests( + &mut self, + strategy_type: &str, + params: &RecoveryParams, + sender: &mut Sender, + desired_requests_count: usize, + validators: &mut VecDeque<(AuthorityDiscoveryId, ValidatorIndex)>, + requesting_chunks: &mut OngoingRequests, + ) where + Sender: overseer::AvailabilityRecoverySenderTrait, + { + let candidate_hash = params.candidate_hash; + let already_requesting_count = requesting_chunks.len(); + + let to_launch = desired_requests_count - already_requesting_count; + let mut requests = Vec::with_capacity(to_launch); + + gum::trace!( + target: LOG_TARGET, + ?candidate_hash, + "Attempting to launch {} requests", + to_launch + ); + + while requesting_chunks.len() < desired_requests_count { + if let Some((authority_id, validator_index)) = validators.pop_back() { + gum::trace!( + target: LOG_TARGET, + ?authority_id, + ?validator_index, + ?candidate_hash, + "Requesting chunk", + ); + + // Request data. + let raw_request_v2 = + req_res::v2::ChunkFetchingRequest { candidate_hash, index: validator_index }; + let raw_request_v1 = req_res::v1::ChunkFetchingRequest::from(raw_request_v2); + + let (req, res) = OutgoingRequest::new_with_fallback( + Recipient::Authority(authority_id.clone()), + raw_request_v2, + raw_request_v1, + ); + requests.push(Requests::ChunkFetching(req)); + + params.metrics.on_chunk_request_issued(strategy_type); + let timer = params.metrics.time_chunk_request(strategy_type); + let v1_protocol_name = params.req_v1_protocol_name.clone(); + let v2_protocol_name = params.req_v2_protocol_name.clone(); + + let chunk_mapping_enabled = params.chunk_mapping_enabled; + let authority_id_clone = authority_id.clone(); + + requesting_chunks.push(Box::pin(async move { + let _timer = timer; + let res = match res.await { + Ok((bytes, protocol)) => + if v2_protocol_name == protocol { + match req_res::v2::ChunkFetchingResponse::decode(&mut &bytes[..]) { + Ok(req_res::v2::ChunkFetchingResponse::Chunk(chunk)) => + Ok((Some(chunk.into()), protocol)), + Ok(req_res::v2::ChunkFetchingResponse::NoSuchChunk) => + Ok((None, protocol)), + Err(e) => Err(RequestError::InvalidResponse(e)), + } + } else if v1_protocol_name == protocol { + // V1 protocol version must not be used when chunk mapping node + // feature is enabled, because we can't know the real index of the + // returned chunk. + // This case should never be reached as long as the + // `AvailabilityChunkMapping` feature is only enabled after the + // v1 version is removed. Still, log this. + if chunk_mapping_enabled { + gum::info!( + target: LOG_TARGET, + ?candidate_hash, + authority_id = ?authority_id_clone, + "Another validator is responding on /req_chunk/1 protocol while the availability chunk \ + mapping feature is enabled in the runtime. All validators must switch to /req_chunk/2." + ); + } + + match req_res::v1::ChunkFetchingResponse::decode(&mut &bytes[..]) { + Ok(req_res::v1::ChunkFetchingResponse::Chunk(chunk)) => Ok(( + Some(chunk.recombine_into_chunk(&raw_request_v1)), + protocol, + )), + Ok(req_res::v1::ChunkFetchingResponse::NoSuchChunk) => + Ok((None, protocol)), + Err(e) => Err(RequestError::InvalidResponse(e)), + } + } else { + Err(RequestError::NetworkError(RequestFailure::UnknownProtocol)) + }, + + Err(e) => Err(e), + }; + + (authority_id, validator_index, res) + })); + } else { + break + } + } + + if requests.len() != 0 { + sender + .send_message(NetworkBridgeTxMessage::SendRequests( + requests, + IfDisconnected::TryConnect, + )) + .await; + } + } + + /// Wait for a sufficient amount of chunks to reconstruct according to the provided `params`. + async fn wait_for_chunks( + &mut self, + strategy_type: &str, + params: &RecoveryParams, + retry_threshold: u32, + validators: &mut VecDeque<(AuthorityDiscoveryId, ValidatorIndex)>, + requesting_chunks: &mut OngoingRequests, + // If supplied, these validators will be used as a backup for requesting chunks. They + // should hold all chunks. Each of them will only be used to query one chunk. + backup_validators: &mut Vec, + // Function that returns `true` when this strategy can conclude. Either if we got enough + // chunks or if it's impossible. + mut can_conclude: impl FnMut( + // Number of validators left in the queue + usize, + // Number of in flight requests + usize, + // Number of valid chunks received so far + usize, + // Number of valid systematic chunks received so far + usize, + ) -> bool, + ) -> (usize, usize) { + let metrics = ¶ms.metrics; + + let mut total_received_responses = 0; + let mut error_count = 0; + + // Wait for all current requests to conclude or time-out, or until we reach enough chunks. + // We also declare requests undead, once `TIMEOUT_START_NEW_REQUESTS` is reached and will + // return in that case for `launch_parallel_requests` to fill up slots again. + while let Some(res) = requesting_chunks.next_with_timeout(TIMEOUT_START_NEW_REQUESTS).await + { + total_received_responses += 1; + + let (authority_id, validator_index, request_result) = res; + + let mut is_error = false; + + match request_result { + Ok((maybe_chunk, protocol)) => { + match protocol { + name if name == params.req_v1_protocol_name => + params.metrics.on_chunk_response_v1(), + name if name == params.req_v2_protocol_name => + params.metrics.on_chunk_response_v2(), + _ => {}, + } + + match maybe_chunk { + Some(chunk) => + if is_chunk_valid(params, &chunk) { + metrics.on_chunk_request_succeeded(strategy_type); + gum::trace!( + target: LOG_TARGET, + candidate_hash = ?params.candidate_hash, + ?authority_id, + ?validator_index, + "Received valid chunk", + ); + self.insert_chunk( + chunk.index, + Chunk { chunk: chunk.chunk, validator_index }, + ); + } else { + metrics.on_chunk_request_invalid(strategy_type); + error_count += 1; + // Record that we got an invalid chunk so that subsequent strategies + // don't try requesting this again. + self.record_error_fatal(authority_id.clone(), validator_index); + is_error = true; + }, + None => { + metrics.on_chunk_request_no_such_chunk(strategy_type); + gum::trace!( + target: LOG_TARGET, + candidate_hash = ?params.candidate_hash, + ?authority_id, + ?validator_index, + "Validator did not have the chunk", + ); + error_count += 1; + // Record that the validator did not have this chunk so that subsequent + // strategies don't try requesting this again. + self.record_error_fatal(authority_id.clone(), validator_index); + is_error = true; + }, + } + }, + Err(err) => { + error_count += 1; + + gum::trace!( + target: LOG_TARGET, + candidate_hash= ?params.candidate_hash, + ?err, + ?authority_id, + ?validator_index, + "Failure requesting chunk", + ); + + is_error = true; + + match err { + RequestError::InvalidResponse(_) => { + metrics.on_chunk_request_invalid(strategy_type); + + gum::debug!( + target: LOG_TARGET, + candidate_hash = ?params.candidate_hash, + ?err, + ?authority_id, + ?validator_index, + "Chunk fetching response was invalid", + ); + + // Record that we got an invalid chunk so that this or + // subsequent strategies don't try requesting this again. + self.record_error_fatal(authority_id.clone(), validator_index); + }, + RequestError::NetworkError(err) => { + // No debug logs on general network errors - that became very + // spammy occasionally. + if let RequestFailure::Network(OutboundFailure::Timeout) = err { + metrics.on_chunk_request_timeout(strategy_type); + } else { + metrics.on_chunk_request_error(strategy_type); + } + + // Record that we got a non-fatal error so that this or + // subsequent strategies will retry requesting this only a + // limited number of times. + self.record_error_non_fatal(authority_id.clone(), validator_index); + }, + RequestError::Canceled(_) => { + metrics.on_chunk_request_error(strategy_type); + + // Record that we got a non-fatal error so that this or + // subsequent strategies will retry requesting this only a + // limited number of times. + self.record_error_non_fatal(authority_id.clone(), validator_index); + }, + } + }, + } + + if is_error { + // First, see if we can retry the request. + if self.can_retry_request(&(authority_id.clone(), validator_index), retry_threshold) + { + validators.push_front((authority_id, validator_index)); + } else { + // Otherwise, try requesting from a backer as a backup, if we've not already + // requested the same chunk from it. + + let position = backup_validators.iter().position(|v| { + !self.recorded_errors.contains_key(&(v.clone(), validator_index)) + }); + if let Some(position) = position { + // Use swap_remove because it's faster and we don't care about order here. + let backer = backup_validators.swap_remove(position); + validators.push_front((backer, validator_index)); + } + } + } + + if can_conclude( + validators.len(), + requesting_chunks.total_len(), + self.chunk_count(), + self.systematic_chunk_count(params.systematic_threshold), + ) { + gum::debug!( + target: LOG_TARGET, + validators_len = validators.len(), + candidate_hash = ?params.candidate_hash, + received_chunks_count = ?self.chunk_count(), + requested_chunks_count = ?requesting_chunks.len(), + threshold = ?params.threshold, + "Can conclude availability recovery strategy", + ); + break + } + } + + (total_received_responses, error_count) + } +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::{tests::*, Metrics, RecoveryStrategy, RecoveryTask}; + use assert_matches::assert_matches; + use futures::{ + channel::mpsc::{self, UnboundedReceiver}, + executor, future, Future, FutureExt, StreamExt, + }; + use parity_scale_codec::Error as DecodingError; + use polkadot_erasure_coding::{recovery_threshold, systematic_recovery_threshold}; + use polkadot_node_network_protocol::request_response::Protocol; + use polkadot_node_primitives::{BlockData, PoV}; + use polkadot_node_subsystem::{AllMessages, TimeoutExt}; + use polkadot_node_subsystem_test_helpers::{ + derive_erasure_chunks_with_proofs_and_root, sender_receiver, TestSubsystemSender, + }; + use polkadot_primitives::{CandidateHash, HeadData, PersistedValidationData}; + use polkadot_primitives_test_helpers::dummy_hash; + use sp_keyring::Sr25519Keyring; + use std::sync::Arc; + + const TIMEOUT: Duration = Duration::from_secs(1); + + impl Default for RecoveryParams { + fn default() -> Self { + let validators = vec![ + Sr25519Keyring::Ferdie, + Sr25519Keyring::Alice.into(), + Sr25519Keyring::Bob.into(), + Sr25519Keyring::Charlie, + Sr25519Keyring::Dave, + Sr25519Keyring::One, + Sr25519Keyring::Two, + ]; + let (erasure_task_tx, _erasure_task_rx) = mpsc::channel(10); + + Self { + validator_authority_keys: validator_authority_id(&validators), + n_validators: validators.len(), + threshold: recovery_threshold(validators.len()).unwrap(), + systematic_threshold: systematic_recovery_threshold(validators.len()).unwrap(), + candidate_hash: CandidateHash(dummy_hash()), + erasure_root: dummy_hash(), + metrics: Metrics::new_dummy(), + bypass_availability_store: false, + post_recovery_check: PostRecoveryCheck::Reencode, + pov_hash: dummy_hash(), + req_v1_protocol_name: "/req_chunk/1".into(), + req_v2_protocol_name: "/req_chunk/2".into(), + chunk_mapping_enabled: true, + erasure_task_tx, + } + } + } + + impl RecoveryParams { + fn create_chunks(&mut self) -> Vec { + let available_data = dummy_available_data(); + let (chunks, erasure_root) = derive_erasure_chunks_with_proofs_and_root( + self.n_validators, + &available_data, + |_, _| {}, + ); + + self.erasure_root = erasure_root; + self.pov_hash = available_data.pov.hash(); + + chunks + } + } + + fn dummy_available_data() -> AvailableData { + let validation_data = PersistedValidationData { + parent_head: HeadData(vec![7, 8, 9]), + relay_parent_number: Default::default(), + max_pov_size: 1024, + relay_parent_storage_root: Default::default(), + }; + + AvailableData { + validation_data, + pov: Arc::new(PoV { block_data: BlockData(vec![42; 64]) }), + } + } + + fn test_harness, TestFut: Future>( + receiver_future: impl FnOnce(UnboundedReceiver) -> RecvFut, + test: impl FnOnce(TestSubsystemSender) -> TestFut, + ) { + let (sender, receiver) = sender_receiver(); + + let test_fut = test(sender); + let receiver_future = receiver_future(receiver); + + futures::pin_mut!(test_fut); + futures::pin_mut!(receiver_future); + + executor::block_on(future::join(test_fut, receiver_future)).1 + } + + #[test] + fn test_recorded_errors() { + let retry_threshold = 2; + let mut state = State::new(); + + let alice = Sr25519Keyring::Alice.public(); + let bob = Sr25519Keyring::Bob.public(); + let eve = Sr25519Keyring::Eve.public(); + + assert!(state.can_retry_request(&(alice.into(), 0.into()), retry_threshold)); + assert!(state.can_retry_request(&(alice.into(), 0.into()), 0)); + state.record_error_non_fatal(alice.into(), 0.into()); + assert!(state.can_retry_request(&(alice.into(), 0.into()), retry_threshold)); + state.record_error_non_fatal(alice.into(), 0.into()); + assert!(!state.can_retry_request(&(alice.into(), 0.into()), retry_threshold)); + state.record_error_non_fatal(alice.into(), 0.into()); + assert!(!state.can_retry_request(&(alice.into(), 0.into()), retry_threshold)); + + assert!(state.can_retry_request(&(alice.into(), 0.into()), 5)); + + state.record_error_fatal(bob.into(), 1.into()); + assert!(!state.can_retry_request(&(bob.into(), 1.into()), retry_threshold)); + state.record_error_non_fatal(bob.into(), 1.into()); + assert!(!state.can_retry_request(&(bob.into(), 1.into()), retry_threshold)); + + assert!(state.can_retry_request(&(eve.into(), 4.into()), 0)); + assert!(state.can_retry_request(&(eve.into(), 4.into()), retry_threshold)); + } + + #[test] + fn test_populate_from_av_store() { + let params = RecoveryParams::default(); + + // Failed to reach the av store + { + let params = params.clone(); + let candidate_hash = params.candidate_hash; + let mut state = State::new(); + + test_harness( + |mut receiver: UnboundedReceiver| async move { + assert_matches!( + receiver.next().timeout(TIMEOUT).await.unwrap().unwrap(), + AllMessages::AvailabilityStore(AvailabilityStoreMessage::QueryAllChunks(hash, tx)) => { + assert_eq!(hash, candidate_hash); + drop(tx); + }); + }, + |mut sender| async move { + let local_chunk_indices = + state.populate_from_av_store(¶ms, &mut sender).await; + + assert_eq!(state.chunk_count(), 0); + assert_eq!(local_chunk_indices.len(), 0); + }, + ); + } + + // Found invalid chunk + { + let mut params = params.clone(); + let candidate_hash = params.candidate_hash; + let mut state = State::new(); + let chunks = params.create_chunks(); + + test_harness( + |mut receiver: UnboundedReceiver| async move { + assert_matches!( + receiver.next().timeout(TIMEOUT).await.unwrap().unwrap(), + AllMessages::AvailabilityStore(AvailabilityStoreMessage::QueryAllChunks(hash, tx)) => { + assert_eq!(hash, candidate_hash); + let mut chunk = chunks[0].clone(); + chunk.index = 3.into(); + tx.send(vec![(2.into(), chunk)]).unwrap(); + }); + }, + |mut sender| async move { + let local_chunk_indices = + state.populate_from_av_store(¶ms, &mut sender).await; + + assert_eq!(state.chunk_count(), 0); + assert_eq!(local_chunk_indices.len(), 1); + }, + ); + } + + // Found valid chunk + { + let mut params = params.clone(); + let candidate_hash = params.candidate_hash; + let mut state = State::new(); + let chunks = params.create_chunks(); + + test_harness( + |mut receiver: UnboundedReceiver| async move { + assert_matches!( + receiver.next().timeout(TIMEOUT).await.unwrap().unwrap(), + AllMessages::AvailabilityStore(AvailabilityStoreMessage::QueryAllChunks(hash, tx)) => { + assert_eq!(hash, candidate_hash); + tx.send(vec![(4.into(), chunks[1].clone())]).unwrap(); + }); + }, + |mut sender| async move { + let local_chunk_indices = + state.populate_from_av_store(¶ms, &mut sender).await; + + assert_eq!(state.chunk_count(), 1); + assert_eq!(local_chunk_indices.len(), 1); + }, + ); + } + } + + #[test] + fn test_launch_parallel_chunk_requests() { + let params = RecoveryParams::default(); + let alice: AuthorityDiscoveryId = Sr25519Keyring::Alice.public().into(); + let bob: AuthorityDiscoveryId = Sr25519Keyring::Bob.public().into(); + let eve: AuthorityDiscoveryId = Sr25519Keyring::Eve.public().into(); + + // No validators to request from. + { + let params = params.clone(); + let mut state = State::new(); + let mut ongoing_reqs = OngoingRequests::new(); + let mut validators = VecDeque::new(); + + test_harness( + |mut receiver: UnboundedReceiver| async move { + // Shouldn't send any requests. + assert!(receiver.next().timeout(TIMEOUT).await.unwrap().is_none()); + }, + |mut sender| async move { + state + .launch_parallel_chunk_requests( + "regular", + ¶ms, + &mut sender, + 3, + &mut validators, + &mut ongoing_reqs, + ) + .await; + + assert_eq!(ongoing_reqs.total_len(), 0); + }, + ); + } + + // Has validators but no need to request more. + { + let params = params.clone(); + let mut state = State::new(); + let mut ongoing_reqs = OngoingRequests::new(); + let mut validators = VecDeque::new(); + validators.push_back((alice.clone(), ValidatorIndex(1))); + + test_harness( + |mut receiver: UnboundedReceiver| async move { + // Shouldn't send any requests. + assert!(receiver.next().timeout(TIMEOUT).await.unwrap().is_none()); + }, + |mut sender| async move { + state + .launch_parallel_chunk_requests( + "regular", + ¶ms, + &mut sender, + 0, + &mut validators, + &mut ongoing_reqs, + ) + .await; + + assert_eq!(ongoing_reqs.total_len(), 0); + }, + ); + } + + // Has validators but no need to request more. + { + let params = params.clone(); + let mut state = State::new(); + let mut ongoing_reqs = OngoingRequests::new(); + ongoing_reqs.push(async { todo!() }.boxed()); + ongoing_reqs.soft_cancel(); + let mut validators = VecDeque::new(); + validators.push_back((alice.clone(), ValidatorIndex(1))); + + test_harness( + |mut receiver: UnboundedReceiver| async move { + // Shouldn't send any requests. + assert!(receiver.next().timeout(TIMEOUT).await.unwrap().is_none()); + }, + |mut sender| async move { + state + .launch_parallel_chunk_requests( + "regular", + ¶ms, + &mut sender, + 0, + &mut validators, + &mut ongoing_reqs, + ) + .await; + + assert_eq!(ongoing_reqs.total_len(), 1); + assert_eq!(ongoing_reqs.len(), 0); + }, + ); + } + + // Needs to request more. + { + let params = params.clone(); + let mut state = State::new(); + let mut ongoing_reqs = OngoingRequests::new(); + ongoing_reqs.push(async { todo!() }.boxed()); + ongoing_reqs.soft_cancel(); + ongoing_reqs.push(async { todo!() }.boxed()); + let mut validators = VecDeque::new(); + validators.push_back((alice.clone(), 0.into())); + validators.push_back((bob, 1.into())); + validators.push_back((eve, 2.into())); + + test_harness( + |mut receiver: UnboundedReceiver| async move { + assert_matches!( + receiver.next().timeout(TIMEOUT).await.unwrap().unwrap(), + AllMessages::NetworkBridgeTx(NetworkBridgeTxMessage::SendRequests(requests, _)) if requests.len() +== 3 ); + }, + |mut sender| async move { + state + .launch_parallel_chunk_requests( + "regular", + ¶ms, + &mut sender, + 10, + &mut validators, + &mut ongoing_reqs, + ) + .await; + + assert_eq!(ongoing_reqs.total_len(), 5); + assert_eq!(ongoing_reqs.len(), 4); + }, + ); + } + + // Check network protocol versioning. + { + let params = params.clone(); + let mut state = State::new(); + let mut ongoing_reqs = OngoingRequests::new(); + let mut validators = VecDeque::new(); + validators.push_back((alice, 0.into())); + + test_harness( + |mut receiver: UnboundedReceiver| async move { + match receiver.next().timeout(TIMEOUT).await.unwrap().unwrap() { + AllMessages::NetworkBridgeTx(NetworkBridgeTxMessage::SendRequests( + mut requests, + _, + )) => { + assert_eq!(requests.len(), 1); + // By default, we should use the new protocol version with a fallback on + // the older one. + let (protocol, request) = requests.remove(0).encode_request(); + assert_eq!(protocol, Protocol::ChunkFetchingV2); + assert_eq!( + request.fallback_request.unwrap().1, + Protocol::ChunkFetchingV1 + ); + }, + _ => unreachable!(), + } + }, + |mut sender| async move { + state + .launch_parallel_chunk_requests( + "regular", + ¶ms, + &mut sender, + 10, + &mut validators, + &mut ongoing_reqs, + ) + .await; + + assert_eq!(ongoing_reqs.total_len(), 1); + assert_eq!(ongoing_reqs.len(), 1); + }, + ); + } + } + + #[test] + fn test_wait_for_chunks() { + let params = RecoveryParams::default(); + let retry_threshold = 2; + + // No ongoing requests. + { + let params = params.clone(); + let mut state = State::new(); + let mut ongoing_reqs = OngoingRequests::new(); + let mut validators = VecDeque::new(); + + test_harness( + |mut receiver: UnboundedReceiver| async move { + // Shouldn't send any requests. + assert!(receiver.next().timeout(TIMEOUT).await.unwrap().is_none()); + }, + |_| async move { + let (total_responses, error_count) = state + .wait_for_chunks( + "regular", + ¶ms, + retry_threshold, + &mut validators, + &mut ongoing_reqs, + &mut vec![], + |_, _, _, _| false, + ) + .await; + assert_eq!(total_responses, 0); + assert_eq!(error_count, 0); + assert_eq!(state.chunk_count(), 0); + }, + ); + } + + // Complex scenario. + { + let mut params = params.clone(); + let chunks = params.create_chunks(); + let mut state = State::new(); + let mut ongoing_reqs = OngoingRequests::new(); + ongoing_reqs.push( + future::ready(( + params.validator_authority_keys[0].clone(), + 0.into(), + Ok((Some(chunks[0].clone()), "".into())), + )) + .boxed(), + ); + ongoing_reqs.soft_cancel(); + ongoing_reqs.push( + future::ready(( + params.validator_authority_keys[1].clone(), + 1.into(), + Ok((Some(chunks[1].clone()), "".into())), + )) + .boxed(), + ); + ongoing_reqs.push( + future::ready(( + params.validator_authority_keys[2].clone(), + 2.into(), + Ok((None, "".into())), + )) + .boxed(), + ); + ongoing_reqs.push( + future::ready(( + params.validator_authority_keys[3].clone(), + 3.into(), + Err(RequestError::from(DecodingError::from("err"))), + )) + .boxed(), + ); + ongoing_reqs.push( + future::ready(( + params.validator_authority_keys[4].clone(), + 4.into(), + Err(RequestError::NetworkError(RequestFailure::NotConnected)), + )) + .boxed(), + ); + + let mut validators: VecDeque<_> = (5..params.n_validators as u32) + .map(|i| (params.validator_authority_keys[i as usize].clone(), i.into())) + .collect(); + validators.push_back(( + Sr25519Keyring::AliceStash.public().into(), + ValidatorIndex(params.n_validators as u32), + )); + + test_harness( + |mut receiver: UnboundedReceiver| async move { + // Shouldn't send any requests. + assert!(receiver.next().timeout(TIMEOUT).await.unwrap().is_none()); + }, + |_| async move { + let (total_responses, error_count) = state + .wait_for_chunks( + "regular", + ¶ms, + retry_threshold, + &mut validators, + &mut ongoing_reqs, + &mut vec![], + |_, _, _, _| false, + ) + .await; + assert_eq!(total_responses, 5); + assert_eq!(error_count, 3); + assert_eq!(state.chunk_count(), 2); + + let mut expected_validators: VecDeque<_> = (4..params.n_validators as u32) + .map(|i| (params.validator_authority_keys[i as usize].clone(), i.into())) + .collect(); + expected_validators.push_back(( + Sr25519Keyring::AliceStash.public().into(), + ValidatorIndex(params.n_validators as u32), + )); + + assert_eq!(validators, expected_validators); + + // This time we'll go over the recoverable error threshold. + ongoing_reqs.push( + future::ready(( + params.validator_authority_keys[4].clone(), + 4.into(), + Err(RequestError::NetworkError(RequestFailure::NotConnected)), + )) + .boxed(), + ); + + let (total_responses, error_count) = state + .wait_for_chunks( + "regular", + ¶ms, + retry_threshold, + &mut validators, + &mut ongoing_reqs, + &mut vec![], + |_, _, _, _| false, + ) + .await; + assert_eq!(total_responses, 1); + assert_eq!(error_count, 1); + assert_eq!(state.chunk_count(), 2); + + validators.pop_front(); + let mut expected_validators: VecDeque<_> = (5..params.n_validators as u32) + .map(|i| (params.validator_authority_keys[i as usize].clone(), i.into())) + .collect(); + expected_validators.push_back(( + Sr25519Keyring::AliceStash.public().into(), + ValidatorIndex(params.n_validators as u32), + )); + + assert_eq!(validators, expected_validators); + + // Check that can_conclude returning true terminates the loop. + let (total_responses, error_count) = state + .wait_for_chunks( + "regular", + ¶ms, + retry_threshold, + &mut validators, + &mut ongoing_reqs, + &mut vec![], + |_, _, _, _| true, + ) + .await; + assert_eq!(total_responses, 0); + assert_eq!(error_count, 0); + assert_eq!(state.chunk_count(), 2); + + assert_eq!(validators, expected_validators); + }, + ); + } + + // Complex scenario with backups in the backing group. + { + let mut params = params.clone(); + let chunks = params.create_chunks(); + let mut state = State::new(); + let mut ongoing_reqs = OngoingRequests::new(); + ongoing_reqs.push( + future::ready(( + params.validator_authority_keys[0].clone(), + 0.into(), + Ok((Some(chunks[0].clone()), "".into())), + )) + .boxed(), + ); + ongoing_reqs.soft_cancel(); + ongoing_reqs.push( + future::ready(( + params.validator_authority_keys[1].clone(), + 1.into(), + Ok((Some(chunks[1].clone()), "".into())), + )) + .boxed(), + ); + ongoing_reqs.push( + future::ready(( + params.validator_authority_keys[2].clone(), + 2.into(), + Ok((None, "".into())), + )) + .boxed(), + ); + ongoing_reqs.push( + future::ready(( + params.validator_authority_keys[3].clone(), + 3.into(), + Err(RequestError::from(DecodingError::from("err"))), + )) + .boxed(), + ); + ongoing_reqs.push( + future::ready(( + params.validator_authority_keys[4].clone(), + 4.into(), + Err(RequestError::NetworkError(RequestFailure::NotConnected)), + )) + .boxed(), + ); + + let mut validators: VecDeque<_> = (5..params.n_validators as u32) + .map(|i| (params.validator_authority_keys[i as usize].clone(), i.into())) + .collect(); + validators.push_back(( + Sr25519Keyring::Eve.public().into(), + ValidatorIndex(params.n_validators as u32), + )); + + let mut backup_backers = vec![ + params.validator_authority_keys[2].clone(), + params.validator_authority_keys[0].clone(), + params.validator_authority_keys[4].clone(), + params.validator_authority_keys[3].clone(), + Sr25519Keyring::AliceStash.public().into(), + Sr25519Keyring::BobStash.public().into(), + ]; + + test_harness( + |mut receiver: UnboundedReceiver| async move { + // Shouldn't send any requests. + assert!(receiver.next().timeout(TIMEOUT).await.unwrap().is_none()); + }, + |_| async move { + let (total_responses, error_count) = state + .wait_for_chunks( + "regular", + ¶ms, + retry_threshold, + &mut validators, + &mut ongoing_reqs, + &mut backup_backers, + |_, _, _, _| false, + ) + .await; + assert_eq!(total_responses, 5); + assert_eq!(error_count, 3); + assert_eq!(state.chunk_count(), 2); + + let mut expected_validators: VecDeque<_> = (5..params.n_validators as u32) + .map(|i| (params.validator_authority_keys[i as usize].clone(), i.into())) + .collect(); + expected_validators.push_back(( + Sr25519Keyring::Eve.public().into(), + ValidatorIndex(params.n_validators as u32), + )); + // We picked a backer as a backup for chunks 2 and 3. + expected_validators + .push_front((params.validator_authority_keys[0].clone(), 2.into())); + expected_validators + .push_front((params.validator_authority_keys[2].clone(), 3.into())); + expected_validators + .push_front((params.validator_authority_keys[4].clone(), 4.into())); + + assert_eq!(validators, expected_validators); + + // This time we'll go over the recoverable error threshold for chunk 4. + ongoing_reqs.push( + future::ready(( + params.validator_authority_keys[4].clone(), + 4.into(), + Err(RequestError::NetworkError(RequestFailure::NotConnected)), + )) + .boxed(), + ); + + validators.pop_front(); + + let (total_responses, error_count) = state + .wait_for_chunks( + "regular", + ¶ms, + retry_threshold, + &mut validators, + &mut ongoing_reqs, + &mut backup_backers, + |_, _, _, _| false, + ) + .await; + assert_eq!(total_responses, 1); + assert_eq!(error_count, 1); + assert_eq!(state.chunk_count(), 2); + + expected_validators.pop_front(); + expected_validators + .push_front((Sr25519Keyring::AliceStash.public().into(), 4.into())); + + assert_eq!(validators, expected_validators); + }, + ); + } + } + + #[test] + fn test_recovery_strategy_run() { + let params = RecoveryParams::default(); + + struct GoodStrategy; + #[async_trait::async_trait] + impl RecoveryStrategy for GoodStrategy { + fn display_name(&self) -> &'static str { + "GoodStrategy" + } + + fn strategy_type(&self) -> &'static str { + "good_strategy" + } + + async fn run( + mut self: Box, + _state: &mut State, + _sender: &mut Sender, + _common_params: &RecoveryParams, + ) -> Result { + Ok(dummy_available_data()) + } + } + + struct UnavailableStrategy; + #[async_trait::async_trait] + impl RecoveryStrategy + for UnavailableStrategy + { + fn display_name(&self) -> &'static str { + "UnavailableStrategy" + } + + fn strategy_type(&self) -> &'static str { + "unavailable_strategy" + } + + async fn run( + mut self: Box, + _state: &mut State, + _sender: &mut Sender, + _common_params: &RecoveryParams, + ) -> Result { + Err(RecoveryError::Unavailable) + } + } + + struct InvalidStrategy; + #[async_trait::async_trait] + impl RecoveryStrategy + for InvalidStrategy + { + fn display_name(&self) -> &'static str { + "InvalidStrategy" + } + + fn strategy_type(&self) -> &'static str { + "invalid_strategy" + } + + async fn run( + mut self: Box, + _state: &mut State, + _sender: &mut Sender, + _common_params: &RecoveryParams, + ) -> Result { + Err(RecoveryError::Invalid) + } + } + + // No recovery strategies. + { + let mut params = params.clone(); + let strategies = VecDeque::new(); + params.bypass_availability_store = true; + + test_harness( + |mut receiver: UnboundedReceiver| async move { + // Shouldn't send any requests. + assert!(receiver.next().timeout(TIMEOUT).await.unwrap().is_none()); + }, + |sender| async move { + let task = RecoveryTask::new(sender, params, strategies); + + assert_eq!(task.run().await.unwrap_err(), RecoveryError::Unavailable); + }, + ); + } + + // If we have the data in av-store, returns early. + { + let params = params.clone(); + let strategies = VecDeque::new(); + let candidate_hash = params.candidate_hash; + + test_harness( + |mut receiver: UnboundedReceiver| async move { + assert_matches!( + receiver.next().timeout(TIMEOUT).await.unwrap().unwrap(), + AllMessages::AvailabilityStore(AvailabilityStoreMessage::QueryAvailableData(hash, tx)) => { + assert_eq!(hash, candidate_hash); + tx.send(Some(dummy_available_data())).unwrap(); + }); + }, + |sender| async move { + let task = RecoveryTask::new(sender, params, strategies); + + assert_eq!(task.run().await.unwrap(), dummy_available_data()); + }, + ); + } + + // Strategy returning `RecoveryError::Invalid`` will short-circuit the entire task. + { + let mut params = params.clone(); + params.bypass_availability_store = true; + let mut strategies: VecDeque>> = + VecDeque::new(); + strategies.push_back(Box::new(InvalidStrategy)); + strategies.push_back(Box::new(GoodStrategy)); + + test_harness( + |mut receiver: UnboundedReceiver| async move { + // Shouldn't send any requests. + assert!(receiver.next().timeout(TIMEOUT).await.unwrap().is_none()); + }, + |sender| async move { + let task = RecoveryTask::new(sender, params, strategies); + + assert_eq!(task.run().await.unwrap_err(), RecoveryError::Invalid); + }, + ); + } + + // Strategy returning `Unavailable` will fall back to the next one. + { + let params = params.clone(); + let candidate_hash = params.candidate_hash; + let mut strategies: VecDeque>> = + VecDeque::new(); + strategies.push_back(Box::new(UnavailableStrategy)); + strategies.push_back(Box::new(GoodStrategy)); + + test_harness( + |mut receiver: UnboundedReceiver| async move { + assert_matches!( + receiver.next().timeout(TIMEOUT).await.unwrap().unwrap(), + AllMessages::AvailabilityStore(AvailabilityStoreMessage::QueryAvailableData(hash, tx)) => { + assert_eq!(hash, candidate_hash); + tx.send(Some(dummy_available_data())).unwrap(); + }); + }, + |sender| async move { + let task = RecoveryTask::new(sender, params, strategies); + + assert_eq!(task.run().await.unwrap(), dummy_available_data()); + }, + ); + } + + // More complex scenario. + { + let params = params.clone(); + let candidate_hash = params.candidate_hash; + let mut strategies: VecDeque>> = + VecDeque::new(); + strategies.push_back(Box::new(UnavailableStrategy)); + strategies.push_back(Box::new(UnavailableStrategy)); + strategies.push_back(Box::new(GoodStrategy)); + strategies.push_back(Box::new(InvalidStrategy)); + + test_harness( + |mut receiver: UnboundedReceiver| async move { + assert_matches!( + receiver.next().timeout(TIMEOUT).await.unwrap().unwrap(), + AllMessages::AvailabilityStore(AvailabilityStoreMessage::QueryAvailableData(hash, tx)) => { + assert_eq!(hash, candidate_hash); + tx.send(Some(dummy_available_data())).unwrap(); + }); + }, + |sender| async move { + let task = RecoveryTask::new(sender, params, strategies); + + assert_eq!(task.run().await.unwrap(), dummy_available_data()); + }, + ); + } + } + + #[test] + fn test_is_unavailable() { + assert_eq!(is_unavailable(0, 0, 0, 0), false); + assert_eq!(is_unavailable(2, 2, 2, 0), false); + // Already reached the threshold. + assert_eq!(is_unavailable(3, 0, 10, 3), false); + assert_eq!(is_unavailable(3, 2, 0, 3), false); + assert_eq!(is_unavailable(3, 2, 10, 3), false); + // It's still possible to reach the threshold + assert_eq!(is_unavailable(0, 0, 10, 3), false); + assert_eq!(is_unavailable(0, 0, 3, 3), false); + assert_eq!(is_unavailable(1, 1, 1, 3), false); + // Not possible to reach the threshold + assert_eq!(is_unavailable(0, 0, 0, 3), true); + assert_eq!(is_unavailable(2, 3, 2, 10), true); + } +} diff --git a/polkadot/node/network/availability-recovery/src/task/strategy/systematic.rs b/polkadot/node/network/availability-recovery/src/task/strategy/systematic.rs new file mode 100644 index 0000000000000..677bc2d1375aa --- /dev/null +++ b/polkadot/node/network/availability-recovery/src/task/strategy/systematic.rs @@ -0,0 +1,343 @@ +// Copyright (C) Parity Technologies (UK) Ltd. +// This file is part of Polkadot. + +// Polkadot is free software: you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation, either version 3 of the License, or +// (at your option) any later version. + +// Polkadot is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. + +// You should have received a copy of the GNU General Public License +// along with Polkadot. If not, see . + +use crate::{ + futures_undead::FuturesUndead, + task::{ + strategy::{ + do_post_recovery_check, is_unavailable, OngoingRequests, N_PARALLEL, + SYSTEMATIC_CHUNKS_REQ_RETRY_LIMIT, + }, + RecoveryParams, RecoveryStrategy, State, + }, + LOG_TARGET, +}; + +use polkadot_node_primitives::AvailableData; +use polkadot_node_subsystem::{overseer, RecoveryError}; +use polkadot_primitives::{ChunkIndex, ValidatorIndex}; + +use std::collections::VecDeque; + +/// Parameters needed for fetching systematic chunks. +pub struct FetchSystematicChunksParams { + /// Validators that hold the systematic chunks. + pub validators: Vec<(ChunkIndex, ValidatorIndex)>, + /// Validators in the backing group, to be used as a backup for requesting systematic chunks. + pub backers: Vec, +} + +/// `RecoveryStrategy` that attempts to recover the systematic chunks from the validators that +/// hold them, in order to bypass the erasure code reconstruction step, which is costly. +pub struct FetchSystematicChunks { + /// Systematic recovery threshold. + threshold: usize, + /// Validators that hold the systematic chunks. + validators: Vec<(ChunkIndex, ValidatorIndex)>, + /// Backers to be used as a backup. + backers: Vec, + /// Collection of in-flight requests. + requesting_chunks: OngoingRequests, +} + +impl FetchSystematicChunks { + /// Instantiate a new systematic chunks strategy. + pub fn new(params: FetchSystematicChunksParams) -> Self { + Self { + threshold: params.validators.len(), + validators: params.validators, + backers: params.backers, + requesting_chunks: FuturesUndead::new(), + } + } + + fn is_unavailable( + unrequested_validators: usize, + in_flight_requests: usize, + systematic_chunk_count: usize, + threshold: usize, + ) -> bool { + is_unavailable( + systematic_chunk_count, + in_flight_requests, + unrequested_validators, + threshold, + ) + } + + /// Desired number of parallel requests. + /// + /// For the given threshold (total required number of chunks) get the desired number of + /// requests we want to have running in parallel at this time. + fn get_desired_request_count(&self, chunk_count: usize, threshold: usize) -> usize { + // Upper bound for parallel requests. + let max_requests_boundary = std::cmp::min(N_PARALLEL, threshold); + // How many chunks are still needed? + let remaining_chunks = threshold.saturating_sub(chunk_count); + // Actual number of requests we want to have in flight in parallel: + // We don't have to make up for any error rate, as an error fetching a systematic chunk + // results in failure of the entire strategy. + std::cmp::min(max_requests_boundary, remaining_chunks) + } + + async fn attempt_systematic_recovery( + &mut self, + state: &mut State, + common_params: &RecoveryParams, + ) -> Result { + let strategy_type = RecoveryStrategy::::strategy_type(self); + let recovery_duration = common_params.metrics.time_erasure_recovery(strategy_type); + let reconstruct_duration = common_params.metrics.time_erasure_reconstruct(strategy_type); + let chunks = state + .received_chunks + .range( + ChunkIndex(0).. + ChunkIndex( + u32::try_from(self.threshold) + .expect("validator count should not exceed u32"), + ), + ) + .map(|(_, chunk)| chunk.chunk.clone()) + .collect::>(); + + let available_data = polkadot_erasure_coding::reconstruct_from_systematic_v1( + common_params.n_validators, + chunks, + ); + + match available_data { + Ok(data) => { + drop(reconstruct_duration); + + // Attempt post-recovery check. + do_post_recovery_check(common_params, data) + .await + .map_err(|e| { + recovery_duration.map(|rd| rd.stop_and_discard()); + e + }) + .map(|data| { + gum::trace!( + target: LOG_TARGET, + candidate_hash = ?common_params.candidate_hash, + erasure_root = ?common_params.erasure_root, + "Data recovery from systematic chunks complete", + ); + data + }) + }, + Err(err) => { + reconstruct_duration.map(|rd| rd.stop_and_discard()); + recovery_duration.map(|rd| rd.stop_and_discard()); + + gum::debug!( + target: LOG_TARGET, + candidate_hash = ?common_params.candidate_hash, + erasure_root = ?common_params.erasure_root, + ?err, + "Systematic data recovery error", + ); + + Err(RecoveryError::Invalid) + }, + } + } +} + +#[async_trait::async_trait] +impl RecoveryStrategy + for FetchSystematicChunks +{ + fn display_name(&self) -> &'static str { + "Fetch systematic chunks" + } + + fn strategy_type(&self) -> &'static str { + "systematic_chunks" + } + + async fn run( + mut self: Box, + state: &mut State, + sender: &mut Sender, + common_params: &RecoveryParams, + ) -> Result { + // First query the store for any chunks we've got. + if !common_params.bypass_availability_store { + let local_chunk_indices = state.populate_from_av_store(common_params, sender).await; + + for (_, our_c_index) in &local_chunk_indices { + // If we are among the systematic validators but hold an invalid chunk, we cannot + // perform the systematic recovery. Fall through to the next strategy. + if self.validators.iter().any(|(c_index, _)| c_index == our_c_index) && + !state.received_chunks.contains_key(our_c_index) + { + gum::debug!( + target: LOG_TARGET, + candidate_hash = ?common_params.candidate_hash, + erasure_root = ?common_params.erasure_root, + requesting = %self.requesting_chunks.len(), + total_requesting = %self.requesting_chunks.total_len(), + n_validators = %common_params.n_validators, + chunk_index = ?our_c_index, + "Systematic chunk recovery is not possible. We are among the systematic validators but hold an invalid chunk", + ); + return Err(RecoveryError::Unavailable) + } + } + } + + // No need to query the validators that have the chunks we already received or that we know + // don't have the data from previous strategies. + self.validators.retain(|(c_index, v_index)| { + !state.received_chunks.contains_key(c_index) && + state.can_retry_request( + &(common_params.validator_authority_keys[v_index.0 as usize].clone(), *v_index), + SYSTEMATIC_CHUNKS_REQ_RETRY_LIMIT, + ) + }); + + let mut systematic_chunk_count = state + .received_chunks + .range(ChunkIndex(0)..ChunkIndex(self.threshold as u32)) + .count(); + + // Safe to `take` here, as we're consuming `self` anyway and we're not using the + // `validators` or `backers` fields in other methods. + let mut validators_queue: VecDeque<_> = std::mem::take(&mut self.validators) + .into_iter() + .map(|(_, validator_index)| { + ( + common_params.validator_authority_keys[validator_index.0 as usize].clone(), + validator_index, + ) + }) + .collect(); + let mut backers: Vec<_> = std::mem::take(&mut self.backers) + .into_iter() + .map(|validator_index| { + common_params.validator_authority_keys[validator_index.0 as usize].clone() + }) + .collect(); + + loop { + // If received_chunks has `systematic_chunk_threshold` entries, attempt to recover the + // data. + if systematic_chunk_count >= self.threshold { + return self.attempt_systematic_recovery::(state, common_params).await + } + + if Self::is_unavailable( + validators_queue.len(), + self.requesting_chunks.total_len(), + systematic_chunk_count, + self.threshold, + ) { + gum::debug!( + target: LOG_TARGET, + candidate_hash = ?common_params.candidate_hash, + erasure_root = ?common_params.erasure_root, + %systematic_chunk_count, + requesting = %self.requesting_chunks.len(), + total_requesting = %self.requesting_chunks.total_len(), + n_validators = %common_params.n_validators, + systematic_threshold = ?self.threshold, + "Data recovery from systematic chunks is not possible", + ); + + return Err(RecoveryError::Unavailable) + } + + let desired_requests_count = + self.get_desired_request_count(systematic_chunk_count, self.threshold); + let already_requesting_count = self.requesting_chunks.len(); + gum::debug!( + target: LOG_TARGET, + ?common_params.candidate_hash, + ?desired_requests_count, + total_received = ?systematic_chunk_count, + systematic_threshold = ?self.threshold, + ?already_requesting_count, + "Requesting systematic availability chunks for a candidate", + ); + + let strategy_type = RecoveryStrategy::::strategy_type(&*self); + + state + .launch_parallel_chunk_requests( + strategy_type, + common_params, + sender, + desired_requests_count, + &mut validators_queue, + &mut self.requesting_chunks, + ) + .await; + + let _ = state + .wait_for_chunks( + strategy_type, + common_params, + SYSTEMATIC_CHUNKS_REQ_RETRY_LIMIT, + &mut validators_queue, + &mut self.requesting_chunks, + &mut backers, + |unrequested_validators, + in_flight_reqs, + // Don't use this chunk count, as it may contain non-systematic chunks. + _chunk_count, + new_systematic_chunk_count| { + systematic_chunk_count = new_systematic_chunk_count; + + let is_unavailable = Self::is_unavailable( + unrequested_validators, + in_flight_reqs, + systematic_chunk_count, + self.threshold, + ); + + systematic_chunk_count >= self.threshold || is_unavailable + }, + ) + .await; + } + } +} + +#[cfg(test)] +mod tests { + use super::*; + use polkadot_erasure_coding::systematic_recovery_threshold; + + #[test] + fn test_get_desired_request_count() { + let num_validators = 100; + let threshold = systematic_recovery_threshold(num_validators).unwrap(); + + let systematic_chunks_task = FetchSystematicChunks::new(FetchSystematicChunksParams { + validators: vec![(1.into(), 1.into()); num_validators], + backers: vec![], + }); + assert_eq!(systematic_chunks_task.get_desired_request_count(0, threshold), threshold); + assert_eq!(systematic_chunks_task.get_desired_request_count(5, threshold), threshold - 5); + assert_eq!( + systematic_chunks_task.get_desired_request_count(num_validators * 2, threshold), + 0 + ); + assert_eq!(systematic_chunks_task.get_desired_request_count(0, N_PARALLEL * 2), N_PARALLEL); + assert_eq!(systematic_chunks_task.get_desired_request_count(N_PARALLEL, N_PARALLEL + 2), 2); + } +} diff --git a/polkadot/node/network/availability-recovery/src/tests.rs b/polkadot/node/network/availability-recovery/src/tests.rs index 6049a5a5c3a2e..d0a4a2d8b60e8 100644 --- a/polkadot/node/network/availability-recovery/src/tests.rs +++ b/polkadot/node/network/availability-recovery/src/tests.rs @@ -14,38 +14,133 @@ // You should have received a copy of the GNU General Public License // along with Polkadot. If not, see . -use std::{sync::Arc, time::Duration}; +use crate::task::{REGULAR_CHUNKS_REQ_RETRY_LIMIT, SYSTEMATIC_CHUNKS_REQ_RETRY_LIMIT}; + +use super::*; +use std::{result::Result, sync::Arc, time::Duration}; use assert_matches::assert_matches; use futures::{executor, future}; use futures_timer::Delay; +use rstest::rstest; use parity_scale_codec::Encode; use polkadot_node_network_protocol::request_response::{ - self as req_res, v1::AvailableDataFetchingRequest, IncomingRequest, Protocol, Recipient, - ReqProtocolNames, Requests, + self as req_res, + v1::{AvailableDataFetchingRequest, ChunkResponse}, + IncomingRequest, Protocol, Recipient, ReqProtocolNames, Requests, }; -use polkadot_node_subsystem_test_helpers::derive_erasure_chunks_with_proofs_and_root; - -use super::*; -use sc_network::{IfDisconnected, OutboundFailure, ProtocolName, RequestFailure}; - -use polkadot_node_primitives::{BlockData, PoV, Proof}; +use polkadot_node_primitives::{BlockData, ErasureChunk, PoV, Proof}; use polkadot_node_subsystem::messages::{ AllMessages, NetworkBridgeTxMessage, RuntimeApiMessage, RuntimeApiRequest, }; use polkadot_node_subsystem_test_helpers::{ - make_subsystem_context, mock::new_leaf, TestSubsystemContextHandle, + derive_erasure_chunks_with_proofs_and_root, make_subsystem_context, mock::new_leaf, + TestSubsystemContextHandle, }; use polkadot_node_subsystem_util::TimeoutExt; use polkadot_primitives::{ - AuthorityDiscoveryId, Block, Hash, HeadData, IndexedVec, PersistedValidationData, ValidatorId, + node_features, AuthorityDiscoveryId, Block, ExecutorParams, Hash, HeadData, IndexedVec, + NodeFeatures, PersistedValidationData, SessionInfo, ValidatorId, }; use polkadot_primitives_test_helpers::{dummy_candidate_receipt, dummy_hash}; +use sc_network::{IfDisconnected, OutboundFailure, ProtocolName, RequestFailure}; +use sp_keyring::Sr25519Keyring; type VirtualOverseer = TestSubsystemContextHandle; +// Implement some helper constructors for the AvailabilityRecoverySubsystem + +/// Create a new instance of `AvailabilityRecoverySubsystem` which starts with a fast path to +/// request data from backers. +fn with_fast_path( + req_receiver: IncomingRequestReceiver, + req_protocol_names: &ReqProtocolNames, + metrics: Metrics, +) -> AvailabilityRecoverySubsystem { + AvailabilityRecoverySubsystem::with_recovery_strategy_kind( + req_receiver, + req_protocol_names, + metrics, + RecoveryStrategyKind::BackersFirstAlways, + ) +} + +/// Create a new instance of `AvailabilityRecoverySubsystem` which requests only chunks +fn with_chunks_only( + req_receiver: IncomingRequestReceiver, + req_protocol_names: &ReqProtocolNames, + metrics: Metrics, +) -> AvailabilityRecoverySubsystem { + AvailabilityRecoverySubsystem::with_recovery_strategy_kind( + req_receiver, + req_protocol_names, + metrics, + RecoveryStrategyKind::ChunksAlways, + ) +} + +/// Create a new instance of `AvailabilityRecoverySubsystem` which requests chunks if PoV is +/// above a threshold. +fn with_chunks_if_pov_large( + req_receiver: IncomingRequestReceiver, + req_protocol_names: &ReqProtocolNames, + metrics: Metrics, +) -> AvailabilityRecoverySubsystem { + AvailabilityRecoverySubsystem::with_recovery_strategy_kind( + req_receiver, + req_protocol_names, + metrics, + RecoveryStrategyKind::BackersFirstIfSizeLower(FETCH_CHUNKS_THRESHOLD), + ) +} + +/// Create a new instance of `AvailabilityRecoverySubsystem` which requests systematic chunks if +/// PoV is above a threshold. +fn with_systematic_chunks_if_pov_large( + req_receiver: IncomingRequestReceiver, + req_protocol_names: &ReqProtocolNames, + metrics: Metrics, +) -> AvailabilityRecoverySubsystem { + AvailabilityRecoverySubsystem::for_validator( + Some(FETCH_CHUNKS_THRESHOLD), + req_receiver, + req_protocol_names, + metrics, + ) +} + +/// Create a new instance of `AvailabilityRecoverySubsystem` which first requests full data +/// from backers, with a fallback to recover from systematic chunks. +fn with_fast_path_then_systematic_chunks( + req_receiver: IncomingRequestReceiver, + req_protocol_names: &ReqProtocolNames, + metrics: Metrics, +) -> AvailabilityRecoverySubsystem { + AvailabilityRecoverySubsystem::with_recovery_strategy_kind( + req_receiver, + req_protocol_names, + metrics, + RecoveryStrategyKind::BackersThenSystematicChunks, + ) +} + +/// Create a new instance of `AvailabilityRecoverySubsystem` which first attempts to request +/// systematic chunks, with a fallback to requesting regular chunks. +fn with_systematic_chunks( + req_receiver: IncomingRequestReceiver, + req_protocol_names: &ReqProtocolNames, + metrics: Metrics, +) -> AvailabilityRecoverySubsystem { + AvailabilityRecoverySubsystem::with_recovery_strategy_kind( + req_receiver, + req_protocol_names, + metrics, + RecoveryStrategyKind::SystematicChunks, + ) +} + // Deterministic genesis hash for protocol names const GENESIS_HASH: Hash = Hash::repeat_byte(0xff); @@ -61,14 +156,11 @@ fn request_receiver( receiver.0 } -fn test_harness>( +fn test_harness>( subsystem: AvailabilityRecoverySubsystem, - test: impl FnOnce(VirtualOverseer) -> T, + test: impl FnOnce(VirtualOverseer) -> Fut, ) { - let _ = env_logger::builder() - .is_test(true) - .filter(Some("polkadot_availability_recovery"), log::LevelFilter::Trace) - .try_init(); + sp_tracing::init_for_tests(); let pool = sp_core::testing::TaskExecutor::new(); @@ -138,8 +230,6 @@ async fn overseer_recv( msg } -use sp_keyring::Sr25519Keyring; - #[derive(Debug)] enum Has { No, @@ -163,27 +253,127 @@ struct TestState { validators: Vec, validator_public: IndexedVec, validator_authority_id: Vec, + validator_groups: IndexedVec>, current: Hash, candidate: CandidateReceipt, session_index: SessionIndex, + core_index: CoreIndex, + node_features: NodeFeatures, persisted_validation_data: PersistedValidationData, available_data: AvailableData, - chunks: Vec, - invalid_chunks: Vec, + chunks: IndexedVec, + invalid_chunks: IndexedVec, } impl TestState { + fn new(node_features: NodeFeatures) -> Self { + let validators = vec![ + Sr25519Keyring::Ferdie, // <- this node, role: validator + Sr25519Keyring::Alice, + Sr25519Keyring::Bob, + Sr25519Keyring::Charlie, + Sr25519Keyring::Dave, + Sr25519Keyring::One, + Sr25519Keyring::Two, + ]; + + let validator_public = validator_pubkeys(&validators); + let validator_authority_id = validator_authority_id(&validators); + let validator_groups = vec![ + vec![1.into(), 0.into(), 3.into(), 4.into()], + vec![5.into(), 6.into()], + vec![2.into()], + ]; + + let current = Hash::repeat_byte(1); + + let mut candidate = dummy_candidate_receipt(dummy_hash()); + + let session_index = 10; + + let persisted_validation_data = PersistedValidationData { + parent_head: HeadData(vec![7, 8, 9]), + relay_parent_number: Default::default(), + max_pov_size: 1024, + relay_parent_storage_root: Default::default(), + }; + + let pov = PoV { block_data: BlockData(vec![42; 64]) }; + + let available_data = AvailableData { + validation_data: persisted_validation_data.clone(), + pov: Arc::new(pov), + }; + + let core_index = CoreIndex(2); + + let (chunks, erasure_root) = derive_erasure_chunks_with_proofs_and_root( + validators.len(), + &available_data, + |_, _| {}, + ); + let chunks = map_chunks(chunks, &node_features, validators.len(), core_index); + + // Mess around: + let invalid_chunks = chunks + .iter() + .cloned() + .map(|mut chunk| { + if chunk.chunk.len() >= 2 && chunk.chunk[0] != chunk.chunk[1] { + chunk.chunk[0] = chunk.chunk[1]; + } else if chunk.chunk.len() >= 1 { + chunk.chunk[0] = !chunk.chunk[0]; + } else { + chunk.proof = Proof::dummy_proof(); + } + chunk + }) + .collect(); + debug_assert_ne!(chunks, invalid_chunks); + + candidate.descriptor.erasure_root = erasure_root; + candidate.descriptor.relay_parent = Hash::repeat_byte(10); + candidate.descriptor.pov_hash = Hash::repeat_byte(3); + + Self { + validators, + validator_public, + validator_authority_id, + validator_groups: IndexedVec::>::try_from( + validator_groups, + ) + .unwrap(), + current, + candidate, + session_index, + core_index, + node_features, + persisted_validation_data, + available_data, + chunks, + invalid_chunks, + } + } + + fn with_empty_node_features() -> Self { + Self::new(NodeFeatures::EMPTY) + } + fn threshold(&self) -> usize { recovery_threshold(self.validators.len()).unwrap() } + fn systematic_threshold(&self) -> usize { + systematic_recovery_threshold(self.validators.len()).unwrap() + } + fn impossibility_threshold(&self) -> usize { self.validators.len() - self.threshold() + 1 } - async fn test_runtime_api(&self, virtual_overseer: &mut VirtualOverseer) { + async fn test_runtime_api_session_info(&self, virtual_overseer: &mut VirtualOverseer) { assert_matches!( overseer_recv(virtual_overseer).await, AllMessages::RuntimeApi(RuntimeApiMessage::Request( @@ -199,8 +389,7 @@ impl TestState { tx.send(Ok(Some(SessionInfo { validators: self.validator_public.clone(), discovery_keys: self.validator_authority_id.clone(), - // all validators in the same group. - validator_groups: IndexedVec::>::from(vec![(0..self.validators.len()).map(|i| ValidatorIndex(i as _)).collect()]), + validator_groups: self.validator_groups.clone(), assignment_keys: vec![], n_cores: 0, zeroth_delay_tranche_width: 0, @@ -214,6 +403,38 @@ impl TestState { }))).unwrap(); } ); + assert_matches!( + overseer_recv(virtual_overseer).await, + AllMessages::RuntimeApi(RuntimeApiMessage::Request( + relay_parent, + RuntimeApiRequest::SessionExecutorParams( + session_index, + tx, + ) + )) => { + assert_eq!(relay_parent, self.current); + assert_eq!(session_index, self.session_index); + + tx.send(Ok(Some(ExecutorParams::new()))).unwrap(); + } + ); + } + + async fn test_runtime_api_node_features(&self, virtual_overseer: &mut VirtualOverseer) { + assert_matches!( + overseer_recv(virtual_overseer).await, + AllMessages::RuntimeApi(RuntimeApiMessage::Request( + _relay_parent, + RuntimeApiRequest::NodeFeatures( + _, + tx, + ) + )) => { + tx.send(Ok( + self.node_features.clone() + )).unwrap(); + } + ); } async fn respond_to_available_data_query( @@ -239,16 +460,19 @@ impl TestState { async fn respond_to_query_all_request( &self, virtual_overseer: &mut VirtualOverseer, - send_chunk: impl Fn(usize) -> bool, + send_chunk: impl Fn(ValidatorIndex) -> bool, ) { assert_matches!( overseer_recv(virtual_overseer).await, AllMessages::AvailabilityStore( AvailabilityStoreMessage::QueryAllChunks(_, tx) ) => { - let v = self.chunks.iter() - .filter(|c| send_chunk(c.index.0 as usize)) - .cloned() + let v = self.chunks.iter().enumerate() + .filter_map(|(val_idx, c)| if send_chunk(ValidatorIndex(val_idx as u32)) { + Some((ValidatorIndex(val_idx as u32), c.clone())) + } else { + None + }) .collect(); let _ = tx.send(v); @@ -259,16 +483,19 @@ impl TestState { async fn respond_to_query_all_request_invalid( &self, virtual_overseer: &mut VirtualOverseer, - send_chunk: impl Fn(usize) -> bool, + send_chunk: impl Fn(ValidatorIndex) -> bool, ) { assert_matches!( overseer_recv(virtual_overseer).await, AllMessages::AvailabilityStore( AvailabilityStoreMessage::QueryAllChunks(_, tx) ) => { - let v = self.invalid_chunks.iter() - .filter(|c| send_chunk(c.index.0 as usize)) - .cloned() + let v = self.invalid_chunks.iter().enumerate() + .filter_map(|(val_idx, c)| if send_chunk(ValidatorIndex(val_idx as u32)) { + Some((ValidatorIndex(val_idx as u32), c.clone())) + } else { + None + }) .collect(); let _ = tx.send(v); @@ -276,14 +503,16 @@ impl TestState { ) } - async fn test_chunk_requests( + async fn test_chunk_requests_inner( &self, req_protocol_names: &ReqProtocolNames, candidate_hash: CandidateHash, virtual_overseer: &mut VirtualOverseer, n: usize, - who_has: impl Fn(usize) -> Has, - ) -> Vec, ProtocolName), RequestFailure>>> { + mut who_has: impl FnMut(ValidatorIndex) -> Has, + systematic_recovery: bool, + protocol: Protocol, + ) -> Vec, ProtocolName), RequestFailure>>> { // arbitrary order. let mut i = 0; let mut senders = Vec::new(); @@ -301,13 +530,19 @@ impl TestState { i += 1; assert_matches!( req, - Requests::ChunkFetchingV1(req) => { + Requests::ChunkFetching(req) => { assert_eq!(req.payload.candidate_hash, candidate_hash); - let validator_index = req.payload.index.0 as usize; + let validator_index = req.payload.index; + let chunk = self.chunks.get(validator_index).unwrap().clone(); + + if systematic_recovery { + assert!(chunk.index.0 as usize <= self.systematic_threshold(), "requested non-systematic chunk"); + } + let available_data = match who_has(validator_index) { Has::No => Ok(None), - Has::Yes => Ok(Some(self.chunks[validator_index].clone().into())), + Has::Yes => Ok(Some(chunk)), Has::NetworkError(e) => Err(e), Has::DoesNotReturn => { senders.push(req.pending_response); @@ -315,11 +550,29 @@ impl TestState { } }; - let _ = req.pending_response.send( + req.pending_response.send( available_data.map(|r| - (req_res::v1::ChunkFetchingResponse::from(r).encode(), req_protocol_names.get_name(Protocol::ChunkFetchingV1)) + ( + match protocol { + Protocol::ChunkFetchingV1 => + match r { + None => req_res::v1::ChunkFetchingResponse::NoSuchChunk, + Some(c) => req_res::v1::ChunkFetchingResponse::Chunk( + ChunkResponse { + chunk: c.chunk, + proof: c.proof + } + ) + }.encode(), + Protocol::ChunkFetchingV2 => + req_res::v2::ChunkFetchingResponse::from(r).encode(), + + _ => unreachable!() + }, + req_protocol_names.get_name(protocol) + ) ) - ); + ).unwrap(); } ) } @@ -329,16 +582,61 @@ impl TestState { senders } + async fn test_chunk_requests( + &self, + req_protocol_names: &ReqProtocolNames, + candidate_hash: CandidateHash, + virtual_overseer: &mut VirtualOverseer, + n: usize, + who_has: impl FnMut(ValidatorIndex) -> Has, + systematic_recovery: bool, + ) -> Vec, ProtocolName), RequestFailure>>> { + self.test_chunk_requests_inner( + req_protocol_names, + candidate_hash, + virtual_overseer, + n, + who_has, + systematic_recovery, + Protocol::ChunkFetchingV2, + ) + .await + } + + // Use legacy network protocol version. + async fn test_chunk_requests_v1( + &self, + req_protocol_names: &ReqProtocolNames, + candidate_hash: CandidateHash, + virtual_overseer: &mut VirtualOverseer, + n: usize, + who_has: impl FnMut(ValidatorIndex) -> Has, + systematic_recovery: bool, + ) -> Vec, ProtocolName), RequestFailure>>> { + self.test_chunk_requests_inner( + req_protocol_names, + candidate_hash, + virtual_overseer, + n, + who_has, + systematic_recovery, + Protocol::ChunkFetchingV1, + ) + .await + } + async fn test_full_data_requests( &self, req_protocol_names: &ReqProtocolNames, candidate_hash: CandidateHash, virtual_overseer: &mut VirtualOverseer, who_has: impl Fn(usize) -> Has, - ) -> Vec, ProtocolName), RequestFailure>>> { + group_index: GroupIndex, + ) -> Vec, ProtocolName), RequestFailure>>> { let mut senders = Vec::new(); - for _ in 0..self.validators.len() { - // Receive a request for a chunk. + let expected_validators = self.validator_groups.get(group_index).unwrap(); + for _ in 0..expected_validators.len() { + // Receive a request for the full `AvailableData`. assert_matches!( overseer_recv(virtual_overseer).await, AllMessages::NetworkBridgeTx( @@ -357,6 +655,7 @@ impl TestState { .iter() .position(|a| Recipient::Authority(a.clone()) == req.peer) .unwrap(); + assert!(expected_validators.contains(&ValidatorIndex(validator_index as u32))); let available_data = match who_has(validator_index) { Has::No => Ok(None), @@ -387,95 +686,67 @@ impl TestState { } } +impl Default for TestState { + fn default() -> Self { + // Enable the chunk mapping node feature. + let mut node_features = NodeFeatures::new(); + node_features + .resize(node_features::FeatureIndex::AvailabilityChunkMapping as usize + 1, false); + node_features + .set(node_features::FeatureIndex::AvailabilityChunkMapping as u8 as usize, true); + + Self::new(node_features) + } +} + fn validator_pubkeys(val_ids: &[Sr25519Keyring]) -> IndexedVec { val_ids.iter().map(|v| v.public().into()).collect() } -fn validator_authority_id(val_ids: &[Sr25519Keyring]) -> Vec { +pub fn validator_authority_id(val_ids: &[Sr25519Keyring]) -> Vec { val_ids.iter().map(|v| v.public().into()).collect() } -impl Default for TestState { - fn default() -> Self { - let validators = vec![ - Sr25519Keyring::Ferdie, // <- this node, role: validator - Sr25519Keyring::Alice, - Sr25519Keyring::Bob, - Sr25519Keyring::Charlie, - Sr25519Keyring::Dave, - ]; - - let validator_public = validator_pubkeys(&validators); - let validator_authority_id = validator_authority_id(&validators); - - let current = Hash::repeat_byte(1); - - let mut candidate = dummy_candidate_receipt(dummy_hash()); - - let session_index = 10; - - let persisted_validation_data = PersistedValidationData { - parent_head: HeadData(vec![7, 8, 9]), - relay_parent_number: Default::default(), - max_pov_size: 1024, - relay_parent_storage_root: Default::default(), - }; - - let pov = PoV { block_data: BlockData(vec![42; 64]) }; - - let available_data = AvailableData { - validation_data: persisted_validation_data.clone(), - pov: Arc::new(pov), - }; - - let (chunks, erasure_root) = derive_erasure_chunks_with_proofs_and_root( - validators.len(), - &available_data, - |_, _| {}, - ); - // Mess around: - let invalid_chunks = chunks - .iter() - .cloned() - .map(|mut chunk| { - if chunk.chunk.len() >= 2 && chunk.chunk[0] != chunk.chunk[1] { - chunk.chunk[0] = chunk.chunk[1]; - } else if chunk.chunk.len() >= 1 { - chunk.chunk[0] = !chunk.chunk[0]; - } else { - chunk.proof = Proof::dummy_proof(); - } - chunk - }) - .collect(); - debug_assert_ne!(chunks, invalid_chunks); - - candidate.descriptor.erasure_root = erasure_root; - candidate.descriptor.relay_parent = Hash::repeat_byte(10); - - Self { - validators, - validator_public, - validator_authority_id, - current, - candidate, - session_index, - persisted_validation_data, - available_data, - chunks, - invalid_chunks, - } - } +/// Map the chunks to the validators according to the availability chunk mapping algorithm. +fn map_chunks( + chunks: Vec, + node_features: &NodeFeatures, + n_validators: usize, + core_index: CoreIndex, +) -> IndexedVec { + let chunk_indices = + availability_chunk_indices(Some(node_features), n_validators, core_index).unwrap(); + + (0..n_validators) + .map(|val_idx| chunks[chunk_indices[val_idx].0 as usize].clone()) + .collect::>() + .into() } -#[test] -fn availability_is_recovered_from_chunks_if_no_group_provided() { +#[rstest] +#[case(true)] +#[case(false)] +fn availability_is_recovered_from_chunks_if_no_group_provided(#[case] systematic_recovery: bool) { let test_state = TestState::default(); let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); - let subsystem = AvailabilityRecoverySubsystem::with_fast_path( - request_receiver(&req_protocol_names), - Metrics::new_dummy(), - ); + let (subsystem, threshold) = match systematic_recovery { + true => ( + with_fast_path_then_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.systematic_threshold(), + ), + false => ( + with_fast_path( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.threshold(), + ), + }; test_harness(subsystem, |mut virtual_overseer| async move { overseer_signal( @@ -495,12 +766,15 @@ fn availability_is_recovered_from_chunks_if_no_group_provided() { test_state.candidate.clone(), test_state.session_index, None, + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; let candidate_hash = test_state.candidate.hash(); @@ -512,8 +786,9 @@ fn availability_is_recovered_from_chunks_if_no_group_provided() { &req_protocol_names, candidate_hash, &mut virtual_overseer, - test_state.threshold(), + threshold, |_| Has::Yes, + systematic_recovery, ) .await; @@ -533,16 +808,31 @@ fn availability_is_recovered_from_chunks_if_no_group_provided() { new_candidate.clone(), test_state.session_index, None, + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; - test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + if systematic_recovery { + test_state + .test_chunk_requests( + &req_protocol_names, + new_candidate.hash(), + &mut virtual_overseer, + threshold, + |_| Has::No, + systematic_recovery, + ) + .await; + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + } + + // Even if the recovery is systematic, we'll always fall back to regular recovery, so keep + // this around. test_state .test_chunk_requests( &req_protocol_names, @@ -550,6 +840,7 @@ fn availability_is_recovered_from_chunks_if_no_group_provided() { &mut virtual_overseer, test_state.impossibility_threshold(), |_| Has::No, + false, ) .await; @@ -559,15 +850,33 @@ fn availability_is_recovered_from_chunks_if_no_group_provided() { }); } -#[test] -fn availability_is_recovered_from_chunks_even_if_backing_group_supplied_if_chunks_only() { - let test_state = TestState::default(); +#[rstest] +#[case(true)] +#[case(false)] +fn availability_is_recovered_from_chunks_even_if_backing_group_supplied_if_chunks_only( + #[case] systematic_recovery: bool, +) { let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); - let subsystem = AvailabilityRecoverySubsystem::with_chunks_only( - request_receiver(&req_protocol_names), - Metrics::new_dummy(), - ); - + let test_state = TestState::default(); + let (subsystem, threshold) = match systematic_recovery { + true => ( + with_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.systematic_threshold(), + ), + false => ( + with_chunks_only( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.threshold(), + ), + }; + test_harness(subsystem, |mut virtual_overseer| async move { overseer_signal( &mut virtual_overseer, @@ -586,12 +895,15 @@ fn availability_is_recovered_from_chunks_even_if_backing_group_supplied_if_chunk test_state.candidate.clone(), test_state.session_index, Some(GroupIndex(0)), + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; let candidate_hash = test_state.candidate.hash(); @@ -603,8 +915,9 @@ fn availability_is_recovered_from_chunks_even_if_backing_group_supplied_if_chunk &req_protocol_names, candidate_hash, &mut virtual_overseer, - test_state.threshold(), + threshold, |_| Has::Yes, + systematic_recovery, ) .await; @@ -623,41 +936,80 @@ fn availability_is_recovered_from_chunks_even_if_backing_group_supplied_if_chunk AvailabilityRecoveryMessage::RecoverAvailableData( new_candidate.clone(), test_state.session_index, - None, + Some(GroupIndex(1)), + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; - test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; - test_state - .test_chunk_requests( - &req_protocol_names, - new_candidate.hash(), - &mut virtual_overseer, - test_state.impossibility_threshold(), - |_| Has::No, - ) - .await; + if systematic_recovery { + test_state + .test_chunk_requests( + &req_protocol_names, + new_candidate.hash(), + &mut virtual_overseer, + threshold * SYSTEMATIC_CHUNKS_REQ_RETRY_LIMIT as usize, + |_| Has::No, + systematic_recovery, + ) + .await; + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + // Even if the recovery is systematic, we'll always fall back to regular recovery, so + // keep this around. + test_state + .test_chunk_requests( + &req_protocol_names, + new_candidate.hash(), + &mut virtual_overseer, + test_state.impossibility_threshold() - threshold, + |_| Has::No, + false, + ) + .await; + + // A request times out with `Unavailable` error. + assert_eq!(rx.await.unwrap().unwrap_err(), RecoveryError::Unavailable); + } else { + test_state + .test_chunk_requests( + &req_protocol_names, + new_candidate.hash(), + &mut virtual_overseer, + test_state.impossibility_threshold(), + |_| Has::No, + false, + ) + .await; - // A request times out with `Unavailable` error. - assert_eq!(rx.await.unwrap().unwrap_err(), RecoveryError::Unavailable); + // A request times out with `Unavailable` error. + assert_eq!(rx.await.unwrap().unwrap_err(), RecoveryError::Unavailable); + } virtual_overseer }); } -#[test] -fn bad_merkle_path_leads_to_recovery_error() { - let mut test_state = TestState::default(); +#[rstest] +#[case(true)] +#[case(false)] +fn bad_merkle_path_leads_to_recovery_error(#[case] systematic_recovery: bool) { let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); - let subsystem = AvailabilityRecoverySubsystem::with_fast_path( - request_receiver(&req_protocol_names), - Metrics::new_dummy(), - ); + let mut test_state = TestState::default(); + let subsystem = match systematic_recovery { + true => with_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + false => with_chunks_only( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + }; test_harness(subsystem, |mut virtual_overseer| async move { overseer_signal( @@ -677,25 +1029,40 @@ fn bad_merkle_path_leads_to_recovery_error() { test_state.candidate.clone(), test_state.session_index, None, + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; let candidate_hash = test_state.candidate.hash(); // Create some faulty chunks. - test_state.chunks[0].chunk = vec![0; 32]; - test_state.chunks[1].chunk = vec![1; 32]; - test_state.chunks[2].chunk = vec![2; 32]; - test_state.chunks[3].chunk = vec![3; 32]; - test_state.chunks[4].chunk = vec![4; 32]; + for chunk in test_state.chunks.iter_mut() { + chunk.chunk = vec![0; 32]; + } test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + if systematic_recovery { + test_state + .test_chunk_requests( + &req_protocol_names, + candidate_hash, + &mut virtual_overseer, + test_state.systematic_threshold(), + |_| Has::No, + systematic_recovery, + ) + .await; + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + } + test_state .test_chunk_requests( &req_protocol_names, @@ -703,6 +1070,7 @@ fn bad_merkle_path_leads_to_recovery_error() { &mut virtual_overseer, test_state.impossibility_threshold(), |_| Has::Yes, + false, ) .await; @@ -712,14 +1080,24 @@ fn bad_merkle_path_leads_to_recovery_error() { }); } -#[test] -fn wrong_chunk_index_leads_to_recovery_error() { +#[rstest] +#[case(true)] +#[case(false)] +fn wrong_chunk_index_leads_to_recovery_error(#[case] systematic_recovery: bool) { let mut test_state = TestState::default(); let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); - let subsystem = AvailabilityRecoverySubsystem::with_fast_path( - request_receiver(&req_protocol_names), - Metrics::new_dummy(), - ); + let subsystem = match systematic_recovery { + true => with_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + false => with_chunks_only( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + }; test_harness(subsystem, |mut virtual_overseer| async move { overseer_signal( @@ -739,32 +1117,55 @@ fn wrong_chunk_index_leads_to_recovery_error() { test_state.candidate.clone(), test_state.session_index, None, + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; - let candidate_hash = test_state.candidate.hash(); + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; - // These chunks should fail the index check as they don't have the correct index for - // validator. - test_state.chunks[1] = test_state.chunks[0].clone(); - test_state.chunks[2] = test_state.chunks[0].clone(); - test_state.chunks[3] = test_state.chunks[0].clone(); - test_state.chunks[4] = test_state.chunks[0].clone(); + let candidate_hash = test_state.candidate.hash(); test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + // Chunks should fail the index check as they don't have the correct index. + + // *(test_state.chunks.get_mut(0.into()).unwrap()) = + // test_state.chunks.get(1.into()).unwrap().clone(); + let first_chunk = test_state.chunks.get(0.into()).unwrap().clone(); + for c_index in 1..test_state.chunks.len() { + *(test_state.chunks.get_mut(ValidatorIndex(c_index as u32)).unwrap()) = + first_chunk.clone(); + } + + if systematic_recovery { + test_state + .test_chunk_requests( + &req_protocol_names, + candidate_hash, + &mut virtual_overseer, + test_state.systematic_threshold(), + |_| Has::Yes, + // We set this to false, as we know we will be requesting the wrong indices. + false, + ) + .await; + + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + } + test_state .test_chunk_requests( &req_protocol_names, candidate_hash, &mut virtual_overseer, - test_state.impossibility_threshold(), - |_| Has::No, + test_state.chunks.len() - 1, + |_| Has::Yes, + false, ) .await; @@ -774,14 +1175,30 @@ fn wrong_chunk_index_leads_to_recovery_error() { }); } -#[test] -fn invalid_erasure_coding_leads_to_invalid_error() { +#[rstest] +#[case(true)] +#[case(false)] +fn invalid_erasure_coding_leads_to_invalid_error(#[case] systematic_recovery: bool) { let mut test_state = TestState::default(); let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); - let subsystem = AvailabilityRecoverySubsystem::with_fast_path( - request_receiver(&req_protocol_names), - Metrics::new_dummy(), - ); + let (subsystem, threshold) = match systematic_recovery { + true => ( + with_fast_path_then_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.systematic_threshold(), + ), + false => ( + with_fast_path( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.threshold(), + ), + }; test_harness(subsystem, |mut virtual_overseer| async move { let pov = PoV { block_data: BlockData(vec![69; 64]) }; @@ -795,7 +1212,12 @@ fn invalid_erasure_coding_leads_to_invalid_error() { |i, chunk| *chunk = vec![i as u8; 32], ); - test_state.chunks = bad_chunks; + test_state.chunks = map_chunks( + bad_chunks, + &test_state.node_features, + test_state.validators.len(), + test_state.core_index, + ); test_state.candidate.descriptor.erasure_root = bad_erasure_root; let candidate_hash = test_state.candidate.hash(); @@ -817,12 +1239,15 @@ fn invalid_erasure_coding_leads_to_invalid_error() { test_state.candidate.clone(), test_state.session_index, None, + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; @@ -832,8 +1257,9 @@ fn invalid_erasure_coding_leads_to_invalid_error() { &req_protocol_names, candidate_hash, &mut virtual_overseer, - test_state.threshold(), + threshold, |_| Has::Yes, + systematic_recovery, ) .await; @@ -843,12 +1269,74 @@ fn invalid_erasure_coding_leads_to_invalid_error() { }); } +#[test] +fn invalid_pov_hash_leads_to_invalid_error() { + let mut test_state = TestState::default(); + let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); + let subsystem = AvailabilityRecoverySubsystem::for_collator( + None, + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ); + + test_harness(subsystem, |mut virtual_overseer| async move { + let pov = PoV { block_data: BlockData(vec![69; 64]) }; + + test_state.candidate.descriptor.pov_hash = pov.hash(); + + let candidate_hash = test_state.candidate.hash(); + + overseer_signal( + &mut virtual_overseer, + OverseerSignal::ActiveLeaves(ActiveLeavesUpdate::start_work(new_leaf( + test_state.current, + 1, + ))), + ) + .await; + + let (tx, rx) = oneshot::channel(); + + overseer_send( + &mut virtual_overseer, + AvailabilityRecoveryMessage::RecoverAvailableData( + test_state.candidate.clone(), + test_state.session_index, + None, + Some(test_state.core_index), + tx, + ), + ) + .await; + + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; + + test_state + .test_chunk_requests( + &req_protocol_names, + candidate_hash, + &mut virtual_overseer, + test_state.threshold(), + |_| Has::Yes, + false, + ) + .await; + + assert_eq!(rx.await.unwrap().unwrap_err(), RecoveryError::Invalid); + virtual_overseer + }); +} + #[test] fn fast_path_backing_group_recovers() { let test_state = TestState::default(); let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); - let subsystem = AvailabilityRecoverySubsystem::with_fast_path( + let subsystem = with_fast_path( request_receiver(&req_protocol_names), + &req_protocol_names, Metrics::new_dummy(), ); @@ -870,12 +1358,14 @@ fn fast_path_backing_group_recovers() { test_state.candidate.clone(), test_state.session_index, Some(GroupIndex(0)), + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; let candidate_hash = test_state.candidate.hash(); @@ -892,6 +1382,7 @@ fn fast_path_backing_group_recovers() { candidate_hash, &mut virtual_overseer, who_has, + GroupIndex(0), ) .await; @@ -901,15 +1392,47 @@ fn fast_path_backing_group_recovers() { }); } -#[test] -fn recovers_from_only_chunks_if_pov_large() { - let test_state = TestState::default(); +#[rstest] +#[case(true, false)] +#[case(false, true)] +#[case(false, false)] +fn recovers_from_only_chunks_if_pov_large( + #[case] systematic_recovery: bool, + #[case] for_collator: bool, +) { + let mut test_state = TestState::default(); let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); - let subsystem = AvailabilityRecoverySubsystem::with_chunks_if_pov_large( - Some(FETCH_CHUNKS_THRESHOLD), - request_receiver(&req_protocol_names), - Metrics::new_dummy(), - ); + let (subsystem, threshold) = match (systematic_recovery, for_collator) { + (true, false) => ( + with_systematic_chunks_if_pov_large( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.systematic_threshold(), + ), + (false, false) => ( + with_chunks_if_pov_large( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.threshold(), + ), + (false, true) => { + test_state.candidate.descriptor.pov_hash = test_state.available_data.pov.hash(); + ( + AvailabilityRecoverySubsystem::for_collator( + None, + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.threshold(), + ) + }, + (_, _) => unreachable!(), + }; test_harness(subsystem, |mut virtual_overseer| async move { overseer_signal( @@ -929,12 +1452,15 @@ fn recovers_from_only_chunks_if_pov_large() { test_state.candidate.clone(), test_state.session_index, Some(GroupIndex(0)), + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; let candidate_hash = test_state.candidate.hash(); @@ -947,16 +1473,19 @@ fn recovers_from_only_chunks_if_pov_large() { } ); - test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; - test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + if !for_collator { + test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + } test_state .test_chunk_requests( &req_protocol_names, candidate_hash, &mut virtual_overseer, - test_state.threshold(), + threshold, |_| Has::Yes, + systematic_recovery, ) .await; @@ -975,14 +1504,13 @@ fn recovers_from_only_chunks_if_pov_large() { AvailabilityRecoveryMessage::RecoverAvailableData( new_candidate.clone(), test_state.session_index, - Some(GroupIndex(0)), + Some(GroupIndex(1)), + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; - assert_matches!( overseer_recv(&mut virtual_overseer).await, AllMessages::AvailabilityStore( @@ -992,18 +1520,48 @@ fn recovers_from_only_chunks_if_pov_large() { } ); - test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; - test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + if !for_collator { + test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + } - test_state - .test_chunk_requests( - &req_protocol_names, - new_candidate.hash(), - &mut virtual_overseer, - test_state.impossibility_threshold(), - |_| Has::No, - ) - .await; + if systematic_recovery { + test_state + .test_chunk_requests( + &req_protocol_names, + new_candidate.hash(), + &mut virtual_overseer, + test_state.systematic_threshold() * SYSTEMATIC_CHUNKS_REQ_RETRY_LIMIT as usize, + |_| Has::No, + systematic_recovery, + ) + .await; + if !for_collator { + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + } + // Even if the recovery is systematic, we'll always fall back to regular recovery. + test_state + .test_chunk_requests( + &req_protocol_names, + new_candidate.hash(), + &mut virtual_overseer, + test_state.impossibility_threshold() - threshold, + |_| Has::No, + false, + ) + .await; + } else { + test_state + .test_chunk_requests( + &req_protocol_names, + new_candidate.hash(), + &mut virtual_overseer, + test_state.impossibility_threshold(), + |_| Has::No, + false, + ) + .await; + } // A request times out with `Unavailable` error. assert_eq!(rx.await.unwrap().unwrap_err(), RecoveryError::Unavailable); @@ -1011,15 +1569,40 @@ fn recovers_from_only_chunks_if_pov_large() { }); } -#[test] -fn fast_path_backing_group_recovers_if_pov_small() { - let test_state = TestState::default(); +#[rstest] +#[case(true, false)] +#[case(false, true)] +#[case(false, false)] +fn fast_path_backing_group_recovers_if_pov_small( + #[case] systematic_recovery: bool, + #[case] for_collator: bool, +) { + let mut test_state = TestState::default(); let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); - let subsystem = AvailabilityRecoverySubsystem::with_chunks_if_pov_large( - Some(FETCH_CHUNKS_THRESHOLD), - request_receiver(&req_protocol_names), - Metrics::new_dummy(), - ); + + let subsystem = match (systematic_recovery, for_collator) { + (true, false) => with_systematic_chunks_if_pov_large( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + + (false, false) => with_chunks_if_pov_large( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + (false, true) => { + test_state.candidate.descriptor.pov_hash = test_state.available_data.pov.hash(); + AvailabilityRecoverySubsystem::for_collator( + None, + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ) + }, + (_, _) => unreachable!(), + }; test_harness(subsystem, |mut virtual_overseer| async move { overseer_signal( @@ -1039,12 +1622,15 @@ fn fast_path_backing_group_recovers_if_pov_small() { test_state.candidate.clone(), test_state.session_index, Some(GroupIndex(0)), + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; let candidate_hash = test_state.candidate.hash(); @@ -1062,7 +1648,9 @@ fn fast_path_backing_group_recovers_if_pov_small() { } ); - test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; + if !for_collator { + test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; + } test_state .test_full_data_requests( @@ -1070,6 +1658,7 @@ fn fast_path_backing_group_recovers_if_pov_small() { candidate_hash, &mut virtual_overseer, who_has, + GroupIndex(0), ) .await; @@ -1079,14 +1668,31 @@ fn fast_path_backing_group_recovers_if_pov_small() { }); } -#[test] -fn no_answers_in_fast_path_causes_chunk_requests() { +#[rstest] +#[case(true)] +#[case(false)] +fn no_answers_in_fast_path_causes_chunk_requests(#[case] systematic_recovery: bool) { let test_state = TestState::default(); let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); - let subsystem = AvailabilityRecoverySubsystem::with_fast_path( - request_receiver(&req_protocol_names), - Metrics::new_dummy(), - ); + + let (subsystem, threshold) = match systematic_recovery { + true => ( + with_fast_path_then_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.systematic_threshold(), + ), + false => ( + with_fast_path( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.threshold(), + ), + }; test_harness(subsystem, |mut virtual_overseer| async move { overseer_signal( @@ -1106,12 +1712,15 @@ fn no_answers_in_fast_path_causes_chunk_requests() { test_state.candidate.clone(), test_state.session_index, Some(GroupIndex(0)), + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; let candidate_hash = test_state.candidate.hash(); @@ -1129,6 +1738,7 @@ fn no_answers_in_fast_path_causes_chunk_requests() { candidate_hash, &mut virtual_overseer, who_has, + GroupIndex(0), ) .await; @@ -1139,8 +1749,9 @@ fn no_answers_in_fast_path_causes_chunk_requests() { &req_protocol_names, candidate_hash, &mut virtual_overseer, - test_state.threshold(), + threshold, |_| Has::Yes, + systematic_recovery, ) .await; @@ -1150,14 +1761,25 @@ fn no_answers_in_fast_path_causes_chunk_requests() { }); } -#[test] -fn task_canceled_when_receivers_dropped() { +#[rstest] +#[case(true)] +#[case(false)] +fn task_canceled_when_receivers_dropped(#[case] systematic_recovery: bool) { let test_state = TestState::default(); let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); - let subsystem = AvailabilityRecoverySubsystem::with_chunks_only( - request_receiver(&req_protocol_names), - Metrics::new_dummy(), - ); + + let subsystem = match systematic_recovery { + true => with_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + false => with_chunks_only( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + }; test_harness(subsystem, |mut virtual_overseer| async move { overseer_signal( @@ -1177,12 +1799,15 @@ fn task_canceled_when_receivers_dropped() { test_state.candidate.clone(), test_state.session_index, None, + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; for _ in 0..test_state.validators.len() { match virtual_overseer.recv().timeout(TIMEOUT).await { @@ -1195,14 +1820,24 @@ fn task_canceled_when_receivers_dropped() { }); } -#[test] -fn chunks_retry_until_all_nodes_respond() { +#[rstest] +#[case(true)] +#[case(false)] +fn chunks_retry_until_all_nodes_respond(#[case] systematic_recovery: bool) { let test_state = TestState::default(); let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); - let subsystem = AvailabilityRecoverySubsystem::with_chunks_only( - request_receiver(&req_protocol_names), - Metrics::new_dummy(), - ); + let subsystem = match systematic_recovery { + true => with_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + false => with_chunks_only( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + }; test_harness(subsystem, |mut virtual_overseer| async move { overseer_signal( @@ -1221,30 +1856,51 @@ fn chunks_retry_until_all_nodes_respond() { AvailabilityRecoveryMessage::RecoverAvailableData( test_state.candidate.clone(), test_state.session_index, - Some(GroupIndex(0)), + None, + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; let candidate_hash = test_state.candidate.hash(); test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + if systematic_recovery { + for _ in 0..SYSTEMATIC_CHUNKS_REQ_RETRY_LIMIT { + test_state + .test_chunk_requests( + &req_protocol_names, + candidate_hash, + &mut virtual_overseer, + test_state.systematic_threshold(), + |_| Has::timeout(), + true, + ) + .await; + } + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + } + test_state .test_chunk_requests( &req_protocol_names, candidate_hash, &mut virtual_overseer, - test_state.validators.len() - test_state.threshold(), + test_state.impossibility_threshold(), |_| Has::timeout(), + false, ) .await; - // we get to go another round! + // We get to go another round! Actually, we get to go `REGULAR_CHUNKS_REQ_RETRY_LIMIT` + // number of times. test_state .test_chunk_requests( &req_protocol_names, @@ -1252,21 +1908,23 @@ fn chunks_retry_until_all_nodes_respond() { &mut virtual_overseer, test_state.impossibility_threshold(), |_| Has::No, + false, ) .await; - // Recovered data should match the original one. + // Recovery is impossible. assert_eq!(rx.await.unwrap().unwrap_err(), RecoveryError::Unavailable); virtual_overseer }); } #[test] -fn not_returning_requests_wont_stall_retrieval() { +fn network_bridge_not_returning_responses_wont_stall_retrieval() { let test_state = TestState::default(); let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); - let subsystem = AvailabilityRecoverySubsystem::with_chunks_only( + let subsystem = with_chunks_only( request_receiver(&req_protocol_names), + &req_protocol_names, Metrics::new_dummy(), ); @@ -1288,12 +1946,15 @@ fn not_returning_requests_wont_stall_retrieval() { test_state.candidate.clone(), test_state.session_index, Some(GroupIndex(0)), + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; let candidate_hash = test_state.candidate.hash(); @@ -1311,6 +1972,7 @@ fn not_returning_requests_wont_stall_retrieval() { &mut virtual_overseer, not_returning_count, |_| Has::DoesNotReturn, + false, ) .await; @@ -1322,6 +1984,7 @@ fn not_returning_requests_wont_stall_retrieval() { // Should start over: test_state.validators.len() + 3, |_| Has::timeout(), + false, ) .await; @@ -1333,6 +1996,7 @@ fn not_returning_requests_wont_stall_retrieval() { &mut virtual_overseer, test_state.threshold(), |_| Has::Yes, + false, ) .await; @@ -1342,14 +2006,24 @@ fn not_returning_requests_wont_stall_retrieval() { }); } -#[test] -fn all_not_returning_requests_still_recovers_on_return() { +#[rstest] +#[case(true)] +#[case(false)] +fn all_not_returning_requests_still_recovers_on_return(#[case] systematic_recovery: bool) { let test_state = TestState::default(); let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); - let subsystem = AvailabilityRecoverySubsystem::with_chunks_only( - request_receiver(&req_protocol_names), - Metrics::new_dummy(), - ); + let subsystem = match systematic_recovery { + true => with_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + false => with_chunks_only( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + }; test_harness(subsystem, |mut virtual_overseer| async move { overseer_signal( @@ -1368,46 +2042,64 @@ fn all_not_returning_requests_still_recovers_on_return() { AvailabilityRecoveryMessage::RecoverAvailableData( test_state.candidate.clone(), test_state.session_index, - Some(GroupIndex(0)), + None, + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; let candidate_hash = test_state.candidate.hash(); test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + let n = if systematic_recovery { + test_state.systematic_threshold() + } else { + test_state.validators.len() + }; let senders = test_state .test_chunk_requests( &req_protocol_names, candidate_hash, &mut virtual_overseer, - test_state.validators.len(), + n, |_| Has::DoesNotReturn, + systematic_recovery, ) .await; future::join( async { Delay::new(Duration::from_millis(10)).await; - // Now retrieval should be able to recover. + // Now retrieval should be able progress. std::mem::drop(senders); }, - test_state.test_chunk_requests( - &req_protocol_names, - candidate_hash, - &mut virtual_overseer, - // Should start over: - test_state.validators.len() + 3, - |_| Has::timeout(), - ), + async { + test_state + .test_chunk_requests( + &req_protocol_names, + candidate_hash, + &mut virtual_overseer, + // Should start over: + n, + |_| Has::timeout(), + systematic_recovery, + ) + .await + }, ) .await; + if systematic_recovery { + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + } + // we get to go another round! test_state .test_chunk_requests( @@ -1416,6 +2108,7 @@ fn all_not_returning_requests_still_recovers_on_return() { &mut virtual_overseer, test_state.threshold(), |_| Has::Yes, + false, ) .await; @@ -1425,14 +2118,24 @@ fn all_not_returning_requests_still_recovers_on_return() { }); } -#[test] -fn returns_early_if_we_have_the_data() { +#[rstest] +#[case(true)] +#[case(false)] +fn returns_early_if_we_have_the_data(#[case] systematic_recovery: bool) { let test_state = TestState::default(); let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); - let subsystem = AvailabilityRecoverySubsystem::with_chunks_only( - request_receiver(&req_protocol_names), - Metrics::new_dummy(), - ); + let subsystem = match systematic_recovery { + true => with_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + false => with_chunks_only( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + }; test_harness(subsystem, |mut virtual_overseer| async move { overseer_signal( @@ -1452,12 +2155,15 @@ fn returns_early_if_we_have_the_data() { test_state.candidate.clone(), test_state.session_index, None, + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; test_state.respond_to_available_data_query(&mut virtual_overseer, true).await; assert_eq!(rx.await.unwrap().unwrap(), test_state.available_data); @@ -1466,11 +2172,12 @@ fn returns_early_if_we_have_the_data() { } #[test] -fn does_not_query_local_validator() { +fn returns_early_if_present_in_the_subsystem_cache() { let test_state = TestState::default(); let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); - let subsystem = AvailabilityRecoverySubsystem::with_chunks_only( + let subsystem = with_fast_path( request_receiver(&req_protocol_names), + &req_protocol_names, Metrics::new_dummy(), ); @@ -1491,36 +2198,222 @@ fn does_not_query_local_validator() { AvailabilityRecoveryMessage::RecoverAvailableData( test_state.candidate.clone(), test_state.session_index, - None, + Some(GroupIndex(0)), + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; - test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; - test_state.respond_to_query_all_request(&mut virtual_overseer, |i| i == 0).await; + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; let candidate_hash = test_state.candidate.hash(); + let who_has = |i| match i { + 3 => Has::Yes, + _ => Has::No, + }; + + test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; + test_state - .test_chunk_requests( + .test_full_data_requests( &req_protocol_names, candidate_hash, &mut virtual_overseer, - test_state.validators.len(), - |i| if i == 0 { panic!("requested from local validator") } else { Has::timeout() }, + who_has, + GroupIndex(0), ) .await; - // second round, make sure it uses the local chunk. - test_state - .test_chunk_requests( + // Recovered data should match the original one. + assert_eq!(rx.await.unwrap().unwrap(), test_state.available_data); + + // A second recovery for the same candidate will return early as it'll be present in the + // cache. + let (tx, rx) = oneshot::channel(); + overseer_send( + &mut virtual_overseer, + AvailabilityRecoveryMessage::RecoverAvailableData( + test_state.candidate.clone(), + test_state.session_index, + Some(GroupIndex(0)), + Some(test_state.core_index), + tx, + ), + ) + .await; + assert_eq!(rx.await.unwrap().unwrap(), test_state.available_data); + + virtual_overseer + }); +} + +#[rstest] +#[case(true)] +#[case(false)] +fn does_not_query_local_validator(#[case] systematic_recovery: bool) { + let test_state = TestState::default(); + let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); + let (subsystem, threshold) = match systematic_recovery { + true => ( + with_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.systematic_threshold(), + ), + false => ( + with_chunks_only( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.threshold(), + ), + }; + + test_harness(subsystem, |mut virtual_overseer| async move { + overseer_signal( + &mut virtual_overseer, + OverseerSignal::ActiveLeaves(ActiveLeavesUpdate::start_work(new_leaf( + test_state.current, + 1, + ))), + ) + .await; + + let (tx, rx) = oneshot::channel(); + + overseer_send( + &mut virtual_overseer, + AvailabilityRecoveryMessage::RecoverAvailableData( + test_state.candidate.clone(), + test_state.session_index, + None, + Some(test_state.core_index), + tx, + ), + ) + .await; + + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; + test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; + test_state + .respond_to_query_all_request(&mut virtual_overseer, |i| i.0 == 0) + .await; + + let candidate_hash = test_state.candidate.hash(); + + // second round, make sure it uses the local chunk. + test_state + .test_chunk_requests( + &req_protocol_names, + candidate_hash, + &mut virtual_overseer, + threshold - 1, + |i| if i.0 == 0 { panic!("requested from local validator") } else { Has::Yes }, + systematic_recovery, + ) + .await; + + assert_eq!(rx.await.unwrap().unwrap(), test_state.available_data); + virtual_overseer + }); +} + +#[rstest] +#[case(true)] +#[case(false)] +fn invalid_local_chunk(#[case] systematic_recovery: bool) { + let test_state = TestState::default(); + let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); + let subsystem = match systematic_recovery { + true => with_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + false => with_chunks_only( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + }; + + test_harness(subsystem, |mut virtual_overseer| async move { + overseer_signal( + &mut virtual_overseer, + OverseerSignal::ActiveLeaves(ActiveLeavesUpdate::start_work(new_leaf( + test_state.current, + 1, + ))), + ) + .await; + + let (tx, rx) = oneshot::channel(); + + overseer_send( + &mut virtual_overseer, + AvailabilityRecoveryMessage::RecoverAvailableData( + test_state.candidate.clone(), + test_state.session_index, + None, + Some(test_state.core_index), + tx, + ), + ) + .await; + + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; + test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; + + let validator_index_for_first_chunk = test_state + .chunks + .iter() + .enumerate() + .find_map(|(val_idx, chunk)| if chunk.index.0 == 0 { Some(val_idx) } else { None }) + .unwrap() as u32; + + test_state + .respond_to_query_all_request_invalid(&mut virtual_overseer, |i| { + i.0 == validator_index_for_first_chunk + }) + .await; + + let candidate_hash = test_state.candidate.hash(); + + // If systematic recovery detects invalid local chunk, it'll directly go to regular + // recovery, if we were the one holding an invalid chunk. + if systematic_recovery { + test_state + .respond_to_query_all_request_invalid(&mut virtual_overseer, |i| { + i.0 == validator_index_for_first_chunk + }) + .await; + } + + test_state + .test_chunk_requests( &req_protocol_names, candidate_hash, &mut virtual_overseer, - test_state.threshold() - 1, - |i| if i == 0 { panic!("requested from local validator") } else { Has::Yes }, + test_state.threshold(), + |i| { + if i.0 == validator_index_for_first_chunk { + panic!("requested from local validator") + } else { + Has::Yes + } + }, + false, ) .await; @@ -1530,14 +2423,439 @@ fn does_not_query_local_validator() { } #[test] -fn invalid_local_chunk_is_ignored() { +fn systematic_chunks_are_not_requested_again_in_regular_recovery() { + // Run this test multiple times, as the order in which requests are made is random and we want + // to make sure that we catch regressions. + for _ in 0..TestState::default().chunks.len() { + let test_state = TestState::default(); + let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); + let subsystem = with_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ); + + test_harness(subsystem, |mut virtual_overseer| async move { + overseer_signal( + &mut virtual_overseer, + OverseerSignal::ActiveLeaves(ActiveLeavesUpdate::start_work(new_leaf( + test_state.current, + 1, + ))), + ) + .await; + + let (tx, rx) = oneshot::channel(); + + overseer_send( + &mut virtual_overseer, + AvailabilityRecoveryMessage::RecoverAvailableData( + test_state.candidate.clone(), + test_state.session_index, + None, + Some(test_state.core_index), + tx, + ), + ) + .await; + + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; + test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + + let validator_index_for_first_chunk = test_state + .chunks + .iter() + .enumerate() + .find_map(|(val_idx, chunk)| if chunk.index.0 == 0 { Some(val_idx) } else { None }) + .unwrap() as u32; + + test_state + .test_chunk_requests( + &req_protocol_names, + test_state.candidate.hash(), + &mut virtual_overseer, + test_state.systematic_threshold(), + |i| if i.0 == validator_index_for_first_chunk { Has::No } else { Has::Yes }, + true, + ) + .await; + + // Falls back to regular recovery, since one validator returned a fatal error. + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + + test_state + .test_chunk_requests( + &req_protocol_names, + test_state.candidate.hash(), + &mut virtual_overseer, + 1, + |i| { + if (test_state.chunks.get(i).unwrap().index.0 as usize) < + test_state.systematic_threshold() + { + panic!("Already requested") + } else { + Has::Yes + } + }, + false, + ) + .await; + + assert_eq!(rx.await.unwrap().unwrap(), test_state.available_data); + virtual_overseer + }); + } +} + +#[rstest] +#[case(true, true)] +#[case(true, false)] +#[case(false, true)] +#[case(false, false)] +fn chunk_indices_are_mapped_to_different_validators( + #[case] systematic_recovery: bool, + #[case] mapping_enabled: bool, +) { + let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); + let test_state = match mapping_enabled { + true => TestState::default(), + false => TestState::with_empty_node_features(), + }; + let subsystem = match systematic_recovery { + true => with_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + false => with_chunks_only( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + }; + + test_harness(subsystem, |mut virtual_overseer| async move { + overseer_signal( + &mut virtual_overseer, + OverseerSignal::ActiveLeaves(ActiveLeavesUpdate::start_work(new_leaf( + test_state.current, + 1, + ))), + ) + .await; + + let (tx, _rx) = oneshot::channel(); + + overseer_send( + &mut virtual_overseer, + AvailabilityRecoveryMessage::RecoverAvailableData( + test_state.candidate.clone(), + test_state.session_index, + None, + Some(test_state.core_index), + tx, + ), + ) + .await; + + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; + + test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + + let mut chunk_indices: Vec<(u32, u32)> = vec![]; + + assert_matches!( + overseer_recv(&mut virtual_overseer).await, + AllMessages::NetworkBridgeTx( + NetworkBridgeTxMessage::SendRequests( + requests, + _if_disconnected, + ) + ) => { + for req in requests { + assert_matches!( + req, + Requests::ChunkFetching(req) => { + assert_eq!(req.payload.candidate_hash, test_state.candidate.hash()); + + let validator_index = req.payload.index; + let chunk_index = test_state.chunks.get(validator_index).unwrap().index; + + if systematic_recovery && mapping_enabled { + assert!((chunk_index.0 as usize) <= test_state.systematic_threshold(), "requested non-systematic chunk"); + } + + chunk_indices.push((chunk_index.0, validator_index.0)); + } + ) + } + } + ); + + if mapping_enabled { + assert!(!chunk_indices.iter().any(|(c_index, v_index)| c_index == v_index)); + } else { + assert!(chunk_indices.iter().all(|(c_index, v_index)| c_index == v_index)); + } + + virtual_overseer + }); +} + +#[rstest] +#[case(true, false)] +#[case(false, true)] +#[case(false, false)] +fn number_of_request_retries_is_bounded( + #[case] systematic_recovery: bool, + #[case] should_fail: bool, +) { + let mut test_state = TestState::default(); + let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); + // We need the number of validators to be evenly divisible by the threshold for this test to be + // easier to write. + let n_validators = 6; + test_state.validators.truncate(n_validators); + test_state.validator_authority_id.truncate(n_validators); + let mut temp = test_state.validator_public.to_vec(); + temp.truncate(n_validators); + test_state.validator_public = temp.into(); + + let (chunks, erasure_root) = derive_erasure_chunks_with_proofs_and_root( + n_validators, + &test_state.available_data, + |_, _| {}, + ); + test_state.chunks = + map_chunks(chunks, &test_state.node_features, n_validators, test_state.core_index); + test_state.candidate.descriptor.erasure_root = erasure_root; + + let (subsystem, retry_limit) = match systematic_recovery { + false => ( + with_chunks_only( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + REGULAR_CHUNKS_REQ_RETRY_LIMIT, + ), + true => ( + with_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + SYSTEMATIC_CHUNKS_REQ_RETRY_LIMIT, + ), + }; + + test_harness(subsystem, |mut virtual_overseer| async move { + overseer_signal( + &mut virtual_overseer, + OverseerSignal::ActiveLeaves(ActiveLeavesUpdate::start_work(new_leaf( + test_state.current, + 1, + ))), + ) + .await; + + let (tx, rx) = oneshot::channel(); + + overseer_send( + &mut virtual_overseer, + AvailabilityRecoveryMessage::RecoverAvailableData( + test_state.candidate.clone(), + test_state.session_index, + None, + Some(test_state.core_index), + tx, + ), + ) + .await; + + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; + test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + + let validator_count_per_iteration = if systematic_recovery { + test_state.systematic_threshold() + } else { + test_state.chunks.len() + }; + + // Network errors are considered non-fatal but should be retried a limited number of times. + for _ in 1..retry_limit { + test_state + .test_chunk_requests( + &req_protocol_names, + test_state.candidate.hash(), + &mut virtual_overseer, + validator_count_per_iteration, + |_| Has::timeout(), + systematic_recovery, + ) + .await; + } + + if should_fail { + test_state + .test_chunk_requests( + &req_protocol_names, + test_state.candidate.hash(), + &mut virtual_overseer, + validator_count_per_iteration, + |_| Has::timeout(), + systematic_recovery, + ) + .await; + + assert_eq!(rx.await.unwrap().unwrap_err(), RecoveryError::Unavailable); + } else { + test_state + .test_chunk_requests( + &req_protocol_names, + test_state.candidate.hash(), + &mut virtual_overseer, + test_state.threshold(), + |_| Has::Yes, + systematic_recovery, + ) + .await; + + assert_eq!(rx.await.unwrap().unwrap(), test_state.available_data); + } + + virtual_overseer + }); +} + +#[test] +fn systematic_recovery_retries_from_backers() { let test_state = TestState::default(); let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); - let subsystem = AvailabilityRecoverySubsystem::with_chunks_only( + let subsystem = with_systematic_chunks( request_receiver(&req_protocol_names), + &req_protocol_names, Metrics::new_dummy(), ); + test_harness(subsystem, |mut virtual_overseer| async move { + overseer_signal( + &mut virtual_overseer, + OverseerSignal::ActiveLeaves(ActiveLeavesUpdate::start_work(new_leaf( + test_state.current, + 1, + ))), + ) + .await; + + let (tx, rx) = oneshot::channel(); + let group_index = GroupIndex(2); + let group_size = test_state.validator_groups.get(group_index).unwrap().len(); + + overseer_send( + &mut virtual_overseer, + AvailabilityRecoveryMessage::RecoverAvailableData( + test_state.candidate.clone(), + test_state.session_index, + Some(group_index), + Some(test_state.core_index), + tx, + ), + ) + .await; + + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; + test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + + let mut cnt = 0; + + test_state + .test_chunk_requests( + &req_protocol_names, + test_state.candidate.hash(), + &mut virtual_overseer, + test_state.systematic_threshold(), + |_| { + let res = if cnt < group_size { Has::timeout() } else { Has::Yes }; + cnt += 1; + res + }, + true, + ) + .await; + + // Exhaust retries. + for _ in 0..(SYSTEMATIC_CHUNKS_REQ_RETRY_LIMIT - 1) { + test_state + .test_chunk_requests( + &req_protocol_names, + test_state.candidate.hash(), + &mut virtual_overseer, + group_size, + |_| Has::No, + true, + ) + .await; + } + + // Now, final chance is to try from a backer. + test_state + .test_chunk_requests( + &req_protocol_names, + test_state.candidate.hash(), + &mut virtual_overseer, + group_size, + |_| Has::Yes, + true, + ) + .await; + + assert_eq!(rx.await.unwrap().unwrap(), test_state.available_data); + virtual_overseer + }); +} + +#[rstest] +#[case(true)] +#[case(false)] +fn test_legacy_network_protocol_with_mapping_disabled(#[case] systematic_recovery: bool) { + // In this case, when the mapping is disabled, recovery will work with both v2 and v1 requests, + // under the assumption that ValidatorIndex is always equal to ChunkIndex. However, systematic + // recovery will not be possible, it will fall back to regular recovery. + let test_state = TestState::with_empty_node_features(); + let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); + let (subsystem, threshold) = match systematic_recovery { + true => ( + with_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.systematic_threshold(), + ), + false => ( + with_fast_path( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.threshold(), + ), + }; + test_harness(subsystem, |mut virtual_overseer| async move { overseer_signal( &mut virtual_overseer, @@ -1556,30 +2874,250 @@ fn invalid_local_chunk_is_ignored() { test_state.candidate.clone(), test_state.session_index, None, + Some(test_state.core_index), tx, ), ) .await; - test_state.test_runtime_api(&mut virtual_overseer).await; + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; + + let candidate_hash = test_state.candidate.hash(); + test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + test_state - .respond_to_query_all_request_invalid(&mut virtual_overseer, |i| i == 0) + .test_chunk_requests_v1( + &req_protocol_names, + candidate_hash, + &mut virtual_overseer, + threshold, + |_| Has::Yes, + false, + ) .await; + // Recovered data should match the original one. + assert_eq!(rx.await.unwrap().unwrap(), test_state.available_data); + virtual_overseer + }); +} + +#[rstest] +#[case(true)] +#[case(false)] +fn test_legacy_network_protocol_with_mapping_enabled(#[case] systematic_recovery: bool) { + // In this case, when the mapping is enabled, we MUST only use v2. Recovery should fail for v1. + let test_state = TestState::default(); + let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); + let (subsystem, threshold) = match systematic_recovery { + true => ( + with_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.systematic_threshold(), + ), + false => ( + with_fast_path( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ), + test_state.threshold(), + ), + }; + + test_harness(subsystem, |mut virtual_overseer| async move { + overseer_signal( + &mut virtual_overseer, + OverseerSignal::ActiveLeaves(ActiveLeavesUpdate::start_work(new_leaf( + test_state.current, + 1, + ))), + ) + .await; + + let (tx, rx) = oneshot::channel(); + + overseer_send( + &mut virtual_overseer, + AvailabilityRecoveryMessage::RecoverAvailableData( + test_state.candidate.clone(), + test_state.session_index, + None, + Some(test_state.core_index), + tx, + ), + ) + .await; + + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; + let candidate_hash = test_state.candidate.hash(); + test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + + if systematic_recovery { + test_state + .test_chunk_requests_v1( + &req_protocol_names, + candidate_hash, + &mut virtual_overseer, + threshold, + |_| Has::Yes, + systematic_recovery, + ) + .await; + + // Systematic recovery failed, trying regular recovery. + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + } + + test_state + .test_chunk_requests_v1( + &req_protocol_names, + candidate_hash, + &mut virtual_overseer, + test_state.validators.len() - test_state.threshold(), + |_| Has::Yes, + false, + ) + .await; + + assert_eq!(rx.await.unwrap().unwrap_err(), RecoveryError::Unavailable); + virtual_overseer + }); +} + +#[test] +fn test_systematic_recovery_skipped_if_no_core_index() { + let test_state = TestState::default(); + let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); + let subsystem = with_systematic_chunks( + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ); + + test_harness(subsystem, |mut virtual_overseer| async move { + overseer_signal( + &mut virtual_overseer, + OverseerSignal::ActiveLeaves(ActiveLeavesUpdate::start_work(new_leaf( + test_state.current, + 1, + ))), + ) + .await; + + let (tx, rx) = oneshot::channel(); + + overseer_send( + &mut virtual_overseer, + AvailabilityRecoveryMessage::RecoverAvailableData( + test_state.candidate.clone(), + test_state.session_index, + None, + None, + tx, + ), + ) + .await; + + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; + + let candidate_hash = test_state.candidate.hash(); + + test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + + // Systematic recovery not possible without core index, falling back to regular recovery. test_state .test_chunk_requests( &req_protocol_names, candidate_hash, &mut virtual_overseer, - test_state.threshold() - 1, - |i| if i == 0 { panic!("requested from local validator") } else { Has::Yes }, + test_state.validators.len() - test_state.threshold(), + |_| Has::No, + false, ) .await; - assert_eq!(rx.await.unwrap().unwrap(), test_state.available_data); + // Make it fail, in order to assert that indeed regular recovery was attempted. If it were + // systematic recovery, we would have had one more attempt for regular reconstruction. + assert_eq!(rx.await.unwrap().unwrap_err(), RecoveryError::Unavailable); + virtual_overseer + }); +} + +#[test] +fn test_systematic_recovery_skipped_if_mapping_disabled() { + let test_state = TestState::with_empty_node_features(); + let req_protocol_names = ReqProtocolNames::new(&GENESIS_HASH, None); + let subsystem = AvailabilityRecoverySubsystem::for_validator( + None, + request_receiver(&req_protocol_names), + &req_protocol_names, + Metrics::new_dummy(), + ); + + test_harness(subsystem, |mut virtual_overseer| async move { + overseer_signal( + &mut virtual_overseer, + OverseerSignal::ActiveLeaves(ActiveLeavesUpdate::start_work(new_leaf( + test_state.current, + 1, + ))), + ) + .await; + + let (tx, rx) = oneshot::channel(); + + overseer_send( + &mut virtual_overseer, + AvailabilityRecoveryMessage::RecoverAvailableData( + test_state.candidate.clone(), + test_state.session_index, + None, + Some(test_state.core_index), + tx, + ), + ) + .await; + + test_state.test_runtime_api_session_info(&mut virtual_overseer).await; + + test_state.test_runtime_api_node_features(&mut virtual_overseer).await; + + let candidate_hash = test_state.candidate.hash(); + + test_state.respond_to_available_data_query(&mut virtual_overseer, false).await; + test_state.respond_to_query_all_request(&mut virtual_overseer, |_| false).await; + + // Systematic recovery not possible without core index, falling back to regular recovery. + test_state + .test_chunk_requests( + &req_protocol_names, + candidate_hash, + &mut virtual_overseer, + test_state.validators.len() - test_state.threshold(), + |_| Has::No, + false, + ) + .await; + + // Make it fail, in order to assert that indeed regular recovery was attempted. If it were + // systematic recovery, we would have had one more attempt for regular reconstruction. + assert_eq!(rx.await.unwrap().unwrap_err(), RecoveryError::Unavailable); virtual_overseer }); } diff --git a/polkadot/node/network/bridge/src/tx/mod.rs b/polkadot/node/network/bridge/src/tx/mod.rs index d5be6f01c3373..7b6dea748572b 100644 --- a/polkadot/node/network/bridge/src/tx/mod.rs +++ b/polkadot/node/network/bridge/src/tx/mod.rs @@ -301,7 +301,15 @@ where for req in reqs { match req { - Requests::ChunkFetchingV1(_) => metrics.on_message("chunk_fetching_v1"), + Requests::ChunkFetching(ref req) => { + // This is not the actual request that will succeed, as we don't know yet + // what that will be. It's only the primary request we tried. + if req.fallback_request.is_some() { + metrics.on_message("chunk_fetching_v2") + } else { + metrics.on_message("chunk_fetching_v1") + } + }, Requests::AvailableDataFetchingV1(_) => metrics.on_message("available_data_fetching_v1"), Requests::CollationFetchingV1(_) => metrics.on_message("collation_fetching_v1"), diff --git a/polkadot/node/network/protocol/src/request_response/mod.rs b/polkadot/node/network/protocol/src/request_response/mod.rs index cab02bb88a00b..fe06593bd7a0f 100644 --- a/polkadot/node/network/protocol/src/request_response/mod.rs +++ b/polkadot/node/network/protocol/src/request_response/mod.rs @@ -98,6 +98,10 @@ pub enum Protocol { /// Protocol for requesting candidates with attestations in statement distribution /// when async backing is enabled. AttestedCandidateV2, + + /// Protocol for chunk fetching version 2, used by availability distribution and availability + /// recovery. + ChunkFetchingV2, } /// Minimum bandwidth we expect for validators - 500Mbit/s is the recommendation, so approximately @@ -209,7 +213,7 @@ impl Protocol { let name = req_protocol_names.get_name(self); let legacy_names = self.get_legacy_name().into_iter().map(Into::into).collect(); match self { - Protocol::ChunkFetchingV1 => N::request_response_config( + Protocol::ChunkFetchingV1 | Protocol::ChunkFetchingV2 => N::request_response_config( name, legacy_names, 1_000, @@ -292,7 +296,7 @@ impl Protocol { // times (due to network delays), 100 seems big enough to accommodate for "bursts", // assuming we can service requests relatively quickly, which would need to be measured // as well. - Protocol::ChunkFetchingV1 => 100, + Protocol::ChunkFetchingV1 | Protocol::ChunkFetchingV2 => 100, // 10 seems reasonable, considering group sizes of max 10 validators. Protocol::CollationFetchingV1 | Protocol::CollationFetchingV2 => 10, // 10 seems reasonable, considering group sizes of max 10 validators. @@ -362,6 +366,7 @@ impl Protocol { // Introduced after legacy names became legacy. Protocol::AttestedCandidateV2 => None, Protocol::CollationFetchingV2 => None, + Protocol::ChunkFetchingV2 => None, } } } @@ -412,6 +417,7 @@ impl ReqProtocolNames { }; let short_name = match protocol { + // V1: Protocol::ChunkFetchingV1 => "/req_chunk/1", Protocol::CollationFetchingV1 => "/req_collation/1", Protocol::PoVFetchingV1 => "/req_pov/1", @@ -419,8 +425,10 @@ impl ReqProtocolNames { Protocol::StatementFetchingV1 => "/req_statement/1", Protocol::DisputeSendingV1 => "/send_dispute/1", + // V2: Protocol::CollationFetchingV2 => "/req_collation/2", Protocol::AttestedCandidateV2 => "/req_attested_candidate/2", + Protocol::ChunkFetchingV2 => "/req_chunk/2", }; format!("{}{}", prefix, short_name).into() diff --git a/polkadot/node/network/protocol/src/request_response/outgoing.rs b/polkadot/node/network/protocol/src/request_response/outgoing.rs index 96ef4a6ab25dc..f578c4ffded34 100644 --- a/polkadot/node/network/protocol/src/request_response/outgoing.rs +++ b/polkadot/node/network/protocol/src/request_response/outgoing.rs @@ -30,7 +30,7 @@ use super::{v1, v2, IsRequest, Protocol}; #[derive(Debug)] pub enum Requests { /// Request an availability chunk from a node. - ChunkFetchingV1(OutgoingRequest), + ChunkFetching(OutgoingRequest), /// Fetch a collation from a collator which previously announced it. CollationFetchingV1(OutgoingRequest), /// Fetch a PoV from a validator which previously sent out a seconded statement. @@ -59,7 +59,7 @@ impl Requests { /// contained in the `enum`. pub fn encode_request(self) -> (Protocol, OutgoingRequest>) { match self { - Self::ChunkFetchingV1(r) => r.encode_request(), + Self::ChunkFetching(r) => r.encode_request(), Self::CollationFetchingV1(r) => r.encode_request(), Self::CollationFetchingV2(r) => r.encode_request(), Self::PoVFetchingV1(r) => r.encode_request(), @@ -164,24 +164,20 @@ where /// /// Returns a raw `Vec` response over the channel. Use the associated `ProtocolName` to know /// which request was the successful one and appropriately decode the response. - // WARNING: This is commented for now because it's not used yet. - // If you need it, make sure to test it. You may need to enable the V1 substream upgrade - // protocol, unless libp2p was in the meantime updated to a version that fixes the problem - // described in https://github.com/libp2p/rust-libp2p/issues/5074 - // pub fn new_with_fallback( - // peer: Recipient, - // payload: Req, - // fallback_request: FallbackReq, - // ) -> (Self, impl Future, ProtocolName)>>) { - // let (tx, rx) = oneshot::channel(); - // let r = Self { - // peer, - // payload, - // pending_response: tx, - // fallback_request: Some((fallback_request, FallbackReq::PROTOCOL)), - // }; - // (r, async { Ok(rx.await??) }) - // } + pub fn new_with_fallback( + peer: Recipient, + payload: Req, + fallback_request: FallbackReq, + ) -> (Self, impl Future, ProtocolName)>>) { + let (tx, rx) = oneshot::channel(); + let r = Self { + peer, + payload, + pending_response: tx, + fallback_request: Some((fallback_request, FallbackReq::PROTOCOL)), + }; + (r, async { Ok(rx.await??) }) + } /// Encode a request into a `Vec`. /// diff --git a/polkadot/node/network/protocol/src/request_response/v1.rs b/polkadot/node/network/protocol/src/request_response/v1.rs index 60eecb69f7389..c503c6e4df03b 100644 --- a/polkadot/node/network/protocol/src/request_response/v1.rs +++ b/polkadot/node/network/protocol/src/request_response/v1.rs @@ -33,7 +33,8 @@ use super::{IsRequest, Protocol}; pub struct ChunkFetchingRequest { /// Hash of candidate we want a chunk for. pub candidate_hash: CandidateHash, - /// The index of the chunk to fetch. + /// The validator index we are requesting from. This must be identical to the index of the + /// chunk we'll receive. For v2, this may not be the case. pub index: ValidatorIndex, } @@ -57,6 +58,15 @@ impl From> for ChunkFetchingResponse { } } +impl From for Option { + fn from(x: ChunkFetchingResponse) -> Self { + match x { + ChunkFetchingResponse::Chunk(c) => Some(c), + ChunkFetchingResponse::NoSuchChunk => None, + } + } +} + /// Skimmed down variant of `ErasureChunk`. /// /// Instead of transmitting a full `ErasureChunk` we transmit `ChunkResponse` in @@ -80,7 +90,7 @@ impl From for ChunkResponse { impl ChunkResponse { /// Re-build an `ErasureChunk` from response and request. pub fn recombine_into_chunk(self, req: &ChunkFetchingRequest) -> ErasureChunk { - ErasureChunk { chunk: self.chunk, proof: self.proof, index: req.index } + ErasureChunk { chunk: self.chunk, proof: self.proof, index: req.index.into() } } } diff --git a/polkadot/node/network/protocol/src/request_response/v2.rs b/polkadot/node/network/protocol/src/request_response/v2.rs index 6b90c579237fb..7e1a2d989168c 100644 --- a/polkadot/node/network/protocol/src/request_response/v2.rs +++ b/polkadot/node/network/protocol/src/request_response/v2.rs @@ -18,12 +18,13 @@ use parity_scale_codec::{Decode, Encode}; +use polkadot_node_primitives::ErasureChunk; use polkadot_primitives::{ CandidateHash, CommittedCandidateReceipt, Hash, Id as ParaId, PersistedValidationData, - UncheckedSignedStatement, + UncheckedSignedStatement, ValidatorIndex, }; -use super::{IsRequest, Protocol}; +use super::{v1, IsRequest, Protocol}; use crate::v2::StatementFilter; /// Request a candidate with statements. @@ -78,3 +79,60 @@ impl IsRequest for CollationFetchingRequest { type Response = CollationFetchingResponse; const PROTOCOL: Protocol = Protocol::CollationFetchingV2; } + +/// Request an availability chunk. +#[derive(Debug, Copy, Clone, Encode, Decode)] +pub struct ChunkFetchingRequest { + /// Hash of candidate we want a chunk for. + pub candidate_hash: CandidateHash, + /// The validator index we are requesting from. This may not be identical to the index of the + /// chunk we'll receive. It's up to the caller to decide whether they need to validate they got + /// the chunk they were expecting. + pub index: ValidatorIndex, +} + +/// Receive a requested erasure chunk. +#[derive(Debug, Clone, Encode, Decode)] +pub enum ChunkFetchingResponse { + /// The requested chunk data. + #[codec(index = 0)] + Chunk(ErasureChunk), + /// Node was not in possession of the requested chunk. + #[codec(index = 1)] + NoSuchChunk, +} + +impl From> for ChunkFetchingResponse { + fn from(x: Option) -> Self { + match x { + Some(c) => ChunkFetchingResponse::Chunk(c), + None => ChunkFetchingResponse::NoSuchChunk, + } + } +} + +impl From for Option { + fn from(x: ChunkFetchingResponse) -> Self { + match x { + ChunkFetchingResponse::Chunk(c) => Some(c), + ChunkFetchingResponse::NoSuchChunk => None, + } + } +} + +impl From for ChunkFetchingRequest { + fn from(v1::ChunkFetchingRequest { candidate_hash, index }: v1::ChunkFetchingRequest) -> Self { + Self { candidate_hash, index } + } +} + +impl From for v1::ChunkFetchingRequest { + fn from(ChunkFetchingRequest { candidate_hash, index }: ChunkFetchingRequest) -> Self { + Self { candidate_hash, index } + } +} + +impl IsRequest for ChunkFetchingRequest { + type Response = ChunkFetchingResponse; + const PROTOCOL: Protocol = Protocol::ChunkFetchingV2; +} diff --git a/polkadot/node/overseer/src/tests.rs b/polkadot/node/overseer/src/tests.rs index 55a6bdb74ba73..87484914ef975 100644 --- a/polkadot/node/overseer/src/tests.rs +++ b/polkadot/node/overseer/src/tests.rs @@ -856,6 +856,7 @@ fn test_availability_recovery_msg() -> AvailabilityRecoveryMessage { dummy_candidate_receipt(dummy_hash()), Default::default(), None, + None, sender, ) } diff --git a/polkadot/node/primitives/src/lib.rs b/polkadot/node/primitives/src/lib.rs index 67930f8735c84..5f007bc8d67d9 100644 --- a/polkadot/node/primitives/src/lib.rs +++ b/polkadot/node/primitives/src/lib.rs @@ -30,13 +30,14 @@ use parity_scale_codec::{Decode, Encode, Error as CodecError, Input}; use serde::{de, Deserialize, Deserializer, Serialize, Serializer}; use polkadot_primitives::{ - BlakeTwo256, BlockNumber, CandidateCommitments, CandidateHash, CollatorPair, + BlakeTwo256, BlockNumber, CandidateCommitments, CandidateHash, ChunkIndex, CollatorPair, CommittedCandidateReceipt, CompactStatement, CoreIndex, EncodeAs, Hash, HashT, HeadData, Id as ParaId, PersistedValidationData, SessionIndex, Signed, UncheckedSigned, ValidationCode, - ValidationCodeHash, ValidatorIndex, MAX_CODE_SIZE, MAX_POV_SIZE, + ValidationCodeHash, MAX_CODE_SIZE, MAX_POV_SIZE, }; pub use sp_consensus_babe::{ AllowedSlots as BabeAllowedSlots, BabeEpochConfiguration, Epoch as BabeEpoch, + Randomness as BabeRandomness, }; pub use polkadot_parachain_primitives::primitives::{ @@ -639,7 +640,7 @@ pub struct ErasureChunk { /// The erasure-encoded chunk of data belonging to the candidate block. pub chunk: Vec, /// The index of this erasure-encoded chunk of data. - pub index: ValidatorIndex, + pub index: ChunkIndex, /// Proof for this chunk's branch in the Merkle tree. pub proof: Proof, } diff --git a/polkadot/node/service/src/lib.rs b/polkadot/node/service/src/lib.rs index f50b9770b4182..7c9b9e05d62c3 100644 --- a/polkadot/node/service/src/lib.rs +++ b/polkadot/node/service/src/lib.rs @@ -915,7 +915,10 @@ pub fn new_full< let (pov_req_receiver, cfg) = IncomingRequest::get_config_receiver::<_, Network>(&req_protocol_names); net_config.add_request_response_protocol(cfg); - let (chunk_req_receiver, cfg) = + let (chunk_req_v1_receiver, cfg) = + IncomingRequest::get_config_receiver::<_, Network>(&req_protocol_names); + net_config.add_request_response_protocol(cfg); + let (chunk_req_v2_receiver, cfg) = IncomingRequest::get_config_receiver::<_, Network>(&req_protocol_names); net_config.add_request_response_protocol(cfg); @@ -1000,7 +1003,8 @@ pub fn new_full< candidate_validation_config, availability_config: AVAILABILITY_CONFIG, pov_req_receiver, - chunk_req_receiver, + chunk_req_v1_receiver, + chunk_req_v2_receiver, statement_req_receiver, candidate_req_v2_receiver, approval_voting_config, diff --git a/polkadot/node/service/src/overseer.rs b/polkadot/node/service/src/overseer.rs index 175a77e1c5f6d..6f35718cd18f2 100644 --- a/polkadot/node/service/src/overseer.rs +++ b/polkadot/node/service/src/overseer.rs @@ -119,8 +119,10 @@ pub struct ExtendedOverseerGenArgs { pub availability_config: AvailabilityConfig, /// POV request receiver. pub pov_req_receiver: IncomingRequestReceiver, - /// Erasure chunks request receiver. - pub chunk_req_receiver: IncomingRequestReceiver, + /// Erasure chunk request v1 receiver. + pub chunk_req_v1_receiver: IncomingRequestReceiver, + /// Erasure chunk request v2 receiver. + pub chunk_req_v2_receiver: IncomingRequestReceiver, /// Receiver for incoming large statement requests. pub statement_req_receiver: IncomingRequestReceiver, /// Receiver for incoming candidate requests. @@ -163,7 +165,8 @@ pub fn validator_overseer_builder( candidate_validation_config, availability_config, pov_req_receiver, - chunk_req_receiver, + chunk_req_v1_receiver, + chunk_req_v2_receiver, statement_req_receiver, candidate_req_v2_receiver, approval_voting_config, @@ -226,7 +229,7 @@ where network_service.clone(), authority_discovery_service.clone(), network_bridge_metrics.clone(), - req_protocol_names, + req_protocol_names.clone(), peerset_protocol_names.clone(), notification_sinks.clone(), )) @@ -241,12 +244,18 @@ where )) .availability_distribution(AvailabilityDistributionSubsystem::new( keystore.clone(), - IncomingRequestReceivers { pov_req_receiver, chunk_req_receiver }, + IncomingRequestReceivers { + pov_req_receiver, + chunk_req_v1_receiver, + chunk_req_v2_receiver, + }, + req_protocol_names.clone(), Metrics::register(registry)?, )) - .availability_recovery(AvailabilityRecoverySubsystem::with_chunks_if_pov_large( + .availability_recovery(AvailabilityRecoverySubsystem::for_validator( fetch_chunks_threshold, available_data_req_receiver, + &req_protocol_names, Metrics::register(registry)?, )) .availability_store(AvailabilityStoreSubsystem::new( @@ -412,7 +421,7 @@ where network_service.clone(), authority_discovery_service.clone(), network_bridge_metrics.clone(), - req_protocol_names, + req_protocol_names.clone(), peerset_protocol_names.clone(), notification_sinks.clone(), )) @@ -429,6 +438,7 @@ where .availability_recovery(AvailabilityRecoverySubsystem::for_collator( None, available_data_req_receiver, + &req_protocol_names, Metrics::register(registry)?, )) .availability_store(DummySubsystem) diff --git a/polkadot/node/subsystem-bench/Cargo.toml b/polkadot/node/subsystem-bench/Cargo.toml index 21eaed832c4b9..ebd9322e9f74a 100644 --- a/polkadot/node/subsystem-bench/Cargo.toml +++ b/polkadot/node/subsystem-bench/Cargo.toml @@ -89,6 +89,7 @@ paste = "1.0.14" orchestra = { version = "0.3.5", default-features = false, features = ["futures_channel"] } pyroscope = "0.5.7" pyroscope_pprofrs = "0.2.7" +strum = { version = "0.24", features = ["derive"] } [features] default = [] diff --git a/polkadot/node/subsystem-bench/examples/availability_read.yaml b/polkadot/node/subsystem-bench/examples/availability_read.yaml index 82355b0e2973a..263a6988242e2 100644 --- a/polkadot/node/subsystem-bench/examples/availability_read.yaml +++ b/polkadot/node/subsystem-bench/examples/availability_read.yaml @@ -1,8 +1,8 @@ TestConfiguration: # Test 1 - objective: !DataAvailabilityRead - fetch_from_backers: true - n_validators: 300 + strategy: FullFromBackers + n_validators: 500 n_cores: 20 min_pov_size: 5120 max_pov_size: 5120 @@ -16,7 +16,7 @@ TestConfiguration: # Test 2 - objective: !DataAvailabilityRead - fetch_from_backers: true + strategy: FullFromBackers n_validators: 500 n_cores: 20 min_pov_size: 5120 @@ -31,7 +31,7 @@ TestConfiguration: # Test 3 - objective: !DataAvailabilityRead - fetch_from_backers: true + strategy: FullFromBackers n_validators: 1000 n_cores: 20 min_pov_size: 5120 diff --git a/polkadot/node/subsystem-bench/src/lib/availability/mod.rs b/polkadot/node/subsystem-bench/src/lib/availability/mod.rs index f7d65589565ba..955a8fbac2e9a 100644 --- a/polkadot/node/subsystem-bench/src/lib/availability/mod.rs +++ b/polkadot/node/subsystem-bench/src/lib/availability/mod.rs @@ -17,12 +17,14 @@ use crate::{ availability::av_store_helpers::new_av_store, dummy_builder, - environment::{TestEnvironment, TestEnvironmentDependencies, GENESIS_HASH}, + environment::{TestEnvironment, TestEnvironmentDependencies}, mock::{ - av_store::{self, MockAvailabilityStore, NetworkAvailabilityState}, + av_store::{MockAvailabilityStore, NetworkAvailabilityState}, chain_api::{ChainApiState, MockChainApi}, network_bridge::{self, MockNetworkBridgeRx, MockNetworkBridgeTx}, - runtime_api::{self, MockRuntimeApi, MockRuntimeApiCoreState}, + runtime_api::{ + node_features_with_chunk_mapping_enabled, MockRuntimeApi, MockRuntimeApiCoreState, + }, AlwaysSupportsParachains, }, network::new_network, @@ -30,16 +32,17 @@ use crate::{ }; use colored::Colorize; use futures::{channel::oneshot, stream::FuturesUnordered, StreamExt}; + use parity_scale_codec::Encode; use polkadot_availability_bitfield_distribution::BitfieldDistribution; use polkadot_availability_distribution::{ AvailabilityDistributionSubsystem, IncomingRequestReceivers, }; -use polkadot_availability_recovery::AvailabilityRecoverySubsystem; +use polkadot_availability_recovery::{AvailabilityRecoverySubsystem, RecoveryStrategyKind}; use polkadot_node_core_av_store::AvailabilityStoreSubsystem; use polkadot_node_metrics::metrics::Metrics; use polkadot_node_network_protocol::{ - request_response::{IncomingRequest, ReqProtocolNames}, + request_response::{v1, v2, IncomingRequest}, OurView, }; use polkadot_node_subsystem::{ @@ -51,12 +54,13 @@ use polkadot_node_subsystem_types::{ Span, }; use polkadot_overseer::{metrics::Metrics as OverseerMetrics, Handle as OverseerHandle}; -use polkadot_primitives::{Block, GroupIndex, Hash}; +use polkadot_primitives::{Block, CoreIndex, GroupIndex, Hash}; use sc_network::request_responses::{IncomingRequest as RawIncomingRequest, ProtocolConfig}; +use std::{ops::Sub, sync::Arc, time::Instant}; +use strum::Display; use sc_service::SpawnTaskHandle; use serde::{Deserialize, Serialize}; -use std::{ops::Sub, sync::Arc, time::Instant}; pub use test_state::TestState; mod av_store_helpers; @@ -64,15 +68,26 @@ mod test_state; const LOG_TARGET: &str = "subsystem-bench::availability"; +#[derive(clap::ValueEnum, Clone, Copy, Debug, PartialEq, Serialize, Deserialize, Display)] +#[value(rename_all = "kebab-case")] +#[strum(serialize_all = "kebab-case")] +pub enum Strategy { + /// Regular random chunk recovery. This is also the fallback for the next strategies. + Chunks, + /// Recovery from systematic chunks. Much faster than regular chunk recovery becasue it avoid + /// doing the reed-solomon reconstruction. + Systematic, + /// Fetch the full availability datafrom backers first. Saves CPU as we don't need to + /// re-construct from chunks. Typically this is only faster if nodes have enough bandwidth. + FullFromBackers, +} + #[derive(Debug, Clone, Serialize, Deserialize, clap::Parser)] #[clap(rename_all = "kebab-case")] #[allow(missing_docs)] pub struct DataAvailabilityReadOptions { - #[clap(short, long, default_value_t = false)] - /// Turbo boost AD Read by fetching the full availability datafrom backers first. Saves CPU as - /// we don't need to re-construct from chunks. Typically this is only faster if nodes have - /// enough bandwidth. - pub fetch_from_backers: bool, + #[clap(short, long, default_value_t = Strategy::Systematic)] + pub strategy: Strategy, } pub enum TestDataAvailability { @@ -84,7 +99,7 @@ fn build_overseer_for_availability_read( spawn_task_handle: SpawnTaskHandle, runtime_api: MockRuntimeApi, av_store: MockAvailabilityStore, - network_bridge: (MockNetworkBridgeTx, MockNetworkBridgeRx), + (network_bridge_tx, network_bridge_rx): (MockNetworkBridgeTx, MockNetworkBridgeRx), availability_recovery: AvailabilityRecoverySubsystem, dependencies: &TestEnvironmentDependencies, ) -> (Overseer, AlwaysSupportsParachains>, OverseerHandle) { @@ -95,8 +110,8 @@ fn build_overseer_for_availability_read( let builder = dummy .replace_runtime_api(|_| runtime_api) .replace_availability_store(|_| av_store) - .replace_network_bridge_tx(|_| network_bridge.0) - .replace_network_bridge_rx(|_| network_bridge.1) + .replace_network_bridge_tx(|_| network_bridge_tx) + .replace_network_bridge_rx(|_| network_bridge_rx) .replace_availability_recovery(|_| availability_recovery); let (overseer, raw_handle) = @@ -109,7 +124,7 @@ fn build_overseer_for_availability_read( fn build_overseer_for_availability_write( spawn_task_handle: SpawnTaskHandle, runtime_api: MockRuntimeApi, - network_bridge: (MockNetworkBridgeTx, MockNetworkBridgeRx), + (network_bridge_tx, network_bridge_rx): (MockNetworkBridgeTx, MockNetworkBridgeRx), availability_distribution: AvailabilityDistributionSubsystem, chain_api: MockChainApi, availability_store: AvailabilityStoreSubsystem, @@ -123,8 +138,8 @@ fn build_overseer_for_availability_write( let builder = dummy .replace_runtime_api(|_| runtime_api) .replace_availability_store(|_| availability_store) - .replace_network_bridge_tx(|_| network_bridge.0) - .replace_network_bridge_rx(|_| network_bridge.1) + .replace_network_bridge_tx(|_| network_bridge_tx) + .replace_network_bridge_rx(|_| network_bridge_rx) .replace_chain_api(|_| chain_api) .replace_bitfield_distribution(|_| bitfield_distribution) // This is needed to test own chunk recovery for `n_cores`. @@ -142,10 +157,14 @@ pub fn prepare_test( with_prometheus_endpoint: bool, ) -> (TestEnvironment, Vec) { let dependencies = TestEnvironmentDependencies::default(); + let availability_state = NetworkAvailabilityState { candidate_hashes: state.candidate_hashes.clone(), + candidate_hash_to_core_index: state.candidate_hash_to_core_index.clone(), available_data: state.available_data.clone(), chunks: state.chunks.clone(), + chunk_indices: state.chunk_indices.clone(), + req_protocol_names: state.req_protocol_names.clone(), }; let mut req_cfgs = Vec::new(); @@ -153,20 +172,31 @@ pub fn prepare_test( let (collation_req_receiver, collation_req_cfg) = IncomingRequest::get_config_receiver::< Block, sc_network::NetworkWorker, - >(&ReqProtocolNames::new(GENESIS_HASH, None)); + >(&state.req_protocol_names); req_cfgs.push(collation_req_cfg); let (pov_req_receiver, pov_req_cfg) = IncomingRequest::get_config_receiver::< Block, sc_network::NetworkWorker, - >(&ReqProtocolNames::new(GENESIS_HASH, None)); - - let (chunk_req_receiver, chunk_req_cfg) = IncomingRequest::get_config_receiver::< - Block, - sc_network::NetworkWorker, - >(&ReqProtocolNames::new(GENESIS_HASH, None)); + >(&state.req_protocol_names); req_cfgs.push(pov_req_cfg); + let (chunk_req_v1_receiver, chunk_req_v1_cfg) = + IncomingRequest::::get_config_receiver::< + Block, + sc_network::NetworkWorker, + >(&state.req_protocol_names); + + // We won't use v1 chunk fetching requests, but we need to keep the inbound queue alive. + // Otherwise, av-distribution subsystem will terminate. + std::mem::forget(chunk_req_v1_cfg); + + let (chunk_req_v2_receiver, chunk_req_v2_cfg) = + IncomingRequest::::get_config_receiver::< + Block, + sc_network::NetworkWorker, + >(&state.req_protocol_names); + let (network, network_interface, network_receiver) = new_network( &state.config, &dependencies, @@ -180,9 +210,9 @@ pub fn prepare_test( state.test_authorities.clone(), ); let network_bridge_rx = - network_bridge::MockNetworkBridgeRx::new(network_receiver, Some(chunk_req_cfg)); + network_bridge::MockNetworkBridgeRx::new(network_receiver, Some(chunk_req_v2_cfg)); - let runtime_api = runtime_api::MockRuntimeApi::new( + let runtime_api = MockRuntimeApi::new( state.config.clone(), state.test_authorities.clone(), state.candidate_receipts.clone(), @@ -194,24 +224,34 @@ pub fn prepare_test( let (overseer, overseer_handle) = match &mode { TestDataAvailability::Read(options) => { - let use_fast_path = options.fetch_from_backers; - - let subsystem = if use_fast_path { - AvailabilityRecoverySubsystem::with_fast_path( + let subsystem = match options.strategy { + Strategy::FullFromBackers => + AvailabilityRecoverySubsystem::with_recovery_strategy_kind( + collation_req_receiver, + &state.req_protocol_names, + Metrics::try_register(&dependencies.registry).unwrap(), + RecoveryStrategyKind::BackersFirstAlways, + ), + Strategy::Chunks => AvailabilityRecoverySubsystem::with_recovery_strategy_kind( collation_req_receiver, + &state.req_protocol_names, Metrics::try_register(&dependencies.registry).unwrap(), - ) - } else { - AvailabilityRecoverySubsystem::with_chunks_only( + RecoveryStrategyKind::ChunksAlways, + ), + Strategy::Systematic => AvailabilityRecoverySubsystem::with_recovery_strategy_kind( collation_req_receiver, + &state.req_protocol_names, Metrics::try_register(&dependencies.registry).unwrap(), - ) + RecoveryStrategyKind::SystematicChunks, + ), }; // Use a mocked av-store. - let av_store = av_store::MockAvailabilityStore::new( + let av_store = MockAvailabilityStore::new( state.chunks.clone(), + state.chunk_indices.clone(), state.candidate_hashes.clone(), + state.candidate_hash_to_core_index.clone(), ); build_overseer_for_availability_read( @@ -226,7 +266,12 @@ pub fn prepare_test( TestDataAvailability::Write => { let availability_distribution = AvailabilityDistributionSubsystem::new( state.test_authorities.keyring.keystore(), - IncomingRequestReceivers { pov_req_receiver, chunk_req_receiver }, + IncomingRequestReceivers { + pov_req_receiver, + chunk_req_v1_receiver, + chunk_req_v2_receiver, + }, + state.req_protocol_names.clone(), Metrics::try_register(&dependencies.registry).unwrap(), ); @@ -296,6 +341,7 @@ pub async fn benchmark_availability_read( Some(GroupIndex( candidate_num as u32 % (std::cmp::max(5, config.n_cores) / 5) as u32, )), + Some(*state.candidate_hash_to_core_index.get(&candidate.hash()).unwrap()), tx, ), ); @@ -341,7 +387,7 @@ pub async fn benchmark_availability_write( env.metrics().set_n_cores(config.n_cores); gum::info!(target: LOG_TARGET, "Seeding availability store with candidates ..."); - for backed_candidate in state.backed_candidates.clone() { + for (core_index, backed_candidate) in state.backed_candidates.clone().into_iter().enumerate() { let candidate_index = *state.candidate_hashes.get(&backed_candidate.hash()).unwrap(); let available_data = state.available_data[candidate_index].clone(); let (tx, rx) = oneshot::channel(); @@ -352,6 +398,8 @@ pub async fn benchmark_availability_write( available_data, expected_erasure_root: backed_candidate.descriptor().erasure_root, tx, + core_index: CoreIndex(core_index as u32), + node_features: node_features_with_chunk_mapping_enabled(), }, )) .await; diff --git a/polkadot/node/subsystem-bench/src/lib/availability/test_state.rs b/polkadot/node/subsystem-bench/src/lib/availability/test_state.rs index c328ffedf916e..5d443734bb387 100644 --- a/polkadot/node/subsystem-bench/src/lib/availability/test_state.rs +++ b/polkadot/node/subsystem-bench/src/lib/availability/test_state.rs @@ -14,22 +14,28 @@ // You should have received a copy of the GNU General Public License // along with Polkadot. If not, see . -use crate::configuration::{TestAuthorities, TestConfiguration}; +use crate::{ + configuration::{TestAuthorities, TestConfiguration}, + environment::GENESIS_HASH, + mock::runtime_api::node_features_with_chunk_mapping_enabled, +}; use bitvec::bitvec; use colored::Colorize; use itertools::Itertools; use parity_scale_codec::Encode; use polkadot_node_network_protocol::{ - request_response::v1::ChunkFetchingRequest, Versioned, VersionedValidationProtocol, + request_response::{v2::ChunkFetchingRequest, ReqProtocolNames}, + Versioned, VersionedValidationProtocol, }; use polkadot_node_primitives::{AvailableData, BlockData, ErasureChunk, PoV}; use polkadot_node_subsystem_test_helpers::{ derive_erasure_chunks_with_proofs_and_root, mock::new_block_import_info, }; +use polkadot_node_subsystem_util::availability_chunks::availability_chunk_indices; use polkadot_overseer::BlockInfo; use polkadot_primitives::{ - AvailabilityBitfield, BlockNumber, CandidateHash, CandidateReceipt, Hash, HeadData, Header, - PersistedValidationData, Signed, SigningContext, ValidatorIndex, + AvailabilityBitfield, BlockNumber, CandidateHash, CandidateReceipt, ChunkIndex, CoreIndex, + Hash, HeadData, Header, PersistedValidationData, Signed, SigningContext, ValidatorIndex, }; use polkadot_primitives_test_helpers::{dummy_candidate_receipt, dummy_hash}; use sp_core::H256; @@ -49,14 +55,20 @@ pub struct TestState { pub pov_size_to_candidate: HashMap, // Map from generated candidate hashes to candidate index in `available_data` and `chunks`. pub candidate_hashes: HashMap, + // Map from candidate hash to occupied core index. + pub candidate_hash_to_core_index: HashMap, // Per candidate index receipts. pub candidate_receipt_templates: Vec, // Per candidate index `AvailableData` pub available_data: Vec, - // Per candiadte index chunks + // Per candidate index chunks pub chunks: Vec>, + // Per-core ValidatorIndex -> ChunkIndex mapping + pub chunk_indices: Vec>, // Per relay chain block - candidate backed by our backing group pub backed_candidates: Vec, + // Request protcol names + pub req_protocol_names: ReqProtocolNames, // Relay chain block infos pub block_infos: Vec, // Chung fetching requests for backed candidates @@ -89,6 +101,9 @@ impl TestState { candidate_receipts: Default::default(), block_headers: Default::default(), test_authorities: config.generate_authorities(), + req_protocol_names: ReqProtocolNames::new(GENESIS_HASH, None), + chunk_indices: Default::default(), + candidate_hash_to_core_index: Default::default(), }; // we use it for all candidates. @@ -99,6 +114,17 @@ impl TestState { relay_parent_storage_root: Default::default(), }; + test_state.chunk_indices = (0..config.n_cores) + .map(|core_index| { + availability_chunk_indices( + Some(&node_features_with_chunk_mapping_enabled()), + config.n_validators, + CoreIndex(core_index as u32), + ) + .unwrap() + }) + .collect(); + // For each unique pov we create a candidate receipt. for (index, pov_size) in config.pov_sizes().iter().cloned().unique().enumerate() { gum::info!(target: LOG_TARGET, index, pov_size, "{}", "Generating template candidate".bright_blue()); @@ -167,6 +193,11 @@ impl TestState { // Store the new candidate in the state test_state.candidate_hashes.insert(candidate_receipt.hash(), candidate_index); + let core_index = (index % config.n_cores) as u32; + test_state + .candidate_hash_to_core_index + .insert(candidate_receipt.hash(), core_index.into()); + gum::debug!(target: LOG_TARGET, candidate_hash = ?candidate_receipt.hash(), "new candidate"); candidate_receipt diff --git a/polkadot/node/subsystem-bench/src/lib/mock/av_store.rs b/polkadot/node/subsystem-bench/src/lib/mock/av_store.rs index a035bf0189776..14ec4ccb4c32a 100644 --- a/polkadot/node/subsystem-bench/src/lib/mock/av_store.rs +++ b/polkadot/node/subsystem-bench/src/lib/mock/av_store.rs @@ -20,7 +20,7 @@ use crate::network::{HandleNetworkMessage, NetworkMessage}; use futures::{channel::oneshot, FutureExt}; use parity_scale_codec::Encode; use polkadot_node_network_protocol::request_response::{ - v1::{AvailableDataFetchingResponse, ChunkFetchingResponse, ChunkResponse}, + v1::AvailableDataFetchingResponse, v2::ChunkFetchingResponse, Protocol, ReqProtocolNames, Requests, }; use polkadot_node_primitives::{AvailableData, ErasureChunk}; @@ -28,13 +28,14 @@ use polkadot_node_subsystem::{ messages::AvailabilityStoreMessage, overseer, SpawnedSubsystem, SubsystemError, }; use polkadot_node_subsystem_types::OverseerSignal; -use polkadot_primitives::CandidateHash; -use sc_network::ProtocolName; +use polkadot_primitives::{CandidateHash, ChunkIndex, CoreIndex, ValidatorIndex}; use std::collections::HashMap; pub struct AvailabilityStoreState { candidate_hashes: HashMap, chunks: Vec>, + chunk_indices: Vec>, + candidate_hash_to_core_index: HashMap, } const LOG_TARGET: &str = "subsystem-bench::av-store-mock"; @@ -43,9 +44,12 @@ const LOG_TARGET: &str = "subsystem-bench::av-store-mock"; /// used in a test. #[derive(Clone)] pub struct NetworkAvailabilityState { + pub req_protocol_names: ReqProtocolNames, pub candidate_hashes: HashMap, pub available_data: Vec, pub chunks: Vec>, + pub chunk_indices: Vec>, + pub candidate_hash_to_core_index: HashMap, } // Implement access to the state. @@ -58,7 +62,7 @@ impl HandleNetworkMessage for NetworkAvailabilityState { ) -> Option { match message { NetworkMessage::RequestFromNode(peer, request) => match request { - Requests::ChunkFetchingV1(outgoing_request) => { + Requests::ChunkFetching(outgoing_request) => { gum::debug!(target: LOG_TARGET, request = ?outgoing_request, "Received `RequestFromNode`"); let validator_index: usize = outgoing_request.payload.index.0 as usize; let candidate_hash = outgoing_request.payload.candidate_hash; @@ -69,11 +73,22 @@ impl HandleNetworkMessage for NetworkAvailabilityState { .expect("candidate was generated previously; qed"); gum::warn!(target: LOG_TARGET, ?candidate_hash, candidate_index, "Candidate mapped to index"); - let chunk: ChunkResponse = - self.chunks.get(*candidate_index).unwrap()[validator_index].clone().into(); + let candidate_chunks = self.chunks.get(*candidate_index).unwrap(); + let chunk_indices = self + .chunk_indices + .get( + self.candidate_hash_to_core_index.get(&candidate_hash).unwrap().0 + as usize, + ) + .unwrap(); + + let chunk = candidate_chunks + .get(chunk_indices.get(validator_index).unwrap().0 as usize) + .unwrap(); + let response = Ok(( - ChunkFetchingResponse::from(Some(chunk)).encode(), - ProtocolName::Static("dummy"), + ChunkFetchingResponse::from(Some(chunk.clone())).encode(), + self.req_protocol_names.get_name(Protocol::ChunkFetchingV2), )); if let Err(err) = outgoing_request.pending_response.send(response) { @@ -94,7 +109,7 @@ impl HandleNetworkMessage for NetworkAvailabilityState { let response = Ok(( AvailableDataFetchingResponse::from(Some(available_data)).encode(), - ProtocolName::Static("dummy"), + self.req_protocol_names.get_name(Protocol::AvailableDataFetchingV1), )); outgoing_request .pending_response @@ -119,16 +134,25 @@ pub struct MockAvailabilityStore { impl MockAvailabilityStore { pub fn new( chunks: Vec>, + chunk_indices: Vec>, candidate_hashes: HashMap, + candidate_hash_to_core_index: HashMap, ) -> MockAvailabilityStore { - Self { state: AvailabilityStoreState { chunks, candidate_hashes } } + Self { + state: AvailabilityStoreState { + chunks, + candidate_hashes, + chunk_indices, + candidate_hash_to_core_index, + }, + } } async fn respond_to_query_all_request( &self, candidate_hash: CandidateHash, - send_chunk: impl Fn(usize) -> bool, - tx: oneshot::Sender>, + send_chunk: impl Fn(ValidatorIndex) -> bool, + tx: oneshot::Sender>, ) { let candidate_index = self .state @@ -137,15 +161,27 @@ impl MockAvailabilityStore { .expect("candidate was generated previously; qed"); gum::debug!(target: LOG_TARGET, ?candidate_hash, candidate_index, "Candidate mapped to index"); - let v = self - .state - .chunks - .get(*candidate_index) - .unwrap() - .iter() - .filter(|c| send_chunk(c.index.0 as usize)) - .cloned() - .collect(); + let n_validators = self.state.chunks[0].len(); + let candidate_chunks = self.state.chunks.get(*candidate_index).unwrap(); + let core_index = self.state.candidate_hash_to_core_index.get(&candidate_hash).unwrap(); + // We'll likely only send our chunk, so use capacity 1. + let mut v = Vec::with_capacity(1); + + for validator_index in 0..n_validators { + if !send_chunk(ValidatorIndex(validator_index as u32)) { + continue; + } + let chunk_index = self + .state + .chunk_indices + .get(core_index.0 as usize) + .unwrap() + .get(validator_index) + .unwrap(); + + let chunk = candidate_chunks.get(chunk_index.0 as usize).unwrap().clone(); + v.push((ValidatorIndex(validator_index as u32), chunk.clone())); + } let _ = tx.send(v); } @@ -182,8 +218,12 @@ impl MockAvailabilityStore { AvailabilityStoreMessage::QueryAllChunks(candidate_hash, tx) => { // We always have our own chunk. gum::debug!(target: LOG_TARGET, candidate_hash = ?candidate_hash, "Responding to QueryAllChunks"); - self.respond_to_query_all_request(candidate_hash, |index| index == 0, tx) - .await; + self.respond_to_query_all_request( + candidate_hash, + |index| index == 0.into(), + tx, + ) + .await; }, AvailabilityStoreMessage::QueryChunkSize(candidate_hash, tx) => { gum::debug!(target: LOG_TARGET, candidate_hash = ?candidate_hash, "Responding to QueryChunkSize"); @@ -195,12 +235,29 @@ impl MockAvailabilityStore { .expect("candidate was generated previously; qed"); gum::debug!(target: LOG_TARGET, ?candidate_hash, candidate_index, "Candidate mapped to index"); - let chunk_size = - self.state.chunks.get(*candidate_index).unwrap()[0].encoded_size(); + let chunk_size = self + .state + .chunks + .get(*candidate_index) + .unwrap() + .first() + .unwrap() + .encoded_size(); let _ = tx.send(Some(chunk_size)); }, - AvailabilityStoreMessage::StoreChunk { candidate_hash, chunk, tx } => { - gum::debug!(target: LOG_TARGET, chunk_index = ?chunk.index ,candidate_hash = ?candidate_hash, "Responding to StoreChunk"); + AvailabilityStoreMessage::StoreChunk { + candidate_hash, + chunk, + tx, + validator_index, + } => { + gum::debug!( + target: LOG_TARGET, + chunk_index = ?chunk.index, + validator_index = ?validator_index, + candidate_hash = ?candidate_hash, + "Responding to StoreChunk" + ); let _ = tx.send(Ok(())); }, _ => { diff --git a/polkadot/node/subsystem-bench/src/lib/mock/network_bridge.rs b/polkadot/node/subsystem-bench/src/lib/mock/network_bridge.rs index 10508f456a48f..d70953926d130 100644 --- a/polkadot/node/subsystem-bench/src/lib/mock/network_bridge.rs +++ b/polkadot/node/subsystem-bench/src/lib/mock/network_bridge.rs @@ -37,7 +37,7 @@ use sc_network::{request_responses::ProtocolConfig, RequestFailure}; const LOG_TARGET: &str = "subsystem-bench::network-bridge"; const ALLOWED_PROTOCOLS: &[&str] = &[ - "/ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff/req_chunk/1", + "/ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff/req_chunk/2", "/ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff/req_attested_candidate/2", ]; diff --git a/polkadot/node/subsystem-bench/src/lib/mock/runtime_api.rs b/polkadot/node/subsystem-bench/src/lib/mock/runtime_api.rs index 9788a1123ec03..be9dbd55cb6f9 100644 --- a/polkadot/node/subsystem-bench/src/lib/mock/runtime_api.rs +++ b/polkadot/node/subsystem-bench/src/lib/mock/runtime_api.rs @@ -26,9 +26,9 @@ use polkadot_node_subsystem::{ }; use polkadot_node_subsystem_types::OverseerSignal; use polkadot_primitives::{ - AsyncBackingParams, CandidateEvent, CandidateReceipt, CoreState, GroupIndex, GroupRotationInfo, - IndexedVec, NodeFeatures, OccupiedCore, ScheduledCore, SessionIndex, SessionInfo, - ValidatorIndex, + node_features, AsyncBackingParams, CandidateEvent, CandidateReceipt, CoreState, GroupIndex, + GroupRotationInfo, IndexedVec, NodeFeatures, OccupiedCore, ScheduledCore, SessionIndex, + SessionInfo, ValidatorIndex, }; use sp_consensus_babe::Epoch as BabeEpoch; use sp_core::H256; @@ -41,6 +41,8 @@ const LOG_TARGET: &str = "subsystem-bench::runtime-api-mock"; pub struct RuntimeApiState { // All authorities in the test, authorities: TestAuthorities, + // Node features state in the runtime + node_features: NodeFeatures, // Candidate hashes per block candidate_hashes: HashMap>, // Included candidates per bock @@ -76,6 +78,9 @@ impl MockRuntimeApi { session_index: SessionIndex, core_state: MockRuntimeApiCoreState, ) -> MockRuntimeApi { + // Enable chunk mapping feature to make systematic av-recovery possible. + let node_features = node_features_with_chunk_mapping_enabled(); + Self { state: RuntimeApiState { authorities, @@ -83,6 +88,7 @@ impl MockRuntimeApi { included_candidates, babe_epoch, session_index, + node_features, }, config, core_state, @@ -168,15 +174,15 @@ impl MockRuntimeApi { }, RuntimeApiMessage::Request( _block_hash, - RuntimeApiRequest::SessionExecutorParams(_session_index, sender), + RuntimeApiRequest::NodeFeatures(_session_index, sender), ) => { - let _ = sender.send(Ok(Some(Default::default()))); + let _ = sender.send(Ok(self.state.node_features.clone())); }, RuntimeApiMessage::Request( - _request, - RuntimeApiRequest::NodeFeatures(_session_index, sender), + _block_hash, + RuntimeApiRequest::SessionExecutorParams(_session_index, sender), ) => { - let _ = sender.send(Ok(NodeFeatures::EMPTY)); + let _ = sender.send(Ok(Some(Default::default()))); }, RuntimeApiMessage::Request( _block_hash, @@ -292,3 +298,10 @@ impl MockRuntimeApi { } } } + +pub fn node_features_with_chunk_mapping_enabled() -> NodeFeatures { + let mut node_features = NodeFeatures::new(); + node_features.resize(node_features::FeatureIndex::AvailabilityChunkMapping as usize + 1, false); + node_features.set(node_features::FeatureIndex::AvailabilityChunkMapping as u8 as usize, true); + node_features +} diff --git a/polkadot/node/subsystem-bench/src/lib/network.rs b/polkadot/node/subsystem-bench/src/lib/network.rs index 9686f456b9e65..775f881eaad84 100644 --- a/polkadot/node/subsystem-bench/src/lib/network.rs +++ b/polkadot/node/subsystem-bench/src/lib/network.rs @@ -1016,7 +1016,7 @@ pub trait RequestExt { impl RequestExt for Requests { fn authority_id(&self) -> Option<&AuthorityDiscoveryId> { match self { - Requests::ChunkFetchingV1(request) => { + Requests::ChunkFetching(request) => { if let Recipient::Authority(authority_id) = &request.peer { Some(authority_id) } else { @@ -1052,7 +1052,7 @@ impl RequestExt for Requests { fn into_response_sender(self) -> ResponseSender { match self { - Requests::ChunkFetchingV1(outgoing_request) => outgoing_request.pending_response, + Requests::ChunkFetching(outgoing_request) => outgoing_request.pending_response, Requests::AvailableDataFetchingV1(outgoing_request) => outgoing_request.pending_response, _ => unimplemented!("unsupported request type"), @@ -1062,7 +1062,7 @@ impl RequestExt for Requests { /// Swaps the `ResponseSender` and returns the previous value. fn swap_response_sender(&mut self, new_sender: ResponseSender) -> ResponseSender { match self { - Requests::ChunkFetchingV1(outgoing_request) => + Requests::ChunkFetching(outgoing_request) => std::mem::replace(&mut outgoing_request.pending_response, new_sender), Requests::AvailableDataFetchingV1(outgoing_request) => std::mem::replace(&mut outgoing_request.pending_response, new_sender), @@ -1075,7 +1075,7 @@ impl RequestExt for Requests { /// Returns the size in bytes of the request payload. fn size(&self) -> usize { match self { - Requests::ChunkFetchingV1(outgoing_request) => outgoing_request.payload.encoded_size(), + Requests::ChunkFetching(outgoing_request) => outgoing_request.payload.encoded_size(), Requests::AvailableDataFetchingV1(outgoing_request) => outgoing_request.payload.encoded_size(), Requests::AttestedCandidateV2(outgoing_request) => diff --git a/polkadot/node/subsystem-test-helpers/src/lib.rs b/polkadot/node/subsystem-test-helpers/src/lib.rs index 6c1ac86c4507b..375121c374637 100644 --- a/polkadot/node/subsystem-test-helpers/src/lib.rs +++ b/polkadot/node/subsystem-test-helpers/src/lib.rs @@ -25,7 +25,7 @@ use polkadot_node_subsystem::{ SubsystemError, SubsystemResult, TrySendError, }; use polkadot_node_subsystem_util::TimeoutExt; -use polkadot_primitives::{Hash, ValidatorIndex}; +use polkadot_primitives::{ChunkIndex, Hash}; use futures::{channel::mpsc, poll, prelude::*}; use parking_lot::Mutex; @@ -487,7 +487,7 @@ pub fn derive_erasure_chunks_with_proofs_and_root( .enumerate() .map(|(index, (proof, chunk))| ErasureChunk { chunk: chunk.to_vec(), - index: ValidatorIndex(index as _), + index: ChunkIndex(index as _), proof: Proof::try_from(proof).unwrap(), }) .collect::>(); diff --git a/polkadot/node/subsystem-types/Cargo.toml b/polkadot/node/subsystem-types/Cargo.toml index 93dd43c5dbfc4..e03fc60a1fd73 100644 --- a/polkadot/node/subsystem-types/Cargo.toml +++ b/polkadot/node/subsystem-types/Cargo.toml @@ -11,6 +11,7 @@ workspace = true [dependencies] derive_more = "0.99.17" +fatality = "0.1.1" futures = "0.3.30" polkadot-primitives = { path = "../../primitives" } polkadot-node-primitives = { path = "../primitives" } diff --git a/polkadot/node/subsystem-types/src/errors.rs b/polkadot/node/subsystem-types/src/errors.rs index 44136362a69ef..b8e70641243ea 100644 --- a/polkadot/node/subsystem-types/src/errors.rs +++ b/polkadot/node/subsystem-types/src/errors.rs @@ -18,6 +18,7 @@ use crate::JaegerError; use ::orchestra::OrchestraError as OverseerError; +use fatality::fatality; /// A description of an error causing the runtime API request to be unservable. #[derive(thiserror::Error, Debug, Clone)] @@ -68,32 +69,21 @@ impl core::fmt::Display for ChainApiError { impl std::error::Error for ChainApiError {} /// An error that may happen during Availability Recovery process. -#[derive(PartialEq, Debug, Clone)] +#[derive(PartialEq, Clone)] +#[fatality(splitable)] +#[allow(missing_docs)] pub enum RecoveryError { - /// A chunk is recovered but is invalid. + #[error("Invalid data")] Invalid, - /// A requested chunk is unavailable. + #[error("Data is unavailable")] Unavailable, - /// Erasure task channel closed, usually means node is shutting down. + #[fatal] + #[error("Erasure task channel closed")] ChannelClosed, } -impl std::fmt::Display for RecoveryError { - fn fmt(&self, f: &mut core::fmt::Formatter) -> Result<(), core::fmt::Error> { - let msg = match self { - RecoveryError::Invalid => "Invalid", - RecoveryError::Unavailable => "Unavailable", - RecoveryError::ChannelClosed => "ChannelClosed", - }; - - write!(f, "{}", msg) - } -} - -impl std::error::Error for RecoveryError {} - /// An error type that describes faults that may happen /// /// These are: diff --git a/polkadot/node/subsystem-types/src/messages.rs b/polkadot/node/subsystem-types/src/messages.rs index 2a54b3aed301e..722a97989bce0 100644 --- a/polkadot/node/subsystem-types/src/messages.rs +++ b/polkadot/node/subsystem-types/src/messages.rs @@ -480,6 +480,8 @@ pub enum AvailabilityRecoveryMessage { CandidateReceipt, SessionIndex, Option, // Optional backing group to request from first. + Option, /* A `CoreIndex` needs to be specified for the recovery process to + * prefer systematic chunk recovery. */ oneshot::Sender>, ), } @@ -515,7 +517,7 @@ pub enum AvailabilityStoreMessage { QueryChunkSize(CandidateHash, oneshot::Sender>), /// Query all chunks that we have for the given candidate hash. - QueryAllChunks(CandidateHash, oneshot::Sender>), + QueryAllChunks(CandidateHash, oneshot::Sender>), /// Query whether an `ErasureChunk` exists within the AV Store. /// @@ -530,6 +532,8 @@ pub enum AvailabilityStoreMessage { StoreChunk { /// A hash of the candidate this chunk belongs to. candidate_hash: CandidateHash, + /// Validator index. May not be equal to the chunk index. + validator_index: ValidatorIndex, /// The chunk itself. chunk: ErasureChunk, /// Sending side of the channel to send result to. @@ -549,6 +553,11 @@ pub enum AvailabilityStoreMessage { available_data: AvailableData, /// Erasure root we expect to get after chunking. expected_erasure_root: Hash, + /// Core index where the candidate was backed. + core_index: CoreIndex, + /// Node features at the candidate relay parent. Used for computing the validator->chunk + /// mapping. + node_features: NodeFeatures, /// Sending side of the channel to send result to. tx: oneshot::Sender>, }, diff --git a/polkadot/node/subsystem-util/Cargo.toml b/polkadot/node/subsystem-util/Cargo.toml index 219ea4d3f57d0..9259ca94f0735 100644 --- a/polkadot/node/subsystem-util/Cargo.toml +++ b/polkadot/node/subsystem-util/Cargo.toml @@ -24,6 +24,7 @@ gum = { package = "tracing-gum", path = "../gum" } derive_more = "0.99.17" schnellru = "0.2.1" +erasure-coding = { package = "polkadot-erasure-coding", path = "../../erasure-coding" } polkadot-node-subsystem = { path = "../subsystem" } polkadot-node-subsystem-types = { path = "../subsystem-types" } polkadot-node-jaeger = { path = "../jaeger" } diff --git a/polkadot/node/subsystem-util/src/availability_chunks.rs b/polkadot/node/subsystem-util/src/availability_chunks.rs new file mode 100644 index 0000000000000..45168e4512e15 --- /dev/null +++ b/polkadot/node/subsystem-util/src/availability_chunks.rs @@ -0,0 +1,227 @@ +// Copyright (C) Parity Technologies (UK) Ltd. +// This file is part of Polkadot. + +// Polkadot is free software: you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation, either version 3 of the License, or +// (at your option) any later version. + +// Polkadot is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. + +// You should have received a copy of the GNU General Public License +// along with Polkadot. If not, see . + +use erasure_coding::systematic_recovery_threshold; +use polkadot_primitives::{node_features, ChunkIndex, CoreIndex, NodeFeatures, ValidatorIndex}; + +/// Compute the per-validator availability chunk index. +/// WARNING: THIS FUNCTION IS CRITICAL TO PARACHAIN CONSENSUS. +/// Any modification to the output of the function needs to be coordinated via the runtime. +/// It's best to use minimal/no external dependencies. +pub fn availability_chunk_index( + maybe_node_features: Option<&NodeFeatures>, + n_validators: usize, + core_index: CoreIndex, + validator_index: ValidatorIndex, +) -> Result { + if let Some(features) = maybe_node_features { + if let Some(&true) = features + .get(usize::from(node_features::FeatureIndex::AvailabilityChunkMapping as u8)) + .as_deref() + { + let systematic_threshold = systematic_recovery_threshold(n_validators)? as u32; + let core_start_pos = core_index.0 * systematic_threshold; + + return Ok(ChunkIndex((core_start_pos + validator_index.0) % n_validators as u32)) + } + } + + Ok(validator_index.into()) +} + +/// Compute the per-core availability chunk indices. Returns a Vec which maps ValidatorIndex to +/// ChunkIndex for a given availability core index +/// WARNING: THIS FUNCTION IS CRITICAL TO PARACHAIN CONSENSUS. +/// Any modification to the output of the function needs to be coordinated via the +/// runtime. It's best to use minimal/no external dependencies. +pub fn availability_chunk_indices( + maybe_node_features: Option<&NodeFeatures>, + n_validators: usize, + core_index: CoreIndex, +) -> Result, erasure_coding::Error> { + let identity = (0..n_validators).map(|index| ChunkIndex(index as u32)); + if let Some(features) = maybe_node_features { + if let Some(&true) = features + .get(usize::from(node_features::FeatureIndex::AvailabilityChunkMapping as u8)) + .as_deref() + { + let systematic_threshold = systematic_recovery_threshold(n_validators)? as u32; + let core_start_pos = core_index.0 * systematic_threshold; + + return Ok(identity + .into_iter() + .cycle() + .skip(core_start_pos as usize) + .take(n_validators) + .collect()) + } + } + + Ok(identity.collect()) +} + +#[cfg(test)] +mod tests { + use super::*; + use std::collections::HashSet; + + pub fn node_features_with_mapping_enabled() -> NodeFeatures { + let mut node_features = NodeFeatures::new(); + node_features + .resize(node_features::FeatureIndex::AvailabilityChunkMapping as usize + 1, false); + node_features + .set(node_features::FeatureIndex::AvailabilityChunkMapping as u8 as usize, true); + node_features + } + + pub fn node_features_with_other_bits_enabled() -> NodeFeatures { + let mut node_features = NodeFeatures::new(); + node_features.resize(node_features::FeatureIndex::FirstUnassigned as usize + 1, true); + node_features + .set(node_features::FeatureIndex::AvailabilityChunkMapping as u8 as usize, false); + node_features + } + + #[test] + fn test_availability_chunk_indices() { + let n_validators = 20u32; + let n_cores = 15u32; + + // If the mapping feature is not enabled, it should always be the identity vector. + { + for node_features in + [None, Some(NodeFeatures::EMPTY), Some(node_features_with_other_bits_enabled())] + { + for core_index in 0..n_cores { + let indices = availability_chunk_indices( + node_features.as_ref(), + n_validators as usize, + CoreIndex(core_index), + ) + .unwrap(); + + for validator_index in 0..n_validators { + assert_eq!( + indices[validator_index as usize], + availability_chunk_index( + node_features.as_ref(), + n_validators as usize, + CoreIndex(core_index), + ValidatorIndex(validator_index) + ) + .unwrap() + ) + } + + assert_eq!( + indices, + (0..n_validators).map(|i| ChunkIndex(i)).collect::>() + ); + } + } + } + + // Test when mapping feature is enabled. + { + let node_features = node_features_with_mapping_enabled(); + let mut previous_indices = None; + + for core_index in 0..n_cores { + let indices = availability_chunk_indices( + Some(&node_features), + n_validators as usize, + CoreIndex(core_index), + ) + .unwrap(); + + for validator_index in 0..n_validators { + assert_eq!( + indices[validator_index as usize], + availability_chunk_index( + Some(&node_features), + n_validators as usize, + CoreIndex(core_index), + ValidatorIndex(validator_index) + ) + .unwrap() + ) + } + + // Check that it's not equal to the previous core's indices. + if let Some(previous_indices) = previous_indices { + assert_ne!(previous_indices, indices); + } + + previous_indices = Some(indices.clone()); + + // Check that it's indeed a permutation. + assert_eq!( + (0..n_validators).map(|i| ChunkIndex(i)).collect::>(), + indices.into_iter().collect::>() + ); + } + } + } + + #[test] + // This is just a dummy test that checks the mapping against some hardcoded outputs, to prevent + // accidental changes to the algorithms. + fn prevent_changes_to_mapping() { + let n_validators = 7; + let node_features = node_features_with_mapping_enabled(); + + assert_eq!( + availability_chunk_indices(Some(&node_features), n_validators, CoreIndex(0)) + .unwrap() + .into_iter() + .map(|i| i.0) + .collect::>(), + vec![0, 1, 2, 3, 4, 5, 6] + ); + assert_eq!( + availability_chunk_indices(Some(&node_features), n_validators, CoreIndex(1)) + .unwrap() + .into_iter() + .map(|i| i.0) + .collect::>(), + vec![2, 3, 4, 5, 6, 0, 1] + ); + assert_eq!( + availability_chunk_indices(Some(&node_features), n_validators, CoreIndex(2)) + .unwrap() + .into_iter() + .map(|i| i.0) + .collect::>(), + vec![4, 5, 6, 0, 1, 2, 3] + ); + assert_eq!( + availability_chunk_indices(Some(&node_features), n_validators, CoreIndex(3)) + .unwrap() + .into_iter() + .map(|i| i.0) + .collect::>(), + vec![6, 0, 1, 2, 3, 4, 5] + ); + assert_eq!( + availability_chunk_indices(Some(&node_features), n_validators, CoreIndex(4)) + .unwrap() + .into_iter() + .map(|i| i.0) + .collect::>(), + vec![1, 2, 3, 4, 5, 6, 0] + ); + } +} diff --git a/polkadot/node/subsystem-util/src/lib.rs b/polkadot/node/subsystem-util/src/lib.rs index b93818070a183..d371b699b9eb9 100644 --- a/polkadot/node/subsystem-util/src/lib.rs +++ b/polkadot/node/subsystem-util/src/lib.rs @@ -25,17 +25,15 @@ #![warn(missing_docs)] +pub use overseer::{ + gen::{OrchestraError as OverseerError, Timeout}, + Subsystem, TimeoutExt, +}; use polkadot_node_subsystem::{ errors::{RuntimeApiError, SubsystemError}, messages::{RuntimeApiMessage, RuntimeApiRequest, RuntimeApiSender}, overseer, SubsystemSender, }; -use polkadot_primitives::{async_backing::BackingState, slashing, CoreIndex, ExecutorParams}; - -pub use overseer::{ - gen::{OrchestraError as OverseerError, Timeout}, - Subsystem, TimeoutExt, -}; pub use polkadot_node_metrics::{metrics, Metronome}; @@ -43,11 +41,12 @@ use futures::channel::{mpsc, oneshot}; use parity_scale_codec::Encode; use polkadot_primitives::{ - AsyncBackingParams, AuthorityDiscoveryId, CandidateEvent, CandidateHash, - CommittedCandidateReceipt, CoreState, EncodeAs, GroupIndex, GroupRotationInfo, Hash, - Id as ParaId, OccupiedCoreAssumption, PersistedValidationData, ScrapedOnChainVotes, - SessionIndex, SessionInfo, Signed, SigningContext, ValidationCode, ValidationCodeHash, - ValidatorId, ValidatorIndex, ValidatorSignature, + async_backing::BackingState, slashing, AsyncBackingParams, AuthorityDiscoveryId, + CandidateEvent, CandidateHash, CommittedCandidateReceipt, CoreIndex, CoreState, EncodeAs, + ExecutorParams, GroupIndex, GroupRotationInfo, Hash, Id as ParaId, OccupiedCoreAssumption, + PersistedValidationData, ScrapedOnChainVotes, SessionIndex, SessionInfo, Signed, + SigningContext, ValidationCode, ValidationCodeHash, ValidatorId, ValidatorIndex, + ValidatorSignature, }; pub use rand; use sp_application_crypto::AppCrypto; @@ -60,17 +59,18 @@ use std::{ use thiserror::Error; use vstaging::get_disabled_validators_with_fallback; +pub use determine_new_blocks::determine_new_blocks; pub use metered; pub use polkadot_node_network_protocol::MIN_GOSSIP_PEERS; -pub use determine_new_blocks::determine_new_blocks; - /// These reexports are required so that external crates can use the `delegated_subsystem` macro /// properly. pub mod reexports { pub use polkadot_overseer::gen::{SpawnedSubsystem, Spawner, Subsystem, SubsystemContext}; } +/// Helpers for the validator->chunk index mapping. +pub mod availability_chunks; /// A utility for managing the implicit view of the relay-chain derived from active /// leaves and the minimum allowed relay-parents that parachain candidates can have /// and be backed in those leaves' children. diff --git a/polkadot/node/subsystem-util/src/runtime/error.rs b/polkadot/node/subsystem-util/src/runtime/error.rs index 8751693b078a6..1111b119e95f5 100644 --- a/polkadot/node/subsystem-util/src/runtime/error.rs +++ b/polkadot/node/subsystem-util/src/runtime/error.rs @@ -28,7 +28,7 @@ pub enum Error { /// Runtime API subsystem is down, which means we're shutting down. #[fatal] #[error("Runtime request got canceled")] - RuntimeRequestCanceled(oneshot::Canceled), + RuntimeRequestCanceled(#[from] oneshot::Canceled), /// Some request to the runtime failed. /// For example if we prune a block we're requesting info about. diff --git a/polkadot/node/subsystem-util/src/runtime/mod.rs b/polkadot/node/subsystem-util/src/runtime/mod.rs index 714384b32e37b..214c58a8e88f7 100644 --- a/polkadot/node/subsystem-util/src/runtime/mod.rs +++ b/polkadot/node/subsystem-util/src/runtime/mod.rs @@ -31,8 +31,8 @@ use polkadot_node_subsystem::{ use polkadot_node_subsystem_types::UnpinHandle; use polkadot_primitives::{ node_features::FeatureIndex, slashing, AsyncBackingParams, CandidateEvent, CandidateHash, - CoreState, EncodeAs, ExecutorParams, GroupIndex, GroupRotationInfo, Hash, IndexedVec, - NodeFeatures, OccupiedCore, ScrapedOnChainVotes, SessionIndex, SessionInfo, Signed, + CoreIndex, CoreState, EncodeAs, ExecutorParams, GroupIndex, GroupRotationInfo, Hash, + IndexedVec, NodeFeatures, OccupiedCore, ScrapedOnChainVotes, SessionIndex, SessionInfo, Signed, SigningContext, UncheckedSigned, ValidationCode, ValidationCodeHash, ValidatorId, ValidatorIndex, LEGACY_MIN_BACKING_VOTES, }; @@ -348,7 +348,7 @@ where pub async fn get_occupied_cores( sender: &mut Sender, relay_parent: Hash, -) -> Result> +) -> Result> where Sender: overseer::SubsystemSender, { @@ -356,9 +356,10 @@ where Ok(cores .into_iter() - .filter_map(|core_state| { + .enumerate() + .filter_map(|(core_index, core_state)| { if let CoreState::Occupied(occupied) = core_state { - Some(occupied) + Some((CoreIndex(core_index as u32), occupied)) } else { None } diff --git a/polkadot/primitives/src/lib.rs b/polkadot/primitives/src/lib.rs index 01f393086a668..061794ca06d1b 100644 --- a/polkadot/primitives/src/lib.rs +++ b/polkadot/primitives/src/lib.rs @@ -41,26 +41,26 @@ pub use v7::{ ApprovalVotingParams, AssignmentId, AsyncBackingParams, AuthorityDiscoveryId, AvailabilityBitfield, BackedCandidate, Balance, BlakeTwo256, Block, BlockId, BlockNumber, CandidateCommitments, CandidateDescriptor, CandidateEvent, CandidateHash, CandidateIndex, - CandidateReceipt, CheckedDisputeStatementSet, CheckedMultiDisputeStatementSet, CollatorId, - CollatorSignature, CommittedCandidateReceipt, CompactStatement, ConsensusLog, CoreIndex, - CoreState, DisputeState, DisputeStatement, DisputeStatementSet, DownwardMessage, EncodeAs, - ExecutorParam, ExecutorParamError, ExecutorParams, ExecutorParamsHash, ExecutorParamsPrepHash, - ExplicitDisputeStatement, GroupIndex, GroupRotationInfo, Hash, HashT, HeadData, Header, - HorizontalMessages, HrmpChannelId, Id, InboundDownwardMessage, InboundHrmpMessage, IndexedVec, - InherentData, InvalidDisputeStatementKind, Moment, MultiDisputeStatementSet, NodeFeatures, - Nonce, OccupiedCore, OccupiedCoreAssumption, OutboundHrmpMessage, ParathreadClaim, - ParathreadEntry, PersistedValidationData, PvfCheckStatement, PvfExecKind, PvfPrepKind, - RuntimeMetricLabel, RuntimeMetricLabelValue, RuntimeMetricLabelValues, RuntimeMetricLabels, - RuntimeMetricOp, RuntimeMetricUpdate, ScheduledCore, ScrapedOnChainVotes, SessionIndex, - SessionInfo, Signature, Signed, SignedAvailabilityBitfield, SignedAvailabilityBitfields, - SignedStatement, SigningContext, Slot, UncheckedSigned, UncheckedSignedAvailabilityBitfield, - UncheckedSignedAvailabilityBitfields, UncheckedSignedStatement, UpgradeGoAhead, - UpgradeRestriction, UpwardMessage, ValidDisputeStatementKind, ValidationCode, - ValidationCodeHash, ValidatorId, ValidatorIndex, ValidatorSignature, ValidityAttestation, - ValidityError, ASSIGNMENT_KEY_TYPE_ID, LEGACY_MIN_BACKING_VOTES, LOWEST_PUBLIC_ID, - MAX_CODE_SIZE, MAX_HEAD_DATA_SIZE, MAX_POV_SIZE, MIN_CODE_SIZE, - ON_DEMAND_DEFAULT_QUEUE_MAX_SIZE, ON_DEMAND_MAX_QUEUE_MAX_SIZE, PARACHAINS_INHERENT_IDENTIFIER, - PARACHAIN_KEY_TYPE_ID, + CandidateReceipt, CheckedDisputeStatementSet, CheckedMultiDisputeStatementSet, ChunkIndex, + CollatorId, CollatorSignature, CommittedCandidateReceipt, CompactStatement, ConsensusLog, + CoreIndex, CoreState, DisputeState, DisputeStatement, DisputeStatementSet, DownwardMessage, + EncodeAs, ExecutorParam, ExecutorParamError, ExecutorParams, ExecutorParamsHash, + ExecutorParamsPrepHash, ExplicitDisputeStatement, GroupIndex, GroupRotationInfo, Hash, HashT, + HeadData, Header, HorizontalMessages, HrmpChannelId, Id, InboundDownwardMessage, + InboundHrmpMessage, IndexedVec, InherentData, InvalidDisputeStatementKind, Moment, + MultiDisputeStatementSet, NodeFeatures, Nonce, OccupiedCore, OccupiedCoreAssumption, + OutboundHrmpMessage, ParathreadClaim, ParathreadEntry, PersistedValidationData, + PvfCheckStatement, PvfExecKind, PvfPrepKind, RuntimeMetricLabel, RuntimeMetricLabelValue, + RuntimeMetricLabelValues, RuntimeMetricLabels, RuntimeMetricOp, RuntimeMetricUpdate, + ScheduledCore, ScrapedOnChainVotes, SessionIndex, SessionInfo, Signature, Signed, + SignedAvailabilityBitfield, SignedAvailabilityBitfields, SignedStatement, SigningContext, Slot, + UncheckedSigned, UncheckedSignedAvailabilityBitfield, UncheckedSignedAvailabilityBitfields, + UncheckedSignedStatement, UpgradeGoAhead, UpgradeRestriction, UpwardMessage, + ValidDisputeStatementKind, ValidationCode, ValidationCodeHash, ValidatorId, ValidatorIndex, + ValidatorSignature, ValidityAttestation, ValidityError, ASSIGNMENT_KEY_TYPE_ID, + LEGACY_MIN_BACKING_VOTES, LOWEST_PUBLIC_ID, MAX_CODE_SIZE, MAX_HEAD_DATA_SIZE, MAX_POV_SIZE, + MIN_CODE_SIZE, ON_DEMAND_DEFAULT_QUEUE_MAX_SIZE, ON_DEMAND_MAX_QUEUE_MAX_SIZE, + PARACHAINS_INHERENT_IDENTIFIER, PARACHAIN_KEY_TYPE_ID, }; #[cfg(feature = "std")] diff --git a/polkadot/primitives/src/v7/mod.rs b/polkadot/primitives/src/v7/mod.rs index 8a059408496c0..fb8406aece690 100644 --- a/polkadot/primitives/src/v7/mod.rs +++ b/polkadot/primitives/src/v7/mod.rs @@ -117,6 +117,34 @@ pub trait TypeIndex { #[cfg_attr(feature = "std", derive(Serialize, Deserialize, Hash))] pub struct ValidatorIndex(pub u32); +/// Index of an availability chunk. +/// +/// The underlying type is identical to `ValidatorIndex`, because +/// the number of chunks will always be equal to the number of validators. +/// However, the chunk index held by a validator may not always be equal to its `ValidatorIndex`, so +/// we use a separate type to make code easier to read. +#[derive(Eq, Ord, PartialEq, PartialOrd, Copy, Clone, Encode, Decode, TypeInfo, RuntimeDebug)] +#[cfg_attr(feature = "std", derive(Serialize, Deserialize, Hash))] +pub struct ChunkIndex(pub u32); + +impl From for ValidatorIndex { + fn from(c_index: ChunkIndex) -> Self { + ValidatorIndex(c_index.0) + } +} + +impl From for ChunkIndex { + fn from(v_index: ValidatorIndex) -> Self { + ChunkIndex(v_index.0) + } +} + +impl From for ChunkIndex { + fn from(n: u32) -> Self { + ChunkIndex(n) + } +} + // We should really get https://github.com/paritytech/polkadot/issues/2403 going .. impl From for ValidatorIndex { fn from(n: u32) -> Self { @@ -1787,6 +1815,14 @@ where self.0.get(index.type_index()) } + /// Returns a mutable reference to an element indexed using `K`. + pub fn get_mut(&mut self, index: K) -> Option<&mut V> + where + K: TypeIndex, + { + self.0.get_mut(index.type_index()) + } + /// Returns number of elements in vector. pub fn len(&self) -> usize { self.0.len() @@ -1989,6 +2025,7 @@ pub mod node_features { /// A feature index used to identify a bit into the node_features array stored /// in the HostConfiguration. #[repr(u8)] + #[derive(Clone, Copy)] pub enum FeatureIndex { /// Tells if tranch0 assignments could be sent in a single certificate. /// Reserved for: `` @@ -1997,10 +2034,16 @@ pub mod node_features { /// The value stored there represents the assumed core index where the candidates /// are backed. This is needed for the elastic scaling MVP. ElasticScalingMVP = 1, + /// Tells if the chunk mapping feature is enabled. + /// Enables the implementation of + /// [RFC-47](https://github.com/polkadot-fellows/RFCs/blob/main/text/0047-assignment-of-availability-chunks.md). + /// Must not be enabled unless all validators and collators have stopped using `req_chunk` + /// protocol version 1. If it is enabled, validators can start systematic chunk recovery. + AvailabilityChunkMapping = 2, /// First unassigned feature bit. /// Every time a new feature flag is assigned it should take this value. /// and this should be incremented. - FirstUnassigned = 2, + FirstUnassigned = 3, } } diff --git a/polkadot/roadmap/implementers-guide/src/node/approval/approval-voting.md b/polkadot/roadmap/implementers-guide/src/node/approval/approval-voting.md index 345b3d2e69704..9b4082c49e2f0 100644 --- a/polkadot/roadmap/implementers-guide/src/node/approval/approval-voting.md +++ b/polkadot/roadmap/implementers-guide/src/node/approval/approval-voting.md @@ -396,7 +396,7 @@ On receiving an `ApprovedAncestor(Hash, BlockNumber, response_channel)`: * Requires `(SessionIndex, SessionInfo, CandidateReceipt, ValidatorIndex, backing_group, block_hash, candidate_index)` * Extract the public key of the `ValidatorIndex` from the `SessionInfo` for the session. * Issue an `AvailabilityRecoveryMessage::RecoverAvailableData(candidate, session_index, Some(backing_group), - response_sender)` +Some(core_index), response_sender)` * Load the historical validation code of the parachain by dispatching a `RuntimeApiRequest::ValidationCodeByHash(descriptor.validation_code_hash)` against the state of `block_hash`. * Spawn a background task with a clone of `background_tx` diff --git a/polkadot/roadmap/implementers-guide/src/node/availability/availability-recovery.md b/polkadot/roadmap/implementers-guide/src/node/availability/availability-recovery.md index c57c4589244e7..5b756080becc0 100644 --- a/polkadot/roadmap/implementers-guide/src/node/availability/availability-recovery.md +++ b/polkadot/roadmap/implementers-guide/src/node/availability/availability-recovery.md @@ -1,84 +1,108 @@ # Availability Recovery -This subsystem is the inverse of the [Availability Distribution](availability-distribution.md) subsystem: validators -will serve the availability chunks kept in the availability store to nodes who connect to them. And the subsystem will -also implement the other side: the logic for nodes to connect to validators, request availability pieces, and -reconstruct the `AvailableData`. +This subsystem is responsible for recovering the data made available via the +[Availability Distribution](availability-distribution.md) subsystem, neccessary for candidate validation during the +approval/disputes processes. Additionally, it is also being used by collators to recover PoVs in adversarial scenarios +where the other collators of the para are censoring blocks. -This version of the availability recovery subsystem is based off of direct connections to validators. In order to -recover any given `AvailableData`, we must recover at least `f + 1` pieces from validators of the session. Thus, we will -connect to and query randomly chosen validators until we have received `f + 1` pieces. +According to the Polkadot protocol, in order to recover any given `AvailableData`, we generally must recover at least +`f + 1` pieces from validators of the session. Thus, we should connect to and query randomly chosen validators until we +have received `f + 1` pieces. + +In practice, there are various optimisations implemented in this subsystem which avoid querying all chunks from +different validators and/or avoid doing the chunk reconstruction altogether. ## Protocol -`PeerSet`: `Validation` +This version of the availability recovery subsystem is based only on request-response network protocols. Input: -* `NetworkBridgeUpdate(update)` -* `AvailabilityRecoveryMessage::RecoverAvailableData(candidate, session, backing_group, response)` +* `AvailabilityRecoveryMessage::RecoverAvailableData(candidate, session, backing_group, core_index, response)` Output: -* `NetworkBridge::SendValidationMessage` -* `NetworkBridge::ReportPeer` -* `AvailabilityStore::QueryChunk` +* `NetworkBridgeMessage::SendRequests` +* `AvailabilityStoreMessage::QueryAllChunks` +* `AvailabilityStoreMessage::QueryAvailableData` +* `AvailabilityStoreMessage::QueryChunkSize` + ## Functionality -We hold a state which tracks the currently ongoing recovery tasks, as well as which request IDs correspond to which -task. A recovery task is a structure encapsulating all recovery tasks with the network necessary to recover the -available data in respect to one candidate. +We hold a state which tracks the currently ongoing recovery tasks. A `RecoveryTask` is a structure encapsulating all +network tasks needed in order to recover the available data in respect to a candidate. + +Each `RecoveryTask` has a collection of ordered recovery strategies to try. ```rust +/// Subsystem state. struct State { - /// Each recovery is implemented as an independent async task, and the handles only supply information about the result. - ongoing_recoveries: FuturesUnordered, - /// A recent block hash for which state should be available. - live_block_hash: Hash, - // An LRU cache of recently recovered data. - availability_lru: LruMap>, + /// Each recovery task is implemented as its own async task, + /// and these handles are for communicating with them. + ongoing_recoveries: FuturesUnordered, + /// A recent block hash for which state should be available. + live_block: (BlockNumber, Hash), + /// An LRU cache of recently recovered data. + availability_lru: LruMap, + /// Cached runtime info. + runtime_info: RuntimeInfo, } -/// This is a future, which concludes either when a response is received from the recovery tasks, -/// or all the `awaiting` channels have closed. -struct RecoveryHandle { - candidate_hash: CandidateHash, - interaction_response: RemoteHandle, - awaiting: Vec>>, -} - -struct Unavailable; -struct Concluded(CandidateHash, Result); - -struct RecoveryTaskParams { - validator_authority_keys: Vec, - validators: Vec, - // The number of pieces needed. - threshold: usize, - candidate_hash: Hash, - erasure_root: Hash, +struct RecoveryParams { + /// Discovery ids of `validators`. + pub validator_authority_keys: Vec, + /// Number of validators. + pub n_validators: usize, + /// The number of regular chunks needed. + pub threshold: usize, + /// The number of systematic chunks needed. + pub systematic_threshold: usize, + /// A hash of the relevant candidate. + pub candidate_hash: CandidateHash, + /// The root of the erasure encoding of the candidate. + pub erasure_root: Hash, + /// Metrics to report. + pub metrics: Metrics, + /// Do not request data from availability-store. Useful for collators. + pub bypass_availability_store: bool, + /// The type of check to perform after available data was recovered. + pub post_recovery_check: PostRecoveryCheck, + /// The blake2-256 hash of the PoV. + pub pov_hash: Hash, + /// Protocol name for ChunkFetchingV1. + pub req_v1_protocol_name: ProtocolName, + /// Protocol name for ChunkFetchingV2. + pub req_v2_protocol_name: ProtocolName, + /// Whether or not chunk mapping is enabled. + pub chunk_mapping_enabled: bool, + /// Channel to the erasure task handler. + pub erasure_task_tx: mpsc::Sender, } -enum RecoveryTask { - RequestFromBackers { - // a random shuffling of the validators from the backing group which indicates the order - // in which we connect to them and request the chunk. - shuffled_backers: Vec, - } - RequestChunksFromValidators { - // a random shuffling of the validators which indicates the order in which we connect to the validators and - // request the chunk from them. - shuffling: Vec, - received_chunks: Map, - requesting_chunks: FuturesUnordered>, - } +pub struct RecoveryTask { + sender: Sender, + params: RecoveryParams, + strategies: VecDeque>>, + state: task::State, } -struct RecoveryTask { - to_subsystems: SubsystemSender, - params: RecoveryTaskParams, - source: Source, +#[async_trait::async_trait] +/// Common trait for runnable recovery strategies. +pub trait RecoveryStrategy: Send { + /// Main entry point of the strategy. + async fn run( + mut self: Box, + state: &mut task::State, + sender: &mut Sender, + common_params: &RecoveryParams, + ) -> Result; + + /// Return the name of the strategy for logging purposes. + fn display_name(&self) -> &'static str; + + /// Return the strategy type for use as a metric label. + fn strategy_type(&self) -> &'static str; } ``` @@ -90,68 +114,71 @@ Ignore `BlockFinalized` signals. On `Conclude`, shut down the subsystem. -#### `AvailabilityRecoveryMessage::RecoverAvailableData(receipt, session, Option, response)` +#### `AvailabilityRecoveryMessage::RecoverAvailableData(...)` -1. Check the `availability_lru` for the candidate and return the data if so. -1. Check if there is already an recovery handle for the request. If so, add the response handle to it. +1. Check the `availability_lru` for the candidate and return the data if present. +1. Check if there is already a recovery handle for the request. If so, add the response handle to it. 1. Otherwise, load the session info for the given session under the state of `live_block_hash`, and initiate a recovery - task with *`launch_recovery_task`*. Add a recovery handle to the state and add the response channel to it. + task with `launch_recovery_task`. Add a recovery handle to the state and add the response channel to it. 1. If the session info is not available, return `RecoveryError::Unavailable` on the response channel. ### Recovery logic -#### `launch_recovery_task(session_index, session_info, candidate_receipt, candidate_hash, Option)` +#### `handle_recover(...) -> Result<()>` -1. Compute the threshold from the session info. It should be `f + 1`, where `n = 3f + k`, where `k in {1, 2, 3}`, and - `n` is the number of validators. -1. Set the various fields of `RecoveryParams` based on the validator lists in `session_info` and information about the - candidate. -1. If the `backing_group_index` is `Some`, start in the `RequestFromBackers` phase with a shuffling of the backing group - validator indices and a `None` requesting value. -1. Otherwise, start in the `RequestChunksFromValidators` source with `received_chunks`,`requesting_chunks`, and - `next_shuffling` all empty. -1. Set the `to_subsystems` sender to be equal to a clone of the `SubsystemContext`'s sender. -1. Initialize `received_chunks` to an empty set, as well as `requesting_chunks`. +Instantiate the appropriate `RecoveryStrategy`es, based on the subsystem configuration, params and session info. +Call `launch_recovery_task()`. -Launch the source as a background task running `run(recovery_task)`. +#### `launch_recovery_task(state, ctx, response_sender, recovery_strategies, params) -> Result<()>` -#### `run(recovery_task) -> Result` +Create the `RecoveryTask` and launch it as a background task running `recovery_task.run()`. -```rust -// How many parallel requests to have going at once. -const N_PARALLEL: usize = 50; -``` +#### `recovery_task.run(mut self) -> Result` + +* Loop: + * Pop a strategy from the queue. If none are left, return `RecoveryError::Unavailable`. + * Run the strategy. + * If the strategy returned successfully or returned `RecoveryError::Invalid`, break the loop. + +### Recovery strategies + +#### `FetchFull` + +This strategy tries requesting the full available data from the validators in the backing group to +which the node is already connected. They are tried one by one in a random order. +It is very performant if there's enough network bandwidth and the backing group is not overloaded. +The costly reed-solomon reconstruction is not needed. + +#### `FetchSystematicChunks` + +Very similar to `FetchChunks` below but requests from the validators that hold the systematic chunks, so that we avoid +reed-solomon reconstruction. Only possible if `node_features::FeatureIndex::AvailabilityChunkMapping` is enabled and +the `core_index` is supplied (currently only for recoveries triggered by approval voting). + +More info in +[RFC-47](https://github.com/polkadot-fellows/RFCs/blob/main/text/0047-assignment-of-availability-chunks.md). + +#### `FetchChunks` + +The least performant strategy but also the most comprehensive one. It's the only one that cannot fail under the +byzantine threshold assumption, so it's always added as the last one in the `recovery_strategies` queue. + +Performs parallel chunk requests to validators. When enough chunks were received, do the reconstruction. +In the worst case, all validators will be tried. + +### Default recovery strategy configuration + +#### For validators + +If the estimated available data size is smaller than a configured constant (currently 1Mib for Polkadot or 4Mib for +other networks), try doing `FetchFull` first. +Next, if the preconditions described in `FetchSystematicChunks` above are met, try systematic recovery. +As a last resort, do `FetchChunks`. + +#### For collators + +Collators currently only use `FetchChunks`, as they only attempt recoveries in rare scenarios. -* Request `AvailabilityStoreMessage::QueryAvailableData`. If it exists, return that. -* If the task contains `RequestFromBackers` - * Loop: - * If the `requesting_pov` is `Some`, poll for updates on it. If it concludes, set `requesting_pov` to `None`. - * If the `requesting_pov` is `None`, take the next backer off the `shuffled_backers`. - * If the backer is `Some`, issue a `NetworkBridgeMessage::Requests` with a network request for the - `AvailableData` and wait for the response. - * If it concludes with a `None` result, return to beginning. - * If it concludes with available data, attempt a re-encoding. - * If it has the correct erasure-root, break and issue a `Ok(available_data)`. - * If it has an incorrect erasure-root, return to beginning. - * Send the result to each member of `awaiting`. - * If the backer is `None`, set the source to `RequestChunksFromValidators` with a random shuffling of validators - and empty `received_chunks`, and `requesting_chunks` and break the loop. - -* If the task contains `RequestChunksFromValidators`: - * Request `AvailabilityStoreMessage::QueryAllChunks`. For each chunk that exists, add it to `received_chunks` and - remote the validator from `shuffling`. - * Loop: - * If `received_chunks + requesting_chunks + shuffling` lengths are less than the threshold, break and return - `Err(Unavailable)`. - * Poll for new updates from `requesting_chunks`. Check merkle proofs of any received chunks. If the request simply - fails due to network issues, insert into the front of `shuffling` to be retried. - * If `received_chunks` has more than `threshold` entries, attempt to recover the data. - * If that fails, return `Err(RecoveryError::Invalid)` - * If correct: - * If re-encoding produces an incorrect erasure-root, break and issue a `Err(RecoveryError::Invalid)`. - * break and issue `Ok(available_data)` - * Send the result to each member of `awaiting`. - * While there are fewer than `N_PARALLEL` entries in `requesting_chunks`, - * Pop the next item from `shuffling`. If it's empty and `requesting_chunks` is empty, return - `Err(RecoveryError::Unavailable)`. - * Issue a `NetworkBridgeMessage::Requests` and wait for the response in `requesting_chunks`. +Moreover, the recovery task is specially configured to not attempt requesting data from the local availability-store +(because it doesn't exist) and to not reencode the data after a succcessful recovery (because it's an expensive check +that is not needed; checking the pov_hash is enough for collators). diff --git a/polkadot/roadmap/implementers-guide/src/types/overseer-protocol.md b/polkadot/roadmap/implementers-guide/src/types/overseer-protocol.md index e011afb97089a..c82d89d2d8799 100644 --- a/polkadot/roadmap/implementers-guide/src/types/overseer-protocol.md +++ b/polkadot/roadmap/implementers-guide/src/types/overseer-protocol.md @@ -238,6 +238,9 @@ enum AvailabilityRecoveryMessage { CandidateReceipt, SessionIndex, Option, // Backing validator group to request the data directly from. + Option, /* A `CoreIndex` needs to be specified for the recovery process to + * prefer systematic chunk recovery. This is the core that the candidate + * was occupying while pending availability. */ ResponseChannel>, ), } diff --git a/polkadot/zombienet_tests/functional/0013-enable-node-feature.js b/polkadot/zombienet_tests/functional/0013-enable-node-feature.js new file mode 100644 index 0000000000000..5fe2e38dad7d4 --- /dev/null +++ b/polkadot/zombienet_tests/functional/0013-enable-node-feature.js @@ -0,0 +1,35 @@ +async function run(nodeName, networkInfo, index) { + const { wsUri, userDefinedTypes } = networkInfo.nodesByName[nodeName]; + const api = await zombie.connect(wsUri, userDefinedTypes); + + await zombie.util.cryptoWaitReady(); + + // account to submit tx + const keyring = new zombie.Keyring({ type: "sr25519" }); + const alice = keyring.addFromUri("//Alice"); + + await new Promise(async (resolve, reject) => { + const unsub = await api.tx.sudo + .sudo(api.tx.configuration.setNodeFeature(Number(index), true)) + .signAndSend(alice, ({ status, isError }) => { + if (status.isInBlock) { + console.log( + `Transaction included at blockhash ${status.asInBlock}`, + ); + } else if (status.isFinalized) { + console.log( + `Transaction finalized at blockHash ${status.asFinalized}`, + ); + unsub(); + return resolve(); + } else if (isError) { + console.log(`Transaction error`); + reject(`Transaction error`); + } + }); + }); + + return 0; +} + +module.exports = { run }; diff --git a/polkadot/zombienet_tests/functional/0013-systematic-chunk-recovery.toml b/polkadot/zombienet_tests/functional/0013-systematic-chunk-recovery.toml new file mode 100644 index 0000000000000..67925a3d3a7c6 --- /dev/null +++ b/polkadot/zombienet_tests/functional/0013-systematic-chunk-recovery.toml @@ -0,0 +1,46 @@ +[settings] +timeout = 1000 +bootnode = true + +[relaychain.genesis.runtimeGenesis.patch.configuration.config.scheduler_params] + max_validators_per_core = 2 + +[relaychain.genesis.runtimeGenesis.patch.configuration.config] + needed_approvals = 4 + +[relaychain] +default_image = "{{ZOMBIENET_INTEGRATION_TEST_IMAGE}}" +chain = "rococo-local" +default_command = "polkadot" + +[relaychain.default_resources] +limits = { memory = "4G", cpu = "2" } +requests = { memory = "2G", cpu = "1" } + + [[relaychain.nodes]] + name = "alice" + validator = "true" + + [[relaychain.node_groups]] + name = "validator" + count = 3 + args = ["-lparachain=debug,parachain::availability-recovery=trace,parachain::availability-distribution=trace"] + +{% for id in range(2000,2002) %} +[[parachains]] +id = {{id}} +addToGenesis = true +cumulus_based = true +chain = "glutton-westend-local-{{id}}" + [parachains.genesis.runtimeGenesis.patch.glutton] + compute = "50000000" + storage = "2500000000" + trashDataCount = 5120 + + [parachains.collator] + name = "collator" + image = "{{CUMULUS_IMAGE}}" + command = "polkadot-parachain" + args = ["-lparachain=debug"] + +{% endfor %} diff --git a/polkadot/zombienet_tests/functional/0013-systematic-chunk-recovery.zndsl b/polkadot/zombienet_tests/functional/0013-systematic-chunk-recovery.zndsl new file mode 100644 index 0000000000000..e9e5a429e2a2c --- /dev/null +++ b/polkadot/zombienet_tests/functional/0013-systematic-chunk-recovery.zndsl @@ -0,0 +1,43 @@ +Description: Systematic chunk recovery is used if the chunk mapping feature is enabled. +Network: ./0013-systematic-chunk-recovery.toml +Creds: config + +# Check authority status. +alice: reports node_roles is 4 +validator: reports node_roles is 4 + +# Ensure parachains are registered. +validator: parachain 2000 is registered within 60 seconds +validator: parachain 2001 is registered within 60 seconds + +# Ensure parachains made progress and approval checking works. +validator: parachain 2000 block height is at least 15 within 600 seconds +validator: parachain 2001 block height is at least 15 within 600 seconds + +validator: reports substrate_block_height{status="finalized"} is at least 30 within 400 seconds + +validator: reports polkadot_parachain_approval_checking_finality_lag < 3 + +validator: reports polkadot_parachain_approvals_no_shows_total < 3 within 100 seconds + +# Ensure we used regular chunk recovery and that there are no failed recoveries. +validator: count of log lines containing "Data recovery from chunks complete" is at least 10 within 300 seconds +validator: count of log lines containing "Data recovery from systematic chunks complete" is 0 within 10 seconds +validator: count of log lines containing "Data recovery from systematic chunks is not possible" is 0 within 10 seconds +validator: count of log lines containing "Data recovery from chunks is not possible" is 0 within 10 seconds +validator: reports polkadot_parachain_availability_recovery_recoveries_finished{result="failure"} is 0 within 10 seconds + +# Enable the chunk mapping feature +alice: js-script ./0013-enable-node-feature.js with "2" return is 0 within 600 seconds + +validator: reports substrate_block_height{status="finalized"} is at least 60 within 400 seconds + +validator: reports polkadot_parachain_approval_checking_finality_lag < 3 + +validator: reports polkadot_parachain_approvals_no_shows_total < 3 within 100 seconds + +# Ensure we used systematic chunk recovery and that there are no failed recoveries. +validator: count of log lines containing "Data recovery from systematic chunks complete" is at least 10 within 300 seconds +validator: count of log lines containing "Data recovery from systematic chunks is not possible" is 0 within 10 seconds +validator: count of log lines containing "Data recovery from chunks is not possible" is 0 within 10 seconds +validator: reports polkadot_parachain_availability_recovery_recoveries_finished{result="failure"} is 0 within 10 seconds diff --git a/polkadot/zombienet_tests/functional/0014-chunk-fetching-network-compatibility.toml b/polkadot/zombienet_tests/functional/0014-chunk-fetching-network-compatibility.toml new file mode 100644 index 0000000000000..881abab64fd07 --- /dev/null +++ b/polkadot/zombienet_tests/functional/0014-chunk-fetching-network-compatibility.toml @@ -0,0 +1,48 @@ +[settings] +timeout = 1000 +bootnode = true + +[relaychain.genesis.runtimeGenesis.patch.configuration.config.scheduler_params] + max_validators_per_core = 2 + +[relaychain.genesis.runtimeGenesis.patch.configuration.config] + needed_approvals = 4 + +[relaychain] +default_image = "{{ZOMBIENET_INTEGRATION_TEST_IMAGE}}" +chain = "rococo-local" +default_command = "polkadot" + +[relaychain.default_resources] +limits = { memory = "4G", cpu = "2" } +requests = { memory = "2G", cpu = "1" } + + [[relaychain.node_groups]] + # Use an image that doesn't speak /req_chunk/2 protocol. + image = "{{POLKADOT_IMAGE}}:master-bde0bbe5" + name = "old" + count = 2 + args = ["-lparachain=debug,parachain::availability-recovery=trace,parachain::availability-distribution=trace"] + + [[relaychain.node_groups]] + name = "new" + count = 2 + args = ["-lparachain=debug,parachain::availability-recovery=trace,parachain::availability-distribution=trace,sub-libp2p=trace"] + +{% for id in range(2000,2002) %} +[[parachains]] +id = {{id}} +addToGenesis = true +cumulus_based = true +chain = "glutton-westend-local-{{id}}" + [parachains.genesis.runtimeGenesis.patch.glutton] + compute = "50000000" + storage = "2500000000" + trashDataCount = 5120 + + [parachains.collator] + name = "collator" + image = "{{CUMULUS_IMAGE}}" + args = ["-lparachain=debug"] + +{% endfor %} diff --git a/polkadot/zombienet_tests/functional/0014-chunk-fetching-network-compatibility.zndsl b/polkadot/zombienet_tests/functional/0014-chunk-fetching-network-compatibility.zndsl new file mode 100644 index 0000000000000..2ac5012db668d --- /dev/null +++ b/polkadot/zombienet_tests/functional/0014-chunk-fetching-network-compatibility.zndsl @@ -0,0 +1,53 @@ +Description: Validators preserve backwards compatibility with peers speaking an older version of the /req_chunk protocol +Network: ./0014-chunk-fetching-network-compatibility.toml +Creds: config + +# Check authority status. +new: reports node_roles is 4 +old: reports node_roles is 4 + +# Ensure parachains are registered. +new: parachain 2000 is registered within 60 seconds +old: parachain 2000 is registered within 60 seconds +old: parachain 2001 is registered within 60 seconds +new: parachain 2001 is registered within 60 seconds + +# Ensure parachains made progress and approval checking works. +new: parachain 2000 block height is at least 15 within 600 seconds +old: parachain 2000 block height is at least 15 within 600 seconds +new: parachain 2001 block height is at least 15 within 600 seconds +old: parachain 2001 block height is at least 15 within 600 seconds + +new: reports substrate_block_height{status="finalized"} is at least 30 within 400 seconds +old: reports substrate_block_height{status="finalized"} is at least 30 within 400 seconds + +new: reports polkadot_parachain_approval_checking_finality_lag < 3 +old: reports polkadot_parachain_approval_checking_finality_lag < 3 + +new: reports polkadot_parachain_approvals_no_shows_total < 3 within 10 seconds +old: reports polkadot_parachain_approvals_no_shows_total < 3 within 10 seconds + +# Ensure that there are no failed recoveries. +new: count of log lines containing "Data recovery from chunks complete" is at least 10 within 300 seconds +old: count of log lines containing "Data recovery from chunks complete" is at least 10 within 300 seconds +new: count of log lines containing "Data recovery from chunks is not possible" is 0 within 10 seconds +old: count of log lines containing "Data recovery from chunks is not possible" is 0 within 10 seconds +new: reports polkadot_parachain_availability_recovery_recoveries_finished{result="failure"} is 0 within 10 seconds +old: reports polkadot_parachain_availability_recovery_recoveries_finished{result="failure"} is 0 within 10 seconds + +# Ensure we used the fallback network request. +new: log line contains "Trying the fallback protocol" within 100 seconds + +# Ensure systematic recovery was not used. +old: count of log lines containing "Data recovery from systematic chunks complete" is 0 within 10 seconds +new: count of log lines containing "Data recovery from systematic chunks complete" is 0 within 10 seconds + +# Ensure availability-distribution worked fine +new: reports polkadot_parachain_fetched_chunks_total{success="succeeded"} is at least 10 within 400 seconds +old: reports polkadot_parachain_fetched_chunks_total{success="succeeded"} is at least 10 within 400 seconds + +new: reports polkadot_parachain_fetched_chunks_total{success="failed"} is 0 within 10 seconds +old: reports polkadot_parachain_fetched_chunks_total{success="failed"} is 0 within 10 seconds + +new: reports polkadot_parachain_fetched_chunks_total{success="not-found"} is 0 within 10 seconds +old: reports polkadot_parachain_fetched_chunks_total{success="not-found"} is 0 within 10 seconds diff --git a/prdoc/pr_1644.prdoc b/prdoc/pr_1644.prdoc new file mode 100644 index 0000000000000..cc43847fa09b2 --- /dev/null +++ b/prdoc/pr_1644.prdoc @@ -0,0 +1,59 @@ +title: Add availability-recovery from systematic chunks + +doc: + - audience: Node Operator + description: | + Implements https://github.com/polkadot-fellows/RFCs/pull/47. This optimisation is guarded by a configuration bit in + the runtime and will only be enabled once a supermajority of the validators have upgraded to this version. + It's strongly advised to upgrade to this version. + - audience: Node Dev + description: | + Implements https://github.com/polkadot-fellows/RFCs/pull/47 and adds the logic for availability recovery from systematic chunks. + The /req_chunk/1 req-response protocol is now considered deprecated in favour of /req_chunk/2. Systematic recovery is guarded + by a configuration bit in the runtime (bit with index 2 of the node_features field from the HostConfiguration) + and must not be enabled until all (or almost all) validators have upgraded to the node version that includes + this PR. + +crates: + - name: sc-network + bump: minor + - name: polkadot-primitives + bump: minor + - name: cumulus-client-pov-recovery + bump: none + - name: polkadot-overseer + bump: none + - name: polkadot-node-primitives + bump: major + - name: polkadot-erasure-coding + bump: major + - name: polkadot-node-jaeger + bump: major + - name: polkadot-node-subsystem-types + bump: major + - name: polkadot-node-network-protocol + bump: major + - name: polkadot-service + bump: major + - name: polkadot-node-subsystem-util + bump: major + - name: polkadot-availability-distribution + bump: major + - name: polkadot-availability-recovery + bump: major + - name: polkadot-node-core-approval-voting + bump: minor + - name: polkadot-node-core-av-store + bump: major + - name: polkadot-network-bridge + bump: minor + - name: polkadot-node-core-backing + bump: none + - name: polkadot-node-core-bitfield-signing + bump: none + - name: polkadot-node-core-dispute-coordinator + bump: none + - name: cumulus-relay-chain-minimal-node + bump: minor + - name: polkadot + bump: minor diff --git a/substrate/client/network/src/service.rs b/substrate/client/network/src/service.rs index 1aaa63191a811..27de12bc1ec9a 100644 --- a/substrate/client/network/src/service.rs +++ b/substrate/client/network/src/service.rs @@ -592,7 +592,7 @@ where crate::MAX_CONNECTIONS_ESTABLISHED_INCOMING, )), ) - .substream_upgrade_protocol_override(upgrade::Version::V1Lazy) + .substream_upgrade_protocol_override(upgrade::Version::V1) .notify_handler_buffer_size(NonZeroUsize::new(32).expect("32 != 0; qed")) // NOTE: 24 is somewhat arbitrary and should be tuned in the future if necessary. // See