Retry approval on availability failure if the check is still needed #6807

alexggh · 2024-12-09T16:27:29Z

Recovering the POV can fail in situation where the node just restart and the DHT topology wasn't fully discovered yet, so the current node can't connect to most of its Peers. This is bad because for gossiping the assignment you need to be connected to just a few peers, so because we can't approve the candidate and other nodes will see this as a no show.

This becomes bad in the scenario where you've got a lot of nodes restarting at the same time, so you end up having a lot of no-shows in the network that are never covered, in that case it makes sense for nodes to actually retry approving the candidate at a later data in time and retry several times if the block containing the candidate wasn't approved.

TODO

Add a subsystem test.

Signed-off-by: Alexandru Gheorghe <[email protected]>

Polkadot-Forum · 2024-12-10T14:42:14Z

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/2025-11-25-kusama-parachains-spammening-aftermath/11108/1

burdges · 2024-12-11T10:03:47Z

Aside: JAM puts validators' IP etc on-chain. We could figure out if on-chain is better? Ideally we'd want on-chain to equivocation defesnes anyways.

AndreiEres

Can this somehow interfere in normal mode?

polkadot/node/core/approval-voting/src/lib.rs

…ting_retry

Signed-off-by: Alexandru Gheorghe <[email protected]>

alindima

LGTM

ordian · 2024-12-23T14:03:16Z

Alternatively, we could change the

polkadot-sdk/polkadot/node/network/availability-recovery/src/task/strategy/full.rs

Line 89 in ca78179

IfDisconnected::ImmediateError,

to try connect if we are disconnected. This doesn't have a more granular retry functionality, but a much simpler change.

alindima · 2024-12-23T14:12:46Z

Alternatively, we could change the

polkadot-sdk/polkadot/node/network/availability-recovery/src/task/strategy/full.rs

Line 89 in ca78179

IfDisconnected::ImmediateError,

to try connect if we are disconnected. This doesn't have a more granular retry functionality, but a much simpler change.

We could do this, but I doubt this will make a difference. If the strategy of fetching from backers fails, we fall back to chunk fetching which uses IfDisconnected::TryConnect.

Alex is saying that:

current node can't connect to most of its Peers.

If it were just an issue of not trying to connect, then the chunk recovery should have worked

polkadot/node/core/approval-voting/src/lib.rs

ordian · 2025-01-06T09:03:37Z

Aside: JAM puts validators' IP etc on-chain. We could figure out if on-chain is better? Ideally we'd want on-chain to equivocation defesnes anyways.

Agree, we should explore this. This would also solve the problem with restarts and empty DHT cache.

ordian

Approving it, but it feels like a workaround for an unreliable/slow authority discovery, which we should fix and then remove this workaround.

alexggh · 2025-01-06T10:05:03Z

Approving it, but it feels like a workaround for an unreliable/slow authority discovery, which we should fix and then remove this workaround.

I did discussed with @lexnv about speeding authority-discovery by saving the cache on disk to have it available on restart, however even with a speedy authority discovery, I still think this would be a good thing to have for robustness, because it is a fallback when networking calls fail from various other reasons, network calls are expected to fail every now and then.

sandreim

LGTM! @ordian proposed an alternative solution which should live in AD, but I agree that this fix adds some robustness on top of AD.

From the back of my head, a better place to implement PoV retry would is in the availability recovery subsystem. This would be superior in terms of latency (it's push vs pull/poll) as this subsystem could keep track of all these PoVs that failed to be retrieved and retry immediately to fetch chunks as soon as peers connect (or even more complex strategies - like speculative availability).

The only change needed in approval voting would be to notify av recovery that it's no longer interested in some PoV.

WDYT ?

polkadot/node/core/approval-voting/src/lib.rs

sandreim · 2025-01-07T11:17:03Z

polkadot/node/core/approval-voting/src/lib.rs

+								core_index,
+								session_index,
+								attempts_remaining: retry.attempts_remaining - 1,
+								backoff: retry.backoff,


As time passes the chances of connecting to enough peers increases, wouldn't it make sense to decrease the back-off as the retry count increases ? This would help approve candidate faster.

Reconnecting to peers is in the oder of minutes, so we wouldn't gain much by reducing the backoff, also usually with backoffs you actually want to increase it as the number of attempts increase because you don't want to end up in a situation where you many failed attempts start stampeding, which makes things worse, however we can't increase it here because of this: #6807 (comment), so I think 1min is an acceptable compromise.

alindima · 2025-01-07T12:56:29Z

From the back of my head, a better place to implement PoV retry would is in the availability recovery subsystem. This would be superior in terms of latency (it's push vs pull/poll) as this subsystem could keep track of all these PoVs that failed to be retrieved and retry immediately to fetch chunks as soon as peers connect (or even more complex strategies - like speculative availability).

The only change needed in approval voting would be to notify av recovery that it's no longer interested in some PoV.

I don't think this is particularly better. Latency of transmitting messages between subsystems should be negligible.
av-recovery is being used by multiple actors: dispute-coordinator, approval-voting and collator pov-recovery. I doubt that they all need the same amount of retries and in the same scenarios. In addition, I'd like to keep this separation of concerns that we have now. av-recovery is tasked with recovering the available data. If it's not possible, higher level code decides what needs to be done

sandreim · 2025-01-07T15:33:15Z

I don't think this is particularly better. Latency of transmitting messages between subsystems should be negligible. av-recovery is being used by multiple actors: dispute-coordinator, approval-voting and collator pov-recovery.

Yes, but why is this a bad thing ? If PoV recovery fails dispute-coordinator currently gives up and it would be more robust to use a retry mechanism.

I doubt that they all need the same amount of retries and in the same scenarios. In addition, I'd like to keep this separation of concerns that we have now. av-recovery is tasked with recovering the available data.

My proposal doesn't change the concerns of av-recovery, but makes it more robust. It should be easy to pass the retry params in the RecoverAvailableData message.

Signed-off-by: Alexandru Gheorghe <[email protected]>

alexggh · 2025-01-07T16:22:27Z

From the back of my head, a better place to implement PoV retry would is in the availability recovery subsystem. This would be superior in terms of latency (it's push vs pull/poll) as this subsystem could keep track of all these PoVs that failed to be retrieved and retry immediately to fetch chunks as soon as peers connect (or even more complex strategies - like speculative availability).

The only change needed in approval voting would be to notify av recovery that it's no longer interested in some PoV.

I don't think this is particularly better. Latency of transmitting messages between subsystems should be negligible. av-recovery is being used by multiple actors: dispute-coordinator, approval-voting and collator pov-recovery. I doubt that they all need the same amount of retries and in the same scenarios. In addition, I'd like to keep this separation of concerns that we have now. av-recovery is tasked with recovering the available data. If it's not possible, higher level code decides what needs to be done

Alright, I looked a bit on this and moving it in the availability-recovery wouldn't be a straight-forward task, the reason for this is because the implementers of RecoveryStrategy are stateful, consumed and can't be cloned, so you can't simply re-run the strategies from here:

polkadot-sdk/polkadot/node/network/availability-recovery/src/task/mod.rs

Line 147 in baa3bcc

while let Some(current_strategy) = self.strategies.pop_front() {

, so you would end-up having to propagate the failure here:

polkadot-sdk/polkadot/node/network/availability-recovery/src/lib.rs

Line 800 in baa3bcc

output = state.ongoing_recoveries.select_next_some() => {

and having to re-run the logic for building the recovery-strategy from there, that means that you also have to carry along the parameters needed for rebuilding the strategies array.

I don't think that would be better and easier and less risky to understand and implement than the current proposed approach, but it would give us the benefit that disputes could also opt-in to use it.

Given the current PR improves the situation, I would be inclined to keep the current approach rather than invest in getting this knob in availability-recovery, let me know what you think!

paritytech-workflow-stopper · 2025-01-07T16:23:39Z

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/12654933804
Failed job name: cargo-clippy

sandreim · 2025-01-07T19:42:51Z

Alright, I looked a bit on this and moving it in the availability-recovery wouldn't be a straight-forward task, the reason for this is because the implementers of RecoveryStrategy are stateful, consumed and can't be cloned, so you can't simply re-run the strategies from here:

polkadot-sdk/polkadot/node/network/availability-recovery/src/task/mod.rs

Line 147 in baa3bcc

while let Some(current_strategy) = self.strategies.pop_front() {

, so you would end-up having to propagate the failure here:

polkadot-sdk/polkadot/node/network/availability-recovery/src/lib.rs

Line 800 in baa3bcc

output = state.ongoing_recoveries.select_next_some() => {

and having to re-run the logic for building the recovery-strategy from there, that means that you also have to carry along the parameters needed for rebuilding the strategies array.
I don't think that would be better and easier and less risky to understand and implement than the current proposed approach, but it would give us the benefit that disputes could also opt-in to use it.

Given the current PR improves the situation, I would be inclined to keep the current approach rather than invest in getting this knob in availability-recovery, let me know what you think!

The only major downside of the current approach is that for each retry we re-fetch the same chunks from the previous attempt, so worse case scenario - we'd be roughly fetching the PoV 16 times, which increases the bandwidth cost.

alindima · 2025-01-08T08:22:43Z

From the back of my head, a better place to implement PoV retry would is in the availability recovery subsystem. This would be superior in terms of latency (it's push vs pull/poll) as this subsystem could keep track of all these PoVs that failed to be retrieved and retry immediately to fetch chunks as soon as peers connect (or even more complex strategies - like speculative availability).

The only change needed in approval voting would be to notify av recovery that it's no longer interested in some PoV.

I don't think this is particularly better. Latency of transmitting messages between subsystems should be negligible. av-recovery is being used by multiple actors: dispute-coordinator, approval-voting and collator pov-recovery. I doubt that they all need the same amount of retries and in the same scenarios. In addition, I'd like to keep this separation of concerns that we have now. av-recovery is tasked with recovering the available data. If it's not possible, higher level code decides what needs to be done

Alright, I looked a bit on this and moving it in the availability-recovery wouldn't be a straight-forward task, the reason for this is because the implementers of RecoveryStrategy are stateful, consumed and can't be cloned, so you can't simply re-run the strategies from here:

polkadot-sdk/polkadot/node/network/availability-recovery/src/task/mod.rs

Line 147 in baa3bcc

while let Some(current_strategy) = self.strategies.pop_front() {

, so you would end-up having to propagate the failure here:

polkadot-sdk/polkadot/node/network/availability-recovery/src/lib.rs

Line 800 in baa3bcc

output = state.ongoing_recoveries.select_next_some() => {

and having to re-run the logic for building the recovery-strategy from there, that means that you also have to carry along the parameters needed for rebuilding the strategies array.
I don't think that would be better and easier and less risky to understand and implement than the current proposed approach, but it would give us the benefit that disputes could also opt-in to use it.

Given the current PR improves the situation, I would be inclined to keep the current approach rather than invest in getting this knob in availability-recovery, let me know what you think!

We could modify the RecoveryTask to retry the last strategy if it fails with an Unavailable error, up to a configurable X amount of retries. Or just chain those X extra strategies on the strategies: VecDeque<Box<dyn RecoveryStrategy<Sender>>>, that's part of the task.
The only param that needs to be recreated for the FetchChunks strategy is the number of validators.

I'm not sure however if this would end up being easier to implement/reason about. Btw, we're doing a similar thing in collators for pov-recovery:

polkadot-sdk/cumulus/client/pov-recovery/src/lib.rs

Line 364 in 4059282

if self.candidates_in_retry.insert(block_hash) {

We only retry once here and with no extra sleeps. It could indeed help us deduplicating this code and has the benefit that Andrei mentioned (that already fetched chunks wouldn't be re-requested), but I'm personally also fine with logging an issue about this and handling it as a refactor if we get the time and it indeed turns out that the code is nice enough

alexggh · 2025-01-08T08:40:44Z

Alright, I looked a bit on this and moving it in the availability-recovery wouldn't be a straight-forward task, the reason for this is because the implementers of RecoveryStrategy are stateful, consumed and can't be cloned, so you can't simply re-run the strategies from here:

Another reason, why we can't isolate this logic just in availability-recovery is this

polkadot-sdk/polkadot/node/core/approval-voting/src/lib.rs

Line 755 in 63c73bf

match work.timeout(APPROVAL_CHECKING_TIMEOUT).await {

, which would make the future calling into availability-recovery timeout after only 2 minutes, so further refactoring in approval-voting would also be needed to use that.

It could indeed help us deduplicating this code and has the benefit that Andrei mentioned (that already fetched chunks wouldn't be re-requested), but I'm personally also fine with logging an issue about this and handling it as a refactor if we get the time and it indeed turns out that the code is nice enough

Given, that I would want to have this retry in approval-voting rather sooner than later, I'm also in favour of logging an issue for this generic mechanism and merge the PR, considering the downsides are not that critical in my opinion.

…ting_retry

Signed-off-by: Alexandru Gheorghe <[email protected]>

sandreim · 2025-01-08T10:45:50Z

It could indeed help us deduplicating this code and has the benefit that Andrei mentioned (that already fetched chunks wouldn't be re-requested), but I'm personally also fine with logging an issue about this and handling it as a refactor if we get the time and it indeed turns out that the code is nice enough

Given, that I would want to have this retry in approval-voting rather sooner than later, I'm also in favour of logging an issue for this generic mechanism and merge the PR, considering the downsides are not that critical in my opinion.

Sounds good, let's go 🚀

Signed-off-by: Alexandru Gheorghe <[email protected]>

lexnv

LGTM! 👍

paritytech-cmd-bot-polkadot-sdk · 2025-01-14T15:28:57Z

Created backport PR for stable2407:

[stable2407] Backport #6807 #7155 with remaining conflicts!

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-6807-to-stable2407
git worktree add --checkout .worktree/backport-6807-to-stable2407 backport-6807-to-stable2407
cd .worktree/backport-6807-to-stable2407
git reset --hard HEAD^
git cherry-pick -x 6878ba1f399b628cf456ad3abfe72f2553422e1f
git push --force-with-lease

paritytech-cmd-bot-polkadot-sdk · 2025-01-14T15:29:05Z

Created backport PR for stable2409:

[stable2409] Backport #6807 #7156 with remaining conflicts!

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-6807-to-stable2409
git worktree add --checkout .worktree/backport-6807-to-stable2409 backport-6807-to-stable2409
cd .worktree/backport-6807-to-stable2409
git reset --hard HEAD^
git cherry-pick -x 6878ba1f399b628cf456ad3abfe72f2553422e1f
git push --force-with-lease

paritytech-cmd-bot-polkadot-sdk · 2025-01-14T15:29:11Z

Successfully created backport PR for stable2412:

[stable2412] Backport #6807 #7157

…6807) Recovering the POV can fail in situation where the node just restart and the DHT topology wasn't fully discovered yet, so the current node can't connect to most of its Peers. This is bad because for gossiping the assignment you need to be connected to just a few peers, so because we can't approve the candidate and other nodes will see this as a no show. This becomes bad in the scenario where you've got a lot of nodes restarting at the same time, so you end up having a lot of no-shows in the network that are never covered, in that case it makes sense for nodes to actually retry approving the candidate at a later data in time and retry several times if the block containing the candidate wasn't approved. ## TODO - [x] Add a subsystem test. --------- Signed-off-by: Alexandru Gheorghe <[email protected]> (cherry picked from commit 6878ba1)

Backport #6807 into `stable2409` from alexggh. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot.  --------- Signed-off-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]>

Backport #6807 into `stable2412` from alexggh. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot.  Co-authored-by: Alexandru Gheorghe <[email protected]>

Backport #6807 into `stable2407` from alexggh. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot.  --------- Signed-off-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]>

* master: (33 commits) Implement `pallet-asset-rewards` (#3926) [pallet-revive] Add host function `to_account_id` (#7091) [pallet-revive] Remove revive events (#7164) [pallet-revive] Remove debug buffer (#7163) litep2p: Provide partial results to speedup GetRecord queries (#7099) [pallet-revive] Bump asset-hub westend spec version (#7176) Remove 0 as a special case in gas/storage meters (#6890) [pallet-revive] Fix `caller_is_root` return value (#7086) req-resp/litep2p: Reject inbound requests from banned peers (#7158) Add "run to block" tools (#7109) Fix reversed error message in DispatchInfo (#7170) approval-voting: Make importing of duplicate assignment idempotent (#6971) Parachains: Use relay chain slot for velocity measurement (#6825) PRDOC: Document `validate: false` (#7117) xcm: convert properly assets in xcmpayment apis (#7134) CI: Only format umbrella crate during umbrella check (#7139) approval-voting: Fix sending of assignments after restart (#6973) Retry approval on availability failure if the check is still needed (#6807) [pallet-revive-eth-rpc] persist eth transaction hash (#6836) litep2p: Sufix litep2p to the identify agent version for visibility (#7133) ...

…aritytech#6807) Recovering the POV can fail in situation where the node just restart and the DHT topology wasn't fully discovered yet, so the current node can't connect to most of its Peers. This is bad because for gossiping the assignment you need to be connected to just a few peers, so because we can't approve the candidate and other nodes will see this as a no show. This becomes bad in the scenario where you've got a lot of nodes restarting at the same time, so you end up having a lot of no-shows in the network that are never covered, in that case it makes sense for nodes to actually retry approving the candidate at a later data in time and retry several times if the block containing the candidate wasn't approved. ## TODO - [x] Add a subsystem test. --------- Signed-off-by: Alexandru Gheorghe <[email protected]>

Approval-voting restart

bbd049d

Signed-off-by: Alexandru Gheorghe <[email protected]>

alexggh requested review from sandreim, ordian, eskimor and AndreiEres and removed request for sandreim December 9, 2024 16:27

AndreiEres reviewed Dec 11, 2024

View reviewed changes

polkadot/node/core/approval-voting/src/lib.rs Show resolved Hide resolved

alexggh added 3 commits December 20, 2024 15:33

Merge remote-tracking branch 'origin/master' into alexggh/approval_vo…

4e89eb1

…ting_retry

Fix in order

7d4781c

Signed-off-by: Alexandru Gheorghe <[email protected]>

Add unittest

1346e76

Signed-off-by: Alexandru Gheorghe <[email protected]>

alexggh force-pushed the alexggh/approval_voting_retry branch from 5835565 to 1346e76 Compare December 20, 2024 15:22

Delete prdoc/pr_6729.prdoc

17f8b15

alindima approved these changes Dec 23, 2024

View reviewed changes

ordian reviewed Jan 3, 2025

View reviewed changes

polkadot/node/core/approval-voting/src/lib.rs Show resolved Hide resolved

ordian approved these changes Jan 6, 2025

View reviewed changes

sandreim reviewed Jan 7, 2025

View reviewed changes

Extend log

3acbcea

Signed-off-by: Alexandru Gheorghe <[email protected]>

AndreiEres approved these changes Jan 7, 2025

View reviewed changes

alexggh added 2 commits January 8, 2025 10:44

Merge remote-tracking branch 'origin/master' into alexggh/approval_vo…

4419e37

…ting_retry

Make clippy happy

2260359

Signed-off-by: Alexandru Gheorghe <[email protected]>

sandreim approved these changes Jan 8, 2025

View reviewed changes

Add prdoc

2d3199d

Signed-off-by: Alexandru Gheorghe <[email protected]>

alexggh added the A4-needs-backport Pull request must be backported to all maintained releases. label Jan 13, 2025

lexnv approved these changes Jan 13, 2025

View reviewed changes

alexggh mentioned this pull request Jan 14, 2025

availability-recovery implement retry configuration #7150

Open

alexggh added this pull request to the merge queue Jan 14, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 14, 2025

Merge branch 'master' into alexggh/approval_voting_retry

9dabdbd

alexggh enabled auto-merge January 14, 2025 14:16

alexggh added this pull request to the merge queue Jan 14, 2025

Merged via the queue into master with commit 6878ba1 Jan 14, 2025
200 of 205 checks passed

alexggh deleted the alexggh/approval_voting_retry branch January 14, 2025 15:28

paritytech-cmd-bot-polkadot-sdk bot mentioned this pull request Jan 14, 2025

[stable2407] Backport #6807 #7155

Merged

paritytech-cmd-bot-polkadot-sdk bot mentioned this pull request Jan 14, 2025

[stable2409] Backport #6807 #7156

Merged

paritytech-cmd-bot-polkadot-sdk bot mentioned this pull request Jan 14, 2025

[stable2412] Backport #6807 #7157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry approval on availability failure if the check is still needed #6807

Retry approval on availability failure if the check is still needed #6807

alexggh commented Dec 9, 2024 •

edited

Loading

Polkadot-Forum commented Dec 10, 2024

burdges commented Dec 11, 2024 •

edited

Loading

AndreiEres left a comment

alindima left a comment

ordian commented Dec 23, 2024

alindima commented Dec 23, 2024

ordian commented Jan 6, 2025

ordian left a comment

alexggh commented Jan 6, 2025

sandreim left a comment •

edited

Loading

sandreim Jan 7, 2025

alexggh Jan 7, 2025

alindima commented Jan 7, 2025

sandreim commented Jan 7, 2025 •

edited

Loading

alexggh commented Jan 7, 2025

paritytech-workflow-stopper bot commented Jan 7, 2025

sandreim commented Jan 7, 2025

alindima commented Jan 8, 2025

alexggh commented Jan 8, 2025

sandreim commented Jan 8, 2025

lexnv left a comment

paritytech-cmd-bot-polkadot-sdk bot commented Jan 14, 2025

paritytech-cmd-bot-polkadot-sdk bot commented Jan 14, 2025

paritytech-cmd-bot-polkadot-sdk bot commented Jan 14, 2025

Retry approval on availability failure if the check is still needed #6807

Retry approval on availability failure if the check is still needed #6807

Conversation

alexggh commented Dec 9, 2024 • edited Loading

TODO

Polkadot-Forum commented Dec 10, 2024

burdges commented Dec 11, 2024 • edited Loading

AndreiEres left a comment

Choose a reason for hiding this comment

alindima left a comment

Choose a reason for hiding this comment

ordian commented Dec 23, 2024

alindima commented Dec 23, 2024

ordian commented Jan 6, 2025

ordian left a comment

Choose a reason for hiding this comment

alexggh commented Jan 6, 2025

sandreim left a comment • edited Loading

Choose a reason for hiding this comment

sandreim Jan 7, 2025

Choose a reason for hiding this comment

alexggh Jan 7, 2025

Choose a reason for hiding this comment

alindima commented Jan 7, 2025

sandreim commented Jan 7, 2025 • edited Loading

alexggh commented Jan 7, 2025

paritytech-workflow-stopper bot commented Jan 7, 2025

sandreim commented Jan 7, 2025

alindima commented Jan 8, 2025

alexggh commented Jan 8, 2025

sandreim commented Jan 8, 2025

lexnv left a comment

Choose a reason for hiding this comment

paritytech-cmd-bot-polkadot-sdk bot commented Jan 14, 2025

paritytech-cmd-bot-polkadot-sdk bot commented Jan 14, 2025

paritytech-cmd-bot-polkadot-sdk bot commented Jan 14, 2025

alexggh commented Dec 9, 2024 •

edited

Loading

burdges commented Dec 11, 2024 •

edited

Loading

sandreim left a comment •

edited

Loading

sandreim commented Jan 7, 2025 •

edited

Loading