-
Notifications
You must be signed in to change notification settings - Fork 799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry approval on availability failure if the check is still needed #6807
Conversation
Signed-off-by: Alexandru Gheorghe <[email protected]>
This pull request has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/2025-11-25-kusama-parachains-spammening-aftermath/11108/1 |
Aside: JAM puts validators' IP etc on-chain. We could figure out if on-chain is better? Ideally we'd want on-chain to equivocation defesnes anyways. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this somehow interfere in normal mode?
Signed-off-by: Alexandru Gheorghe <[email protected]>
Signed-off-by: Alexandru Gheorghe <[email protected]>
5835565
to
1346e76
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Alternatively, we could change the polkadot-sdk/polkadot/node/network/availability-recovery/src/task/strategy/full.rs Line 89 in ca78179
to try connect if we are disconnected. This doesn't have a more granular retry functionality, but a much simpler change. |
We could do this, but I doubt this will make a difference. If the strategy of fetching from backers fails, we fall back to chunk fetching which uses Alex is saying that:
If it were just an issue of not trying to connect, then the chunk recovery should have worked |
Agree, we should explore this. This would also solve the problem with restarts and empty DHT cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving it, but it feels like a workaround for an unreliable/slow authority discovery, which we should fix and then remove this workaround.
I did discussed with @lexnv about speeding authority-discovery by saving the cache on disk to have it available on restart, however even with a speedy authority discovery, I still think this would be a good thing to have for robustness, because it is a fallback when networking calls fail from various other reasons, network calls are expected to fail every now and then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! @ordian proposed an alternative solution which should live in AD, but I agree that this fix adds some robustness on top of AD.
From the back of my head, a better place to implement PoV retry would is in the availability recovery subsystem. This would be superior in terms of latency (it's push vs pull/poll) as this subsystem could keep track of all these PoVs that failed to be retrieved and retry immediately to fetch chunks as soon as peers connect (or even more complex strategies - like speculative availability).
The only change needed in approval voting would be to notify av recovery that it's no longer interested in some PoV.
WDYT ?
core_index, | ||
session_index, | ||
attempts_remaining: retry.attempts_remaining - 1, | ||
backoff: retry.backoff, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As time passes the chances of connecting to enough peers increases, wouldn't it make sense to decrease the back-off as the retry count increases ? This would help approve candidate faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconnecting to peers is in the oder of minutes, so we wouldn't gain much by reducing the backoff, also usually with backoffs you actually want to increase it as the number of attempts increase because you don't want to end up in a situation where you many failed attempts start stampeding, which makes things worse, however we can't increase it here because of this: #6807 (comment), so I think 1min is an acceptable compromise.
I don't think this is particularly better. Latency of transmitting messages between subsystems should be negligible. |
Yes, but why is this a bad thing ? If PoV recovery fails dispute-coordinator currently gives up and it would be more robust to use a retry mechanism.
My proposal doesn't change the concerns of av-recovery, but makes it more robust. It should be easy to pass the retry params in the |
Signed-off-by: Alexandru Gheorghe <[email protected]>
Alright, I looked a bit on this and moving it in the availability-recovery wouldn't be a straight-forward task, the reason for this is because the implementers of
I don't think that would be better and easier and less risky to understand and implement than the current proposed approach, but it would give us the benefit that disputes could also opt-in to use it. Given the current PR improves the situation, I would be inclined to keep the current approach rather than invest in getting this knob in availability-recovery, let me know what you think! |
All GitHub workflows were cancelled due to failure one of the required jobs. |
The only major downside of the current approach is that for each retry we re-fetch the same chunks from the previous attempt, so worse case scenario - we'd be roughly fetching the PoV 16 times, which increases the bandwidth cost. |
We could modify the I'm not sure however if this would end up being easier to implement/reason about. Btw, we're doing a similar thing in collators for pov-recovery:
|
Another reason, why we can't isolate this logic just in availability-recovery is this
Given, that I would want to have this retry in approval-voting rather sooner than later, I'm also in favour of logging an issue for this generic mechanism and merge the PR, considering the downsides are not that critical in my opinion. |
Signed-off-by: Alexandru Gheorghe <[email protected]>
Sounds good, let's go 🚀 |
Signed-off-by: Alexandru Gheorghe <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍
Created backport PR for
Please cherry-pick the changes locally and resolve any conflicts. git fetch origin backport-6807-to-stable2407
git worktree add --checkout .worktree/backport-6807-to-stable2407 backport-6807-to-stable2407
cd .worktree/backport-6807-to-stable2407
git reset --hard HEAD^
git cherry-pick -x 6878ba1f399b628cf456ad3abfe72f2553422e1f
git push --force-with-lease |
Created backport PR for
Please cherry-pick the changes locally and resolve any conflicts. git fetch origin backport-6807-to-stable2409
git worktree add --checkout .worktree/backport-6807-to-stable2409 backport-6807-to-stable2409
cd .worktree/backport-6807-to-stable2409
git reset --hard HEAD^
git cherry-pick -x 6878ba1f399b628cf456ad3abfe72f2553422e1f
git push --force-with-lease |
Successfully created backport PR for |
…6807) Recovering the POV can fail in situation where the node just restart and the DHT topology wasn't fully discovered yet, so the current node can't connect to most of its Peers. This is bad because for gossiping the assignment you need to be connected to just a few peers, so because we can't approve the candidate and other nodes will see this as a no show. This becomes bad in the scenario where you've got a lot of nodes restarting at the same time, so you end up having a lot of no-shows in the network that are never covered, in that case it makes sense for nodes to actually retry approving the candidate at a later data in time and retry several times if the block containing the candidate wasn't approved. ## TODO - [x] Add a subsystem test. --------- Signed-off-by: Alexandru Gheorghe <[email protected]> (cherry picked from commit 6878ba1)
Backport #6807 into `stable2409` from alexggh. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot. <!-- # To be used by other automation, do not modify: original-pr-number: #${pull_number} --> --------- Signed-off-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]>
Backport #6807 into `stable2412` from alexggh. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot. <!-- # To be used by other automation, do not modify: original-pr-number: #${pull_number} --> Co-authored-by: Alexandru Gheorghe <[email protected]>
Backport #6807 into `stable2407` from alexggh. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot. <!-- # To be used by other automation, do not modify: original-pr-number: #${pull_number} --> --------- Signed-off-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]>
* master: (33 commits) Implement `pallet-asset-rewards` (#3926) [pallet-revive] Add host function `to_account_id` (#7091) [pallet-revive] Remove revive events (#7164) [pallet-revive] Remove debug buffer (#7163) litep2p: Provide partial results to speedup GetRecord queries (#7099) [pallet-revive] Bump asset-hub westend spec version (#7176) Remove 0 as a special case in gas/storage meters (#6890) [pallet-revive] Fix `caller_is_root` return value (#7086) req-resp/litep2p: Reject inbound requests from banned peers (#7158) Add "run to block" tools (#7109) Fix reversed error message in DispatchInfo (#7170) approval-voting: Make importing of duplicate assignment idempotent (#6971) Parachains: Use relay chain slot for velocity measurement (#6825) PRDOC: Document `validate: false` (#7117) xcm: convert properly assets in xcmpayment apis (#7134) CI: Only format umbrella crate during umbrella check (#7139) approval-voting: Fix sending of assignments after restart (#6973) Retry approval on availability failure if the check is still needed (#6807) [pallet-revive-eth-rpc] persist eth transaction hash (#6836) litep2p: Sufix litep2p to the identify agent version for visibility (#7133) ...
…aritytech#6807) Recovering the POV can fail in situation where the node just restart and the DHT topology wasn't fully discovered yet, so the current node can't connect to most of its Peers. This is bad because for gossiping the assignment you need to be connected to just a few peers, so because we can't approve the candidate and other nodes will see this as a no show. This becomes bad in the scenario where you've got a lot of nodes restarting at the same time, so you end up having a lot of no-shows in the network that are never covered, in that case it makes sense for nodes to actually retry approving the candidate at a later data in time and retry several times if the block containing the candidate wasn't approved. ## TODO - [x] Add a subsystem test. --------- Signed-off-by: Alexandru Gheorghe <[email protected]>
Recovering the POV can fail in situation where the node just restart and the DHT topology wasn't fully discovered yet, so the current node can't connect to most of its Peers. This is bad because for gossiping the assignment you need to be connected to just a few peers, so because we can't approve the candidate and other nodes will see this as a no show.
This becomes bad in the scenario where you've got a lot of nodes restarting at the same time, so you end up having a lot of no-shows in the network that are never covered, in that case it makes sense for nodes to actually retry approving the candidate at a later data in time and retry several times if the block containing the candidate wasn't approved.
TODO