PoV reclaim underestimates proof size #5229

s0me0ne-unkn0wn · 2024-08-04T09:25:35Z

The PoV reclaim signed extension calculates the actual storage proof size as a difference between the recorded proof sizes in the extension's post_dispatch and pre_dispatch calls. Then, the delta between the real storage proof size and benchmarked size is calculated, and the runtime-registered weight is corrected by that delta.

The problem is that after Executive calls into the Applyable::apply but before the PoV reclaim's pre_dispatch, many things happen that also affect the recorded proof size. Namely, pre_dispatch hooks of other signed extensions get called.

The CheckNonce extension, which updates the nonce and stores the account data in the Account map, had the most significant influence on the proof size I observed. Other things that are not so influential but worth mentioning are the writes to AllExtrinsicLen and BlockWeight itself by the CheckWeight extension.

The PoV reclaim, not being aware of those storage proof size increments, decreases proof size in the runtime too much. To give a concrete example, for a transaction that actually consumed 3769 bytes of proof (measured as a difference between recorded proof sizes before and after calling into apply), the runtime incremented proof size by a benchmarked value of 3731 bytes, and then 1758 of them were reclaimed by the PoV reclaim, resulting in 1973 bytes of proof increase being registered by the runtime. So 3769 was registered by the recorder and 1973 by the runtime, a huge difference.

The result is the wrong proof size registered by the runtime, which may, above all else, be a security problem.

The following graph shows how proof size increases as registered by the node and by the runtime with increase of the number of transactions.

CC @skunert @bkchr @eskimor

The text was updated successfully, but these errors were encountered:

bkchr · 2024-08-04T10:20:52Z

Nice find!

https://github.com/paritytech/polkadot-sdk/blob/master/cumulus%2Fprimitives%2Fstorage-weight-reclaim%2Fsrc%2Flib.rs#L193

Here we need to add currenr.proof_size = max(current.proof_size, post_dispatch_proof_size). This should prevent this issue.

s0me0ne-unkn0wn · 2024-08-04T10:30:44Z

I'm also wondering if just moving the StorageWeightReclaim to the top of the extensions list is the mitigation. After some tests, it seems it is. I see an opposite bias, though, but not that big, and it's safer to be biased in that direction anyway. Namely, a random block stops authoring having 3_510_188 recorded at node and 3_929_255 registered by the runtime. Most probably comes from some storage operations in the post_dispatch of other extensions.

s0me0ne-unkn0wn · 2024-08-04T10:40:12Z

Here we need to add currenr.proof_size = max(current.proof_size, post_dispatch_proof_size). This should prevent this issue.

Hmmm, so you propose to just stick to the recorded weight, effectively discarding runtime's own calculations? Interesting, I wonder if it would interfere with the unspent weight which is refunded by CheckWeight later (just thinking aloud, I don't know how unspent works and I never seen it being non-zero).

skunert · 2024-08-04T11:42:54Z

Here we need to add currenr.proof_size = max(current.proof_size, post_dispatch_proof_size). This should prevent this issue.

This is a good way to alleviate the issue. However we should add the BlockLength that has been recorded until now too.

The only problem of this patch is that this is that if we run into this case we lose track of the weight that has been reserved for on_finalize. This should for sure be integrated as a safe guard, but its not the end of this story.

I'm also wondering if just moving the StorageWeightReclaim to the top of the extensions list is the mitigation. After some tests, it seems it is. I see an opposite bias, though, but not that big, and it's safer to be biased in that direction anyway. Namely, a random block stops authoring having 3_510_188 recorded at node and 3_929_255 registered by the runtime. Most probably comes from some storage operations in the post_dispatch of other extensions.

While this is true for the specific case we were testing, I don't think its safe to make this general assumption that post_dispatch of SignedExtensions consumes less proof size.

so you propose to just stick to the recorded weight

Only for the case where we run into the condition though. For the general case we should not just set the runtime storage item to the proof size, because on_initialize returns the weight for itself plus the weight for on_finalize.

To fix this problem properly I can currently think of two ways:

Integrate the reclaim into frame-executive, making sure that the proof size respects the whole execution, including all SignedExtensions. I don't think much of this approach because the reclaim is parachain specific and optional.
Add an option for SignedExtensions that act as a "wrapper". So that their pre_dispatch is executed first and their post_dispatch is executed last. This would require some research on what would be the best way to implement.

s0me0ne-unkn0wn · 2024-08-04T12:16:26Z

Add an option for SignedExtensions that act as a "wrapper". So that their pre_dispatch is executed first and their post_dispatch is executed last. This would require some research on what would be the best way to implement.

One more option is to have two paired extensions, one implementing pre_dispatch and residing on the top of the list and another one with post_dispatch at the bottom. That design has the advantage of not affecting the current architecture in any way, but I cannot think out of my head of any means for them to share the state. AFAIK, signed extensions are not supposed to talk to each other.

bkchr · 2024-08-04T14:16:45Z

Add an option for SignedExtensions that act as a "wrapper". So that their pre_dispatch is executed first and their post_dispatch is executed last. This would require some research on what would be the best way to implement.

No, this is not what I meant. What I meant is that storage reclaim wraps all other extensions. This way it has control over pre and post dispatch.

BTW, while thinking about this. The costs for the extensions should be included in the base extrinsic weight(cc @ggwpez). Maybe this wasn't updated for the chain you tested this with @s0me0ne-unkn0wn? Or maybe that is just not added to the extrinsic weight directly.
In the future we will have each extension being benchmarked separately, which should help here as well.

gui1117 · 2024-08-05T07:14:09Z

Indeed signed extensions weight should be taken into account by base extrinsic weight in theory.

For future transaction extensions with weight we can do wrapping extension or 2 extensions. Both are doable and fine to me. I draft an implementation of wrapping extension here #5234

skunert · 2024-08-05T08:30:38Z

Indeed signed extensions weight should be taken into account by base extrinsic weight in theory.

Do we assume a number of accounts for this? Afaik @s0me0ne-unkn0wn was filling the state with tons of accounts for these tests, maybe recalculation of this weight would be necessary?

EDIT:

Only in theory, seems like base extrinsic weight is

pub const ExtrinsicBaseWeight: Weight = Weight::from_parts(constants::WEIGHT_REF_TIME_PER_NANOS, 0)
		.saturating_mul(125_000);

everywhere.

s0me0ne-unkn0wn · 2024-08-05T08:46:08Z

Yes, indeed, this test is done with 256k accounts in genesis to check, among other things, how the state size affects throughput. As far as I can see, the rococo-parachain used in this test uses no-op extrinsic base weight from frame-support with a proof size of zero. But nevertheless, having it as a constant doesn't look right to me. It means we're not ready to scale the network. Should every parachain hold a referendum to update that constant every time the number of existing accounts grows significantly? And somebody should notice that increase, make the change, and propose a referendum. Sounds like manual-mode scaling. I believe it should be much more automagical.

s0me0ne-unkn0wn · 2024-08-05T08:51:57Z

@gui1117 oh, I absolutely love how you first say "yep, it's doable" and then pull a 1200-line PR out of your pocket 🤣

skunert · 2024-08-05T08:56:48Z

Yes, indeed, this test is done with 256k accounts in genesis to check, among other things, how the state size affects throughput. As far as I can see, the rococo-parachain used in this test uses no-op extrinsic base weight from frame-support with a proof size of zero. But nevertheless, having it as a constant doesn't look right to me. It means we're not ready to scale the network. Should every parachain hold a referendum to update that constant every time the number of existing accounts grows significantly? And somebody should notice that increase, make the change, and propose a referendum. Sounds like manual-mode scaling. I believe it should be much more automagical.

Absolutely should be automatic, just trying to identify the involved parts. I think the wrapper draft will be the way to go.

gui1117 · 2024-08-05T11:05:13Z

@gui1117 oh, I absolutely love how you first say "yep, it's doable" and then pull a 1200-line PR out of your pocket 🤣

I was already working on it coincidentally for transaction extensions PR :-)

kianenigma · 2024-08-06T16:12:00Z

very interesting discovery @s0me0ne-unkn0wn!

Prelude: let's take a moment and appreciate in advance that with PVM, someday, we can get rid of this.

If the solution of this lies in the order of signed extension, or somehow them being wrapped, I have proposed in general moving some signed extensions that we think are "mandatory" to a fixed list, and using it in templates. This will be a strong signal from us that we think this is how one should configure their runtime.

polkadot-sdk/substrate/frame/src/lib.rs

Lines 293 to 304 in 11fdff1

    
           		/// Default set of signed extensions exposed from the `frame_system`. 
        
           		/// 
        
           		/// crucially, this does NOT contain any tx-payment extension. 
        
           		pub type SystemSignedExtensionsOf<T> = ( 
        
           			frame_system::CheckNonZeroSender<T>, 
        
           			frame_system::CheckSpecVersion<T>, 
        
           			frame_system::CheckTxVersion<T>, 
        
           			frame_system::CheckGenesis<T>, 
        
           			frame_system::CheckEra<T>, 
        
           			frame_system::CheckNonce<T>, 
        
           			frame_system::CheckWeight<T>, 
        
           		);

to scale the network. Should every parachain hold a referendum to update that constant every time the number of existing accounts grows significantly?

Yes, this is the vision, but it is a pretty cumbersome process as you can imagine, and worst of all it is not well documented.

Truth is, benchmarking is a race that we are losing right now, or we are barely holding onto. As in, even in RC, I believe the state is bigger (don't quote me on this) than our "assumption" of benchmarking. As noted in #5223 (comment), I think moving towards PVM is hopefully the right direction.

There is another issue with increasing the assumption as well: That you pessimistically reduce throughput, esp. if the benchmarking code is written poorly, and there is not enough manual refunds in the code. With a good PoV reclaim we can be a bit bolder here and confidently increase the static weights, as they will get reclaimed.

As I commented in

gui1117 · 2024-08-07T11:33:08Z

There might be a bug in reclaim which explains this situation: #5273

basically proof size is understimated if pre dispatch info overestimate the proof size and post dispatch info underestimage the proof size at the same time.

ggwpez · 2024-08-07T16:30:31Z

The extrinsic base weight benchmarking is indeed not looking at the PoV weight. The logic currently is to build an empty block and measure that. The problem is that the block is build in the client - not the runtime. Therefore it is not compatible with the omni-bencher.

But it could be possible to extend it at least for the polkadot non-omni node.

In the future we will have each extension being benchmarked separately, which should help here as well.

Yea the TE merge request adds this, then we could just add that here:

polkadot-sdk/bridges/chains/chain-bridge-hub-cumulus/src/lib.rs

Line 93 in a633e95

weights.base_extrinsic = ExtrinsicBaseWeight::get();

ggwpez · 2024-08-08T10:35:06Z

I changed the overhead benchmarking to also include the proof when building the block until we have the TE benchmarks: #5283

It could help to ensure that we dont underestimate the proof size upfront.

@skunert

The code do reduce or increase the weight by comparing `benchmarked_weight` and `consumed_weight`. But `benchmarked_weight` is the pre dispatch weight. not the post dispatch weight that is actually written into the block weight by `CheckWeight`. So in case the consumed weight was: `pre dispatch weight > consumed weight > post dispatch weight` then the reclaim code was reducing the block weight instead of increasing it. Might explain this issue even better #5229 @skunert @s0me0ne-unkn0wn

I propose to move PoV-reclaim to the next release. It can happen that the amount to reclaim is overestimated paritytech/polkadot-sdk#5229. Improvements (paritytech/polkadot-sdk#5281, paritytech/polkadot-sdk#5273) and a proper fix (paritytech/polkadot-sdk#5234) are in the works, but should be tested again. I expect the chances that the system-chains run into this scenario to be low, but it does not look like activity is so high that the inclusion of reclaim is urgent. --------- Co-authored-by: Bastian Köcher <[email protected]>

@gui1117

This PR adds an additional defensive check to the reclaim SE. Since it can happen that we miss some storage accesses on other SEs pre-dispatch, we should double check that the bookkeeping of the runtime stays ahead of the node-side pov-size. If we discover a mismatch and the node-side pov-size is indeed higher, we should set the runtime bookkeeping to the node-side value. In cases such as #5229, we would stop including extrinsics and not run `on_idle` at least. cc @gui1117 --------- Co-authored-by: command-bot <>

@skunert

The code do reduce or increase the weight by comparing `benchmarked_weight` and `consumed_weight`. But `benchmarked_weight` is the pre dispatch weight. not the post dispatch weight that is actually written into the block weight by `CheckWeight`. So in case the consumed weight was: `pre dispatch weight > consumed weight > post dispatch weight` then the reclaim code was reducing the block weight instead of increasing it. Might explain this issue even better #5229 @skunert @s0me0ne-unkn0wn (cherry picked from commit 862860e)

@gui1117

This PR adds an additional defensive check to the reclaim SE. Since it can happen that we miss some storage accesses on other SEs pre-dispatch, we should double check that the bookkeeping of the runtime stays ahead of the node-side pov-size. If we discover a mismatch and the node-side pov-size is indeed higher, we should set the runtime bookkeeping to the node-side value. In cases such as #5229, we would stop including extrinsics and not run `on_idle` at least. cc @gui1117 --------- Co-authored-by: command-bot <> (cherry picked from commit 055eb53)

@skunert

The code do reduce or increase the weight by comparing `benchmarked_weight` and `consumed_weight`. But `benchmarked_weight` is the pre dispatch weight. not the post dispatch weight that is actually written into the block weight by `CheckWeight`. So in case the consumed weight was: `pre dispatch weight > consumed weight > post dispatch weight` then the reclaim code was reducing the block weight instead of increasing it. Might explain this issue even better #5229 @skunert @s0me0ne-unkn0wn (cherry picked from commit 862860e)

@gui1117

This PR adds an additional defensive check to the reclaim SE. Since it can happen that we miss some storage accesses on other SEs pre-dispatch, we should double check that the bookkeeping of the runtime stays ahead of the node-side pov-size. If we discover a mismatch and the node-side pov-size is indeed higher, we should set the runtime bookkeeping to the node-side value. In cases such as #5229, we would stop including extrinsics and not run `on_idle` at least. cc @gui1117 --------- Co-authored-by: command-bot <> (cherry picked from commit 055eb53)

skunert · 2024-08-16T15:38:48Z

Small write-up for future reference and for others who want to reason about this bug. Tests by @s0me0ne-unkn0wn have shown that #5281 and #5273 have indeed improved the situation significantly. The runtime now correctly rejects transactions very close to the specified limit. However, there are still some considerations to keep in mind.

Impact of #5281

In general, when we detect that the node-side storage proof is already larger than what we have recorded in the runtime, we set it to the node-side proof. At some point, the runtime BlockWeight will reach its limit. After that, we will reject the next transaction due to insufficient proof weight. However, since we set the runtime-side proof weight once to the node value, the runtime loses track of how much weight was reserved for on_finalize. The weight consumed by on_finalize will be added on top, potentially exceeding the size limit.

There are safeguards in place that make a real-world impact of this unlikely:

We still have the conservative limit of 2.5 MB set on the node side.
Every proof-size related result from the node side is uncompressed. So, even if we slightly exceed the limit, the resulting PoV will be compressed afterward.

Impact of the Extrinsic Composition

I was initially surprised by the significant impact shown in the initial graph. On further thought, it makes sense. The tests were TPS style, with lots of balance transfers. With a small extrinsic that only reads around two storage items, the impact of a missed CheckNonce::pre_dispatch is high. If you have many extrinsics that touch more storage items, the discrepancy will be lower.

Impact of SignedExtension Ordering

In the PoV-reclaim enablement guide, we recommend placing the StorageWeightReclaim SignedExtension last in the list. It sounds counterintuitive at first, but perhaps we should recommend placing it first in the list. The post_dispatch items of most SEs touch storage items that have been read before. Therefore, it should be better to include pre_dispatch hooks in the reclaim scope rather than the post_dispatch hooks. Storage is read for the first time in pre_dispatch whereas post_dispatch of many SEs modify items that have been read before. I think of this as just a rule-of-thumb, because it does not need to be true for all SEs and is not enforceable anyway. So low impact.

@skunert

…ech#5273) The code do reduce or increase the weight by comparing `benchmarked_weight` and `consumed_weight`. But `benchmarked_weight` is the pre dispatch weight. not the post dispatch weight that is actually written into the block weight by `CheckWeight`. So in case the consumed weight was: `pre dispatch weight > consumed weight > post dispatch weight` then the reclaim code was reducing the block weight instead of increasing it. Might explain this issue even better paritytech#5229 @skunert @s0me0ne-unkn0wn

s0me0ne-unkn0wn added I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known. labels Aug 4, 2024

s0me0ne-unkn0wn linked a pull request Aug 6, 2024 that will close this issue

WIP: implement StorageWeightReclaim as wrapping transaction extension with weight #5234

Closed

1 task

gui1117 mentioned this issue Aug 7, 2024

StorageWeightReclaim: Fix issue when underestimating refund. #5273

Merged

This was referenced Aug 8, 2024

StorageWeightReclaim: set to node pov size if higher #5281

Merged

Remove pov-reclaim from kusama system-chains polkadot-fellows/runtimes#423

Merged

skunert mentioned this issue Oct 24, 2024

Enable PoV reclaim in all system chains polkadot-fellows/runtimes#494

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoV reclaim underestimates proof size #5229

PoV reclaim underestimates proof size #5229

s0me0ne-unkn0wn commented Aug 4, 2024

bkchr commented Aug 4, 2024

s0me0ne-unkn0wn commented Aug 4, 2024

s0me0ne-unkn0wn commented Aug 4, 2024

skunert commented Aug 4, 2024

s0me0ne-unkn0wn commented Aug 4, 2024

bkchr commented Aug 4, 2024

gui1117 commented Aug 5, 2024 •

edited

Loading

skunert commented Aug 5, 2024 •

edited

Loading

s0me0ne-unkn0wn commented Aug 5, 2024

s0me0ne-unkn0wn commented Aug 5, 2024

skunert commented Aug 5, 2024

gui1117 commented Aug 5, 2024

kianenigma commented Aug 6, 2024

gui1117 commented Aug 7, 2024 •

edited

Loading

ggwpez commented Aug 7, 2024

ggwpez commented Aug 8, 2024

skunert commented Aug 16, 2024

PoV reclaim underestimates proof size #5229

PoV reclaim underestimates proof size #5229

Comments

s0me0ne-unkn0wn commented Aug 4, 2024

bkchr commented Aug 4, 2024

s0me0ne-unkn0wn commented Aug 4, 2024

s0me0ne-unkn0wn commented Aug 4, 2024

skunert commented Aug 4, 2024

s0me0ne-unkn0wn commented Aug 4, 2024

bkchr commented Aug 4, 2024

gui1117 commented Aug 5, 2024 • edited Loading

skunert commented Aug 5, 2024 • edited Loading

s0me0ne-unkn0wn commented Aug 5, 2024

s0me0ne-unkn0wn commented Aug 5, 2024

skunert commented Aug 5, 2024

gui1117 commented Aug 5, 2024

kianenigma commented Aug 6, 2024

gui1117 commented Aug 7, 2024 • edited Loading

ggwpez commented Aug 7, 2024

ggwpez commented Aug 8, 2024

skunert commented Aug 16, 2024

Impact of #5281

Impact of the Extrinsic Composition

Impact of SignedExtension Ordering

gui1117 commented Aug 5, 2024 •

edited

Loading

skunert commented Aug 5, 2024 •

edited

Loading

gui1117 commented Aug 7, 2024 •

edited

Loading