-
Notifications
You must be signed in to change notification settings - Fork 767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoV reclaim underestimates proof size #5229
Comments
Nice find! Here we need to add |
I'm also wondering if just moving the |
Hmmm, so you propose to just stick to the recorded weight, effectively discarding runtime's own calculations? Interesting, I wonder if it would interfere with the |
This is a good way to alleviate the issue. However we should add the The only problem of this patch is that this is that if we run into this case we lose track of the weight that has been reserved for
While this is true for the specific case we were testing, I don't think its safe to make this general assumption that
Only for the case where we run into the condition though. For the general case we should not just set the runtime storage item to the proof size, because To fix this problem properly I can currently think of two ways:
|
One more option is to have two paired extensions, one implementing |
No, this is not what I meant. What I meant is that storage reclaim wraps all other extensions. This way it has control over pre and post dispatch. BTW, while thinking about this. The costs for the extensions should be included in the base extrinsic weight(cc @ggwpez). Maybe this wasn't updated for the chain you tested this with @s0me0ne-unkn0wn? Or maybe that is just not added to the extrinsic weight directly. |
Indeed signed extensions weight should be taken into account by base extrinsic weight in theory. For future transaction extensions with weight we can do wrapping extension or 2 extensions. Both are doable and fine to me. I draft an implementation of wrapping extension here #5234 |
Do we assume a number of accounts for this? Afaik @s0me0ne-unkn0wn was filling the state with tons of accounts for these tests, maybe recalculation of this weight would be necessary? EDIT: Only in theory, seems like base extrinsic weight is
everywhere. |
Yes, indeed, this test is done with 256k accounts in genesis to check, among other things, how the state size affects throughput. As far as I can see, the |
@gui1117 oh, I absolutely love how you first say "yep, it's doable" and then pull a 1200-line PR out of your pocket 🤣 |
Absolutely should be automatic, just trying to identify the involved parts. I think the wrapper draft will be the way to go. |
I was already working on it coincidentally for transaction extensions PR :-) |
very interesting discovery @s0me0ne-unkn0wn! Prelude: let's take a moment and appreciate in advance that with PVM, someday, we can get rid of this. If the solution of this lies in the order of signed extension, or somehow them being wrapped, I have proposed in general moving some signed extensions that we think are "mandatory" to a fixed list, and using it in templates. This will be a strong signal from us that we think this is how one should configure their runtime. polkadot-sdk/substrate/frame/src/lib.rs Lines 293 to 304 in 11fdff1
Yes, this is the vision, but it is a pretty cumbersome process as you can imagine, and worst of all it is not well documented. Truth is, benchmarking is a race that we are losing right now, or we are barely holding onto. As in, even in RC, I believe the state is bigger (don't quote me on this) than our "assumption" of benchmarking. As noted in #5223 (comment), I think moving towards PVM is hopefully the right direction. There is another issue with increasing the assumption as well: That you pessimistically reduce throughput, esp. if the benchmarking code is written poorly, and there is not enough manual refunds in the code. With a good PoV reclaim we can be a bit bolder here and confidently increase the static weights, as they will get reclaimed. As I commented in |
There might be a bug in reclaim which explains this situation: #5273 basically proof size is understimated if pre dispatch info overestimate the proof size and post dispatch info underestimage the proof size at the same time. |
The extrinsic base weight benchmarking is indeed not looking at the PoV weight. The logic currently is to build an empty block and measure that. The problem is that the block is build in the client - not the runtime. Therefore it is not compatible with the omni-bencher. But it could be possible to extend it at least for the polkadot non-omni node.
Yea the TE merge request adds this, then we could just add that here:
|
I changed the overhead benchmarking to also include the proof when building the block until we have the TE benchmarks: #5283 It could help to ensure that we dont underestimate the proof size upfront. |
The code do reduce or increase the weight by comparing `benchmarked_weight` and `consumed_weight`. But `benchmarked_weight` is the pre dispatch weight. not the post dispatch weight that is actually written into the block weight by `CheckWeight`. So in case the consumed weight was: `pre dispatch weight > consumed weight > post dispatch weight` then the reclaim code was reducing the block weight instead of increasing it. Might explain this issue even better #5229 @skunert @s0me0ne-unkn0wn
I propose to move PoV-reclaim to the next release. It can happen that the amount to reclaim is overestimated paritytech/polkadot-sdk#5229. Improvements (paritytech/polkadot-sdk#5281, paritytech/polkadot-sdk#5273) and a proper fix (paritytech/polkadot-sdk#5234) are in the works, but should be tested again. I expect the chances that the system-chains run into this scenario to be low, but it does not look like activity is so high that the inclusion of reclaim is urgent. --------- Co-authored-by: Bastian Köcher <[email protected]>
This PR adds an additional defensive check to the reclaim SE. Since it can happen that we miss some storage accesses on other SEs pre-dispatch, we should double check that the bookkeeping of the runtime stays ahead of the node-side pov-size. If we discover a mismatch and the node-side pov-size is indeed higher, we should set the runtime bookkeeping to the node-side value. In cases such as #5229, we would stop including extrinsics and not run `on_idle` at least. cc @gui1117 --------- Co-authored-by: command-bot <>
The code do reduce or increase the weight by comparing `benchmarked_weight` and `consumed_weight`. But `benchmarked_weight` is the pre dispatch weight. not the post dispatch weight that is actually written into the block weight by `CheckWeight`. So in case the consumed weight was: `pre dispatch weight > consumed weight > post dispatch weight` then the reclaim code was reducing the block weight instead of increasing it. Might explain this issue even better #5229 @skunert @s0me0ne-unkn0wn (cherry picked from commit 862860e)
This PR adds an additional defensive check to the reclaim SE. Since it can happen that we miss some storage accesses on other SEs pre-dispatch, we should double check that the bookkeeping of the runtime stays ahead of the node-side pov-size. If we discover a mismatch and the node-side pov-size is indeed higher, we should set the runtime bookkeeping to the node-side value. In cases such as #5229, we would stop including extrinsics and not run `on_idle` at least. cc @gui1117 --------- Co-authored-by: command-bot <> (cherry picked from commit 055eb53)
The code do reduce or increase the weight by comparing `benchmarked_weight` and `consumed_weight`. But `benchmarked_weight` is the pre dispatch weight. not the post dispatch weight that is actually written into the block weight by `CheckWeight`. So in case the consumed weight was: `pre dispatch weight > consumed weight > post dispatch weight` then the reclaim code was reducing the block weight instead of increasing it. Might explain this issue even better #5229 @skunert @s0me0ne-unkn0wn (cherry picked from commit 862860e)
This PR adds an additional defensive check to the reclaim SE. Since it can happen that we miss some storage accesses on other SEs pre-dispatch, we should double check that the bookkeeping of the runtime stays ahead of the node-side pov-size. If we discover a mismatch and the node-side pov-size is indeed higher, we should set the runtime bookkeeping to the node-side value. In cases such as #5229, we would stop including extrinsics and not run `on_idle` at least. cc @gui1117 --------- Co-authored-by: command-bot <> (cherry picked from commit 055eb53)
Small write-up for future reference and for others who want to reason about this bug. Tests by @s0me0ne-unkn0wn have shown that #5281 and #5273 have indeed improved the situation significantly. The runtime now correctly rejects transactions very close to the specified limit. However, there are still some considerations to keep in mind. Impact of #5281In general, when we detect that the node-side storage proof is already larger than what we have recorded in the runtime, we set it to the node-side proof. At some point, the runtime BlockWeight will reach its limit. After that, we will reject the next transaction due to insufficient proof weight. However, since we set the runtime-side proof weight once to the node value, the runtime loses track of how much weight was reserved for on_finalize. The weight consumed by on_finalize will be added on top, potentially exceeding the size limit. There are safeguards in place that make a real-world impact of this unlikely:
Impact of the Extrinsic CompositionI was initially surprised by the significant impact shown in the initial graph. On further thought, it makes sense. The tests were TPS style, with lots of balance transfers. With a small extrinsic that only reads around two storage items, the impact of a missed Impact of SignedExtension OrderingIn the PoV-reclaim enablement guide, we recommend placing the |
…ech#5273) The code do reduce or increase the weight by comparing `benchmarked_weight` and `consumed_weight`. But `benchmarked_weight` is the pre dispatch weight. not the post dispatch weight that is actually written into the block weight by `CheckWeight`. So in case the consumed weight was: `pre dispatch weight > consumed weight > post dispatch weight` then the reclaim code was reducing the block weight instead of increasing it. Might explain this issue even better paritytech#5229 @skunert @s0me0ne-unkn0wn
The PoV reclaim signed extension calculates the actual storage proof size as a difference between the recorded proof sizes in the extension's
post_dispatch
andpre_dispatch
calls. Then, the delta between the real storage proof size and benchmarked size is calculated, and the runtime-registered weight is corrected by that delta.The problem is that after
Executive
calls into theApplyable::apply
but before the PoV reclaim'spre_dispatch
, many things happen that also affect the recorded proof size. Namely,pre_dispatch
hooks of other signed extensions get called.The
CheckNonce
extension, which updates the nonce and stores the account data in theAccount
map, had the most significant influence on the proof size I observed. Other things that are not so influential but worth mentioning are the writes toAllExtrinsicLen
andBlockWeight
itself by theCheckWeight
extension.The PoV reclaim, not being aware of those storage proof size increments, decreases proof size in the runtime too much. To give a concrete example, for a transaction that actually consumed 3769 bytes of proof (measured as a difference between recorded proof sizes before and after calling into
apply
), the runtime incremented proof size by a benchmarked value of 3731 bytes, and then 1758 of them were reclaimed by the PoV reclaim, resulting in 1973 bytes of proof increase being registered by the runtime. So 3769 was registered by the recorder and 1973 by the runtime, a huge difference.The result is the wrong proof size registered by the runtime, which may, above all else, be a security problem.
The following graph shows how proof size increases as registered by the node and by the runtime with increase of the number of transactions.
CC @skunert @bkchr @eskimor
The text was updated successfully, but these errors were encountered: