-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throttle bug fixes + req refactors #565
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heyoo
It's very important to define precise invariant(s) and property(s) of your ideal system before you start implementing a solution to get there. There are problems cropping up because the invariants defined so far are not precise.
The invariant is here
interchain-security/docs/throttle.md
Lines 81 to 101 in b28e176
## Throttling Invariant | |
Using on-chain params and the sub protocol defined, slash packet throttling is implemented such that the following invariant is maintained (in addition to those already defined in the CCV spec). | |
For the following invariant to hold, these points must be true: | |
- The final slashed validator does not have more than `SlashMeterReplenishFraction` of total voting power on the provider. | |
- `SlashMeterReplenishFraction` is large enough to avoid rounding errors. | |
- `SlashMeterReplenishPeriod` is sufficiently longer than the time it takes to produce a block. | |
Invariant: | |
> The time it takes to jail `X`% of the provider validator set from consumer initiated slash requests will be greater than or equal to `(X * SlashMeterReplenishPeriod / SlashMeterReplenishFraction) - 2 * SlashMeterReplenishPeriod` | |
Intuition: If jailings begin when the slash meter is full, then `SlashMeterReplenishFraction` of the provider validator set can be jailed immediately. The remaining jailings are only applied when the slash meter is positive in value, so the time it takes to jail the remaining `X - SlashMeterReplenishFraction` of the provider validator set is `(X - SlashMeterReplenishFraction) * SlashMeterReplenishPeriod / SlashMeterReplenishFraction`. However, the final validator could be jailed during the final replenishment period, with the meter being very small in value (causing it to go negative after jailing). So we subtract another `SlashMeterReplenishPeriod` term in the invariant to account for this. | |
Note this invariant could be adjusted with different slash meter protocols, but the current scheme is the simplest to implement and understand. | |
This invariant is useful because it allows us to reason about the time it takes to jail a certain percentage of the provider validator set from consumer initiated slash requests. For example, if `SlashMeterReplenishFraction` is set to 0.06, then it takes no less than 4 replenishment periods to jail 33% of the provider validator set on the Cosmos Hub. Note that as of writing this on 11/29/22, the Cosmos Hub does not have a validator with more than 6% of total voting power. | |
Note also that 4 replenishment period is a worst case scenario that depends on well crafted attack timings. |
but it's not sensible in the current form
The time it takes to jail
X
% of the provider validator set from consumer initiated slash requests ...
because the validator set is always changing every block. So the sentence doesn't really make sense: which validator set is being talked about?
I bring this up because development without invariants leads to many problems. The core issue is that you cannot judge a solution/implementation without invariants. Moreover, you cannot describe what the code is supposed to be doing, except in a hand wavy way.
I bring this up here because, in this PR, changes are being made which affect the slash guarantees in subtle ways, and it cannot be said if they're correct or not, because the goal (the invariant) isn't defined.
To be concrete, with a problem here. One of the issues being addressed is the one from this comment
which says that setting replenishFraction=1
is not sufficient to avoid packet delays. But this code does not address that AFAIU.
Note that inactive validators are not the same as unbonded valiators. All unbonded validators are inactive, but not all inactive validators are unbonded. Some inactive validators are unbonding, and they must be slashable. Slash packets targeting active and inactive (but not unbonded) validators can all arrive at the same time and they must all be handled as per whatever invariant.
I think it's very important to precisely write down the throttle invariants, and then work from there. We can work together on doing that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving because needed in goc-december testnet release.
Integration tests will be added in some other PR.
Everything passes locally, including |
Responding to @danwt
The problems that have came up so far seem like normal bugs that're a part of the development process. E2e/unit tests are limited, this is why we decided not to merge the slash packet throttling feature into main until we either had diff tests passing, or throttle specific integration tests passing (both of which are capable of finding this critical bug). Finding bugs now is a good thing imo and I welcome such efforts!
Imo this is not the case. There is a lack of finalized spec rn, but that does not mean we're throwing invariants out the window.
You have a valid concern about the invariant I've defined in This PR does address #462 (comment). See
None of the functionality you've written in this paragraph should be violated by slash packet throttling. If it is violated, that's a bug we need to fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just some suggestions on simplifying the code. Also, why are the GH actions failing?
ctx.BlockTime(), // recv time | ||
chainID, // consumer chain id that sent the packet | ||
packet.Sequence, // IBC sequence number of the packet | ||
providerConsAddr)) // Provider consensus address of val to be slashed | ||
|
||
// Queue slash packet data in the same (consumer chain specific) queue as vsc matured packet data, | ||
// to enforce order of handling between the two packet types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not update the packet data to include the providerConsAddr
and avoid calling GetProviderAddrFromConsumerAddr
later on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not update the packet data
By this, I believe you're referring to the SlashPacketData
type. This is the protobuf generated type that is sent over the wire from consumer to provider. Adding a new field to this type (that's only used on the provider) seems unnecessary to me. However, if you feel this is important, I can refactor the base branch once the merge conflicts are solved. I'm hesitant to do much more refactoring in this PR due to how disconnected from main it is at this time :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking more like updating SlashPacketData.Validator.Address
to the provider address obtained from GetProviderAddrFromConsumerAddr
. But it's not that important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha, that's a reasonable change, done in 1f920b4
var valPower int64 | ||
if !found || val.IsJailed() { | ||
// If validator is not found, or found but jailed, it's power is 0. This path is explicitly defined since the | ||
// staking keeper's LastValidatorPower values are not updated till the staking keeper's endblocker. | ||
valPower = 0 | ||
} else { | ||
valPower = k.stakingKeeper.GetLastValidatorPower(ctx, val.GetOperator()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something but doesn't this change allow validator to be slashed all the way to 0 in a single block?
If a consumer sends many slash packets for the same validator?
Besides that, I don't think this addressing the problem in this comment
To state the problem more clearly:
We want it to be possible to set the slashFraction or other params so that the throttle is 'invisible'. Ie. the behavior of the system is exactly as in the current spec,
This means, we do not want any slash packets to be held up/delayed (still talking about when fraction=1, replenishPeriod = 1 second ect).
Consider the following
Imagine there are n validators total on the provider, with k<n active. Imagine the actives are v1,v2..vk and the inactives are vk+1,..vn
A consumer sends slash packets for every single provider validator v1..,vn and they are all delivered to the provider in the same block.
It is possible, that all of the slashes for the active validators v1..vk are handled first. This will set the meter to 0 so the slashes for vk+1,.. vn will be blocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this design we can slash up to but not including 100% of the "initial val set" instantaneously when the replenish fraction is 1. If we want to include 100%, its a one line inequality change. I like 100% not being allowed, but I'm not super attached to that idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change we discussed today is included in 777a3f6, thank you for the efforts and feedback. Even with disagreements at times, the dialogue ended up useful and productive! It will be cool to get this feature properly specced and thoroughly tested before prod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and before I forget,
I might be missing something but doesn't this change allow validator to be slashed all the way to 0 in a single block?
Yes this is possible with or without slash packet throttling. Does #544 address your concern? If not, might be useful to add onto that issue or create a new one
I think this has to do with some oddities of how GH handles CI for a branch off a branch that's disconnected from main 😆, I just ran |
commit 01b13d3 Author: Shawn Marshall-Spitzbart <[email protected]> Date: Thu Dec 8 18:34:49 2022 -0800 weird commit a759c07 Author: Shawn Marshall-Spitzbart <[email protected]> Date: Thu Dec 8 13:33:34 2022 -0800 Throttle bug fixes + req refactors (#565) * fixes * Update throttle.go * fix tests * set replenish frac = 1.0 for all test runs * rm unmarshal func * req refactors * small lint fix * comment adjustment * found <-> jailed order swap * 100% change
* wip * Update module.go * wip * tests pass * Update relay.go * Update relay.go * Update relay.go * Update relay.go * wip * still wip * panic with too many packets * Update relay.go * wip * checkpoint * Update throttle.go * make relay.go closer to main * merge fixes * Update expected_keepers.go * Update throttle_test.go * Update throttle.go * second queue is now implemented * smalls * Update throttle.go * removed pointer silliness * Update throttle_test.go * test improvements * small * where it's called * Update relay.go * comments n stuff * Update throttle.go * Update throttle.go * comments * wip * callbacks * cleans * mas tests * Update README.md * less diff * mas * keys * wip * changes * Update params.go * size constraints * Update keys_test.go * Update throttle_test.go * wip * on recv new behavior and test * on recv slash packet * so close * clean * Update slashing.go * cleans * wip * params * wip * tests * changes * mas * Update README.md * Update README.md * Update README.md * sorry for the friday night emails * Update instance_test.go * wip * wip * wip * wip * wip * wip * wip * wip * works * Update generic_setup.go * Update setup.go * smol * small * path to ccv chan setup * todos * Update setup.go * Create debug_test.go * democ * Update debug_test.go * setup all ccv channels * bump to main * another bump, missed one * fix after merge main * fixes * Update slashing.go * expired client tests * checks * fixed the stuff * smol * changes * updates * cleans * clean * todo * fixes * cleans * Update slashing.go * Update slashing.go * mod tidy * fix build errs * test fixes * Update slashing.go * Update throttle.go * base rounding * mas unit tests * updates * comments * comments * cleans * clean * small * sin gas * more understandable logic * crypto rand * utils * one e2e * helpers * Update slashing.go * helpers * cleans * shiz works * tcs * smalls * un mas * allowance changing test * smol * smol * e2e tests are done * lets try this * Key Assignment -> goc-december (#527) * add MsgAssignConsumerKey * add MsgAssignConsumerKey * fix package name * add keys * add keeper methods for key assignment * handle MsgAssignConsumerKey * map addresses in slash requests * prune old consumer addresses * move AssignConsumerKey logic to keeper * update consumer initial valset * add ApplyKeyAssignmentToValUpdates * fix client creation * do not check init valset on consumer * clean state on val removal * fix TestAssignConsensusKeyForConsumerChain * delete on val removal * remove reverse mapping on val removal * remove pending key assignment in EndBlock * add query endpoints add summary of indexes change ConsumerValidatorByVscID to ConsumerAddrsToPrune * Refactor AssignConsumerKey for clarity (IMO) * finish key assignment genesis code- untested * FIxed mocks compile issue - not sure if it works right though. * add test for init and export genesis * set after get in AssignConsumerKey * enable AssignConsumerKey to be called twice * remove key assignment on chain removal * apply some review comments * fix bug: two validator with same consumer key * rename key: ConsumerValidatorsByVscIDBytePrefix -> ConsumerAddrsToPruneBytePrefix * PendingKeyAssignment -> KeyAssignmentReplacements * msg.ProviderAddr is a validator addr * fix: key assignment genesis tests (#517) * Fix consumer init genesis test * fix provider genesis tests * fix key assignement handler * fix linter * fix merge conflict * fix ProviderValidatorAddress * remove unused expectation Co-authored-by: Marius Poke <[email protected]> * add key assignment CRUD operations unit tests (#516) * test val consumer key related CRUD * test val consumer addr related CRUD * test pending key assignments related CRUD * refactor after review session * refactor after review session * add prune key CRUD tests * renamings in testfiles * improve KeyAssignmentReplacement set and get * remove ApplyKeyAssignmentToInitialValset (redundant) * add invariant to docstring of AppendConsumerAddrsToPrune * fix address conversion * adding e2e tests * fix linter * add queries; setup integration tests (#519) * add queries; setup integration testse * test key assignment before chain start * fix state queries; refactor * rm extra comment * rm unused action field * bump voting times in all tests * add provider address query to tests * Adds some very basic random testing and unit tests (#522) * Adds imports * Does multi iterations: fails! * Handle errs * checkpoint debug * Pre introduce dynamic mock * Issue seems to be resolved * Removes prints in key asisgn * Removes debug, pre reintroduce all test features * Fix some magic numbers, bring back prune check * Pre rework initial assignments * Refactor and tidyup * Better docs, clarity, org Co-authored-by: Daniel <[email protected]> * Enable key assignment testing for all e2e tests (#524) * split CCVTestSuite.setupCallback in two * pre-assign keys for all vals of first consumer * fix linter * remove TestConsumerGenesis Co-authored-by: mpoke <[email protected]> Co-authored-by: Simon Noetzlin <[email protected]> Co-authored-by: MSalopek <[email protected]> Co-authored-by: Daniel T <[email protected]> Co-authored-by: Daniel <[email protected]> * Fix errors in merge commit, comment out failing TestRelayAndApplySlashPacket test * add simon's test fixes * Update state.go * last replenish time -> last full time * readme * increment i * shared method * iteration change * Meter allowance lockstep (#553) changes * log * requested key format changes (#560) changes * Throttle garbage collection (#557) * changes * Update proposal.go * add log * Update throttle_test.go * fixes * update invariant * Throttle bug fixes + req refactors (#565) * fixes * Update throttle.go * fix tests * set replenish frac = 1.0 for all test runs * rm unmarshal func * req refactors * small lint fix * comment adjustment * found <-> jailed order swap * 100% change * weird * merge fixes to build * tests now pass * name change * better ordering tests * remove integration test diff * avoid double call to address mapping * Update throttle.go * 0 included in iteration start * naming refactors * Update keys_test.go * more refactors * clarify allowance terminology * update doc with explanation on min value * md clarification * Update throttle.md * swap replenish order * add max limit note * #533 Adds normal operation diff testing * reb * reb * Del unused Co-authored-by: Daniel <[email protected]> * progress save * cleans * name change * Bugfix (#605) * Circuit breaker refactor (#606) * quick fix * small key correction * smol * don't store time length * use big endian, shawn you dingus Co-authored-by: Jehan <[email protected]> Co-authored-by: mpoke <[email protected]> Co-authored-by: Simon Noetzlin <[email protected]> Co-authored-by: MSalopek <[email protected]> Co-authored-by: Daniel T <[email protected]> Co-authored-by: Daniel <[email protected]> Co-authored-by: Jehan Tremback <[email protected]>
Description
This PR solves two bugs found from diff tests on #462, thank you for finding these @danwt! It also closes an issue raised about the integration of key assignment and throttling.
Bug 1: #462 (comment)
Bug 2: #462 (comment)
Issue fixed: #538
Linked issues
Closes #538
Type of change
How was the feature tested?
Other information
This will need to be cherry-picked into
goc-december
andcircuit-breaker