-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
txHandler: batch transactions to the verifier #4621
txHandler: batch transactions to the verifier #4621
Conversation
…nt/verify_renamings
Looks good to me. I'd be satisfied with any other changes being a follow-on cleanup PR later. |
} | ||
|
||
func (handler *TxHandler) droppedTxnWatcher() { | ||
for unverified := range handler.streamVerifierDropped { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't see where streamVerifierDropped is closed... so this goroutine never exits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this goroutine never exits. It starts with the opening of the channels, and the channels are never re-opened and never closed, so is the goroutine.
You think this is a problem?
@@ -128,9 +130,14 @@ func (c *txSaltedCache) start(ctx context.Context, refreshInterval time.Duration | |||
c.moreSalt() | |||
} | |||
|
|||
func (c *txSaltedCache) WaitForStop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WaitForStop
is exported, so make start
exported as well
data/transactions/verify/txn.go
Outdated
numSigCategories++ | ||
} | ||
|
||
if numSigCategories == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have the checks in stxnCoreChecks
. It looks like there data flows do not intersect but the there is a code duplication that is very special.
so I suggest to 1) extract the prologue of this and stxnCoreChecks
methods into a separate function.
2) use the same errors: errTxnSigHasNoSig vs errSignedTxnHasNoSig and errTxnSigNotWellFormed vs errSignedTxnMaxOneSig
if numberOfBatchableSigsInGroup == 0 { | ||
err := sv.addVerificationTaskToThePoolNow([]*UnverifiedElement{stx}) | ||
if err != nil { | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this terminates the worker, it is like enqueuing error breaks the entire verification loop. In the old code EnqueueBacklog
were just ignored.
I understand that the current code only gets context cancelled error here, but this might involve and this code will break. I suggest at least logging this situation, and have some kind investigation if backlog ctx cancellation is also in the same code path as txHandler stopping (and StreamVerifier stopping).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EnqueueBacklog will return an error when the exec pool ctx is canceled, or when the ctx of the task being enqueued is canceled. The former case happens when the node shuts down, and the latter when the txHandler/StreamVerifier ctx is canceled, when they are stopped.
There are tests covering both cases. I added logging for this event, and added a check to the tests to make sure the event is properly logged.
data/transactions/verify/txn.go
Outdated
hasMsig := false | ||
hasLogicSig := false | ||
// checkTxnSigTypeCounts checks the number of signature types and reports an error in case of a violation | ||
func checkTxnSigTypeCounts(s *transactions.SignedTxn) (isStateProofTxn bool, err *TxGroupError) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about returning numSigs int, sigOrTxnType sigOrTxnType
and let the caller to decide on the error?
where
type sigOrTxnType int
const regularSig sigOrTxnType = 1
const multiSig sigOrTxnType = 2
const logicSig sigOrTxnType = 3
const stateProofTxn sigOrTxnType = 4
then the caller might have very the same error condition checks as now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is more complex and we don't have a use case where two callers are interested in reporting different errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but in fact you are re-doing 64 bytes comparisons again for both sig and msig. Maybe this is not slower than int comparison on modern processors, not sure.
Btw, the second use
isStateProofTxn, err := checkTxnSigTypeCounts(stx)
if err != nil || isStateProofTxn {
if err != nil {
return 0, err
}
return 0, nil
}
could be simplified to
isStateProofTxn, err := checkTxnSigTypeCounts(stx)
if err != nil || isStateProofTxn {
return 0, err
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tricky one. Returning err (instead of nil), when err == nil, the caller is receiving err != nil. I could not understand why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but in fact you are re-doing 64 bytes comparisons again for both sig and msig. Maybe this is not slower than int comparison on modern processors, not sure.
Convinced.
Please fix some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last remark on unused error values.
data/transactions/verify/txn.go
Outdated
var errSignedTxnHasNoSig = errors.New("signedtxn has no sig") | ||
var errSignedTxnMaxOneSig = errors.New("signedtxn should only have one of Sig or Msig or LogicSig") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two only used in test, remove?
* Batching incoming transaction for crypto verification * Streaming capabilities of transactions to the verifier
This PR delivers two main features:
Why (1):
Batching transactions are more efficient to verify because of the cryptographic efficiency in doing so. It is used for verification of txns in the block and the Msig. With this work, it will also be possible for incoming txns.
Why (2):
Batch verifier should use the execution pool to keep the number of goroutines performing heavy lifting in check. This approach is also followed in
PaysetGroups
for verifying the txns in a block.The transaction handler cannot block until the transaction is verified, for that reason it also uses the exec pool in txHandler.
The handler and the verifier cannot both use the exec pool, this may lead to a deadlock. For this reason, the streaming approach is adopted here.
Other considerations and future work:
The signatures inside LogicSig (Msig or Sig) are not added to the main batch. Instead, they are batched in their own batch. This is a limitation and the performance can be improved by adding them to the top level batch. This requires the decoupling of the txn validity checks from the signature verification, and will be address in a different PR.
In the
crypto_sign_ed25519_open_batch
implementation, the validity of every transaction is checked in case the batch verification fails. This can be improved for all workflows:a. in case we don't care with sig fails, and we reject the whole batch (Msig, block validation), we are doing the extra work of verifying each sig one by one when the batch fails.
b. in case we are which txn sig failed, in case of batch verification of the incoming txns, the verification can be done by checking logarithmic number of the sigs, instead we check linearly all of them.
Benchmark results:
The benchmarks are implemented on the current Master, and the same benchmarks are ported to here. That is, the benchmarks reported here measure the same workflows, with the same test code, except for some adaptations since the backlog worker (blw) has some changes on how the transactions are passed from handler to the blw and out of it.
(inv) indicates the probability of a transaction group to have an invalid signature.
Note: the benchmarks use direct feed to the verifier. Benchmarks were added to use the backlog worker to evaluate the drop rate, but these results are not reported here. These benchmarks are not refined enough to get meaningful measurements from them.
Single transaction with single signature
Single transaction with single signature is the case that will most benefit from this project. When the invalid signatures are rare, the results below show ~3x improvement. With high rate of invalid signatures, the gap gets smaller, because, the C code will retry every signature when the batch fails. The fix for this is a separate project as mentioned in future work above.
Before:
The signatures in a single Txn Group or in a multisig were already batched before this project. So improvements for those cases will be limited if any. However, it is important to make sure there is no regression.
Txn Groups
The txn group sizes are random, ranging from a single transaction to the maximum allowed sizes, equally distributed.
Improvements are observed when the invalid signature rate is low. This is mainly due to batching together smaller txn groups.
Before:
Msig
The msig sizes tested are 255, 64, and 16. The results are random, mainly because they perform similarly, and slight random variations pick a different winner randomly. Except for the smallest category, namely, Msigs with 16 signatures and low invalid rate, we see some considerable improvement there.
Before:
Msig TxnGroups
As expected, no significant difference here, since txn groups with many signatures per transaction were already batched, and the new code adds nothing of value here. However, this shows that the new code introduces no regression.
Before:
More Multisig with 4, 8, 16, 16 sig msigs
Before:
and Msig TxnGroups with 4, 8, 16, 16 sig msigs
Before: