-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not wait any longer once quorum threshold is reached in handle_certificate calls #4606
Conversation
2e1a6fd
to
c20e66a
Compare
This was intended behavior - we want the certificate to be executed on as many validators as possible, so we continue past the point of quorum. This is in contrast to handle_transaction, where we want to stop immediately after quorum, since you don't get any bonus points for have extra signatures in your cert. |
The certificate is still executed like the test proves but we don't need to wait for it on the client, no? |
That's because all the authorities in your test immediately process the request, so the timeout never comes into play. But if there was any trouble sending the request in the first place, then other authorities may not even receive the request if we give up early. That said, after thinking about it some more, this may be ok. I initially thought that your comment about running the handlers in separate tasks was a red herring, because when the authority aggregator code was written we hadn't even yet realized that early client disconnects would interrupt the request handler. So, that wasn't the original justification, but it may actually be pretty good justification after all. I agree that it is the case that any validators that at least receive the request will finish, so waiting for more than 2f+1 to reply may not be necessary. @lxfind what do you think? |
Yeah I think this is the right thing to do. |
Maybe a reasonable sleep based delay will be better? I'm open to suggestions. |
Yeah sleep should help. I wonder why this is causing checkpoint tests to fail |
|
||
use tokio::time::Instant; | ||
|
||
pub async fn init_network_authorities( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how's this different from init_local_authorities
below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one will spin up authorities which will exchange messages over network. The local one is updated inline. I wanted to simulate how it'll be in prod and hence used network authorities.
IMO those functions are less relevant once gateway is gone, but let's keep their cleaning up to another PR |
0a73242
to
1a88e82
Compare
I also moved |
1a88e82
to
a45b305
Compare
We continue the quorum round even after we have more than threshold stake with a reduced timeout which is probably because we don't want to terminate the connection causing validators to throw away whatever processing they have going on. This PR fixes the behavior. This may have been true before but we now spawn a tokio task on every server call which means the execution completes even if we terminate the connection. Added a test to confirm this behavior. There are other two places in authority_aggregator we use ContinueWithTimeout() which I am not fully certain if will benefit from this change (and hence not touching those) as we also seem to update authorities based on their response of object version they have present locally.