Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not wait any longer once quorum threshold is reached in handle_certificate calls #4606

Merged
merged 1 commit into from
Sep 15, 2022

Conversation

sadhansood
Copy link
Contributor

@sadhansood sadhansood commented Sep 13, 2022

We continue the quorum round even after we have more than threshold stake with a reduced timeout which is probably because we don't want to terminate the connection causing validators to throw away whatever processing they have going on. This PR fixes the behavior. This may have been true before but we now spawn a tokio task on every server call which means the execution completes even if we terminate the connection. Added a test to confirm this behavior. There are other two places in authority_aggregator we use ContinueWithTimeout() which I am not fully certain if will benefit from this change (and hence not touching those) as we also seem to update authorities based on their response of object version they have present locally.

@sadhansood sadhansood force-pushed the sadhan/fix_auth_agg_timeout branch 3 times, most recently from 2e1a6fd to c20e66a Compare September 13, 2022 20:51
@sadhansood sadhansood marked this pull request as ready for review September 13, 2022 20:52
@mystenmark
Copy link
Contributor

This was intended behavior - we want the certificate to be executed on as many validators as possible, so we continue past the point of quorum. This is in contrast to handle_transaction, where we want to stop immediately after quorum, since you don't get any bonus points for have extra signatures in your cert.

@sadhansood
Copy link
Contributor Author

The certificate is still executed like the test proves but we don't need to wait for it on the client, no?

@mystenmark
Copy link
Contributor

The certificate is still executed like the test proves but we don't need to wait for it on the client, no?

That's because all the authorities in your test immediately process the request, so the timeout never comes into play. But if there was any trouble sending the request in the first place, then other authorities may not even receive the request if we give up early.

That said, after thinking about it some more, this may be ok. I initially thought that your comment about running the handlers in separate tasks was a red herring, because when the authority aggregator code was written we hadn't even yet realized that early client disconnects would interrupt the request handler. So, that wasn't the original justification, but it may actually be pretty good justification after all.

I agree that it is the case that any validators that at least receive the request will finish, so waiting for more than 2f+1 to reply may not be necessary.

@lxfind what do you think?

@lxfind
Copy link
Contributor

lxfind commented Sep 14, 2022

Yeah I think this is the right thing to do.
My only question is whether this test is reliable? i.e. is one yield guaranteed for the server side to finish execution?

@sadhansood
Copy link
Contributor Author

Yeah I think this is the right thing to do. My only question is whether this test is reliable? i.e. is one yield guaranteed for the server side to finish execution?

Maybe a reasonable sleep based delay will be better? I'm open to suggestions.

@lxfind
Copy link
Contributor

lxfind commented Sep 14, 2022

Yeah sleep should help. I wonder why this is causing checkpoint tests to fail


use tokio::time::Instant;

pub async fn init_network_authorities(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how's this different from init_local_authorities below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one will spin up authorities which will exchange messages over network. The local one is updated inline. I wanted to simulate how it'll be in prod and hence used network authorities.

@longbowlu
Copy link
Contributor

longbowlu commented Sep 14, 2022

There are other two places in authority_aggregator we use ContinueWithTimeout() which I am not fully certain if will benefit from this change (and hence not touching those) as we also seem to update authorities based on their response of object version they have present locally.

IMO those functions are less relevant once gateway is gone, but let's keep their cleaning up to another PR

@sadhansood sadhansood force-pushed the sadhan/fix_auth_agg_timeout branch 4 times, most recently from 0a73242 to 1a88e82 Compare September 15, 2022 00:23
@sadhansood
Copy link
Contributor Author

sadhansood commented Sep 15, 2022

I also moved sync_certificate_to_authority to run in a tokio task so that we don't abort the syncing process mid way once we have enough votes. Note: this function is invoked when client identifies and decides to catch up a lagging authority. This may have been another reason why we had such a long post quorum timeout. Without this, client may never (or rarely) catch up lagging authorities (and just rely on checkpoints + gossip)

@sadhansood sadhansood force-pushed the sadhan/fix_auth_agg_timeout branch from 1a88e82 to a45b305 Compare September 15, 2022 06:48
@sadhansood sadhansood merged commit 18280f5 into main Sep 15, 2022
@sadhansood sadhansood deleted the sadhan/fix_auth_agg_timeout branch September 15, 2022 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants