Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[optqs] bug fixes and perf improvements #15452

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Conversation

ibalajiarun
Copy link
Contributor

@ibalajiarun ibalajiarun commented Dec 2, 2024

Description

This PR includes the following fixes

  • [optqs] support fetching batches from QC signers
    In optQS, batches can only be fetched from consensus and batch proposer until the block gets a QC. after QC, batches should be fetched from QC signers as well to guarantee progress.
  • [optqs] ability to update responders on inflight fetch
    Builds on top of previous commit to update responders in flight without waiting for previous fetch to complete.
  • [optqs] Ignore stale proposals due to fetch lag
    If a validators fetches in critical path of proposal and then votes, it is possible that it might already have a SyncInfo for that round because other validators moved on. In such a case, the validator should ignore the proposal instead of voting. This issue came up in logs because some validators couldn't verify the Vote message because the vote round and sync info QC round were same.
  • Enables OptQS on forge

Copy link

trunk-io bot commented Dec 2, 2024

⏱️ 9h 40m total CI duration on this PR
Slowest 15 Jobs Cumulative Duration Recent Runs
test-target-determinator 1h 29m 🟩🟩🟩🟩🟩 (+16 more)
execution-performance / single-node-performance 1h 8m 🟩🟩🟩
forge-realistic-env-graceful-overload / forge 56m 🟥🟥
rust-cargo-deny 38m 🟩🟩🟩🟩🟩 (+17 more)
check-dynamic-deps 24m 🟩🟩🟩🟩🟩 (+18 more)
rust-move-tests 14m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩

settingsfeedbackdocs ⋅ learn more about trunk.io

@ibalajiarun ibalajiarun force-pushed the balaji/optqs-batch branch 2 times, most recently from 4f486f1 to b55b5f0 Compare December 2, 2024 23:22
@ibalajiarun ibalajiarun added the CICD:run-forge-e2e-perf Run the e2e perf forge only label Dec 2, 2024

This comment has been minimized.

@ibalajiarun ibalajiarun changed the title Balaji/optqs batch [optqs] support fetching batches from QC signers Dec 2, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@ibalajiarun ibalajiarun marked this pull request as ready for review December 3, 2024 01:45

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@ibalajiarun ibalajiarun force-pushed the balaji/optqs-batch branch 2 times, most recently from 3dab484 to be1a1b2 Compare December 3, 2024 19:12

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

// NB: this might append the same signers multiple times, but this
// should be very rare and has no negative effects.
for responders in existing_responders {
responders.lock().append(&mut signers.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels awkward, every retry will append duplicate signers into the responders?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the alternate is very verbose. I need to check if the vector is a subset and then add them efficiently. There is no harm re-adding though. It's appended at the end, so it's like effectively cycling over the original vector?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the vector grows infinitely with the iterations, but seems it can at most duplicate once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can grow if the fetch keeps failing forever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought every iteration it starts with the qc signers + batch signers? or do i missing something?

                    let mut signers = signers.clone();
                    signers.append(&mut summary.signers(ordered_authors));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code reads very confusing to me, I see that the batches_and_responders is re-created but also I don't see the data_ptr.responders do the same, so the data_ptr.responders seems keep growing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦🏽‍♂️ yeah you are right. I am not replacing data_ptr.responders. Agree, it is confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be I should I just re-issue the request instead of trying to send the updated peers via the Arc<Mutex<>> and accept that there might be duplicate fetches on QS side. This should be rare anyhow? That way this might be much more cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orthogonally, may be we should make QS fetcher smarter and consolidate fetches to same batches instead of duplicating concurrent fetches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I am updating the responders too properly now. Still not straight-forward code. The real fix is to change batch_requester and BatchReaderImpl to efficiently handle duplicate get requests. That way payload manager code can be very minimal, just fetch without caring about tracking previous requests. I can do that in the next PR. wdyt?

self.payload_manager.get_transactions(block).await?;

let mut block_voters = None;
let (txns, max_txns_from_block_to_execute) = loop {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this works? it reads like it always get_transactions with None.

it creates two initial futures first (which always uses None) then select on the result and qc, regardless when the result comes back, it returns so it'll never use the block_voters to get_transactions again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 58 sets block_voters to Some() and continuing the next iteration of loop? If the block_qc_fut returns some QC, then we will extract the voters bitvec and use it for the next iteration. We only break if get_transactions returns a result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh it starts the new iteration with new get_transactions futures 🤦‍♂️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured what confused me is the new iteration of the loop, maybe it's less confusing to write it like this?

select! {
  qc => {
    get_transaction()?
  }
  txns = get_transactions => txns
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yeah. Updated the code.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 421d5ced5cfa91af0e63c60241c0eef3086ab0d2

two traffics test: inner traffic : committed: 14683.39 txn/s, latency: 2707.42 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 3300 ms), latency samples: 5582940
two traffics test : committed: 99.99 txn/s, latency: 1351.68 ms, (p50: 1300 ms, p70: 1400, p90: 1500 ms, p99: 4600 ms), latency samples: 1720
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.607, avg: 1.535", "ConsensusProposalToOrdered: max: 0.304, avg: 0.292", "ConsensusOrderedToCommit: max: 0.312, avg: 0.305", "ConsensusProposalToCommit: max: 0.602, avg: 0.598"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.74s no progress at version 38175 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.99s no progress at version 6317888 (avg 0.77s) [limit 16].
Test Ok

Copy link
Contributor

✅ Forge suite compat success on 6593fb81261f25490ffddc2252a861c994234c2a ==> 421d5ced5cfa91af0e63c60241c0eef3086ab0d2

Compatibility test results for 6593fb81261f25490ffddc2252a861c994234c2a ==> 421d5ced5cfa91af0e63c60241c0eef3086ab0d2 (PR)
1. Check liveness of validators at old version: 6593fb81261f25490ffddc2252a861c994234c2a
compatibility::simple-validator-upgrade::liveness-check : committed: 17097.08 txn/s, latency: 1981.39 ms, (p50: 1800 ms, p70: 1900, p90: 2500 ms, p99: 4200 ms), latency samples: 570240
2. Upgrading first Validator to new version: 421d5ced5cfa91af0e63c60241c0eef3086ab0d2
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7696.11 txn/s, latency: 3889.39 ms, (p50: 4500 ms, p70: 4700, p90: 4800 ms, p99: 4800 ms), latency samples: 143900
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 7571.00 txn/s, latency: 4408.35 ms, (p50: 4800 ms, p70: 4900, p90: 4900 ms, p99: 5300 ms), latency samples: 258600
3. Upgrading rest of first batch to new version: 421d5ced5cfa91af0e63c60241c0eef3086ab0d2
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7471.83 txn/s, latency: 4009.43 ms, (p50: 4700 ms, p70: 5000, p90: 5100 ms, p99: 5200 ms), latency samples: 135060
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 7574.82 txn/s, latency: 4427.45 ms, (p50: 4800 ms, p70: 4900, p90: 5000 ms, p99: 5400 ms), latency samples: 255360
4. upgrading second batch to new version: 421d5ced5cfa91af0e63c60241c0eef3086ab0d2
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 12437.02 txn/s, latency: 2350.07 ms, (p50: 2600 ms, p70: 2800, p90: 2900 ms, p99: 2900 ms), latency samples: 215980
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 12237.60 txn/s, latency: 2674.85 ms, (p50: 2800 ms, p70: 2900, p90: 3100 ms, p99: 3200 ms), latency samples: 396040
5. check swarm health
Compatibility test for 6593fb81261f25490ffddc2252a861c994234c2a ==> 421d5ced5cfa91af0e63c60241c0eef3086ab0d2 passed
Test Ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants