[optqs] bug fixes and perf improvements #15452

ibalajiarun · 2024-12-02T23:19:26Z

Description

This PR includes the following fixes

[optqs] support fetching batches from QC signers
In optQS, batches can only be fetched from consensus and batch proposer until the block gets a QC. after QC, batches should be fetched from QC signers as well to guarantee progress.
[optqs] ability to update responders on inflight fetch
Builds on top of previous commit to update responders in flight without waiting for previous fetch to complete.
[optqs] Ignore stale proposals due to fetch lag
If a validators fetches in critical path of proposal and then votes, it is possible that it might already have a SyncInfo for that round because other validators moved on. In such a case, the validator should ignore the proposal instead of voting. This issue came up in logs because some validators couldn't verify the Vote message because the vote round and sync info QC round were same.
Enables OptQS on forge

trunk-io · 2024-12-02T23:19:30Z

⏱️ 9h 40m total CI duration on this PR

Slowest 15 Jobs	Cumulative Duration	Recent Runs
test-target-determinator	1h 29m	🟩 🟩 🟩 🟩 🟩 (+16 more)
execution-performance / single-node-performance	1h 8m	🟩 🟩 🟩
forge-realistic-env-graceful-overload / forge	56m	🟥 🟥
rust-cargo-deny	38m	🟩 🟩 🟩 🟩 🟩 (+17 more)
check-dynamic-deps	24m	🟩 🟩 🟩 🟩 🟩 (+18 more)
rust-move-tests	14m	🟩
rust-move-tests	13m	🟩
rust-move-tests	13m	🟩
rust-move-tests	13m	🟩
rust-move-tests	13m	🟩
rust-move-tests	13m	⬜
rust-move-tests	13m	🟩
rust-move-tests	13m	🟩
rust-move-tests	13m	🟩
rust-move-tests	13m	🟩

_{settings ⋅ feedback ⋅ docs ⋅ learn more about trunk.io}

zekun000 · 2025-01-10T16:03:59Z

consensus/src/payload_manager.rs

+    // NB: this might append the same signers multiple times, but this
+    // should be very rare and has no negative effects.
+    for responders in existing_responders {
+        responders.lock().append(&mut signers.clone());


it feels awkward, every retry will append duplicate signers into the responders?

Yeah, the alternate is very verbose. I need to check if the vector is a subset and then add them efficiently. There is no harm re-adding though. It's appended at the end, so it's like effectively cycling over the original vector?

I thought the vector grows infinitely with the iterations, but seems it can at most duplicate once?

It can grow if the fetch keeps failing forever.

I thought every iteration it starts with the qc signers + batch signers? or do i missing something?

let mut signers = signers.clone(); signers.append(&mut summary.signers(ordered_authors));

the code reads very confusing to me, I see that the batches_and_responders is re-created but also I don't see the data_ptr.responders do the same, so the data_ptr.responders seems keep growing?

🤦🏽‍♂️ yeah you are right. I am not replacing data_ptr.responders. Agree, it is confusing.

May be I should I just re-issue the request instead of trying to send the updated peers via the Arc<Mutex<>> and accept that there might be duplicate fetches on QS side. This should be rare anyhow? That way this might be much more cleaner.

Orthogonally, may be we should make QS fetcher smarter and consolidate fetches to same batches instead of duplicating concurrent fetches.

Okay, I am updating the responders too properly now. Still not straight-forward code. The real fix is to change batch_requester and BatchReaderImpl to efficiently handle duplicate get requests. That way payload manager code can be very minimal, just fetch without caring about tracking previous requests. I can do that in the next PR. wdyt?

zekun000 · 2025-01-10T16:13:34Z

consensus/src/block_preparer.rs

-            self.payload_manager.get_transactions(block).await?;
+
+        let mut block_voters = None;
+        let (txns, max_txns_from_block_to_execute) = loop {


I'm not sure how this works? it reads like it always get_transactions with None.

it creates two initial futures first (which always uses None) then select on the result and qc, regardless when the result comes back, it returns so it'll never use the block_voters to get_transactions again?

Line 58 sets block_voters to Some() and continuing the next iteration of loop? If the block_qc_fut returns some QC, then we will extract the voters bitvec and use it for the next iteration. We only break if get_transactions returns a result.

oh it starts the new iteration with new get_transactions futures 🤦‍♂️

I figured what confused me is the new iteration of the loop, maybe it's less confusing to write it like this?

select! { qc => { get_transaction()? } txns = get_transactions => txns }

I see, yeah. Updated the code.

github-actions · 2025-01-11T19:11:18Z

✅ Forge suite `realistic_env_max_load` success on `421d5ced5cfa91af0e63c60241c0eef3086ab0d2`

two traffics test: inner traffic : committed: 14683.39 txn/s, latency: 2707.42 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 3300 ms), latency samples: 5582940
two traffics test : committed: 99.99 txn/s, latency: 1351.68 ms, (p50: 1300 ms, p70: 1400, p90: 1500 ms, p99: 4600 ms), latency samples: 1720
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.607, avg: 1.535", "ConsensusProposalToOrdered: max: 0.304, avg: 0.292", "ConsensusOrderedToCommit: max: 0.312, avg: 0.305", "ConsensusProposalToCommit: max: 0.602, avg: 0.598"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.74s no progress at version 38175 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.99s no progress at version 6317888 (avg 0.77s) [limit 16].
Test Ok

github-actions · 2025-01-11T19:12:30Z

✅ Forge suite `compat` success on `6593fb81261f25490ffddc2252a861c994234c2a` ==> `421d5ced5cfa91af0e63c60241c0eef3086ab0d2`

Compatibility test results for 6593fb81261f25490ffddc2252a861c994234c2a ==> 421d5ced5cfa91af0e63c60241c0eef3086ab0d2 (PR)
1. Check liveness of validators at old version: 6593fb81261f25490ffddc2252a861c994234c2a
compatibility::simple-validator-upgrade::liveness-check : committed: 17097.08 txn/s, latency: 1981.39 ms, (p50: 1800 ms, p70: 1900, p90: 2500 ms, p99: 4200 ms), latency samples: 570240
2. Upgrading first Validator to new version: 421d5ced5cfa91af0e63c60241c0eef3086ab0d2
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7696.11 txn/s, latency: 3889.39 ms, (p50: 4500 ms, p70: 4700, p90: 4800 ms, p99: 4800 ms), latency samples: 143900
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 7571.00 txn/s, latency: 4408.35 ms, (p50: 4800 ms, p70: 4900, p90: 4900 ms, p99: 5300 ms), latency samples: 258600
3. Upgrading rest of first batch to new version: 421d5ced5cfa91af0e63c60241c0eef3086ab0d2
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7471.83 txn/s, latency: 4009.43 ms, (p50: 4700 ms, p70: 5000, p90: 5100 ms, p99: 5200 ms), latency samples: 135060
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 7574.82 txn/s, latency: 4427.45 ms, (p50: 4800 ms, p70: 4900, p90: 5000 ms, p99: 5400 ms), latency samples: 255360
4. upgrading second batch to new version: 421d5ced5cfa91af0e63c60241c0eef3086ab0d2
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 12437.02 txn/s, latency: 2350.07 ms, (p50: 2600 ms, p70: 2800, p90: 2900 ms, p99: 2900 ms), latency samples: 215980
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 12237.60 txn/s, latency: 2674.85 ms, (p50: 2800 ms, p70: 2900, p90: 3100 ms, p99: 3200 ms), latency samples: 396040
5. check swarm health
Compatibility test for 6593fb81261f25490ffddc2252a861c994234c2a ==> 421d5ced5cfa91af0e63c60241c0eef3086ab0d2 passed
Test Ok

ibalajiarun force-pushed the balaji/optqs-batch branch 2 times, most recently from 4f486f1 to b55b5f0 Compare December 2, 2024 23:22

ibalajiarun added the CICD:run-forge-e2e-perf Run the e2e perf forge only label Dec 2, 2024

This comment has been minimized.

Sign in to view

ibalajiarun changed the title ~~Balaji/optqs batch~~ [optqs] support fetching batches from QC signers Dec 2, 2024

This comment has been minimized.

Sign in to view

ibalajiarun force-pushed the balaji/optqs-batch branch from b55b5f0 to 2b3e3ac Compare December 3, 2024 00:14

This comment has been minimized.

Sign in to view

ibalajiarun marked this pull request as ready for review December 3, 2024 01:45

ibalajiarun requested review from zekun000, sasha8, JoshLind and gregnazario as code owners December 3, 2024 01:45

This comment has been minimized.

Sign in to view

ibalajiarun requested review from bchocho and gelash as code owners December 3, 2024 05:34

ibalajiarun force-pushed the balaji/optqs-batch branch from bc884d4 to af611f0 Compare December 3, 2024 05:37

This comment has been minimized.

Sign in to view

ibalajiarun force-pushed the balaji/optqs-batch branch from 9db3836 to 8ebd336 Compare December 3, 2024 18:22

This comment has been minimized.

Sign in to view

ibalajiarun force-pushed the balaji/optqs-batch branch 2 times, most recently from 3dab484 to be1a1b2 Compare December 3, 2024 19:12

This comment has been minimized.

Sign in to view

ibalajiarun force-pushed the balaji/optqs-batch branch from 69d80df to e989225 Compare December 16, 2024 19:58

This comment has been minimized.

Sign in to view

ibalajiarun added 10 commits January 10, 2025 07:21

[optqs] support fetching batches from QC signers

cce8c4c

[optqs] ability to update responders on inflight fetch

fc7363f

[optqs] Ignore stale proposals due to fetch lag

fb825ec

[consensus] Refuse proposals with payloads from different epochs

f7e6afc

unused lint fixes

7b03c9c

enable optqs on forge

db32450

address feedback

85bcb29

address feedback

9abc2c3

support qc tx in new pipeline

713d51f

optqs inline batch verification

471761f

ibalajiarun force-pushed the balaji/optqs-batch branch from e989225 to 471761f Compare January 10, 2025 15:27

This comment has been minimized.

Sign in to view

zekun000 reviewed Jan 10, 2025

View reviewed changes

ibalajiarun added 2 commits January 11, 2025 09:56

block_preparer: address feedback

c98242a

payload_manager bugfix: reset responders in fut

421d5ce

This comment has been minimized.

Sign in to view

[optqs] bug fixes and perf improvements #15452

Are you sure you want to change the base?

[optqs] bug fixes and perf improvements #15452

Conversation

ibalajiarun commented Dec 2, 2024 • edited Loading

Description

trunk-io bot commented Dec 2, 2024 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Jan 11, 2025

✅ Forge suite realistic_env_max_load success on 421d5ced5cfa91af0e63c60241c0eef3086ab0d2

github-actions bot commented Jan 11, 2025

✅ Forge suite compat success on 6593fb81261f25490ffddc2252a861c994234c2a ==> 421d5ced5cfa91af0e63c60241c0eef3086ab0d2

ibalajiarun commented Dec 2, 2024 •

edited

Loading

trunk-io bot commented Dec 2, 2024 •

edited

Loading

✅ Forge suite `realistic_env_max_load` success on `421d5ced5cfa91af0e63c60241c0eef3086ab0d2`

✅ Forge suite `compat` success on `6593fb81261f25490ffddc2252a861c994234c2a` ==> `421d5ced5cfa91af0e63c60241c0eef3086ab0d2`