-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add consensus time metrics on SUI side #4584
Conversation
This diff adds two metrics (a) `sequencing_certificate_latency`: histogram for consensus adapter that measures latency of transaction sequencing. This should be a key metric that shows how much consensus contributes to overall transaction latency. Normally in the stress test tool we should see `shared_object_latency = owned_object_latency + sequencing_certificate_latency`. Metric that existed before, `sequencing_certificate_control_delay` is a proxy metric(it indicates consensus timeout derived from the current latency) and is hard to use directly. (b) `handle_consensus_duration_mcs` is a metric that can be used to derive the utilization of the `handle_consensus_transaction` task. The metrics is simply a sum of execution times. Based on this metric we can build a graph `rate(handle_consensus_duration_mcs) / 1_000_000`, this graph will show 'utilization' of the consensus task on sui side. If value here gets close to 1.0 it means that consensus is generating tasks faster than sui can assign shared object locks. We will need to add something similar to (b) on the narwhal side to include batch waiter etc time.
@@ -351,6 +358,9 @@ impl ConsensusAdapter { | |||
metrics | |||
.sequencing_certificate_control_delay | |||
.set(new_delay as i64); | |||
metrics | |||
.sequencing_certificate_latency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't past just the amount of time to send a transaction to consensus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(replied to Lu below)
metrics | ||
.sequencing_certificate_latency | ||
.observe(past_ms as f64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I noticed that metric after I added mine. There is slight difference between them(one is pure consensus vs another adds some crypto checks), so I think we can keep both.
sequencing_certificate_latency: register_histogram_with_registry!( | ||
"sequencing_certificate_latency", | ||
"The latency for certificate sequencing.", | ||
registry, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should define buckets here until #4601 is merged (I will try to read today), the default values won't tell us much useful insight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Let's see if we can merge better histograms today, if not i'll make PR with better buckets [this one got merged via automerge]
When handling user transaction coming from narwhal we currently do two things **before setting shared** locks: (1) Verify signatures on the transactions (2) Verify if transaction were already processed If you compare [consensus utilization](http://grafana.shared.internal.sui.io:3000/goto/e9vXAbV4k?orgId=1) and [txn verification utilization in consensus](http://grafana.shared.internal.sui.io:3000/goto/OeXu0x44z?orgId=1) you can see that essentially 99% of the consensus task is spent in signature verification. (Details on how utilization counters work are in #4584) This PR swaps those two steps - we simply do not verify signatures on the transaction that was already processed. Because signature verification dominates time in consensus task, this change will significantly increase consensus task throughput. Given that currently each node submits transaction to consensus this effect will be massive - for example **for private testnet this will increase shared txn throughput by 20x**. (We can confirm this by looking at consensus utilization after deploying this fix) We should still look into not submitting txn from each node to consensus, but this fix is still useful regardless of that.
…ed (#4949) When handling user transaction coming from narwhal we currently do two things **before setting shared** locks: (1) Verify signatures on the transactions (2) Verify if transaction were already processed If you compare [consensus utilization](http://grafana.shared.internal.sui.io:3000/goto/e9vXAbV4k?orgId=1) and [txn verification utilization in consensus](http://grafana.shared.internal.sui.io:3000/goto/OeXu0x44z?orgId=1) you can see that essentially 99% of the consensus task is spent in signature verification. (Details on how utilization counters work are in #4584) This PR swaps those two steps - we simply do not verify signatures on the transaction that was already processed. Because signature verification dominates time in consensus task, this change will significantly increase consensus task throughput. Given that currently each node submits transaction to consensus this effect will be massive - for example **for private testnet this will increase shared txn throughput by 20x**. (We can confirm this by looking at consensus utilization after deploying this fix) We should still look into not submitting txn from each node to consensus, but this fix is still useful regardless of that.
This diff adds few metrics
(a)
sequencing_certificate_latency
: histogram for consensus adapter that measures latency of transaction sequencing. This should be a key metric that shows how much consensus contributes to overall transaction latency. Normally in the stress test tool we should seeshared_object_latency = owned_object_latency + sequencing_certificate_latency
. Metric that existed before,sequencing_certificate_control_delay
is a proxy metric(it indicates consensus timeout derived from the current latency) and is hard to use directly.(b)
handle_consensus_duration_mcs
is a metric that can be used to derive the utilization of thehandle_consensus_transaction
task. The metrics is simply a sum of execution times. Based on this metric we can build a graphrate(handle_consensus_duration_mcs) / 1_000_000
, this graph will show 'utilization' of the consensus task on sui side. If value here gets close to 1.0 it means that consensus is sequencing transactions faster than sui can process them(e.g. assign shared object locks or process checkpoints).(c)
verify_narwhal_transaction_duration_mcs
similarly will show what percentage of consensus transaction execution is spent on signature verificationWe will need to add something similar to (b) on the narwhal side to include batch waiter etc time.