Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add consensus time metrics on SUI side #4584

Merged
merged 1 commit into from
Sep 14, 2022
Merged

Add consensus time metrics on SUI side #4584

merged 1 commit into from
Sep 14, 2022

Conversation

andll
Copy link
Contributor

@andll andll commented Sep 12, 2022

This diff adds few metrics
(a) sequencing_certificate_latency: histogram for consensus adapter that measures latency of transaction sequencing. This should be a key metric that shows how much consensus contributes to overall transaction latency. Normally in the stress test tool we should see shared_object_latency = owned_object_latency + sequencing_certificate_latency. Metric that existed before, sequencing_certificate_control_delay is a proxy metric(it indicates consensus timeout derived from the current latency) and is hard to use directly.

(b) handle_consensus_duration_mcs is a metric that can be used to derive the utilization of the handle_consensus_transaction task. The metrics is simply a sum of execution times. Based on this metric we can build a graph rate(handle_consensus_duration_mcs) / 1_000_000, this graph will show 'utilization' of the consensus task on sui side. If value here gets close to 1.0 it means that consensus is sequencing transactions faster than sui can process them(e.g. assign shared object locks or process checkpoints).

(c) verify_narwhal_transaction_duration_mcs similarly will show what percentage of consensus transaction execution is spent on signature verification

We will need to add something similar to (b) on the narwhal side to include batch waiter etc time.

@andll andll enabled auto-merge (squash) September 12, 2022 21:26
This diff adds two metrics
(a) `sequencing_certificate_latency`: histogram for consensus adapter that measures latency of transaction sequencing. This should be a key metric that shows how much consensus contributes to overall transaction latency. Normally in the stress test tool we should see `shared_object_latency = owned_object_latency + sequencing_certificate_latency`.
Metric that existed before, `sequencing_certificate_control_delay` is a proxy metric(it indicates consensus timeout derived from the current latency) and is hard to use directly.

(b) `handle_consensus_duration_mcs` is a metric that can be used to derive the utilization of the `handle_consensus_transaction` task. The metrics is simply a sum of execution times. Based on this metric we can build a graph `rate(handle_consensus_duration_mcs) / 1_000_000`, this graph will show 'utilization' of the consensus task on sui side. If value here gets close to 1.0 it means that consensus is generating tasks faster than sui can assign shared object locks.

We will need to add something similar to (b) on the narwhal side to include batch waiter etc time.
@@ -351,6 +358,9 @@ impl ConsensusAdapter {
metrics
.sequencing_certificate_control_delay
.set(new_delay as i64);
metrics
.sequencing_certificate_latency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't past just the amount of time to send a transaction to consensus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(replied to Lu below)

Comment on lines +361 to +363
metrics
.sequencing_certificate_latency
.observe(past_ms as f64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I noticed that metric after I added mine. There is slight difference between them(one is pure consensus vs another adds some crypto checks), so I think we can keep both.

sequencing_certificate_latency: register_histogram_with_registry!(
"sequencing_certificate_latency",
"The latency for certificate sequencing.",
registry,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should define buckets here until #4601 is merged (I will try to read today), the default values won't tell us much useful insight

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Let's see if we can merge better histograms today, if not i'll make PR with better buckets [this one got merged via automerge]

@andll andll merged commit 5a09dee into main Sep 14, 2022
@andll andll deleted the andrey-22 branch September 14, 2022 20:20
andll added a commit that referenced this pull request Oct 3, 2022
When handling user transaction coming from narwhal we currently do two things **before setting shared** locks:

(1) Verify signatures on the transactions
(2) Verify if transaction were already processed

If you compare [consensus utilization](http://grafana.shared.internal.sui.io:3000/goto/e9vXAbV4k?orgId=1) and [txn verification utilization in consensus](http://grafana.shared.internal.sui.io:3000/goto/OeXu0x44z?orgId=1) you can see that essentially 99% of the consensus task is spent in signature verification. (Details on how utilization counters work are in #4584)

This PR swaps those two steps - we simply do not verify signatures on the transaction that was already processed.

Because signature verification dominates time in consensus task, this change will significantly increase consensus task throughput.

Given that currently each node submits transaction to consensus this effect will be massive - for example **for private testnet this will increase shared txn throughput by 20x**. (We can confirm this by looking at consensus utilization after deploying this fix)

We should still look into not submitting txn from each node to consensus, but this fix is still useful regardless of that.
andll added a commit that referenced this pull request Oct 3, 2022
…ed (#4949)

When handling user transaction coming from narwhal we currently do two things **before setting shared** locks:

(1) Verify signatures on the transactions
(2) Verify if transaction were already processed

If you compare [consensus utilization](http://grafana.shared.internal.sui.io:3000/goto/e9vXAbV4k?orgId=1) and [txn verification utilization in consensus](http://grafana.shared.internal.sui.io:3000/goto/OeXu0x44z?orgId=1) you can see that essentially 99% of the consensus task is spent in signature verification. (Details on how utilization counters work are in #4584)

This PR swaps those two steps - we simply do not verify signatures on the transaction that was already processed.

Because signature verification dominates time in consensus task, this change will significantly increase consensus task throughput.

Given that currently each node submits transaction to consensus this effect will be massive - for example **for private testnet this will increase shared txn throughput by 20x**. (We can confirm this by looking at consensus utilization after deploying this fix)

We should still look into not submitting txn from each node to consensus, but this fix is still useful regardless of that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants