-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[move] Benchmarking historical transactions #15329
Conversation
⏱️ 2h 53m total CI duration on this PR
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
5f0697c
to
9d3bc98
Compare
3b035a7
to
429acfd
Compare
4beb0d6
to
67ec63d
Compare
429acfd
to
a351639
Compare
67ec63d
to
2d8c42b
Compare
9ff5305
to
8d71d53
Compare
bc4d63b
to
e5d9058
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Left a bunch of comments to consider.
[package] | ||
name = "aptos-replay-benchmark" | ||
version = "0.1.0" | ||
description = "A tool to replay and locally benchmark on-chain transactions." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add any tests at all (that also run in the CI), so that we know if other changes cause this tool to break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn verify_tool() {
use clap::CommandFactory;
Command::command().debug_assert();
}
}
allows us to test commands exhaustively. For CI testing, the plan is to actually use this tool on CI to replay txns and benchmark performance (to track regressions, etc. similar to single-node-performance we have now), so once it is there, it will be "tested" because we use it. Will add a TODO for now, as I will probably add a few options to CLI to download and read transactions/state override.
@@ -0,0 +1,30 @@ | |||
[package] | |||
name = "aptos-replay-benchmark" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to also add an option to profile gas locally, so that one can look at gas cost distribution and trace in more depth (similar to https://aptos.dev/en/build/cli/working-with-move-contracts/local-simulation-benchmarking-and-gas-profiling#overview and #15304)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think our tooling would benefit from more unified CLI features (e.g., here, comparison-e2e-testing can re-use Diff
s, etc.). For gas it is a bit tricky here because gas profiler is created inside the VM, so block executor does not have any context (unless we create a special entry point).
I would say in the future we would want to unify this, log gas/sec in addition to pure time, etc.
290e210
to
b73fc74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
I'd add a README.md file to aptos-replay-benchmark
.
Also, we had discussion before to stop using the aptos-
prefix for aptos-move
subdirectories (as it's just noise), the consistency is already broken, so we can continue with non aptos-
prefixed ones.
/// Holds configuration for running the benchmarks and measuring the time taken. | ||
pub struct BenchmarkRunner { | ||
concurrency_levels: Vec<usize>, | ||
num_repeats: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to have num_warmup_repeats
and num_measured_repeats
here for more flexible measurements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I think also for warm-up txns? We probably want to execute A to B, and then B to C and measure B to C only, so we simulate "cached" environment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I was suggesting, first num_warmup_repeats (A to B), then number measured (B+1 to C). we can then simulate cached environment, as well as fully cold environment. We might also want num warmup to be large enough for CPUs with speed stepping...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added that in a slightly different way: user can specify how many blocks to skip. This way we avoid weird cases where versions are not block boundaries (which we probably should enforce)
I guess for good benchmarks we do want to disable dynamic frequency scaling anyway.
let start_time = Instant::now(); | ||
block.run(&executor, concurrency_level); | ||
let time = start_time.elapsed().as_millis(); | ||
println!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional - myabe log instead of println!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I avoid logs is that they are very verbose on parallel execution/API/storage side: we log when speculative logs are flushed, Block-STM finishes execution, etc. - hence the default Error log level.
Do you mean log to a file? Ideally, we can have an option to save results in a CSV so we can re-use, but printing seems good enough as we can have a script invoking the executable and piping stdout into a file in the right format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, ideally our logger could have specialized streams but alas it doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work
pub(crate) fn get_state_override( | ||
&self, | ||
state_view: &impl StateView, | ||
) -> HashMap<StateKey, StateValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one can argue that we should return a pair of state override key and value, since it's always one. HashMap makes this aspect confusing, and also the whole override potentially could include more things than features, I suppose (now or later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to extend it beyond features, hence the hashmap. Ideally reading from the path where we preciously generated and saved overrides for state keys, e.g. framework code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gas feature version is also interesting, but affects the behaviour more. The plan is to allow users to not consider outputs as different if diffs only contain gas payment related changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we aren't going to extend immediately, it makes sense to have restricted API then extend with that PR, but feel free to keep as well, I don't think it's too important here. My main concern was types for partial and complete overrides being the same can be confusing and better to avoid as long as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by partial / complete override?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imagine we have multiple pieces that can override the state. Say I had two functions, for different use-cases / reasons, and one of them only produced a single key-value pair. It seems to me safer to return (key, value) from that API as there is no chance to confuse that (partial) override with the final collection that's used to override (HashMap). But if there are also APIs that return similar collections, etc, the distinction becomes not that clear and useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see! Let's land this - I am about to create a follow up anyway to read override from state, so we can override framework, so I will rework some of those APIs. Basically, I think we can have an option to
- Override features (CLI)
- Override gas feature version (CLI)
- Override compiled modules (point to compiled move package) (CLI)
b73fc74
to
d56ae9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go go
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
✅ Forge suite
|
✅ Forge suite
|
✅ Forge suite
|
A tool to benchmark execution of past transactions. The user has to provide first and last versions of the interval that need to be benchmarked. The tool partitions transactions in the specified closed interval into blocks, and runs all blocks end-to-end, measuring the time. During this run, executor is shared (and so are environment and module caches). There is no commit here, only execution time. For each block, we maintain the state (read-set estimated from 1 run before the benchmarks) on top of which it should run. And so, the benchmark just runs in sequence blocks on top of their initial states (outputs are only used for comparison against the on-chain data). The tool allows one to override configs to experiment with new features, or with how the execution would look like without some features. For now, we support: - enabling features - disabling features In the future, we can add more overrides: gas schedule, modules, etc.
A tool to benchmark execution of past transactions. The user has to provide first and last versions of the interval that need to be benchmarked. The tool partitions transactions in the specified closed interval into blocks, and runs all blocks end-to-end, measuring the time. During this run, executor is shared (and so are environment and module caches). There is no commit here, only execution time. For each block, we maintain the state (read-set estimated from 1 run before the benchmarks) on top of which it should run. And so, the benchmark just runs in sequence blocks on top of their initial states (outputs are only used for comparison against the on-chain data). The tool allows one to override configs to experiment with new features, or with how the execution would look like without some features. For now, we support: - enabling features - disabling features In the future, we can add more overrides: gas schedule, modules, etc.
A tool to benchmark execution of past transactions. The user has to provide first and last versions of the interval that need to be benchmarked. The tool partitions transactions in the specified closed interval into blocks, and runs all blocks end-to-end, measuring the time. During this run, executor is shared (and so are environment and module caches). There is no commit here, only execution time. For each block, we maintain the state (read-set estimated from 1 run before the benchmarks) on top of which it should run. And so, the benchmark just runs in sequence blocks on top of their initial states (outputs are only used for comparison against the on-chain data). The tool allows one to override configs to experiment with new features, or with how the execution would look like without some features. For now, we support: - enabling features - disabling features In the future, we can add more overrides: gas schedule, modules, etc.
Description
This PR introduces a tool to benchmark past transactions (and correctly, unlike existing
aptos-debugger move execute-past-transactions ...
).Summary
Aptos debugger tool uses an RPC under the hood when running transactions. This means state view access latencies can be huge. Also, only executes transactions in blocks, without shared caches across blocks like the real executor does.
This PR introduces a better tool to benchmark execution of past transactions. The user has to provide first and last versions of the interval that need to be benchmarked. The tool partitions transactions in the specified closed interval into blocks, and runs all blocks end-to-end, measuring the overall time. During this run, executor is shared (and so are environment and module caches).
There is no commit here, only execution time. For each block, we maintain the state (read-set estimated from 1 run before the benchmarks) on top of which it should run. And so, the benchmark just runs in sequence blocks on top of their initial states (outputs are only used for comparison against the on-chain data).
The tool allows one to override configs to experiment with new features, or with how the execution would look like without some features. For now, we support:
In the future, we can add more overrides: gas schedule, modules, etc.
It also computes the diffs between expected and new overridden outputs, e.g.:
Example
For example, say we have a new feature, e.g.,
ENABLE_LOADER_V2
. We can benchmark how historical transactions perform with this flag on/off.The flag is off by default:
With the flag on:
Great - we now can quantify the effect of the feature on runtime.
Other related changes
Level
to be able to use it from CLI. The behaviour should be the same.BlockAptosVM::execute_block
withAptosVMBlockExecutor::new().execute_block
where possible (benchmark, debugger) so that we use the high-level wrapper, and not the inner type.How Has This Been Tested?
Manually running benchmarks.
Key Areas to Review
N/A, probably checking logger's
Level
is still correct.Type of Change
Which Components or Systems Does This Change Impact?
Checklist