Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assembler: merge adjacent basic blocks #1454

Merged
merged 28 commits into from
Aug 20, 2024
Merged

Conversation

plafer
Copy link
Contributor

@plafer plafer commented Aug 14, 2024

Closes #1429

The current solution leaves the old unused basic blocks in the MastForest. I could not think of a simple/clean way to avoid this. We now run dead code elimination as well.

@plafer plafer requested review from bitwalker and bobbinth August 14, 2024 21:17
@bobbinth
Copy link
Contributor

Thank you! Not a review yet - but one thought that came to mind: I wonder if it may be better to do this as a separate pass once MastForest has been fully built. It would be less efficient, but this is probably not super important as compilation is not on a "critical path".

At the same time, it would make the code more modular and potentially may yield better results. Also, we can extend this approach to the unused node elimination. That is, we first build the MAST, then do a separate pass to merge contiguous basic blocks, then do another pass to eliminate unused notes. In the future, this can be extended with additional passes for various optimization purposes.

@plafer
Copy link
Contributor Author

plafer commented Aug 15, 2024

I thought about it a bit more, and it's simpler to keep the current implementation, and add a dead code elimination pass that will remove all unused nodes. Otherwise, it's quite complex to merge basic blocks once you have the MastForest, since you need to undo the tree of JOINs that we create to merge basic blocks, create a new one from the new set of basic blocks, and then correctly update all references to old MastNodeIds.

I'll put this back into draft to implement a dead code elimination pass.

@plafer plafer marked this pull request as draft August 15, 2024 12:40
@plafer plafer marked this pull request as ready for review August 15, 2024 17:41
@plafer plafer force-pushed the plafer-1429-merge-adjacent-blocks branch from e38cde2 to f9c14cc Compare August 15, 2024 17:43
@plafer
Copy link
Contributor Author

plafer commented Aug 15, 2024

Added dead code elimination, which makes the assembler 5x slower (or more). I will investigate if there are some easy optimizations I can make, but we should also consider making dead code elimination optional (if only to make tests run faster)

UPDATE: Actually the program_compilation benchmark is virtually unchanged. I observed the ~5x slowdown from running proptests - no longer sure why this is happening

UPDATE: pushed an optimization where we only run dead code elimination if we detect any dead code. now the program_compilation benchmark is exactly unchanged. I also flamegraphed the clz_proptest test, and 80% of the time is spent hashing as a result of deserialization (#1391).

@plafer plafer force-pushed the plafer-1429-merge-adjacent-blocks branch 2 times, most recently from c33a67a to c00fef2 Compare August 15, 2024 18:35
@plafer plafer force-pushed the plafer-1429-merge-adjacent-blocks branch from c00fef2 to aaf0212 Compare August 16, 2024 11:04
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks good! I left some comments inline, but also several general comments:

  1. It seems like the size of the serialized standard library went up from about 250KB to 1.2MB. I was expecting it it to go down - so, curious why that happened.
  2. Do you know how this affected the runtime of our usual BLAKE3 example? I'm mostly curious how the cycle counts changed.
  3. I think we probably should update the program_compilation benchmark. I think right now it benchmarks primarily the Assembler::add_library() method since the program we are compiling is very simple (basically, it results in an empty MAST forest). I think instead, we should benchmark compilation of a specific module in the standard library (e.g., maybe the same sha256 module) - or maybe we can benchmark compilation of the entire standard library.

assembly/src/assembler/basic_block_builder.rs Show resolved Hide resolved
assembly/src/assembler/instruction/mod.rs Show resolved Hide resolved
core/src/mast/node/mod.rs Outdated Show resolved Hide resolved
assembly/src/assembler/mod.rs Outdated Show resolved Hide resolved
assembly/src/assembler/mod.rs Outdated Show resolved Hide resolved
assembly/src/assembler/mod.rs Outdated Show resolved Hide resolved
assembly/src/assembler/dead_code_elimination.rs Outdated Show resolved Hide resolved
assembly/src/assembler/dead_code_elimination.rs Outdated Show resolved Hide resolved
assembly/src/assembler/dead_code_elimination.rs Outdated Show resolved Hide resolved
@bobbinth
Copy link
Contributor

Also - let's rebase this on the main branch. We can merge it there and then release it as v0.10.4 - this way, we'll realize the benefit of this PR in miden-base immediately.

@bobbinth
Copy link
Contributor

It seems like the size of the serialized standard library went up from about 250KB to 1.2MB. I was expecting it it to go down - so, curious why that happened.

Actually, thinking about this more - the reason for size increase could be due to inlining of procedures. Though, 5x increase seems quite high.

@plafer plafer force-pushed the plafer-1429-merge-adjacent-blocks branch from 15ec461 to 2e4b889 Compare August 19, 2024 19:55
@plafer plafer force-pushed the plafer-1429-merge-adjacent-blocks branch from 2e4b889 to a71c792 Compare August 19, 2024 20:03
@plafer
Copy link
Contributor Author

plafer commented Aug 19, 2024

Benchmark of MIDEN_LOG=info RAYON_NUM_THREADS=16 ./target/optimized/miden example --recursive blake3 -n 100:

branch: next

proving time: 12470 ms

============================================================
Generated a program to compute 100-th iteration of BLAKE3 1-to-1 hash; expected result: [4210649924, 4239425932, 2583891669, 2278324621, 1697424527, 1323302812, 3062448259, 2695025053]
--------------------------------
INFO     prove_program [ 12.5s | 3.77% / 100.00% ]
INFO     ┝━ execute_program [ 335ms | 2.69% ]
INFO     ┝━ i [info]: Generated execution trace of 70 columns and 1048576 steps (49% padded) in 334 ms
INFO     ┝━ build_domain [ 10.6ms | 0.08% ] trace_length: 1048576 | lde_domain_size: 8388608
INFO     ┝━ commit_to_main_trace_segment [ 6.64s | 0.00% / 53.25% ]
INFO     │  ┝━ extend_execution_trace [ 5.51s | 44.19% ] num_cols: 70 | blowup: 8
INFO     │  ┕━ compute_execution_trace_commitment [ 1.13s | 9.07% ] tree_depth: 23
INFO     ┝━ commit_to_aux_trace_segment [ 836ms | 0.00% / 6.70% ]
INFO     │  ┝━ extend_execution_trace [ 497ms | 3.99% ] num_cols: 7 | blowup: 8
INFO     │  ┕━ compute_execution_trace_commitment [ 338ms | 2.71% ] tree_depth: 23
INFO     ┝━ evaluate_constraints [ 2.42s | 19.41% ] ce_domain_size: 8388608
INFO     ┝━ commit_to_constraint_evaluations [ 1.19s | 0.00% / 9.58% ]
INFO     │  ┝━ build_composition_poly_columns [ 124ms | 0.99% ] num_columns: 8
INFO     │  ┝━ evaluate_composition_poly_columns [ 782ms | 6.27% ]
INFO     │  ┕━ compute_constraint_evaluation_commitment [ 289ms | 2.31% ] tree_depth: 23
INFO     ┝━ build_deep_composition_poly [ 371ms | 2.97% ]
INFO     ┝━ evaluate_deep_composition_poly [ 71.1ms | 0.57% ]
INFO     ┝━ compute_fri_layers [ 95.9ms | 0.77% ] num_layers: 4
INFO     ┝━ determine_query_positions [ 2.11ms | 0.02% ] grinding_factor: 16 | num_positions: 27
INFO     ┕━ build_proof_object [ 23.6ms | 0.19% ]
--------------------------------
Executed program in 12470 ms
Stack outputs: [4210649924, 4239425932, 2583891669, 2278324621, 1697424527, 1323302812, 3062448259, 2695025053]
Execution proof size: 100 KB
Execution proof security: 96 bits
--------------------------------
INFO     verify_program [ 1.05ms | 100.00% ]
Execution verified in 1 ms

branch: this one

proving time: 5593 ms

Generated a program to compute 100-th iteration of BLAKE3 1-to-1 hash; expected result: [4210649924, 4239425932, 2583891669, 2278324621, 1697424527, 1323302812, 3062448259, 2695025053]
--------------------------------
INFO     prove_program [ 5.59s | 4.10% / 100.00% ]
INFO     ┝━ execute_program [ 234ms | 4.18% ]
INFO     ┝━ i [info]: Generated execution trace of 70 columns and 524288 steps (8% padded) in 233 ms
INFO     ┝━ build_domain [ 5.24ms | 0.09% ] trace_length: 524288 | lde_domain_size: 4194304
INFO     ┝━ commit_to_main_trace_segment [ 2.68s | 0.00% / 47.91% ]
INFO     │  ┝━ extend_execution_trace [ 2.15s | 38.49% ] num_cols: 70 | blowup: 8
INFO     │  ┕━ compute_execution_trace_commitment [ 527ms | 9.42% ] tree_depth: 22
INFO     ┝━ commit_to_aux_trace_segment [ 413ms | 0.00% / 7.38% ]
INFO     │  ┝━ extend_execution_trace [ 253ms | 4.52% ] num_cols: 7 | blowup: 8
INFO     │  ┕━ compute_execution_trace_commitment [ 160ms | 2.86% ] tree_depth: 22
INFO     ┝━ evaluate_constraints [ 1.18s | 21.07% ] ce_domain_size: 4194304
INFO     ┝━ commit_to_constraint_evaluations [ 564ms | 0.00% / 10.09% ]
INFO     │  ┝━ build_composition_poly_columns [ 64.8ms | 1.16% ] num_columns: 8
INFO     │  ┝━ evaluate_composition_poly_columns [ 352ms | 6.30% ]
INFO     │  ┕━ compute_constraint_evaluation_commitment [ 147ms | 2.63% ] tree_depth: 22
INFO     ┝━ build_deep_composition_poly [ 201ms | 3.60% ]
INFO     ┝━ evaluate_deep_composition_poly [ 33.2ms | 0.59% ]
INFO     ┝━ compute_fri_layers [ 39.9ms | 0.71% ] num_layers: 4
INFO     ┝━ determine_query_positions [ 271µs | 0.00% ] grinding_factor: 16 | num_positions: 27
INFO     ┕━ build_proof_object [ 15.0ms | 0.27% ]
--------------------------------
Executed program in 5593 ms
Stack outputs: [4210649924, 4239425932, 2583891669, 2278324621, 1697424527, 1323302812, 3062448259, 2695025053]
Execution proof size: 93 KB
Execution proof security: 96 bits
--------------------------------
INFO     verify_program [ 1.01ms | 100.00% ]
Execution verified in 1 ms

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! I left some comments inline - but they are all pretty small.

assembly/src/assembler/mast_forest_builder.rs Outdated Show resolved Hide resolved
assembly/src/assembler/mast_forest_builder.rs Outdated Show resolved Hide resolved
assembly/src/assembler/mast_forest_builder.rs Outdated Show resolved Hide resolved
core/src/mast/mod.rs Outdated Show resolved Hide resolved
core/src/mast/mod.rs Outdated Show resolved Hide resolved
core/src/mast/mod.rs Outdated Show resolved Hide resolved
core/src/mast/mod.rs Outdated Show resolved Hide resolved
core/src/mast/mod.rs Outdated Show resolved Hide resolved
@bobbinth
Copy link
Contributor

Oh - I think we should also update program_compilation benchmark to see how this affected the assembler performance.

@plafer
Copy link
Contributor Author

plafer commented Aug 20, 2024

Oh - I think we should also update program_compilation benchmark to see how this affected the assembler performance.

I ended up creating a new benchmark under the miden-stdlib package instead. After looking more into it, I think we should keep benchmarks in the package which they are benchmarking. Otherwise, in this case, I would have had to resort to some hacks - e.g. I use the CARGO_MANIFEST_DIR environment variable to find the directory which contains the masm files for the standard library (under asm/). But in the miden-vm package, this environment variable points to the wrong directory.

And also I think it makes things cleaner - we will find all the benchmarks related to the standard library in the standard library's package, as opposed to having all the miden benchmarks in the same package.

So IMO, in another PR, we should move the current benchmarks in the miden-vm package in the package which they are benchmarking.

Benchmark results

On my machine, it took 115ms to compile the standard library.

@plafer plafer requested a review from bobbinth August 20, 2024 12:50
Copy link
Contributor

@bitwalker bitwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! I left two tiny nits inline.

Regarding benchmark locations: I actually think having all of them in a single crate is convenient (we always know where to look for them). miden-vm crate works well for this because all crates "come together" there.

But let's keep the new benchmark as you have and we can later decided whether we move it into miden-vm or move others out of there.

/// Builds a tree of `JOIN` operations to combine the provided MAST node IDs.
pub fn join_nodes(
&mut self,
mast_node_ids: Vec<MastNodeId>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: here and in other functions, I'd probably name this parameter just node_ids as the mast part is kind of implied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

use criterion::{criterion_group, criterion_main, Criterion};

fn stdlib_compilation(c: &mut Criterion) {
let mut group = c.benchmark_group("stdlib");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would rename the group name to "compile_stdlib".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@bobbinth
Copy link
Contributor

Oh - and a couple of last things to check:

  1. What is the size of the serialized stdlib after this PR.
  2. How did this PR affect stdlib deserialization time (we have a benchmark for this in miden-vm).

@plafer
Copy link
Contributor Author

plafer commented Aug 20, 2024

What is the size of the serialized stdlib after this PR.

next: 250825 bytes (or 0.24 MBs)
Now: 1198781 bytes (or 1.14 MBs)

How did this PR affect stdlib deserialization time (we have a benchmark for this in miden-vm).

next: 2.2ms
Now: 14.3ms

A ~5x decrease in performance, which is directly related to the file size which is ~5x larger. So it seems like adding a heuristic for when to stop merging blocks would be very beneficial for deserialization performance.

@plafer plafer requested a review from bobbinth August 20, 2024 18:40
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good! Thank you!

@bobbinth bobbinth merged commit 62a49fd into next Aug 20, 2024
9 checks passed
@bobbinth bobbinth deleted the plafer-1429-merge-adjacent-blocks branch August 20, 2024 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants