Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] add gitlab pipeline for criterion benchmarks #422

Merged
merged 34 commits into from
Sep 16, 2022

Conversation

sergejparity
Copy link
Contributor

No description provided.

@codecov-commenter
Copy link

codecov-commenter commented Sep 2, 2022

Codecov Report

Merging #422 (e916aec) into master (8fd6464) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #422   +/-   ##
=======================================
  Coverage   79.38%   79.38%           
=======================================
  Files          72       72           
  Lines        6198     6198           
=======================================
  Hits         4920     4920           
  Misses       1278     1278           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@Robbepop Robbepop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking forward to trying it out. what are your plans on proper formatting of the output?

.gitlab-ci.yml Outdated
- cargo --version
- rustup +nightly show
- cargo +nightly --version
- cargo spellcheck --version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's just debug info. Will remove later

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this line for merge

@Robbepop
Copy link
Member

Robbepop commented Sep 2, 2022

We might be able to perform the benchmark pass via GitHub Actions using the new hosted GHA runners:
https://github.blog/2022-09-01-github-actions-introducing-the-new-larger-github-hosted-runners-beta/

This would simplify our CI setup compared to having 2 different ones.
What do you think?

@sergejparity
Copy link
Contributor Author

Yes. Those ones look interesting. But they are still in beta, not sure whether they available for us right now.

@Robbepop Robbepop force-pushed the sk-add-gitlab-pipeline branch from 461f183 to c4b82f4 Compare September 5, 2022 08:51
@Robbepop Robbepop force-pushed the sk-add-gitlab-pipeline branch from c4b82f4 to 9c39f52 Compare September 6, 2022 19:55
@paritytech-cicd-pr
Copy link

paritytech-cicd-pr commented Sep 9, 2022

CRITERION BENCHMARKS

BENCHMARK MASTER PR Diff
compile_and_validate_v1 5.3960 ms 5.4034 ms ⚪ -0.0543%
execute_count_until_v1 2.0226 ms 2.0235 ms ⚪ +0.0058%
execute_factorial_iterative_v1 918.89 ns 921.30 ns ⚪ +0.2528%
execute_factorial_recursive_v1 1.2885 µs 1.2668 µs 🟢 -1.6309%
execute_fib_iterative_v1 4.7993 ms 4.8103 ms ⚪ +0.2261%
execute_fib_recursive_v1 11.649 ms 11.088 ms 🟢 -4.8002%
execute_global_bump_v1 2.8162 ms 2.8178 ms ⚪ +0.0311%
execute_host_calls_v1 31.023 µs 30.696 µs ⚪ -0.6575%
execute_memory_fill_v1 4.0522 ms 4.0485 ms ⚪ -0.0776%
execute_memory_sum_v1 3.8000 ms 3.7960 ms ⚪ -0.0977%
execute_memory_vec_add_v1 8.0440 ms 8.0777 ms ⚪ +0.3714%
execute_recursive_is_even_v1 2.3110 ms 2.2377 ms 🟢 -3.1723%
execute_recursive_ok_v1 308.19 µs 295.35 µs 🟢 -4.1052%
execute_recursive_scan_v1 373.12 µs 357.69 µs 🟢 -4.1572%
execute_recursive_trap_v1 25.767 µs 25.363 µs 🟢 -1.7334%
execute_regex_redux_v1 1.4169 ms 1.3883 ms 🟢 -2.0764%
execute_rev_complement_v1 1.3855 ms 1.3879 ms ⚪ +0.2077%
execute_tiny_keccak_v1 1.2512 ms 1.2656 ms ⚪ +1.3573%
execute_trunc_f2i_v1 1.8029 ms 1.8044 ms ⚪ +0.1029%
instantiate_v1 75.835 µs 74.327 µs 🟢 -2.0046%

Link to pipeline

@sergejparity
Copy link
Contributor Author

@Robbepop please take a look at the sample report above. Does it look like what is expected?
It was generated from output of cargo bench --bench benches -- --noplot --baseline master :
As the base I took what can be seen in the logs

Benchmarking execute/memory_vec_add/v1
Benchmarking execute/memory_vec_add/v1: Warming up for 1.0000 s
Benchmarking execute/memory_vec_add/v1: Collecting 10 samples in estimated 2.1970 s (275 iterations)
Benchmarking execute/memory_vec_add/v1: Analyzing
execute/memory_vec_add/v1
                        time:   [7.9111 ms 7.9149 ms 7.9200 ms]
                        change: [-0.1041% +0.0076% +0.1570%] (p = 0.93 > 0.05)
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe

and transformed into this:

Benchmarks results
execute/memory_vec_add/v1
execute/memory_vec_add/v1
time: [7.9111 ms 7.9149 ms 7.9200 ms]
change: [-0.1041% +0.0076% +0.1570%] (p = 0.93 > 0.05)
No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) high mild
1 (10.00%) high severe

Any suggestions what needs to be improved, changed, added, etc..?

@Robbepop
Copy link
Member

Robbepop commented Sep 9, 2022

Hey @sergejparity thank you for the update!
Having those CI based benchmarks is already a big upgrade!
The benchmarked times looks a bit slow, do you know what machine they were running on?
(I did not expect that my old laptop from 2015 would outperform our CI runners 😅 )

I really like the color encodings where red indicates worse performance, green is improved performance and white is no change!
However, imo the default Criterion output for benchmarks is kind of unreadable at large and I'd much more prefer having a tabular.

Something like the following:

Benchmark master PR Diff %
compile_and_validate 5.3814 ms 5.3807 ms ⚪ -0.8455%
instantiate 74.820 µs 74.782 µs ⚪ +0.6305%
execute/tiny_keccak 1.2501 ms 1.2468 ms ⚪ -0.1709%
execute/count_until 1.9322 ms 1.8791 ms 🟢 -1.8681%

etc...

Note that I made up the master column numbers and took over the PR column numbers in the middle as well as the %-diffs.

One thing we might want to do is to make benchmarks in CI more stable. Right now the benchmarks are kinda unstable since I rather wanted to make them execute faster. But for CI we could change that. Maybe we can implement it in a dynamic way that allows for both options, fast and imprecise as well as slow and precise mode.

Generally I have big plans for this CI based benchmarking infrastructure.
For example in the future we want to extend it to also run Wasm coremark benchmarks native and under Wasmtime since that's how we will run wasmi mostly via Substrate. However, that would exceed the scope of the current PR and I will write a detailed issue in the future for this extension.

@sergejparity
Copy link
Contributor Author

@Robbepop thank you for the feedback. Will update the template.

What goes regarding performance, I also just using random regular runners for now. Later will tune up this component too.

@Robbepop
Copy link
Member

Robbepop commented Sep 9, 2022

@Robbepop thank you for the feedback. Will update the template.

What goes regarding performance, I also just using random regular runners for now. Later will tune up this component too.

Hyped to see your updates! :)
Glad to hear that runner performance is going to improve later.

@sergejparity sergejparity changed the title [wip][ci][do_not_merge] add gitlab pipeline [ci] add gitlab pipeline for criterion benchmarks Sep 16, 2022
@sergejparity
Copy link
Contributor Author

@Robbepop now it should be much closer to what you expect.

@Robbepop
Copy link
Member

@sergejparity yes it looks pretty good to me! Great job! 🚀

One small nit is that I'd love if the benchmark names were written in mono fonts. I.e. using ` around the names.

Is this PR done?

@sergejparity
Copy link
Contributor Author

names were written in mono fonts

Here you go :)

.gitlab-ci.yml Outdated
- cargo --version
- rustup +nightly show
- cargo +nightly --version
- cargo spellcheck --version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this line for merge

Comment on lines +13 to +26
echo "PARSING MASTER REPORT"
sed -e 's/^Found.*//g' \
-e 's/^\s\+[[:digit:]].*$//g' \
-e 's/\//_/g' \
-e 's/^[a-z0-9_]\+/"&": {/g' \
-e 's/time:\s\+\[.\{10\}/"time": "/g' \
-e 's/.\{10\}\]/"},/g' \
-e '1s/^/{\n/g' \
-e '/^$/d' \
-e 's/ */ /g' \
-e 's/^ *\(.*\) *$/\1/' $1 \
| sed -z 's/.$//' \
| sed -e '$s/.$/}/g' \
| tee target/criterion/output_master.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow up to this PR we should try to make this part more readable.
For example criterion supports JSON output as well: https://bheisler.github.io/criterion.rs/book/cargo_criterion/external_tools.html
And with tools such as jq we could significantly make extracting information more readable.

I will merge this PR and propose a follow-up PR for you. :) Is that okay with you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. My homegrown parsing "engine" looks ugly. And definitely can be improved.
But json export tool you mention is a part of cargo-criterion which is under development. By the way, have you tried it in action?
Reports from cargo bench also produces jsons with raw data in target/criterion. I've tried to use them as well, but found that even it is possible get execution times and other metrics. Only thing stopped me from using it is that I forced to write my own statistics analysis tool to interpret them :)
So parsing raw command line output to json appeared easier solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upd: tried to install and use cargo-criterion and first attempt was not successful:

Benchmarking compile_and_validate/v1thread 'main' panicked at 'could not open benchmark file benches/wasm/wasm_kernel/target/wasm32-unknown-unknown/release/wasm_kernel.wasm: No such file or directory (os er',  wasmi_v1/benches/bench/mod.rs : 16 : 33 
 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Benchmarking compile_and_validate/v1: Warming up for 1.0000 sthread 'main' panicked at 'Unexpected message FinishedBenchmarkGroup { group: "compile_and_validate/v1" }', /home/sergej/.cargo/registry/src/jackfan.us.kg-1ecc6299db9ec823/cargo-criterion-1.1.0/src/bench_target.rs:306:26
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Io(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })', /home/sergej/.cargo/registry/src/jackfan.us.kg-1ecc6299db9ec823/criterion-0.4.0/src/benchmark_group.rs:380:18
stack backtrace:

Double checked benches/wasm/wasm_kernel/target/wasm32-unknown-unknown/release/wasm_kernel.wasm is in place. What could be the reason for failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with raw command line output compared to the JSON output (or genreally machine readable output) is that the former is subject to change whereas the latter is kinda guaranteed to be stable to not break dependents. Therefore as a long term non-brittle solution we really really want to read output that is intended to be machine readable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upd: tried to install and use cargo-criterion and first attempt was not successful:

Benchmarking compile_and_validate/v1thread 'main' panicked at 'could not open benchmark file benches/wasm/wasm_kernel/target/wasm32-unknown-unknown/release/wasm_kernel.wasm: No such file or directory (os er',  wasmi_v1/benches/bench/mod.rs : 16 : 33 
 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Benchmarking compile_and_validate/v1: Warming up for 1.0000 sthread 'main' panicked at 'Unexpected message FinishedBenchmarkGroup { group: "compile_and_validate/v1" }', /home/sergej/.cargo/registry/src/jackfan.us.kg-1ecc6299db9ec823/cargo-criterion-1.1.0/src/bench_target.rs:306:26
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Io(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })', /home/sergej/.cargo/registry/src/jackfan.us.kg-1ecc6299db9ec823/criterion-0.4.0/src/benchmark_group.rs:380:18
stack backtrace:

Double checked benches/wasm/wasm_kernel/target/wasm32-unknown-unknown/release/wasm_kernel.wasm is in place. What could be the reason for failure.

You are probably doing this in the root directory instead of the wasmi_v1 subdirectory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Now it works, at first glance json's produced by cargo-criterion are almost what is needed. Of course some processing also is needed to produce readable output, like time conversion from ns into more convenient units. But at least we'll have an out-of-the-box machine readable data

@Robbepop
Copy link
Member

Thanks a lot @sergejparity for this great work! I will merge this PR now and will write some issues for future work items for you and the benchmarking CI. :)

🚀

@Robbepop Robbepop merged commit 97fb1d2 into master Sep 16, 2022
@Robbepop Robbepop deleted the sk-add-gitlab-pipeline branch September 16, 2022 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants