Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile-Guided Optimization (PGO) benchmarks #117

Open
zamazan4ik opened this issue Feb 10, 2024 · 2 comments
Open

Profile-Guided Optimization (PGO) benchmarks #117

zamazan4ik opened this issue Feb 10, 2024 · 2 comments
Labels
enhancement New feature or request performance rust Pull requests that update Rust code

Comments

@zamazan4ik
Copy link

Hi!

I tried to apply Profile-Guided Optimization (PGO) to optimize llrt performance further (as I already did for many other projects - see all current results here). I performed some basic benchmarks and want to share the results here.

Test environment

  • Fedora 39
  • Linux kernel 6.7.3
  • AMD Ryzen 9 5900x
  • 48 Gib RAM
  • SSD Samsung 980 Pro 2 Tib
  • Compiler - Rustc 1.76
  • llrt version: the latest for now from the main branch on commit c040bfd05a2be8d3300e7a1bbfc9405c42a865fa
  • Disabled Turbo boost (for more stable results across benchmark runs)

Benchmark

As a benchmark, I use the same command as I found in the Makefile - llrt fixtures/hello.js. The same scenario is used for the PGO training phase. All PGO optimization steps are done with cargo-pgo tool. PGO instrumented version is built with cargo pgo build, PGO optimized version - cargo pgo optimize build. taskset -c 0 is used for reducing CPU scheduling influence on the results.

Results

I got the following results:

hyperfine -u microsecond -N --warmup=2000 --min-runs 10000 "taskset -c 0 ./llrt_optimized ../fixtures/hello.js" "taskset -c 0 ./llrt_release ../fixtures/hello.js"
Benchmark 1: taskset -c 0 ./llrt_optimized ../fixtures/hello.js
  Time (mean ± σ):     2664.8 µs ±  78.8 µs    [User: 590.1 µs, System: 1943.3 µs]
  Range (min … max):   2478.1 µs … 4486.1 µs    10000 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: taskset -c 0 ./llrt_release ../fixtures/hello.js
  Time (mean ± σ):     2796.1 µs ±  63.6 µs    [User: 601.4 µs, System: 2068.9 µs]
  Range (min … max):   2647.5 µs … 4495.0 µs    10000 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./llrt_optimized ../fixtures/hello.js ran
    1.05 ± 0.04 times faster than taskset -c 0 ./llrt_release ../fixtures/hello.js

, where llrt_release - usual Release version, llrt_optimized - PGO-optimized version.

I ran the benchmark multiple times, with different command orders, etc - in all cases, the PGO-optimized version was faster than the usual release version. However, it would be awesome to perform some more precise benchmarks.

Further steps

I can suggest to do the following things:

  • Perform more PGO benchmarks with some more precise performance measurements.
  • If PGO is worth it - add a note to the documentation about it and, possibly, make an option in the build scripts to optimize llrt easier with the existing build infrastructure.
  • Try to play with Post-Link Optimization (PLO) with tools like LLVM BOLT.

I hope these benchmark results can be interesting to someone.

@richarddavison richarddavison added enhancement New feature or request rust Pull requests that update Rust code performance labels Feb 12, 2024
@richarddavison
Copy link
Contributor

This is very interesting! I will rerun the benchmark with PGO (with profile data form test runs) and see the results! PLO is also super interesting but is a different beast! Right now, we use zig as a cross compiler. Since LLRT is a fully static build using musl libc, we can probably use musl sources and clang-15 directly (since it may come with bolt) and apply both PGO, PLO and LTO 🥇

@EricDunaway
Copy link

If instrumentation/sampling and testing could be streamlined it would be interesting to see if a per lambda optimization with pgo+bolt would be beneficial for some use cases rather than a generic optimization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance rust Pull requests that update Rust code
Projects
None yet
Development

No branches or pull requests

3 participants