Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile-Guided Optimization (PGO) results #172

Open
zamazan4ik opened this issue Jan 28, 2024 · 5 comments
Open

Profile-Guided Optimization (PGO) results #172

zamazan4ik opened this issue Jan 28, 2024 · 5 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request performance Performance related issues or enhancement.

Comments

@zamazan4ik
Copy link

zamazan4ik commented Jan 28, 2024

Hi!

Recently I started evaluating using Profile-Guided Optimization (PGO) for optimizing different kinds of software - all my current results are available in my GitHub repo. Since PGO helps with achieving better runtime efficiency in many cases, I decided to perform some PGO tests on Lace. I performed some benchmarks and want to share my results here.

Test environment

  • Fedora 39
  • Linux kernel 6.6.13
  • AMD Ryzen 9 5900x
  • 48 Gib RAM
  • SSD Samsung 980 Pro 2 Tib
  • Compiler - Rustc 1.75
  • Lace version: the latest for now from the master branch on commit 66e5a67688c76437a9ae5ec1bcadc4c1d0c7b604
  • Disabled Turbo boost (for more stable results across benchmark runs)

Benchmark

For benchmarking purposes, I use two things:

  • Built-in benchmarks
  • Manual lace-cli invocations with manual time measurements.

Built-in benchmarks are invoked with cargo bench --all-features --workspace. PGO instrumentation phase on benchmarks is done with cargo pgo bench -- --all-features --workspace. PGO optimization phase is done with cargo pgo optimize bench -- --all-features --workspace.

For lace-cli Release build is done with cargo build --release. PGO instrumented build is done with cargo pgo build. PGO optimized build is done with cargo pgo optimized build. The PGO training phase is done with LLVM_PROFILE_FILE=/home/zamazan4ik/open_source/lace/cli/target/pgo-profiles/lace_%m_%p.profraw ./lace_instrumented run --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace (see "Results" section for more details about using different training sets and its impact on the actual performance numbers).

For lace-cli I use taskset -c 0 to reduce an OS scheduler impact on the result. The seed is fixed for the same purpose.

All PGO optimization steps are done with cargo-pgo tool.

Results

At first, here are the results for the built-in benchmarks:

According to these benchmarks, PGO helps with achieving better performance in many cases. However, as you see, in some cases the performance is regressed. It could be an expected thing since the benchmarks have different scenarios, and some scenarios can have "optimization conflicts": the same optimization decision can lead to an improvement in one scenario and to a regression in another scenario. That's why using benchmarks for the PGO training phase could be a dangerous thing. Anyway, even knowing this we see many improvements.

If we want to see more real-life scenario, I performed PGO benchmarks on lace-cli.

Release vs PGO optimized (trained on the satellites dataset) on the satellites dataset:

hyperfine --warmup 10 --min-runs 50 'taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace
  Time (mean ± σ):      1.469 s ±  0.006 s    [User: 1.386 s, System: 0.063 s]
  Range (min … max):    1.464 s …  1.507 s    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace
  Time (mean ± σ):      1.382 s ±  0.001 s    [User: 1.299 s, System: 0.064 s]
  Range (min … max):    1.380 s …  1.388 s    50 runs

Summary
  taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace ran
    1.06 ± 0.00 times faster than taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace

Release vs PGO optimized (trained on the satellites dataset) on the animals dataset:

hyperfine --warmup 30 --min-runs 100 'taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     682.7 ms ±   3.6 ms    [User: 608.5 ms, System: 65.8 ms]
  Range (min … max):   680.4 ms … 706.4 ms    100 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     652.4 ms ±   2.9 ms    [User: 579.8 ms, System: 64.3 ms]
  Range (min … max):   648.2 ms … 672.5 ms    100 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace ran
    1.05 ± 0.01 times faster than taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace

Just for reference, here is the slowdown from PGO instrumentation:

hyperfine --warmup 5 --min-runs 10 'taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_instrumented run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     681.7 ms ±   0.7 ms    [User: 608.1 ms, System: 65.8 ms]
  Range (min … max):   681.0 ms … 683.1 ms    10 runs

Benchmark 2: taskset -c 0 ./lace_instrumented run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     841.0 ms ±   4.7 ms    [User: 754.1 ms, System: 77.3 ms]
  Range (min … max):   835.2 ms … 853.1 ms    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace ran
    1.23 ± 0.01 times faster than taskset -c 0 ./lace_instrumented run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace

I decided to test one more thing - how much performance differs if we use different PGO training sets? So here we go.

PGO optimized (trained on the satellites dataset) vs PGO optimized (trained on the animals dataset) on the animals dataset:

hyperfine --warmup 30 --min-runs 100 'taskset -c 0 ./lace_optimized_satellites run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_optimized_animals run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_optimized_satellites run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     653.0 ms ±   1.4 ms    [User: 579.7 ms, System: 65.4 ms]
  Range (min … max):   649.4 ms … 655.9 ms    100 runs

Benchmark 2: taskset -c 0 ./lace_optimized_animals run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     622.7 ms ±   1.8 ms    [User: 550.3 ms, System: 64.1 ms]
  Range (min … max):   618.6 ms … 626.3 ms    100 runs

Summary
  taskset -c 0 ./lace_optimized_animals run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace ran
    1.05 ± 0.00 times faster than taskset -c 0 ./lace_optimized_satellites run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace

As you see, improvement is measurable (5% is a good improvement).

Concluding all the results above, I can say that PGO helps to achieve better performance with Lace.

For anyone who cares about the binary size, I also did some measurements on lace-cli:

  • Release: 28184240 byte
  • PGO optimized (animals dataset): 28085792 byte
  • PGO optimized (satellites dataset): 27785576 byte
  • PGO instrumented: 116176688 byte

Possible further steps

I can suggest the following things to consider:

  • Perform more PGO benchmarks on Lace. If it shows improvements - add a note to the documentation about possible improvements in Lace performance with PGO (I guess somewhere in the README file will be enough).
  • Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize Lace according to their workloads.
  • Optimize pre-built binaries (if any)

Testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO) but I recommend starting from the usual LTO and PGO.

Here are some examples of how PGO optimization is integrated into other projects:

I would be happy to answer all your questions about PGO! Much more materials about PGO (actual performance numbers across a lot of other projects, PGO state across an ecosystem, PGO traps, and tricky details) can be found in https://github.com/zamazan4ik/awesome-pgo

@schmidmt
Copy link
Contributor

Hi @zamazan4ik,

Thanks for bringing PGO to our attention as a way to improve performance.

While we have some experience with PGO, most of our experience is with algorithmic improvements to gain performance. Would you like to add a section to our mdbook outlining some of the methods you mentioned? We'd be happy to help with lace and what we've learned about how people use it.

Thanks again, we appreciate it.

@schmidmt schmidmt added documentation Improvements or additions to documentation enhancement New feature or request performance Performance related issues or enhancement. labels Jan 29, 2024
@zamazan4ik
Copy link
Author

Would you like to add a section to our mdbook outlining some of the methods you mentioned?

What exactly mdbook do you mean? I think it can be doable from my side to contribute PGO-related information to it.

@Swandog
Copy link
Contributor

Swandog commented Jan 30, 2024

What exactly mdbook do you mean? I think it can be doable from my side to contribute PGO-related information to it.

Specifically the code under book in this repo: https://github.com/promised-ai/lace/tree/master/book

@zamazan4ik
Copy link
Author

@schmidmt
Copy link
Contributor

Thanks, @zamazan4ik; we appreciate the contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request performance Performance related issues or enhancement.
Projects
None yet
Development

No branches or pull requests

3 participants