-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate LTO, CGU=1, Profile-Guided Optimization (PGO) and LLVM BOLT #834
Comments
Thank you for your issue! I definitely agree with adding it as Quilkin is nearly entirely CPU bound from Server
Quilkin
Client
|
@XAMPPRocky I just tried your instructions above and on my Linux machines nothing happens -
And the only option to close it is SIGKILL. Fortio instances are started exactly as you wrote above. Did I just miss something obvious? |
Oh, it seems like just something about overloading issues (maybe connections). The benchmark started fine when I lowered the connection number and target QPS. Sorry for the ping :) |
Yeah, you need to adjust the |
I performed some benchmarks and want to share my results. Test environment
Benchmark setupFor benchmarking purposes, I use the setup from #834 (comment) (suggested by @XAMPPRocky). The only addition from my side is using
The amount of QPS is tweaked to make sure that Quilkin's CPU core is always 100% (so we can easily measure the throughput improvements on the same hardware). In this benchmark, I use 4 build configurations:
Release build is done with All benchmarks are done multiple times, on the same hardware/software setup, with the same background "noise" (as much I can guarantee ofc). Between each run, ResultsFor the build configurations:
I got the following results:
According to the tests, it's possible to achieve several percent improvements with LTO and/or PGO at least in the benchmark above. Binary sizes for all binaries with
Also, I would share some numbers about enabling LTO and PGO and its impact on the build time:
Possible further steps
|
Thank you for working on this @zamazan4ik! It's a shame we can't get both right now, is there one in particular that you'd recommend we adopt while we wait for it to be fixed? Are you interested in contributing the work to make this happen in our CU? |
I recommend enabling LTO (
If you agree to start with LTO, the changes in general would be as simple as the following change to the
Since LTO (especially the Fat version) greatly slows down the build time (see my build time benchmarks above), you can enable LTO only for building actual releases, not on a usual CI build check. It's all up to you. I recommend you at the beginning just put these lines to the |
Thanks also for doing this work - this is super interesting, and great to see the performance improvements.
This doesn't seem like a huge jump. Even for + LTO being 4 minutes -- that's not the end of the world. So definitely not a blocker.
Shall we switch out the iperf test for a fortio one? I'm not wedded to either, whatever is easiest to use! |
Yeah, I think fortio is better, it's never worked for me for locally hosting an iperf server or using a public server. Re: the release flags, I think I would lean towards enabling them only CI, so that running benchmarks locally and iterating on improvements is still fast. For CI extra time is worth better performance. |
Agreed. #835 filed.
Yeah, that makes sense - We could add the optimisation when building out the images via the Makefile (links below) - which hooks into CI, but for a local Lines 56 to 59 in aeb2871
|
Agree. Just to highlight - some projects enable such "heavy" optimization only for building actual binaries. E.g. Vector implements it via special release script. So if you decide to implement such an approach - there are already examples in the current ecosystem to take a look on.
Definitely! It's a good way to integrate PGO into the project. |
If you would love to show us how it's done 😃 @zamazan4ik - would definitely love your help in this area for sure. Seems like an easy win to me 👍🏻 |
Sure. You can create an additional LTO-specific profile in Cargo.toml like it's done in G3 project. And then from the Makefile just call building Quilkin with specific Cargo profile. |
Hi!
Recently I checked Profile-Guided Optimization (PGO) improvements on multiple projects. The results are here. E.g. PGO helps with optimizing Envoyproxy. According to the multiple tests, PGO can help with improving performance in many other cases. That's why I think trying to optimize the Quilkin with PGO can be a good idea.
Codegen units (CGU) setting to 1 and enabling LTO also can help with optimizing Quilkin performance due to possibly more aggressive inlining (and could help with reducing the binary size).
I can suggest the following action points:
Maybe testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO) but I recommend starting from the usual PGO.
For the Rust projects, I recommend starting experimenting with PGO with cargo-pgo.
Here are some examples of how PGO optimization is integrated in other projects:
configure
scriptI have already tried to perform PGO tests on my machine but met a bug (more details in #833). I think we can wait before the fix or execute the benchmark somehow else (e.g. with
iperf
).The text was updated successfully, but these errors were encountered: