-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slower performance caused only by using LTO #48371
Comments
Could this be related to thinlto + multiple codegen units? |
Please try with |
With |
Yep, this is a known issue with multiple codegen units on release builds, which were enabled in the latest release: #47745 Closing as a duplicate. |
I saw that issue but I thought that ThinLTO works differently than regular LTO, so I opened this one. |
Compiling with Compiling with |
Another example with a pretty significant slowdown:
Without lto: 3.065 seconds |
@nagisa I noticed that you added this issue to the list in #47745, but as I showed it in my previous comment, I didn't have a problem with ThinLTO, but with regular/fat LTO. If the underlying issue is really the same, can you note that fat LTO can also cause slowdown with multiple codegen units? And if it's a different issue, can you please reopen this issue (or find the appropriate one)? Thanks! |
Removed it. |
I agree that this bug should be reopened. My performance regression is also only with traditional "fat" LTO. Why are multiple code-gens be using by default anyway? I thought they were only going to be used with thin LTO.. EDIT: |
FWIW #47866 was another issue where multiple codegen units + fat lto produced worse code. |
@ollie27 I didn't find that issue since it was already closed when I opened this, but this is probably the same issue. The conclusion in #47866 was that it's just how fat LTO works, unfortunately I'm not the one that compiles the code in my case, the best I can do is to convince the maintainers of the benchmark game to compile with I won't close this issue yet since @robsmith11's code is pretty small, so it might be good for further investigation. |
As I mentioned in my previous comment, I think the solution to this is that fat LTO should default to codegen-units=1, not 16 or whatever the new default is. Fat LTO isn't designed for good run-time performance with multiple codegen-units, only thin LTO is. |
Triage; we've changed the defaults around this a bunch of times, but I'm not sure what they are today. |
The Computer Language Benchmarks Game was on the Rust subreddit recently and while I checked out the numbers for Rust, I noticed that the Rust solution for the fasta benchmark is much slower than the C version, although they work fairly similarly, the multithreading in the C version is based on the Rust version. It turned out that the Rust benchmarks are compiled with LTO by default, and when I tested the code on my machine without LTO (both stable and nightly), it was almost as fast as the C version. I tried to find an existing issue, but most of them are about slow compilation, not slow runtime.
The interesting thing is that on the CPU monitor graph it's clearly visible that during the last part of the benchmark all CPU cores are only on 70% usage (so it's like a mutex is locked for too long). I also checked the binary size, it went down from 4.4 MB to 3.1 MB with LTO.
EDIT: I also tested it with the Mutex from
parking_lot
, it's still slow with LTO, but without it's a tiny bit faster than the C version.The text was updated successfully, but these errors were encountered: