Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LTO: 10x regression #47866

Closed
luben opened this issue Jan 30, 2018 · 16 comments
Closed

LTO: 10x regression #47866

luben opened this issue Jan 30, 2018 · 16 comments
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. P-high High priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@luben
Copy link

luben commented Jan 30, 2018

I noticed a significant regression since nightly 2018-01-27 when enabling LTO. Here is a simplified example that has 10x difference.

fn fill<T: Clone + 'static>(init: &'static T) -> Box<Fn(&usize) -> Vec<T>> {
    Box::new(move |&i| {
        let mut vec = Vec::<T>::new();
        vec.resize(i, init.clone());
        vec
    })
}

fn main() {
    let zeroes = fill(&0); 
    for _ in 0..1_000_000 {
        zeroes(&1000);
    }
}

Without LTO enabled:

$ perf stat -B -e cycles,instructions ./target/release/columnar

 Performance counter stats for './target/release/regr':

       335,160,070      cycles                                                      
       539,078,439      instructions              #    1.61  insn per cycle         

       0.112408607 seconds time elapsed

with LTO enabled:


 Performance counter stats for './target/release/regr':

     1,121,648,955      cycles                                                      
     5,233,309,105      instructions              #    4.67  insn per cycle         

       0.361163946 seconds time elapsed

Tested variants
Bad: -C lto=fat
Good: -C lto=thin
Good: -C lto=fat -C codegen-units=1

@siddontang
Copy link

siddontang commented Jan 30, 2018

Refer http://www.brendangregg.com/perf.html#CPUstatistics, Gregg says:

This includes instructions per cycle (IPC), labled "insns per cycle", or in earlier versions, "IPC". This is a commonly examined metric, either IPC or its invert, CPI. Higher IPC values mean higher instruction throughput, and lower values indicate more stall cycles. I'd generally interpret high IPC values (eg, over 1.0) as good, indicating optimal processing of work.

But I am confused that why LTO uses so many instructions.

@ollie27
Copy link
Member

ollie27 commented Jan 30, 2018

What results do you get with -Ccodegen-units=1?

@kennytm kennytm added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Jan 30, 2018
@luben
Copy link
Author

luben commented Jan 30, 2018

@siddontang , higher IPC is good if we execute the same (or similar) number if instructions. Here with LTO it executes 10x number of instructions but without 10x increase of IPC that leads to 3x increased run-time.

@ollie27 , just tested, it solves it. Another correction, -C lto=thin also solves it, I didn't test it properly before.

@alexcrichton
Copy link
Member

The change list for this regression is unfortunately pretty large (between rust-nightly-2018-01-26 and rust-nightly-2018-01-27). Like @luben I can confirm that -C lto=fat produces poor performance while -C lto=thin produces good performance.

The primary change in that change list though was switching the default back to "fat" LTO, it seems at some point in the past our fat lto regressed? I'm running out of time to help narrow this down but what I've found so far is that we had a surprising regression of this program between nightly-2017-12-15 and nightly-2017-12-16 of just the optimized version (not even the LTO version). The change list doesn't have anything suspicious though.

@alexcrichton alexcrichton added regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 30, 2018
@nikomatsakis
Copy link
Contributor

(Just to be clear, this is a regression in the perf of the compiled code, right?)

@nikomatsakis
Copy link
Contributor

@alexcrichton do you think you will have time to chase this down? Should we try to find someone to assign?

@nikomatsakis
Copy link
Contributor

triage: P-high

Need to figure out what's happening here.

@rust-highfive rust-highfive added the P-high High priority label Feb 1, 2018
@alexcrichton
Copy link
Member

@nikomatsakis unfortunately I'm pretty tight on time, so it's probably best to find another soul to help out with this.

@alexcrichton
Copy link
Member

Er ok had some time to debug this. @eddyb it looks like this regression may be caused by #46623 or at least something is caused by that.

The logs I have are:

before

+ rustc -vV
rustc 1.24.0-nightly (50f6c3ece 2017-12-15)
binary: rustc
commit-hash: 50f6c3ece0ec738da48f8e77e6379a14bd02d1f4
commit-date: 2017-12-15
host: x86_64-unknown-linux-gnu
release: 1.24.0-nightly
LLVM version: 4.0
+ rustc foo.rs -O -C codegen-units=1
+ perf stat -e cycles ./foo

 Performance counter stats for './foo':

       331,404,612      cycles

       0.085892079 seconds time elapsed

after

+ rustc -vV
rustc 1.24.0-nightly (77efd6800 2017-12-15)
binary: rustc
commit-hash: 77efd6800c57ba83923dddbbabf03c7afa6a34a4
commit-date: 2017-12-15
host: x86_64-unknown-linux-gnu
release: 1.24.0-nightly
LLVM version: 4.0
+ rustc foo.rs -O -C codegen-units=1
+ perf stat -e cycles ./foo

 Performance counter stats for './foo':

     3,129,021,965      cycles

       0.806494586 seconds time elapsed

That is, with #46623 the benchmark without LTO regressed by 10x

@eddyb
Copy link
Member

eddyb commented Feb 1, 2018

@alexcrichton Maybe it's similar to #47062 (comment)? Is there any change around that nightly?

@alexcrichton
Copy link
Member

False alarm, @eddyb is right in that the commit I bisected above was fixed by #47007

@alexcrichton
Copy link
Member

Hm ok some more investigation. The numbers here all over the place and there's a whole bunch of changes that were in flight over the past few months. That being said I don't think that this is a regression!

So first up "fat" LTO is much slower than "thin" LTO on this benchmark. That doesn't make a lot of sense because it's supposed to be the other way around! I tried to hunt backwards seeing what could have changed around "fat" LTO but I went back pretty far and it was basically always constant. On my machine going all the way back to nightly-2017-11-01 the benchmark takes ~0.4s when compiling with fat LTO and 16 CGUs.

The next suspicion was that fat LTO was slower than turning off LTO (odd!). Turns out #46382 somehow magically fixed this. More specifically f9b0897 (the merge before #46382 ) resulted in 1.5s runtime w/o LTO and 0.4s with it (both using 16 cgus). The next merge (#46382) had a runtime of 0.08s w/o LTO (yes, that's without) and 0.4s with it (fat LTO that is). So somehow #46382 improved non LTO builds on this benchmark but kept normal LTO builds the same.

Now the final change here is the difference between thin and fat LTO. The switch to ThinLTO happened in #46382 as well, which means a -C lto argument actually improves to 0.08s on that commit. (I forcibly disabled thin via -Z thinlto=no). It turns out though that this change was always there.


So tl;dr;

All in all this is probably bug in LLVM why fat LTO isn't optimizing as well as a non-LTO build, but I'm overall going to remove the regression tag as I don't believe this is a regression any more as stable is a "fat" LTO by default as are all other channels right now.

@alexcrichton alexcrichton removed the regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. label Feb 1, 2018
@dotdash
Copy link
Contributor

dotdash commented Feb 1, 2018

Reverting 8bde2ac "rustc: Add -C lto=val option" fixes this.

@alexcrichton
Copy link
Member

@dotdash indeed yeah but all that's doing is switching the default from fat LTO to ThinLTO, you can test that out by passing -C lto=thin and -C lto=fat to see the difference for that.

I'm not sure why, but it appears that this benchmark gets worse when compiled with full LTO. It happens to be faster with ThinLTO, which was the default on beta/nightly for a bit but is no longer the case.

@nikomatsakis
Copy link
Contributor

@alexcrichton great investigation! Seems like we can close the issue, and that @luben ought to use thin lto =)

I'm not sure why, but it appears that this benchmark gets worse when compiled with full LTO. It happens to be faster with ThinLTO, which was the default on beta/nightly for a bit but is no longer the case.

I imagine this can happen, I mean it's all heuristics in the end.

@alexcrichton
Copy link
Member

Ah yes I believe we can close, @luben if you've got any more questions though please just let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. P-high High priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

9 participants