LTO: 10x regression #47866

luben · 2018-01-30T07:42:29Z

I noticed a significant regression since nightly 2018-01-27 when enabling LTO. Here is a simplified example that has 10x difference.

fn fill<T: Clone + 'static>(init: &'static T) -> Box<Fn(&usize) -> Vec<T>> {
    Box::new(move |&i| {
        let mut vec = Vec::<T>::new();
        vec.resize(i, init.clone());
        vec
    })
}

fn main() {
    let zeroes = fill(&0); 
    for _ in 0..1_000_000 {
        zeroes(&1000);
    }
}

Without LTO enabled:

$ perf stat -B -e cycles,instructions ./target/release/columnar

 Performance counter stats for './target/release/regr':

       335,160,070      cycles                                                      
       539,078,439      instructions              #    1.61  insn per cycle         

       0.112408607 seconds time elapsed

with LTO enabled:


 Performance counter stats for './target/release/regr':

     1,121,648,955      cycles                                                      
     5,233,309,105      instructions              #    4.67  insn per cycle         

       0.361163946 seconds time elapsed

Tested variants
Bad: -C lto=fat
Good: -C lto=thin
Good: -C lto=fat -C codegen-units=1

The text was updated successfully, but these errors were encountered:

siddontang · 2018-01-30T08:31:32Z

Refer http://www.brendangregg.com/perf.html#CPUstatistics, Gregg says:

This includes instructions per cycle (IPC), labled "insns per cycle", or in earlier versions, "IPC". This is a commonly examined metric, either IPC or its invert, CPI. Higher IPC values mean higher instruction throughput, and lower values indicate more stall cycles. I'd generally interpret high IPC values (eg, over 1.0) as good, indicating optimal processing of work.

But I am confused that why LTO uses so many instructions.

ollie27 · 2018-01-30T10:14:13Z

What results do you get with -Ccodegen-units=1?

luben · 2018-01-30T15:32:38Z

@siddontang , higher IPC is good if we execute the same (or similar) number if instructions. Here with LTO it executes 10x number of instructions but without 10x increase of IPC that leads to 3x increased run-time.

@ollie27 , just tested, it solves it. Another correction, -C lto=thin also solves it, I didn't test it properly before.

alexcrichton · 2018-01-30T16:56:54Z

The change list for this regression is unfortunately pretty large (between rust-nightly-2018-01-26 and rust-nightly-2018-01-27). Like @luben I can confirm that -C lto=fat produces poor performance while -C lto=thin produces good performance.

The primary change in that change list though was switching the default back to "fat" LTO, it seems at some point in the past our fat lto regressed? I'm running out of time to help narrow this down but what I've found so far is that we had a surprising regression of this program between nightly-2017-12-15 and nightly-2017-12-16 of just the optimized version (not even the LTO version). The change list doesn't have anything suspicious though.

nikomatsakis · 2018-02-01T15:22:27Z

(Just to be clear, this is a regression in the perf of the compiled code, right?)

nikomatsakis · 2018-02-01T15:22:48Z

@alexcrichton do you think you will have time to chase this down? Should we try to find someone to assign?

nikomatsakis · 2018-02-01T15:23:04Z

triage: P-high

Need to figure out what's happening here.

alexcrichton · 2018-02-01T17:41:30Z

@nikomatsakis unfortunately I'm pretty tight on time, so it's probably best to find another soul to help out with this.

alexcrichton · 2018-02-01T20:16:18Z

Er ok had some time to debug this. @eddyb it looks like this regression may be caused by #46623 or at least something is caused by that.

The logs I have are:

before

+ rustc -vV
rustc 1.24.0-nightly (50f6c3ece 2017-12-15)
binary: rustc
commit-hash: 50f6c3ece0ec738da48f8e77e6379a14bd02d1f4
commit-date: 2017-12-15
host: x86_64-unknown-linux-gnu
release: 1.24.0-nightly
LLVM version: 4.0
+ rustc foo.rs -O -C codegen-units=1
+ perf stat -e cycles ./foo

 Performance counter stats for './foo':

       331,404,612      cycles

       0.085892079 seconds time elapsed

after

+ rustc -vV
rustc 1.24.0-nightly (77efd6800 2017-12-15)
binary: rustc
commit-hash: 77efd6800c57ba83923dddbbabf03c7afa6a34a4
commit-date: 2017-12-15
host: x86_64-unknown-linux-gnu
release: 1.24.0-nightly
LLVM version: 4.0
+ rustc foo.rs -O -C codegen-units=1
+ perf stat -e cycles ./foo

 Performance counter stats for './foo':

     3,129,021,965      cycles

       0.806494586 seconds time elapsed

That is, with #46623 the benchmark without LTO regressed by 10x

eddyb · 2018-02-01T20:19:57Z

@alexcrichton Maybe it's similar to #47062 (comment)? Is there any change around that nightly?

alexcrichton · 2018-02-01T20:32:39Z

False alarm, @eddyb is right in that the commit I bisected above was fixed by #47007

alexcrichton · 2018-02-01T21:05:45Z

Hm ok some more investigation. The numbers here all over the place and there's a whole bunch of changes that were in flight over the past few months. That being said I don't think that this is a regression!

So first up "fat" LTO is much slower than "thin" LTO on this benchmark. That doesn't make a lot of sense because it's supposed to be the other way around! I tried to hunt backwards seeing what could have changed around "fat" LTO but I went back pretty far and it was basically always constant. On my machine going all the way back to nightly-2017-11-01 the benchmark takes ~0.4s when compiling with fat LTO and 16 CGUs.

The next suspicion was that fat LTO was slower than turning off LTO (odd!). Turns out #46382 somehow magically fixed this. More specifically f9b0897 (the merge before #46382 ) resulted in 1.5s runtime w/o LTO and 0.4s with it (both using 16 cgus). The next merge (#46382) had a runtime of 0.08s w/o LTO (yes, that's without) and 0.4s with it (fat LTO that is). So somehow #46382 improved non LTO builds on this benchmark but kept normal LTO builds the same.

Now the final change here is the difference between thin and fat LTO. The switch to ThinLTO happened in #46382 as well, which means a -C lto argument actually improves to 0.08s on that commit. (I forcibly disabled thin via -Z thinlto=no). It turns out though that this change was always there.

So tl;dr;

Non-LTO builds got faster in rustc: Prepare to enable ThinLTO by default #46382
ThinLTO, implemented in rustc: Prepare to enable ThinLTO by default #46382, was always faster than fat LTo
"fat" LTO never regressed
We had a window of time where thin LTO was the default, but we switched back to fat, which surfaced this regression.

All in all this is probably bug in LLVM why fat LTO isn't optimizing as well as a non-LTO build, but I'm overall going to remove the regression tag as I don't believe this is a regression any more as stable is a "fat" LTO by default as are all other channels right now.

dotdash · 2018-02-01T22:57:49Z

Reverting 8bde2ac "rustc: Add -C lto=val option" fixes this.

alexcrichton · 2018-02-01T23:00:15Z

@dotdash indeed yeah but all that's doing is switching the default from fat LTO to ThinLTO, you can test that out by passing -C lto=thin and -C lto=fat to see the difference for that.

I'm not sure why, but it appears that this benchmark gets worse when compiled with full LTO. It happens to be faster with ThinLTO, which was the default on beta/nightly for a bit but is no longer the case.

nikomatsakis · 2018-02-02T16:47:27Z

@alexcrichton great investigation! Seems like we can close the issue, and that @luben ought to use thin lto =)

I'm not sure why, but it appears that this benchmark gets worse when compiled with full LTO. It happens to be faster with ThinLTO, which was the default on beta/nightly for a bit but is no longer the case.

I imagine this can happen, I mean it's all heuristics in the end.

alexcrichton · 2018-02-02T18:38:37Z

Ah yes I believe we can close, @luben if you've got any more questions though please just let me know!

kennytm added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Jan 30, 2018

alexcrichton added regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 30, 2018

rust-highfive added the P-high High priority label Feb 1, 2018

alexcrichton removed the regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. label Feb 1, 2018

alexcrichton closed this as completed Feb 2, 2018

ollie27 mentioned this issue Feb 25, 2018

Slower performance caused only by using LTO #48371

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LTO: 10x regression #47866

LTO: 10x regression #47866

luben commented Jan 30, 2018 •

edited

Loading

siddontang commented Jan 30, 2018 •

edited

Loading

ollie27 commented Jan 30, 2018

luben commented Jan 30, 2018

alexcrichton commented Jan 30, 2018

nikomatsakis commented Feb 1, 2018

nikomatsakis commented Feb 1, 2018

nikomatsakis commented Feb 1, 2018

alexcrichton commented Feb 1, 2018

alexcrichton commented Feb 1, 2018

eddyb commented Feb 1, 2018

alexcrichton commented Feb 1, 2018

alexcrichton commented Feb 1, 2018

dotdash commented Feb 1, 2018

alexcrichton commented Feb 1, 2018

nikomatsakis commented Feb 2, 2018

alexcrichton commented Feb 2, 2018

LTO: 10x regression #47866

LTO: 10x regression #47866

Comments

luben commented Jan 30, 2018 • edited Loading

siddontang commented Jan 30, 2018 • edited Loading

ollie27 commented Jan 30, 2018

luben commented Jan 30, 2018

alexcrichton commented Jan 30, 2018

nikomatsakis commented Feb 1, 2018

nikomatsakis commented Feb 1, 2018

nikomatsakis commented Feb 1, 2018

alexcrichton commented Feb 1, 2018

alexcrichton commented Feb 1, 2018

eddyb commented Feb 1, 2018

alexcrichton commented Feb 1, 2018

alexcrichton commented Feb 1, 2018

dotdash commented Feb 1, 2018

alexcrichton commented Feb 1, 2018

nikomatsakis commented Feb 2, 2018

alexcrichton commented Feb 2, 2018

luben commented Jan 30, 2018 •

edited

Loading

siddontang commented Jan 30, 2018 •

edited

Loading