-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slower binary, slower compiler, bigger binary since some days #46897
Comments
Could someone try to reproduce and bisect this? |
cc @kennytm I think |
We shouldn't have such big asm differences between successive nightlies, and I don't see anything suspicious in the middle, so it would be good to try and nail down the commit that made the difference. |
I see a regression with this code (note: this is distinct from #46863). #![crate_type="rlib"]
pub fn f(l: &Vec<u64>) -> u64 {
let mut sum = 0;
for i in 0..l.len() {
sum += l[i]
}
sum
} @kennytm the bad code accesses the stack pointer (%rsp) You have a fast network connection and know how to use You should grep for subq $24, %rsp |
I bisected @leonardo-m's original example, just testing the speed of the executable created by a stage 1 @arielb1's latest example also has a clear difference. The code before that commit (just disassembling the 0000000000000000 <_ZN5test21f17h7c517ed04f8b4e1bE>:
0: 48 8b 57 10 mov rdx,QWORD PTR [rdi+0x10]
4: 48 85 d2 test rdx,rdx
7: 74 1b je 24 <_ZN5test21f17h7c517ed04f8b4e1bE+0x24>
9: 48 8b 0f mov rcx,QWORD PTR [rdi]
c: 31 c0 xor eax,eax
e: 31 f6 xor esi,esi
10: 48 39 f2 cmp rdx,rsi
13: 76 12 jbe 27 <_ZN5test21f17h7c517ed04f8b4e1bE+0x27>
15: 48 03 04 f1 add rax,QWORD PTR [rcx+rsi*8]
19: 48 83 c6 01 add rsi,0x1
1d: 48 39 d6 cmp rsi,rdx
20: 72 ee jb 10 <_ZN5test21f17h7c517ed04f8b4e1bE+0x10>
22: eb 02 jmp 26 <_ZN5test21f17h7c517ed04f8b4e1bE+0x26>
24: 31 c0 xor eax,eax
26: c3 ret
27: 50 push rax
28: 48 8d 3d 00 00 00 00 lea rdi,[rip+0x0] # 2f <_ZN5test21f17h7c517ed04f8b4e1bE+0x2f>
2f: e8 00 00 00 00 call 34 <_ZN5test21f17h7c517ed04f8b4e1bE+0x34>
34: 0f 0b ud2 And after: 0000000000000000 <_ZN5test21f17h7c517ed04f8b4e1bE>:
0: 48 83 ec 18 sub rsp,0x18
4: 4c 8b 07 mov r8,QWORD PTR [rdi]
7: 48 8b 57 10 mov rdx,QWORD PTR [rdi+0x10]
b: 31 f6 xor esi,esi
d: 31 c0 xor eax,eax
f: 48 39 d6 cmp rsi,rdx
12: 72 18 jb 2c <_ZN5test21f17h7c517ed04f8b4e1bE+0x2c>
14: eb 1f jmp 35 <_ZN5test21f17h7c517ed04f8b4e1bE+0x35>
16: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
1d: 00 00 00
20: 49 03 04 f0 add rax,QWORD PTR [r8+rsi*8]
24: 48 89 fe mov rsi,rdi
27: 48 39 d6 cmp rsi,rdx
2a: 73 09 jae 35 <_ZN5test21f17h7c517ed04f8b4e1bE+0x35>
2c: 48 89 f7 mov rdi,rsi
2f: 48 83 c7 01 add rdi,0x1
33: 73 0b jae 40 <_ZN5test21f17h7c517ed04f8b4e1bE+0x40>
35: 48 89 f7 mov rdi,rsi
38: 31 c9 xor ecx,ecx
3a: 31 f6 xor esi,esi
3c: eb 10 jmp 4e <_ZN5test21f17h7c517ed04f8b4e1bE+0x4e>
3e: 66 90 xchg ax,ax
40: 48 c7 44 24 08 01 00 mov QWORD PTR [rsp+0x8],0x1
47: 00 00
49: b9 01 00 00 00 mov ecx,0x1
4e: 48 89 74 cc 08 mov QWORD PTR [rsp+rcx*8+0x8],rsi
53: 48 83 7c 24 08 01 cmp QWORD PTR [rsp+0x8],0x1
59: 75 18 jne 73 <_ZN5test21f17h7c517ed04f8b4e1bE+0x73>
5b: 48 8b 74 24 10 mov rsi,QWORD PTR [rsp+0x10]
60: 48 39 f2 cmp rdx,rsi
63: 77 bb ja 20 <_ZN5test21f17h7c517ed04f8b4e1bE+0x20>
65: 48 8d 3d 00 00 00 00 lea rdi,[rip+0x0] # 6c <_ZN5test21f17h7c517ed04f8b4e1bE+0x6c>
6c: e8 00 00 00 00 call 71 <_ZN5test21f17h7c517ed04f8b4e1bE+0x71>
71: 0f 0b ud2
73: 48 83 c4 18 add rsp,0x18
77: c3 ret |
That's unfortunate. cc @eddyb |
cc @bluss |
Thanks @Deewiant We're looking into it |
@Deewiant >just testing the speed of the executable I am not sure if they share the same cause, first a significant compiler performance decrease using --emit=metadata , then the successive Nightly the decrease of the produced binary performance. |
They should not. However, quite a few big commits landed in the Dec 14 nightly. If you could provide an example that exhibits the slowdown with |
Bounce out the layout refactor from beta @eddyb's #45225 was supposed to get into get into 1.24, but due to an ordering mistake, it had snuck into 1.23. That wide-effect translation-changing PR had poked LLVM's weak corners and caused many regressions (3 of them have fixes I include here, but also #46897, #46845, #46449, #46371). I don't think it is a good idea to land it in the beta (1.23) because there are bound to be some regressions we didn't patch. Therefore, I am reverting it in time for stable, along with its related regression fixes. r? @michaelwoerister (I think)
The last nightly (2017-12-21) has fixed all three problems. Despite some attempts I failed finding a small amount of code that shows a significant compilation slowdown from 2017-12-14 to 2017-12-20, perhaps it's a more diffuse problem that emerges from compiling my whole big amount of code. I am sorry.
I don't fully understand your question, but I try to answer creating a table with some data for some of the last Nightlies:
|
EDIT: @SimonSapin told me about these files, so the range is edbd7d2...7eb64b8. EDIT2: at a quick glance, none of those changes should affect binary size or run time. |
They should be:
Mixing up things seems quite easy here. |
@leonardo-m Ah, that's why I asked, so the range is 7eb64b8...250b492. |
The --emit=metadata times are computed as the minimum of four tries. Their variance is very small (on the order of +-0.05) so I think they are reliable. The binary run-time is the minimum of 5 runs, their variance is higher perhaps because they do some I/O, despite using a SSD, about +-0.12. The compilation is slow so there I've done only one time, so those values are the less reliable, but differences above 1.5 seconds should be reliable. |
For the compilation time issue, maybe there was a bug in NLL that made it active without it being enabled. For the run-time issue, we need to actually investigate this. |
@arielb1 As in, investigate the cause of the run-time fluctuations and how it unregressed? |
This is my good old friend the LLVM optimizes ; Function Attrs: inlinehint nounwind uwtable
define internal fastcc void @"_ZN4core4iter5range93_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..range..Range$LT$A$GT$$GT$4next17h9ca7473ffcbaaeeeE"(%"core::option::Option<usize>"* noalias nocapture sret dereferenceable(16), { i64, i64 }* nocapture dereferenceable(16) %self) unnamed_addr #3 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
%1 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %self, i64 0, i32 0
%2 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %self, i64 0, i32 1
%3 = load i64, i64* %1, align 8, !alias.scope !1, !noalias !4
%4 = load i64, i64* %2, align 8, !alias.scope !4, !noalias !1
%5 = icmp ult i64 %3, %4
br i1 %5, label %bb3, label %bb12
bb3: ; preds = %start
%6 = tail call { i64, i1 } @llvm.uadd.with.overflow.i64(i64 %3, i64 1) #6
%7 = extractvalue { i64, i1 } %6, 1
br i1 %7, label %bb12, label %bb7
bb7: ; preds = %bb3
%8 = extractvalue { i64, i1 } %6, 0
store i64 %8, i64* %1, align 1
%9 = getelementptr inbounds %"core::option::Option<usize>", %"core::option::Option<usize>"* %0, i64 0, i32 0, i64 0
store i64 1, i64* %9, align 8
br label %bb12
bb12: ; preds = %start, %bb7, %bb3
%.sink16 = phi i64 [ 1, %bb7 ], [ 0, %bb3 ], [ 0, %start ]
%.sink14 = phi i64 [ %3, %bb7 ], [ 0, %bb3 ], [ 0, %start ]
; WHAT?
%10 = getelementptr inbounds %"core::option::Option<usize>", %"core::option::Option<usize>"* %0, i64 0, i32 0, i64 %.sink16
store i64 %.sink14, i64* %10, align 8
ret void
} This is basically equivalent to this pseudo-Rust code struct Range {
start: u64,
end: u64
}
fn next(ret: &out Option<u64>, self: &mut Range) {
let Range { start, end } = *self;
let (offset, value) = if start < end {
if let Some(next) = start.checked_add(1) {
self.start = next;
ret.<discriminant> = Some;
(1, start)
} else {
(0, 0)
}
} else {
(0, 0)
};
// (sic) yes this writes a "value" to an "offset".
(ret as *mut [u64; 2])[offset] = value; // WHAT?
} After the "non-constant-address" write is generated, LLVM is sunk - LLVM has no way of getting rid of the non constant address, and no way of optimizing accesses to the target |
Upstream LLVM (that's the yet-to-be-released LLVM 6) has this patch, which avoids the SimplifyCfg sink problem by not running sinking before inlining I have confirmed that - e.g. this code is optimized correctly with the new define void @slowness_base_1([3 x i64] * noalias nocapture dereferenceable(24), i1) {
start:
br i1 %1, label %if_true, label %if_false
if_true:
%x = getelementptr inbounds [3 x i64], [3 x i64]* %0, i32 0, i32 1
store i64 1, i64* %x
br label %ret
if_false:
%y = getelementptr inbounds [3 x i64], [3 x i64]* %0, i32 0, i32 2
store i64 1, i64* %y
br label %ret
ret:
ret void
}
define i64 @slowness_base_2([3 x i64] * noalias nocapture dereferenceable(24), i1) {
start:
%allok = alloca [3 x i64]
%allok_px = getelementptr inbounds [3 x i64], [3 x i64]* %allok, i32 0, i32 2
store i64 2, i64* %allok_px
call void @slowness_base_1([3 x i64] * %allok, i1 %1)
%allok_p = getelementptr inbounds [3 x i64], [3 x i64]* %allok, i32 0, i32 2
%allok_val = load i64, i64* %allok_p
ret i64 %allok_val
} Until then, I think representing small enums using a struct (rather than an array) should work around this problem. |
With the 2017-12-22 Nightly all the problems are back:
|
Coud you open a PR that turns small unions into LLVM structs (rather than arrays)? That should fix the problem. |
The binaries generated by the last Nightly (1.24.0-nightly 4a7c072 2017-12-25) is further few percents slower (perhaps because of thinLTO?), the compilation times can't be compared now. |
@leonardo-m Is there more stable measure besides compile time / run time? They can be affected too easily by the environment especially when the run time is expected to take less than 1 second. The generated assembly size seems to be a good measure. |
rustc: don't use union layouts for tagged union enums. Fixes #46897, fixes #43517 (AFAICT from the testcases). This PR doesn't add any testcases, we should try to at least get perf ones (cc @Mark-Simulacrum). I couldn't find an example in those issues where the choice of LLVM array vs struct (with N identical fields) for padding filler types is still needed, *on top of* this change, to prevent excessive LLVM sinking. r? @arielb1
Please re-test next nightly and reopen if not fixed. |
The run-time of the binary is now OK. The compilation time with --emit=metadata is still quite large. |
@leonardo-m Can you open a separate one for compilation times? Seems to be a separate issue. |
In the 2017-12-14 Nightly I've seen a significant (almost 20%) compilation-time increase when I use
--emit=metadata
on my code. And the successive day (2017-12-15) the compilation got a bit slower still and the run-time of the binary also got about 8% slower. The successive days the situation kept going slightly worse. Extracting parts of my code to show compiler/code performance is not easy, so I've composed this benchmark function (that solves an Euler problem. If this code isn't enough I will write other similar benchmarks):Source code
Compiled with:
rustc -C opt-level=3 -C target-cpu=native -C panic=abort test.rs
host: x86_64-pc-windows-gnu
Runtime:
f8af59d 2017-12-13: 0.30 seconds
edbd7d2 2017-12-20: 0.53 seconds
Asm from the the 2017-12-13 compiler:
https://gist.github.com/anonymous/5c475819a2b7d0e01c036582c76cbd19
Asm from the the 2017-12-20 compiler:
https://gist.github.com/anonymous/bb984787c802da22b76c182d0f260872
The text was updated successfully, but these errors were encountered: