-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
&& operator chains (and ||, possibly) generates unoptimizable LLVM IR #83623
Comments
The Rust backend is not in a position to generate phis or code that uses SSA values directly in general. Note that the optimized IR for the Rust code is also SSA and is also exploiting phis. When comparing optimized IR there is ultimately just one difference between the eq0 case and the Clang IR – Rust code has a |
Would it be worth investigating improvements to LLVM for this, then? That would help Rust here but also help Clang if it ever generates similar things. |
There are a couple of avenues we could explore here, yes. We probably won't emit ideal code from Rust, but we could look into emitting something that LLVM does not trip over. Adding ability in the x86 backend to see past the offending pattern, such as seen here, might be beneficial in a more general way. Its LLVM itself that optimizes to this particular "springboard" block pattern too, so adjusting some of LLVM's optimization passes to produce something more canonical could be an option as well. |
Does this only fail to vectorize on x86(_64)? Should we check other backends? |
Just adding some info: |
@rijenkii The upgrade to LLVM 9 seems like the most likely culprit between those two releases. |
No, ARM fails to remove branches too. godbolt |
Understood. But my second suggested algorithm doesn't use this instructions. |
I wrote a benchmark Codeuse criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
#[derive(Clone, Copy)]
pub struct Blueprint {
pub fuel_tank_size: u32,
pub payload: u32,
pub wheel_diameter: u32,
pub wheel_width: u32,
pub storage: u32,
}
pub fn eq0(a: &Blueprint, b: &Blueprint) -> bool {
(a.fuel_tank_size == b.fuel_tank_size)
&& (a.payload == b.payload)
&& (a.wheel_diameter == b.wheel_diameter)
&& (a.wheel_width == b.wheel_width)
&& (a.storage == b.storage)
}
pub fn eq2(a: &Blueprint, b: &Blueprint) -> bool {
if a.fuel_tank_size != b.fuel_tank_size {
return false;
}
if a.payload != b.payload {
return false;
}
if a.wheel_diameter != b.wheel_diameter {
return false;
}
if a.wheel_width != b.wheel_width {
return false;
}
if a.storage != b.storage {
return false;
}
true
}
struct Data {
original: Vec<Blueprint>,
diff_random_field: Vec<Blueprint>,
diff_last_field: Vec<Blueprint>,
}
fn generate_data(n: usize) -> Data {
use rand::prelude::{Rng, SeedableRng};
use rand_chacha::ChaCha8Rng;
let mut rng = ChaCha8Rng::seed_from_u64(357u64);
let mut data = vec![0; n * 5];
rng.fill(data.as_mut_slice());
let original: Vec<Blueprint> = data
.windows(5)
.map(|w| Blueprint {
fuel_tank_size: w[0],
payload: w[1],
wheel_diameter: w[2],
wheel_width: w[3],
storage: w[4],
})
.collect();
let diff_random_field: Vec<Blueprint> = original
.iter()
.map(|&x| unsafe {
let mut arr: [u32; 5] = std::mem::transmute(x);
let index = rng.gen_range(0..5);
arr[index] = arr[index].wrapping_add(5);
std::mem::transmute(arr)
})
.collect();
let diff_last_field: Vec<Blueprint> = original
.iter()
.copied()
.map(|mut x| {
x.storage += 5;
x
})
.collect();
Data {
original,
diff_random_field,
diff_last_field,
}
}
pub fn criterion_benchmark(c: &mut Criterion) {
let Data {
original,
diff_random_field,
diff_last_field,
} = generate_data(1000);
let mut group = c.benchmark_group("cmp");
let original_copy = original.clone();
group.bench_with_input(BenchmarkId::new("eq0: Self", 0), &original_copy, |b, d| {
b.iter(|| -> Vec<bool> {
original
.iter()
.zip(d.iter())
.map(|(a, b)| eq0(a, b))
.collect()
})
});
group.bench_with_input(
BenchmarkId::new("eq0: Random field", 0),
&diff_random_field,
|b, d| {
b.iter(|| -> Vec<bool> {
original
.iter()
.zip(d.iter())
.map(|(a, b)| eq0(a, b))
.collect()
})
},
);
group.bench_with_input(
BenchmarkId::new("eq0: Last field", 0),
&diff_last_field,
|b, d| {
b.iter(|| -> Vec<bool> {
original
.iter()
.zip(d.iter())
.map(|(a, b)| eq0(a, b))
.collect()
})
},
);
group.bench_with_input(BenchmarkId::new("eq2: Self", 0), &original_copy, |b, d| {
b.iter(|| -> Vec<bool> {
original
.iter()
.zip(d.iter())
.map(|(a, b)| eq2(a, b))
.collect()
})
});
group.bench_with_input(
BenchmarkId::new("eq2: Random field", 0),
&diff_random_field,
|b, d| {
b.iter(|| -> Vec<bool> {
original
.iter()
.zip(d.iter())
.map(|(a, b)| eq2(a, b))
.collect()
})
},
);
group.bench_with_input(
BenchmarkId::new("eq2: Last field", 0),
&diff_last_field,
|b, d| {
b.iter(|| -> Vec<bool> {
original
.iter()
.zip(d.iter())
.map(|(a, b)| eq2(a, b))
.collect()
})
},
);
group.finish();
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches); Output:
As can be seen in criterion ouput, SIMD version works better in case when branch predictor cannot predict which field differ (Random field) while having same speed in other cases. Also, difference is big: 26% and I think that it matters even more for struct with more fields. |
@rijenkii #62993 is probably the cause of why the optimization disappeared. In 1.33, the same version where the SIMD optimizations appear, the following code doesn't optimize out the unnecessary modulo operations. pub fn is_leap_year1(year: i32) -> bool {
let div_4 = year % 4 == 0;
let div_100 = year % 100 == 0;
let div_400 = year % 400 == 0;
div_4 && !(div_100 && !div_400)
} The fix to revert the optimization in 1.33 was added for 1.38. |
@pedantic79 @rijenkii |
If it turns out that #83663 costs more than it saves in terms of runtime performance, would there be value in simply changing |
@mcronce Yes, that's a plan. |
…nd_to_get_better_asm, r=nagisa Simplify logical operations CFG This is basically same commit as e38e954 which was reverted later in 676953f In both cases, this changes weren't benchmarked. e38e954 leads to missed optimization described in [this issue](rust-lang#62993) 676953f leads to missed optimization described in [this issue](rust-lang#83623)
As PR merged, I think we can close this issue. |
Reopening to track the LLVM 13 regression. |
Upgrade to LLVM 13 Work in progress update to LLVM 13. Main changes: * InlineAsm diagnostics reported using SrcMgr diagnostic kind are now handled. Previously these used a separate diag handler. * Codegen tests are updated for additional attributes. * Some data layouts have changed. * Switch `#[used]` attribute from `llvm.used` to `llvm.compiler.used` to avoid SHF_GNU_RETAIN flag introduced in https://reviews.llvm.org/D97448, which appears to trigger a bug in older versions of gold. * Set `LLVM_INCLUDE_TESTS=OFF` to avoid Python 3.6 requirement. Upstream issues: * ~~https://bugs.llvm.org/show_bug.cgi?id=51210 (InlineAsm diagnostic reporting for module asm)~~ Fixed by llvm/llvm-project@1558bb8. * ~~https://bugs.llvm.org/show_bug.cgi?id=51476 (Miscompile on AArch64 due to incorrect comparison elimination)~~ Fixed by llvm/llvm-project@81b1065. * https://bugs.llvm.org/show_bug.cgi?id=51207 (Can't set custom section flags anymore). Problematic change reverted in our fork, https://reviews.llvm.org/D107216 posted for upstream revert. * https://bugs.llvm.org/show_bug.cgi?id=51211 (Regression in codegen for rust-lang#83623). This is an optimization regression that we may likely have to eat for this release. The fix for rust-lang#83623 was based on an incorrect premise, and this needs to be properly addressed in the MergeICmps pass. The [compile-time impact](https://perf.rust-lang.org/compare.html?start=ef9549b6c0efb7525c9b012148689c8d070f9bc0&end=0983094463497eec22d550dad25576a894687002) is mixed, but quite positive as LLVM upgrades go. The LLVM 13 final release is scheduled for Sep 21st. The current nightly is scheduled for stable release on Oct 21st. r? `@ghost`
Don't know if relevant, but there's another way to compare with nice resultpub struct Blueprint {
pub fuel_tank_size: u32,
pub payload: u32,
pub wheel_diameter: u32,
pub wheel_width: u32,
pub storage: u32,
}
impl PartialEq for Blueprint{
fn eq(&self, other: &Self)->bool{
(self.fuel_tank_size,
self.payload,
self.wheel_diameter,
self.wheel_width,
self.storage) ==
(other.fuel_tank_size,
other.payload,
other.wheel_diameter,
other.wheel_width,
other.storage)
}
} Since 1.54 generates: <example::Blueprint as core::cmp::PartialEq>::eq:
vmovdqu xmm0, xmmword ptr [rdi]
vpcmpeqd xmm0, xmm0, xmmword ptr [rsi]
vmovmskps eax, xmm0
cmp al, 15
mov eax, dword ptr [rdi + 16]
sete cl
cmp eax, dword ptr [rsi + 16]
sete al
and al, cl
ret |
@sdgoihew that depends on the fields being cc an issue I opened recently about the derived It seems that the rules on (Rust 1.47 -- well, more accurately the LLVM optimizer in that release -- optimized it away https://rust.godbolt.org/z/zxYKM17KT, but that's poison-incorrect, so it optimizing to that today would be a bug.) |
May be worth noting, Compiler Explorer for the latest stable (1.76) shows the good SIMDified version for the original example (and it seems to be not the first version with it): <example::Blueprint as core::cmp::PartialEq>::eq:
vmovdqu xmm0, xmmword ptr [rdi]
vmovd xmm1, dword ptr [rdi + 16]
vmovd xmm2, dword ptr [rsi + 16]
vpxor xmm1, xmm1, xmm2
vpxor xmm0, xmm0, xmmword ptr [rsi]
vpor xmm0, xmm1, xmm0
vptest xmm0, xmm0
sete al
ret
UPD (thanks, @sdgoihew): even the registers are the same for both code variants in the latest version. |
Both of the above versions compile identically for all aarch64 targets I tried, with and without SVE or NEON, because they prefer not to vectorize at all: https://rust.godbolt.org/z/ez7hhK7T7 <example::Blueprint as core::cmp::PartialEq>::eq:
ldp x8, x9, [x0]
ldp x10, x11, [x1]
ldr w12, [x0, #16]
cmp x8, x10
ldr w13, [x1, #16]
ccmp x9, x11, #0, eq
ccmp x12, x13, #0, eq
cset w0, eq
ret |
Curiously inlining seems to break the optimization. The inlined code still uses the inefficient jump based logic: https://godbolt.org/z/YbT9nKz6c. |
tl;dr: Current IR generated by && chain too hard to optimize for LLVM and always compiles to chain of jumps.
I started of investigation of this from this Reddit thread about lack of using SIMD instructions in PartialEq implementations.
Current Rust PartialEq
I assumed that PartialEq implementation generates code like:
handwritten eq
godbolt link for handwritten Eq
It is quite ineffective because have 5 branches which probably can be replaced by few SIMD instructions.
State on Clang land
So I decided to look how Clang compiles similar code (to know, if there some LLVM issue). So I written such code:
clang code and asm
And asm
Also, godbolt with Clang code.
As you see, Clang successfully optimizes the code to use SIMD instructions and doesn't ever generates branches.
Rust variants of good asm generation
I checked other code variants in Rust.
Rust variants and ASM
And godbolt link with variants
Function
eq0
uses&&
,eq1
uses&
so has different semantics,eq2
has similar semantics but optimized better.eq1
is very simple case (we have one block generated by rustc which easily optimized) and has different semantics so we skip it.We would use
eq0
andeq2
further.Investigation of LLVM IR
clang
andeq2
cases successfully proved that LLVM capable to optimization of && so I started to investigate generated LLVM IR and optimized LLVM IR. I decided to check differences of IR for clang, eq0 and eq2.I would put first generated LLVM IR, its diagram, and optimized LLVM IR for cases. Also, I used different code in files than godbolt so function names aren't match for case names.
Clang IR
I compiled code to LLVM IR using
clang++ is_sorted.cpp -O0 -S -emit-llvm
, removedoptnone
attribut manually, then looked optimizations usingopt -O3 -print-before-all -print-after-all 2>passes.ll
Code and diagrams
real code
So first generated LLVM IR
Picture of control flow
Optimized LLVM
As you see, Clang original code doesn't changed much, it removed only copying from source structs to temporary locals.
Original control flow is very clear and last block utilizes single phi node with a lot of inputs to fill result value.
Rust eq2 case
I generated LLVM code using such command:
rustc +nightly cmp.rs --emit=llvm-ir -C opt-level=3 -C codegen-units=1 --crate-type=rlib -C 'llvm-args=-print-after-all -print-before-all' 2>passes.ll
Rust eq2 case IR and graphs
real code
It generates such IR:
Which can be described with such picture
And after optimizations it would end with:
In general, algorithm can be described as
false
into the byte and jump to end.This indirect usage of byte is optimized in mem2reg phase to pretty SSA form and control flow remains forward only.
Rust eq0 case (which used in reality very often and optimizes bad)
IR code and control flow diagrams
Real code
So, rustc generates such code:
Which has such control flow
And get optimized to this:
Well, it is really hard to tell whats going on in the generated code. Control flow operators placed basically in reversed order (first checked condition put in last position, then it jumps in both cases back than jumps against after condition), and such behaviour doesn't change during optimization passes and in final generated ASM we end with a lot of jumps and miss SIMD usage.
It looks like that LLVM fails to reorganize this blocks in more natural order and probably fails to understand many temporary allocas.
Conclusions of LLVM IR research
Let's look the control flow diagrams last time.
Clang:
Rust eq2 (with manual early returns)
Rust eq0 with usage of
&&
operator.Finally, I have 2 ideas of new algorithms which can be generated by proper codegen for
&&
chains:First approach
We need exploit φ nodes with lots of inputs from Clang approach:
Pseudocode:
Probably, it is the best solution because LLVM tends to handle Clang approach better and this code is already in SSA form which is loved by optimizations.
Second approach
Pseudocode
This version is less friendly to the optimizer because we use pointer here but it would be converted to SSA form in mem2reg phase of optimization.
Implementing of such algorithms would probably require handling of chains of && operators as one prefix operator with many arguments e.g.
&&(arg0, arg1, ..., argN)
.I don't know which part of pipeline is needed to be changed to fix this and which my suggested generated code is easier to produce.
Also, very I expect same problem with
||
operator implementation too.Importance and final thoughts
This bug effectively prevents SIMD optimizations in most
#[derive(PartialEq)]
and some other places too so fixing it could lead to big performance gains. Hope my investigation of this bug would help to fix it.Also, sorry for possible weird wording and grammar mistakes because English isn't native language for me.
And finally, which rustc and clang I used:
The text was updated successfully, but these errors were encountered: