-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed up mem::swap #40454
speed up mem::swap #40454
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @sfackler (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
hm, got to be careful with the benchmarks. If llvm sucks (and it might), then what previously was optimized away in for loops because of using memmove intrinsics that llvm special cases, might now not be optimized. The relevant code here: #[stable(feature = "rust1", since = "1.0.0")]
impl<A: Step> Iterator for ops::Range<A> where
for<'a> &'a A: Add<&'a A, Output = A>
{
type Item = A;
#[inline]
fn next(&mut self) -> Option<A> {
if self.start < self.end {
let mut n = self.start.add_one();
// this line may not be optimized as well now for simple integer types
// note this was the cause of unintentional recursion before ;)
mem::swap(&mut n, &mut self.start);
Some(n)
} else {
None
}
}
#[inline]
fn size_hint(&self) -> (usize, Option<usize>) {
match Step::steps_between_by_one(&self.start, &self.end) {
Some(hint) => (hint, Some(hint)),
None => (0, None)
}
}
} |
Testing confirms that llvm is unable to optimize loops as well when this implementation of mem::swap is used. For some reason even a manual implementation of swap - pub fn swap<T: Copy>(x: &mut T, y: &mut T) {
let t = *x;
*x = *y;
*y = t;
} performs slower than the llvm intrinsic version. Defninitely something to investigate... |
OK so the reason this simple swap function was slower seems to be that |
I am trying to get something close to this: for an array of 2048 chars clang spits out the following: #include <array>
void swap(std::array<char, 2048>&__restrict__ x, std::array<char, 2048>&__restrict__ y) {
std::swap(x, y);
} clang -std=c++11 swap.cpp -S -O3 # BB#0:
xorl %eax, %eax
.p2align 4, 0x90
.LBB0_1: # =>This Inner Loop Header: Depth=1
movups (%rdi,%rax), %xmm0
movups 16(%rdi,%rax), %xmm1
movups (%rsi,%rax), %xmm2
movups 16(%rsi,%rax), %xmm3
movups %xmm2, (%rdi,%rax)
movups %xmm3, 16(%rdi,%rax)
movups %xmm0, (%rsi,%rax)
movups %xmm1, 16(%rsi,%rax)
movups 32(%rdi,%rax), %xmm0
movups 48(%rdi,%rax), %xmm1
movups 32(%rsi,%rax), %xmm2
movups 48(%rsi,%rax), %xmm3
movups %xmm2, 32(%rdi,%rax)
movups %xmm3, 48(%rdi,%rax)
movups %xmm0, 32(%rsi,%rax)
movups %xmm1, 48(%rsi,%rax)
addq $64, %rax
cmpq $2048, %rax # imm = 0x800
jne .LBB0_1
# BB#2:
retq This seems like a nice efficient swap utilizing 4 registers In rust on the other hand I get code like the following: use std::{mem, ptr};
fn swap<T>(x: &mut T, y: &mut T) {
unsafe {
// Give ourselves some scratch space to work with
let mut t: [u8; 16] = mem::uninitialized();
let x = x as *mut T as *mut u8;
let y = y as *mut T as *mut u8;
let t = &mut t as *mut _ as *mut u8;
// can't use a for loop as the `range` impl calls `mem::swap` recursively
let len = mem::size_of::<T>() as isize;
let mut i = 0;
while i + 16 <= len {
// Perform the swap 16 bytes at a time, `&mut` pointers never alias
ptr::copy_nonoverlapping(x.offset(i), t, 16);
ptr::copy_nonoverlapping(y.offset(i), x.offset(i), 16);
ptr::copy_nonoverlapping(t, y.offset(i), 16);
i += 16;
}
if i < len {
// Swap any remaining bytes
let rem = (len - i) as usize;
ptr::copy_nonoverlapping(x.offset(i), t, rem);
ptr::copy_nonoverlapping(y.offset(i), x.offset(i), rem);
ptr::copy_nonoverlapping(t, y.offset(i), rem);
}
}
}
#[no_mangle]
pub fn swap_array(x: &mut [u8; 2048], y: &mut [u8; 2048]) {
swap(x, y)
} rustc swap.rs --crate-type=lib --emit=asm -C opt-level=3 swap_array:
.cfi_startproc
xorl %eax, %eax
.p2align 4, 0x90
.LBB0_1:
movups (%rdi,%rax), %xmm0
movaps %xmm0, -24(%rsp)
movups (%rsi,%rax), %xmm0
movups %xmm0, (%rdi,%rax)
movaps -24(%rsp), %xmm0
movups %xmm0, (%rsi,%rax)
movups 16(%rdi,%rax), %xmm0
movaps %xmm0, -24(%rsp)
movups 16(%rsi,%rax), %xmm0
movups %xmm0, 16(%rdi,%rax)
movaps -24(%rsp), %xmm0
movups %xmm0, 16(%rsi,%rax)
movups 32(%rdi,%rax), %xmm0
movaps %xmm0, -24(%rsp)
movups 32(%rsi,%rax), %xmm0
movups %xmm0, 32(%rdi,%rax)
movaps -24(%rsp), %xmm0
movups %xmm0, 32(%rsi,%rax)
movups 48(%rdi,%rax), %xmm0
movaps %xmm0, -24(%rsp)
movups 48(%rsi,%rax), %xmm0
movups %xmm0, 48(%rdi,%rax)
movaps -24(%rsp), %xmm0
movups %xmm0, 48(%rsi,%rax)
leaq 64(%rax), %rcx
addq $80, %rax
cmpq $2049, %rax
movq %rcx, %rax
jl .LBB0_1
retq It is for some reason not eliminating the stores to and loads from the stack (the variable |
I can get it to work by forcing the alignment of |
src/libcore/mem.rs
Outdated
} | ||
if i < len { | ||
// Swap any remaining bytes | ||
let rem = (len - i) as usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it ever matters, but it occurs to me that calculating rem
as len % 16
instead of len - i
might be more obvious to an optimizer, as knowing what len - i
is requires reasoning about a loop. Similarly, it would be better to test whether rem == 0
than to test i < len
, as it is more trivially determined at compile time.
I doubt this matters in most cases, but I could see the optimizer failing when the size is large, since the loop might not be fully unrolled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it; looks like the generated assembly is the same either way. len
here is a constant known at compile time so I guess it just gets completely eliminated
src/libcore/mem.rs
Outdated
@@ -448,17 +448,29 @@ pub unsafe fn uninitialized<T>() -> T { | |||
pub fn swap<T>(x: &mut T, y: &mut T) { | |||
unsafe { | |||
// Give ourselves some scratch space to work with | |||
let mut t: T = uninitialized(); | |||
let mut t: [u8; 16] = uninitialized(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor point: Instead of using a literal 16
here, it might be slightly better to stick a
const SWAP_BLOCK_SIZE: usize = 16;
at the top of the function. This would make it slightly less error-prone to tweak the value later, and it would make it easy to, e.g., pick different sizes for different architectures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good suggestion, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for reference, the actual block size didn't matter too much but that was just on my machine - 16 may very well not be optimal
cc @rust-lang/compiler This seems much harder to ever perform MIR optimizations. |
This is extremely performance sensitive stuff (used in hashmap for example), thanks for working on it. |
Please add comments to the source code explaining why this code is so complex. |
Did you try a 32 byte buffer? This way the compiler could emit code that uses 2 xmm register in parallel. Edit: After giving this a deeper look, I think for it to be really optimal we'll probably need multiple versions depending on the size of T. |
OK so I did some more testing and it seems the SIMD version is indeed faster - this makes sense to me. On a struct of size 2048 I get 630 cycles for the SIMD version and 990 cycles for the 16-byte version. It would be nice to find a way to get llvm to use the SIMD instructions without resorting to unstable features though. On my machine the u64x4 and u64x8 versions had exactly the same performance profile; the C++ version earlier used only 4 registers which apparently is optimal. I had a u64x16 version but it's slower. here is the SIMD version for reference: fn swap_u64x4<T>(x: &mut T, y: &mut T) {
unsafe {
#[allow(non_camel_case_types)]
#[repr(simd)]
struct u64x4(u64, u64, u64, u64);
let mut t: u64x4 = mem::uninitialized();
let x = x as *mut T as *mut u8;
let y = y as *mut T as *mut u8;
let t = &mut t as *mut _ as *mut u8;
let len = mem::size_of::<T>() as isize;
let block_sz = mem::size_of::<u64x4>();
let mut i = 0;
while i + block_sz as isize <= len {
ptr::copy_nonoverlapping(x.offset(i), t, block_sz);
ptr::copy_nonoverlapping(y.offset(i), x.offset(i), block_sz);
ptr::copy_nonoverlapping(t, y.offset(i), block_sz);
i += block_sz as isize;
}
if i < len {
let rem = (len - i) as usize;
ptr::copy_nonoverlapping(x.offset(i), t, rem);
ptr::copy_nonoverlapping(y.offset(i), x.offset(i), rem);
ptr::copy_nonoverlapping(t, y.offset(i), rem);
}
}
}
#[no_mangle]
pub fn swap_array_u64x4(x: &mut [u8; 2048], y: &mut [u8; 2048]) {
swap_u64x4(x, y)
} with the generated assembly: swap_array_u64x4:
.cfi_startproc
xorl %eax, %eax
.p2align 4, 0x90
.LBB0_1:
movups (%rdi,%rax), %xmm0
movups 16(%rdi,%rax), %xmm1
movups (%rsi,%rax), %xmm2
movups 16(%rsi,%rax), %xmm3
movups %xmm3, 16(%rdi,%rax)
movups %xmm2, (%rdi,%rax)
movups %xmm1, 16(%rsi,%rax)
movups %xmm0, (%rsi,%rax)
movups 32(%rdi,%rax), %xmm0
movups 48(%rdi,%rax), %xmm1
movups 32(%rsi,%rax), %xmm2
movups 48(%rsi,%rax), %xmm3
movups %xmm3, 48(%rdi,%rax)
movups %xmm2, 32(%rdi,%rax)
movups %xmm1, 48(%rsi,%rax)
movups %xmm0, 32(%rsi,%rax)
movups 64(%rdi,%rax), %xmm0
movups 80(%rdi,%rax), %xmm1
movups 64(%rsi,%rax), %xmm2
movups 80(%rsi,%rax), %xmm3
movups %xmm3, 80(%rdi,%rax)
movups %xmm2, 64(%rdi,%rax)
movups %xmm1, 80(%rsi,%rax)
movups %xmm0, 64(%rsi,%rax)
movups 96(%rdi,%rax), %xmm0
movups 112(%rdi,%rax), %xmm1
movups 96(%rsi,%rax), %xmm2
movups 112(%rsi,%rax), %xmm3
movups %xmm3, 112(%rdi,%rax)
movups %xmm2, 96(%rdi,%rax)
movups %xmm1, 112(%rsi,%rax)
movups %xmm0, 96(%rsi,%rax)
movq %rax, %rcx
subq $-128, %rcx
addq $160, %rax
cmpq $2049, %rax
movq %rcx, %rax
jl .LBB0_1
retq The code I am using to time it... unsafe fn time(swap: fn(&mut [u8; ARR_LENGTH], &mut [u8; ARR_LENGTH])) -> u64 {
let mut x: [u8; ARR_LENGTH] = [42; ARR_LENGTH];
let mut y: [u8; ARR_LENGTH] = [69; ARR_LENGTH];
asm!("cpuid
cpuid
cpuid
cpuid
cpuid"
:
:
: "rax", "rbx", "rcx", "rdx"
: "volatile");
let mut min_diff = !0;
let mut max_diff = 0;
let mut i = 0;
while i < 80 {
let before_high: u32;
let before_low: u32;
let after_high: u32;
let after_low: u32;
asm!("cpuid
rdtsc
movl %edx, $0
movl %eax, $1"
: "=r" (before_high), "=r" (before_low)
:
: "rax", "rbx", "rcx", "rdx"
: "volatile" );
swap(&mut x, &mut y);
asm!("rdtscp
movl %edx, $0
movl %eax, $1
cpuid"
: "=r" (after_high), "=r" (after_low)
:
: "rax", "rbx", "rcx", "rdx"
: "volatile" );
asm!("" :: "r"(&x), "r"(&y));
let diff = ((after_high as u64 >> 32) | after_low as u64) - ((before_high as u64 >> 32) | before_low as u64);
if diff < min_diff { min_diff = diff; }
if diff > max_diff { max_diff = diff; }
i += 1;
}
return min_diff;
}
fn noop(_: &mut [u8; ARR_LENGTH], _: &mut [u8; ARR_LENGTH]) {}
fn main() {
unsafe {
let min_timing = time(noop);
let min_execution_time_16 = time(swap_array_16);
let min_execution_time_32 = time(swap_array_32);
let min_execution_time_64 = time(swap_array_64);
let min_execution_time_u64x4 = time(swap_array_u64x4);
let min_execution_time_u64x8 = time(swap_array_u64x8);
let min_execution_time_u64x16 = time(swap_array_u64x16);
println!("arr length {}", ARR_LENGTH);
println!("min timing {}", min_timing);
println!("16_bytes {}", min_execution_time_16 - min_timing);
println!("32_bytes {}", min_execution_time_32 - min_timing);
println!("64_bytes {}", min_execution_time_64 - min_timing);
println!("simd_u64x4 {}", min_execution_time_u64x4 - min_timing);
println!("simd_u64x8 {}", min_execution_time_u64x8 - min_timing);
println!("simd_u64x16 {}", min_execution_time_u64x16 - min_timing);
}
} gives
|
I've spend some time on this and there's some cases that the old code is better (T sizes: 57, 71, 73, 77; for ex.) Thoughts? |
@arthurprs I see a similar effect - I don't see that the xmm registers are not being used, but for certain sizes the old code is better. I'm going to see if there's a way to make it faster in every case, but for now here's the full data (NOTE I timed these using 10 iterations each, count is in cycles):
|
One thing to note is that sizes of 127 etc are probably not that common - for instance, if any of the elements of the struct is a u64 the size will be a multiple of 8. Looking at just the multiples of 8 the new code is faster every time
|
My concern is that using general purpose registers inside very hot loops will throw off the optimizer. I'll try to come up with test cases or use the hashmap code. |
Ok, I'm probably over-thinking this. Either way It's important to check results in x86 and some ARM variant. |
src/libcore/mem.rs
Outdated
@@ -109,7 +109,7 @@ pub use intrinsics::transmute; | |||
/// [`Clone`][clone]. You need the value's destructor to run only once, | |||
/// because a double `free` is undefined behavior. | |||
/// | |||
/// An example is the definition of [`mem::swap`][swap] in this module: | |||
/// An example is the (old) definition of [`mem::swap`][swap] in this module: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would phrase this differently - maybe "An example is a possible implementation of mem::swap" or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should not intimate the existance of a former impl.
Is there more investigation that needs to be done or is this good to go? |
I believe my MIR optimization concerns can be alleviated by moving this optimization into rustc or LLVM later on, so this PR wouldn't necessarily have any lasting negative impact on that (to counter its wins). |
Nah that looks legitimate, seems vectorized ops leaked into the backend which emscripten couldn't handle. Dunno if that's a bug with us or emscripten, but a legitimate bug regardless! |
This is a bug in emscripten the way I see it. It is likely not plausible this will get fixed in emscripten any time soon, so the next best fix is a |
Thanks for persevering @djzin. Cool patch. |
@sfackler should be fixed now... |
@bors r=sfackler |
📌 Commit 83f1f11 has been approved by |
speed up mem::swap I would have thought that the mem::swap code didn't need an intermediate variable precisely because the pointers are guaranteed never to alias. And.. it doesn't! It seems that llvm will also auto-vectorize this case for large structs, but alas it doesn't seem to have all the aliasing info it needs and so will add redundant checks (and even not bother with autovectorizing for small types). Looks like a lot of performance could still be gained here, so this might be a good test case for future optimizer improvements. Here are the current benchmarks for the simd version of mem::swap; the timings are in cycles (code below) measured with 10 iterations. The timings for sizes > 32 which are not a multiple of 8 tend to be ever so slightly faster in the old code, but not always. For large struct sizes (> 1024) the new code shows a marked improvement. \* = latest commit † = subtracted from other measurements | arr_length | noop<sup>†</sup> | rust_stdlib | simd_u64x4\* | simd_u64x8 |------------------|------------|-------------------|-------------------|------------------- 8|80|90|90|90 16|72|177|177|177 24|32|76|76|76 32|68|188|112|188 40|32|80|60|80 48|32|84|56|84 56|32|108|72|108 64|32|108|72|76 72|80|350|220|230 80|80|350|220|230 88|80|420|270|270 96|80|420|270|270 104|80|500|320|320 112|80|490|320|320 120|72|528|342|342 128|48|360|234|234 136|72|987|387|387 144|80|1070|420|420 152|64|856|376|376 160|68|804|400|400 168|80|1060|520|520 176|80|1070|520|520 184|32|464|228|228 192|32|504|228|228 200|32|440|248|248 208|72|987|573|573 216|80|1464|220|220 224|48|852|450|450 232|72|1182|666|666 240|32|428|288|288 248|32|428|308|308 256|80|860|770|770 264|80|1130|820|820 272|80|1340|820|820 280|80|1220|870|870 288|72|1227|804|804 296|72|1356|849|849
☀️ Test successful - status-appveyor, status-travis |
Reuse the mem::swap optimizations to speed up slice::rotate This is most helpful for compound types where LLVM didn't vectorize the loop. Highlight: bench slice::rotate_medium_by727_strings gets 38% faster. Exposes the swapping logic from PR #40454 as `pub unsafe fn ptr::swap_nonoverlapping` under library feature `swap_nonoverlapping` #42818. (The new method seemed plausible, and was the simplest way to share the logic. I'm not attached to it, though, so let me know if a different way would be better.)
This is a workaround for rust-lang#42778, which was git-bisected to rust-lang#40454's optimizations to `mem::swap`, later moved to `ptr` in rust-lang#42819. Natively compiled rustc couldn't even compile stage1 libcore on powerpc64 and s390x, but they work fine without this `repr(simd)`. Since powerpc64le works OK, it seems probably related to being big-endian. The underlying problem is not yet known, but this at least makes those architectures functional again in the meantime. cc @arielb1
Disable big-endian simd in swap_nonoverlapping_bytes This is a workaround for rust-lang#42778, which was git-bisected to rust-lang#40454's optimizations to `mem::swap`, later moved to `ptr` in rust-lang#42819. Natively compiled rustc couldn't even compile stage1 libcore on powerpc64 and s390x, but they work fine without this `repr(simd)`. Since powerpc64le works OK, it seems probably related to being big-endian. The underlying problem is not yet known, but this at least makes those architectures functional again in the meantime. cc @arielb1
I would have thought that the mem::swap code didn't need an intermediate variable precisely because the pointers are guaranteed never to alias. And.. it doesn't! It seems that llvm will also auto-vectorize this case for large structs, but alas it doesn't seem to have all the aliasing info it needs and so will add redundant checks (and even not bother with autovectorizing for small types). Looks like a lot of performance could still be gained here, so this might be a good test case for future optimizer improvements.
Here are the current benchmarks for the simd version of mem::swap; the timings are in cycles (code below) measured with 10 iterations. The timings for sizes > 32 which are not a multiple of 8 tend to be ever so slightly faster in the old code, but not always. For large struct sizes (> 1024) the new code shows a marked improvement.
* = latest commit
† = subtracted from other measurements