Use more simd_* intrinsics #790

bjorn3 · 2019-07-31T09:36:41Z

I currently only did this for x86. Also I skipped _mm_sqrt_ps and some more, as llvm emitted rsqrtps combined with a lot of extra instructions instead of sqrtps, causing slight rounding errors and non optimal codegen.

cc #788

gnzlbg · 2019-07-31T09:39:07Z

Also I skipped _mm_sqrt_ps and some more, as llvm emitted rsqrtps combined with a lot of extra instructions instead of sqrtps, causing slight rounding errors and non optimal codegen.

I'll look into that.

bjorn3 · 2019-07-31T11:11:40Z

Got the same for _mm256_sqrt_ps. The pd versions were working correcly in both cases.

bjorn3 · 2019-07-31T11:32:47Z

#![feature(platform_intrinsics)]

extern crate core;

use core::arch::x86_64::__m128;

extern "platform-intrinsic" {
    fn simd_fsqrt<T>(a: T) -> T;
}

pub unsafe fn sqrt(a: __m128) -> __m128 {
    simd_fsqrt(a)
}

Optimized LLVM:

; playground::sqrt
; Function Attrs: nofree nounwind nonlazybind uwtable
define void @_ZN10playground4sqrt17h5d635885a5180697E(<4 x float>* noalias nocapture sret dereferenceable(16), <4 x float>* noalias nocapture readonly dereferenceable(16) %a) unnamed_addr #0 {
start:
  %1 = load <4 x float>, <4 x float>* %a, align 16
  %2 = tail call fast <4 x float> @llvm.sqrt.v4f32(<4 x float> %1)
  store <4 x float> %2, <4 x float>* %0, align 16
  ret void
}

Optimized asm:

.LCPI0_0:
	.long	3204448256              # float -0.5
	.long	3204448256              # float -0.5
	.long	3204448256              # float -0.5
	.long	3204448256              # float -0.5

.LCPI0_1:
	.long	3225419776              # float -3
	.long	3225419776              # float -3
	.long	3225419776              # float -3
	.long	3225419776              # float -3

playground::sqrt: # @playground::sqrt
# %bb.0:
	movq	%rdi, %rax
	movaps	(%rsi), %xmm0
	rsqrtps	%xmm0, %xmm1
	movaps	%xmm0, %xmm2
	mulps	%xmm1, %xmm2
	movaps	.LCPI0_0(%rip), %xmm3   # xmm3 = [-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1]
	mulps	%xmm2, %xmm3
	mulps	%xmm1, %xmm2
	addps	.LCPI0_1(%rip), %xmm2
	xorps	%xmm1, %xmm1
	cmpneqps	%xmm0, %xmm1
	mulps	%xmm3, %xmm2
	andps	%xmm2, %xmm1
	movaps	%xmm1, (%rdi)
	retq

bjorn3 · 2019-07-31T11:49:58Z

I have gone through every llvm intrinsic for x86 and x86_64 to see if there is a simd_* replacement.

gnzlbg

I've left some questions.

crates/core_arch/src/x86/fma.rs

crates/core_arch/src/x86/avx.rs

crates/core_arch/src/x86/sse.rs

gnzlbg · 2019-08-05T11:29:06Z

crates/core_arch/src/x86/avx.rs

@@ -255,7 +255,7 @@ pub unsafe fn _mm256_andnot_ps(a: __m256, b: __m256) -> __m256 {
 #[cfg_attr(test, assert_instr(vmaxpd))]
 #[stable(feature = "simd_x86", since = "1.27.0")]
 pub unsafe fn _mm256_max_pd(a: __m256d, b: __m256d) -> __m256d {
-    maxpd256(a, b)
+    simd_fmax(a, b)


Is the behavior of these the same, e.g., for subnormals, when one argument contain NaNs, etc. ?

gnzlbg · 2019-08-05T11:31:38Z

crates/core_arch/src/x86/sse.rs

@@ -219,6 +220,7 @@ pub unsafe fn _mm_max_ss(a: __m128, b: __m128) -> __m128 {
 #[cfg_attr(test, assert_instr(maxps))]
 #[stable(feature = "simd_x86", since = "1.27.0")]
 pub unsafe fn _mm_max_ps(a: __m128, b: __m128) -> __m128 {
+    // See the `test_mm_min_ps` test why this can't be implemented using `simd_fmax`.


I think it would be better to add similar tests to the other intrinsics using simd_fmax and simd_fmin, that check subnormals, and also that check the behavior when the first argument is nan, and the second non-nan, and viceversa.

How do I create a subnormal? As far as I understand they are close to zero, but I don't know how close.

How do I create a subnormal?

Check out the docs for f{32,64}::is_normal(). Each floating-point type has a MIN_POSITIVE number, and all numbers between that one and zero (I think in range: (-MIN_POSITIVE, MIN_POSITIVE)) are subnormal. I don't know if creating them from a literal returns 0.0 or not. But if they do, then checking permutations of -0.0, 0.0, and NaN should be enough, e.g., (-0.0, 0.0), (0.0, -0.0), (1.0, NaN), (NaN, 1.0).

0.000000000000000000000000000000000000000000001f32.is_normal() returns false and transmuting it to [u8; 4] gives [1, 0, 0, 0]. Do you want to check permutations with that number too? Or should I just use 0.0?

Do you want to check permutations with that number too?

Yes, we should check that too :)

gnzlbg · 2019-08-05T11:32:33Z

crates/core_arch/src/x86/sse.rs

+        let b: [u8; 16] = transmute(b);
+        assert_eq!(r1, b);
+        assert_eq!(r2, a);
+        assert_ne!(a, b); // sanity check that -0.0 is actually present


I think we need to also test here the behavior when the first argument is nan and the second is not, and vice versa (e.g. if the result the Nan? the second argument ? always the non-nan ? etc.).

crates/stdarch-test/src/lib.rs

bjorn3 · 2019-09-01T12:40:18Z

LLVM doesn't use the simd instructions for certain intrinsics on i586.

gnzlbg · 2019-09-06T13:38:45Z

@bjorn3 maybe we could use the generic intrinsics in some cases (e.g. #[cfg(target_feature = "sse2")] ?), and the specific ones in others ?

bjorn3 · 2019-11-26T19:35:26Z

Rebased to trigger CI, as the old logs are no longer available.

bjorn3 · 2019-11-26T19:38:15Z

Windows build failed while installing rust:

Run rustup update nightly --no-self-update && rustup default nightly
At D:\a\_temp\0855049a-8a0a-4cb8-bf51-de53a4f07b31.ps1:2 char:40
+ rustup update nightly --no-self-update && rustup default nightly
+                                        ~~
The token '&&' is not a valid statement separator in this version.
+ CategoryInfo          : ParserError: (:) [], ParseException
+ FullyQualifiedErrorId : InvalidEndOfLine

gnzlbg · 2019-11-27T10:33:14Z

rustup update nightly --no-self-update && rustup default nightly

Can you split this statement into two different lines and try again?

rustup update nightly --no-self-update
rustup default nightly

makotokato · 2019-12-17T04:02:10Z

Windows build failed while installing rust:

This is fixed by ac59837

`rsqrtps %xmm0,%xmm1` used to match `sqrtps` without leading `r`.

gnzlbg · 2019-12-17T12:25:41Z

Closing / reopening to re-trigger CI.

On i586 the simd_* intrinsics don't compile to MMX instructions, even with `#[target_feature(enable = "mmx")]`.

bjorn3 · 2019-12-17T13:40:16Z

Reverted the mmx changes, as those are the ones not compiling to the required instruction.

bjorn3 · 2019-12-17T15:13:09Z

CI is finally happy!

gnzlbg · 2019-12-18T16:41:17Z

Reverted the mmx changes, as those are the ones not compiling to the required instruction.

Uh, sorry, my fault, I should have caught this. Yes, mmx intrinsics (or those using the _m64 type in general) won't work with the generic simd_ intrinsics. I wouldn't worry about that, _m64 creates so many headaches that few people are using it, and also, chances are we will never stabilize it.

gnzlbg · 2019-12-18T16:41:34Z

Thank you @bjorn3 for working on this!

gnzlbg reviewed Jul 31, 2019

View reviewed changes

crates/core_arch/src/x86/fma.rs Show resolved Hide resolved

crates/core_arch/src/x86/avx.rs Show resolved Hide resolved

crates/core_arch/src/x86/sse.rs Outdated Show resolved Hide resolved

gnzlbg closed this Aug 2, 2019

gnzlbg reopened this Aug 2, 2019

gnzlbg reviewed Aug 5, 2019

View reviewed changes

gnzlbg closed this Aug 18, 2019

gnzlbg reopened this Aug 18, 2019

bjorn3 force-pushed the use_more_simd_x_intrinsics branch from 9019582 to 03a312c Compare November 26, 2019 19:34

bjorn3 added 12 commits December 17, 2019 13:12

Require prefix of instruction line to be the expected instruction

7f95135

`rsqrtps %xmm0,%xmm1` used to match `sqrtps` without leading `r`.

Use simd_fsqrt where possible

d839c95

Use simd_floor and simd_ceil where possible

f174370

Use simd_fma where possible

82d61a8

Add missing simd platform intrinsics

55dceda

Use simd_* in x86/mmx.rs where possible

ebf8f20

Use simd_fmin and simd_fmax for _mm_min_ps and _mm_max_ps

a346542

Use simd_saturating_* in x86/sse2.rs where possible

d274c7f

Use simd_* in x86/sse41.rs where possible

8e3cd3c

Use simd_* in x86/avx.rs where possible

0ba4efd

Use simd_* in x86/avx2.rs where possible

2880aab

Use <i32>::swap_bytes instead of llvm.bswap.i32

3872138

bjorn3 added 4 commits December 17, 2019 13:12

Remove some unused llvm intrinsic declarations

844cf86

Use <i64>::swap_bytes instead of llvm.bswap.i64

37048b5

Revert _mm_{min,max}_ps changes and add explanation why

ce89369

Rustfmt

4c7d4b5

bjorn3 force-pushed the use_more_simd_x_intrinsics branch from 03a312c to 4c7d4b5 Compare December 17, 2019 12:12

gnzlbg closed this Dec 17, 2019

gnzlbg reopened this Dec 17, 2019

Revert mmx changes

cc5ab19

On i586 the simd_* intrinsics don't compile to MMX instructions, even with `#[target_feature(enable = "mmx")]`.

gnzlbg merged commit b51ba3f into rust-lang:master Dec 18, 2019

bjorn3 deleted the use_more_simd_x_intrinsics branch December 18, 2019 17:40

TDecking mentioned this pull request Jun 16, 2024

Use stdsimd instead of LLVM imports in more instances. #1584

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use more simd_* intrinsics #790

Use more simd_* intrinsics #790

bjorn3 commented Jul 31, 2019

gnzlbg commented Jul 31, 2019

bjorn3 commented Jul 31, 2019 •

edited

Loading

bjorn3 commented Jul 31, 2019

bjorn3 commented Jul 31, 2019

gnzlbg left a comment

gnzlbg Aug 5, 2019

gnzlbg Aug 5, 2019

bjorn3 Aug 5, 2019

gnzlbg Aug 5, 2019

bjorn3 Aug 5, 2019

gnzlbg Aug 5, 2019

gnzlbg Aug 5, 2019

bjorn3 commented Sep 1, 2019 •

edited

Loading

gnzlbg commented Sep 6, 2019

bjorn3 commented Nov 26, 2019

bjorn3 commented Nov 26, 2019

gnzlbg commented Nov 27, 2019

makotokato commented Dec 17, 2019

gnzlbg commented Dec 17, 2019

bjorn3 commented Dec 17, 2019

bjorn3 commented Dec 17, 2019

gnzlbg commented Dec 18, 2019

gnzlbg commented Dec 18, 2019

Use more simd_* intrinsics #790

Use more simd_* intrinsics #790

Conversation

bjorn3 commented Jul 31, 2019

gnzlbg commented Jul 31, 2019

bjorn3 commented Jul 31, 2019 • edited Loading

bjorn3 commented Jul 31, 2019

bjorn3 commented Jul 31, 2019

gnzlbg left a comment

Choose a reason for hiding this comment

gnzlbg Aug 5, 2019

Choose a reason for hiding this comment

gnzlbg Aug 5, 2019

Choose a reason for hiding this comment

bjorn3 Aug 5, 2019

Choose a reason for hiding this comment

gnzlbg Aug 5, 2019

Choose a reason for hiding this comment

bjorn3 Aug 5, 2019

Choose a reason for hiding this comment

gnzlbg Aug 5, 2019

Choose a reason for hiding this comment

gnzlbg Aug 5, 2019

Choose a reason for hiding this comment

bjorn3 commented Sep 1, 2019 • edited Loading

gnzlbg commented Sep 6, 2019

bjorn3 commented Nov 26, 2019

bjorn3 commented Nov 26, 2019

gnzlbg commented Nov 27, 2019

makotokato commented Dec 17, 2019

gnzlbg commented Dec 17, 2019

bjorn3 commented Dec 17, 2019

bjorn3 commented Dec 17, 2019

gnzlbg commented Dec 18, 2019

gnzlbg commented Dec 18, 2019

bjorn3 commented Jul 31, 2019 •

edited

Loading

bjorn3 commented Sep 1, 2019 •

edited

Loading