Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use more simd_* intrinsics #790

Merged
merged 17 commits into from
Dec 18, 2019
Merged

Conversation

bjorn3
Copy link
Member

@bjorn3 bjorn3 commented Jul 31, 2019

I currently only did this for x86. Also I skipped _mm_sqrt_ps and some more, as llvm emitted rsqrtps combined with a lot of extra instructions instead of sqrtps, causing slight rounding errors and non optimal codegen.

cc #788

@gnzlbg
Copy link
Contributor

gnzlbg commented Jul 31, 2019

Also I skipped _mm_sqrt_ps and some more, as llvm emitted rsqrtps combined with a lot of extra instructions instead of sqrtps, causing slight rounding errors and non optimal codegen.

I'll look into that.

@bjorn3
Copy link
Member Author

bjorn3 commented Jul 31, 2019

Got the same for _mm256_sqrt_ps. The pd versions were working correcly in both cases.

@bjorn3
Copy link
Member Author

bjorn3 commented Jul 31, 2019

#![feature(platform_intrinsics)]

extern crate core;

use core::arch::x86_64::__m128;

extern "platform-intrinsic" {
    fn simd_fsqrt<T>(a: T) -> T;
}

pub unsafe fn sqrt(a: __m128) -> __m128 {
    simd_fsqrt(a)
}

Optimized LLVM:

; playground::sqrt
; Function Attrs: nofree nounwind nonlazybind uwtable
define void @_ZN10playground4sqrt17h5d635885a5180697E(<4 x float>* noalias nocapture sret dereferenceable(16), <4 x float>* noalias nocapture readonly dereferenceable(16) %a) unnamed_addr #0 {
start:
  %1 = load <4 x float>, <4 x float>* %a, align 16
  %2 = tail call fast <4 x float> @llvm.sqrt.v4f32(<4 x float> %1)
  store <4 x float> %2, <4 x float>* %0, align 16
  ret void
}

Optimized asm:

.LCPI0_0:
	.long	3204448256              # float -0.5
	.long	3204448256              # float -0.5
	.long	3204448256              # float -0.5
	.long	3204448256              # float -0.5

.LCPI0_1:
	.long	3225419776              # float -3
	.long	3225419776              # float -3
	.long	3225419776              # float -3
	.long	3225419776              # float -3

playground::sqrt: # @playground::sqrt
# %bb.0:
	movq	%rdi, %rax
	movaps	(%rsi), %xmm0
	rsqrtps	%xmm0, %xmm1
	movaps	%xmm0, %xmm2
	mulps	%xmm1, %xmm2
	movaps	.LCPI0_0(%rip), %xmm3   # xmm3 = [-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1]
	mulps	%xmm2, %xmm3
	mulps	%xmm1, %xmm2
	addps	.LCPI0_1(%rip), %xmm2
	xorps	%xmm1, %xmm1
	cmpneqps	%xmm0, %xmm1
	mulps	%xmm3, %xmm2
	andps	%xmm2, %xmm1
	movaps	%xmm1, (%rdi)
	retq

@bjorn3
Copy link
Member Author

bjorn3 commented Jul 31, 2019

I have gone through every llvm intrinsic for x86 and x86_64 to see if there is a simd_* replacement.

Copy link
Contributor

@gnzlbg gnzlbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some questions.

crates/core_arch/src/x86/fma.rs Show resolved Hide resolved
crates/core_arch/src/x86/avx.rs Show resolved Hide resolved
crates/core_arch/src/x86/sse.rs Outdated Show resolved Hide resolved
@gnzlbg gnzlbg closed this Aug 2, 2019
@gnzlbg gnzlbg reopened this Aug 2, 2019
@@ -255,7 +255,7 @@ pub unsafe fn _mm256_andnot_ps(a: __m256, b: __m256) -> __m256 {
#[cfg_attr(test, assert_instr(vmaxpd))]
#[stable(feature = "simd_x86", since = "1.27.0")]
pub unsafe fn _mm256_max_pd(a: __m256d, b: __m256d) -> __m256d {
maxpd256(a, b)
simd_fmax(a, b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the behavior of these the same, e.g., for subnormals, when one argument contain NaNs, etc. ?

@@ -219,6 +220,7 @@ pub unsafe fn _mm_max_ss(a: __m128, b: __m128) -> __m128 {
#[cfg_attr(test, assert_instr(maxps))]
#[stable(feature = "simd_x86", since = "1.27.0")]
pub unsafe fn _mm_max_ps(a: __m128, b: __m128) -> __m128 {
// See the `test_mm_min_ps` test why this can't be implemented using `simd_fmax`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to add similar tests to the other intrinsics using simd_fmax and simd_fmin, that check subnormals, and also that check the behavior when the first argument is nan, and the second non-nan, and viceversa.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I create a subnormal? As far as I understand they are close to zero, but I don't know how close.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I create a subnormal?

Check out the docs for f{32,64}::is_normal(). Each floating-point type has a MIN_POSITIVE number, and all numbers between that one and zero (I think in range: (-MIN_POSITIVE, MIN_POSITIVE)) are subnormal. I don't know if creating them from a literal returns 0.0 or not. But if they do, then checking permutations of -0.0, 0.0, and NaN should be enough, e.g., (-0.0, 0.0), (0.0, -0.0), (1.0, NaN), (NaN, 1.0).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.000000000000000000000000000000000000000000001f32.is_normal() returns false and transmuting it to [u8; 4] gives [1, 0, 0, 0]. Do you want to check permutations with that number too? Or should I just use 0.0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to check permutations with that number too?

Yes, we should check that too :)

let b: [u8; 16] = transmute(b);
assert_eq!(r1, b);
assert_eq!(r2, a);
assert_ne!(a, b); // sanity check that -0.0 is actually present
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to also test here the behavior when the first argument is nan and the second is not, and vice versa (e.g. if the result the Nan? the second argument ? always the non-nan ? etc.).

crates/stdarch-test/src/lib.rs Show resolved Hide resolved
@gnzlbg gnzlbg closed this Aug 18, 2019
@gnzlbg gnzlbg reopened this Aug 18, 2019
@bjorn3
Copy link
Member Author

bjorn3 commented Sep 1, 2019

LLVM doesn't use the simd instructions for certain intrinsics on i586.

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 6, 2019

@bjorn3 maybe we could use the generic intrinsics in some cases (e.g. #[cfg(target_feature = "sse2")] ?), and the specific ones in others ?

@bjorn3 bjorn3 force-pushed the use_more_simd_x_intrinsics branch from 9019582 to 03a312c Compare November 26, 2019 19:34
@bjorn3
Copy link
Member Author

bjorn3 commented Nov 26, 2019

Rebased to trigger CI, as the old logs are no longer available.

@bjorn3
Copy link
Member Author

bjorn3 commented Nov 26, 2019

Windows build failed while installing rust:

Run rustup update nightly --no-self-update && rustup default nightly
At D:\a\_temp\0855049a-8a0a-4cb8-bf51-de53a4f07b31.ps1:2 char:40
+ rustup update nightly --no-self-update && rustup default nightly
+                                        ~~
The token '&&' is not a valid statement separator in this version.
+ CategoryInfo          : ParserError: (:) [], ParseException
+ FullyQualifiedErrorId : InvalidEndOfLine

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 27, 2019

rustup update nightly --no-self-update && rustup default nightly

Can you split this statement into two different lines and try again?

rustup update nightly --no-self-update
rustup default nightly

@makotokato
Copy link
Contributor

Windows build failed while installing rust:

This is fixed by ac59837

@bjorn3 bjorn3 force-pushed the use_more_simd_x_intrinsics branch from 03a312c to 4c7d4b5 Compare December 17, 2019 12:12
@gnzlbg gnzlbg closed this Dec 17, 2019
@gnzlbg gnzlbg reopened this Dec 17, 2019
@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 17, 2019

Closing / reopening to re-trigger CI.

On i586 the simd_* intrinsics don't compile to MMX instructions, even
with `#[target_feature(enable = "mmx")]`.
@bjorn3
Copy link
Member Author

bjorn3 commented Dec 17, 2019

Reverted the mmx changes, as those are the ones not compiling to the required instruction.

@bjorn3
Copy link
Member Author

bjorn3 commented Dec 17, 2019

CI is finally happy!

@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 18, 2019

Reverted the mmx changes, as those are the ones not compiling to the required instruction.

Uh, sorry, my fault, I should have caught this. Yes, mmx intrinsics (or those using the _m64 type in general) won't work with the generic simd_ intrinsics. I wouldn't worry about that, _m64 creates so many headaches that few people are using it, and also, chances are we will never stabilize it.

@gnzlbg gnzlbg merged commit b51ba3f into rust-lang:master Dec 18, 2019
@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 18, 2019

Thank you @bjorn3 for working on this!

@bjorn3 bjorn3 deleted the use_more_simd_x_intrinsics branch December 18, 2019 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants