-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use more simd_* intrinsics #790
Conversation
I'll look into that. |
Got the same for |
#![feature(platform_intrinsics)]
extern crate core;
use core::arch::x86_64::__m128;
extern "platform-intrinsic" {
fn simd_fsqrt<T>(a: T) -> T;
}
pub unsafe fn sqrt(a: __m128) -> __m128 {
simd_fsqrt(a)
} Optimized LLVM: ; playground::sqrt
; Function Attrs: nofree nounwind nonlazybind uwtable
define void @_ZN10playground4sqrt17h5d635885a5180697E(<4 x float>* noalias nocapture sret dereferenceable(16), <4 x float>* noalias nocapture readonly dereferenceable(16) %a) unnamed_addr #0 {
start:
%1 = load <4 x float>, <4 x float>* %a, align 16
%2 = tail call fast <4 x float> @llvm.sqrt.v4f32(<4 x float> %1)
store <4 x float> %2, <4 x float>* %0, align 16
ret void
} Optimized asm: .LCPI0_0:
.long 3204448256 # float -0.5
.long 3204448256 # float -0.5
.long 3204448256 # float -0.5
.long 3204448256 # float -0.5
.LCPI0_1:
.long 3225419776 # float -3
.long 3225419776 # float -3
.long 3225419776 # float -3
.long 3225419776 # float -3
playground::sqrt: # @playground::sqrt
# %bb.0:
movq %rdi, %rax
movaps (%rsi), %xmm0
rsqrtps %xmm0, %xmm1
movaps %xmm0, %xmm2
mulps %xmm1, %xmm2
movaps .LCPI0_0(%rip), %xmm3 # xmm3 = [-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1]
mulps %xmm2, %xmm3
mulps %xmm1, %xmm2
addps .LCPI0_1(%rip), %xmm2
xorps %xmm1, %xmm1
cmpneqps %xmm0, %xmm1
mulps %xmm3, %xmm2
andps %xmm2, %xmm1
movaps %xmm1, (%rdi)
retq |
I have gone through every llvm intrinsic for x86 and x86_64 to see if there is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some questions.
@@ -255,7 +255,7 @@ pub unsafe fn _mm256_andnot_ps(a: __m256, b: __m256) -> __m256 { | |||
#[cfg_attr(test, assert_instr(vmaxpd))] | |||
#[stable(feature = "simd_x86", since = "1.27.0")] | |||
pub unsafe fn _mm256_max_pd(a: __m256d, b: __m256d) -> __m256d { | |||
maxpd256(a, b) | |||
simd_fmax(a, b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the behavior of these the same, e.g., for subnormals, when one argument contain NaNs, etc. ?
@@ -219,6 +220,7 @@ pub unsafe fn _mm_max_ss(a: __m128, b: __m128) -> __m128 { | |||
#[cfg_attr(test, assert_instr(maxps))] | |||
#[stable(feature = "simd_x86", since = "1.27.0")] | |||
pub unsafe fn _mm_max_ps(a: __m128, b: __m128) -> __m128 { | |||
// See the `test_mm_min_ps` test why this can't be implemented using `simd_fmax`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to add similar tests to the other intrinsics using simd_fmax
and simd_fmin
, that check subnormals, and also that check the behavior when the first argument is nan, and the second non-nan, and viceversa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do I create a subnormal? As far as I understand they are close to zero, but I don't know how close.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do I create a subnormal?
Check out the docs for f{32,64}::is_normal()
. Each floating-point type has a MIN_POSITIVE
number, and all numbers between that one and zero (I think in range: (-MIN_POSITIVE, MIN_POSITIVE)
) are subnormal. I don't know if creating them from a literal returns 0.0 or not. But if they do, then checking permutations of -0.0, 0.0, and NaN should be enough, e.g., (-0.0, 0.0), (0.0, -0.0), (1.0, NaN), (NaN, 1.0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.000000000000000000000000000000000000000000001f32.is_normal()
returns false and transmuting it to [u8; 4]
gives [1, 0, 0, 0]
. Do you want to check permutations with that number too? Or should I just use 0.0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to check permutations with that number too?
Yes, we should check that too :)
let b: [u8; 16] = transmute(b); | ||
assert_eq!(r1, b); | ||
assert_eq!(r2, a); | ||
assert_ne!(a, b); // sanity check that -0.0 is actually present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to also test here the behavior when the first argument is nan and the second is not, and vice versa (e.g. if the result the Nan? the second argument ? always the non-nan ? etc.).
LLVM doesn't use the simd instructions for certain intrinsics on i586. |
@bjorn3 maybe we could use the generic intrinsics in some cases (e.g. |
9019582
to
03a312c
Compare
Rebased to trigger CI, as the old logs are no longer available. |
Windows build failed while installing rust:
|
Can you split this statement into two different lines and try again?
|
This is fixed by ac59837 |
`rsqrtps %xmm0,%xmm1` used to match `sqrtps` without leading `r`.
03a312c
to
4c7d4b5
Compare
Closing / reopening to re-trigger CI. |
On i586 the simd_* intrinsics don't compile to MMX instructions, even with `#[target_feature(enable = "mmx")]`.
Reverted the mmx changes, as those are the ones not compiling to the required instruction. |
CI is finally happy! |
Uh, sorry, my fault, I should have caught this. Yes, |
Thank you @bjorn3 for working on this! |
I currently only did this for x86. Also I skipped
_mm_sqrt_ps
and some more, as llvm emittedrsqrtps
combined with a lot of extra instructions instead ofsqrtps
, causing slight rounding errors and non optimal codegen.cc #788