Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use simd_* intrinsics instead of llvm intrinsics where possible #788

Closed
3 tasks
bjorn3 opened this issue Jul 30, 2019 · 19 comments
Closed
3 tasks

Use simd_* intrinsics instead of llvm intrinsics where possible #788

bjorn3 opened this issue Jul 30, 2019 · 19 comments

Comments

@bjorn3
Copy link
Member

bjorn3 commented Jul 30, 2019

  • _mm_sqrt_ps: llvm.x86.sse.sqrt.ps -> simd_fsqrt
  • _mm_min_ps: llvm.x86.sse.min.ps -> simd_fmin
  • ...

This makes it easier for other codegen backends to support stuff, as all variants of the same operation are nicely bundled together. (eg single and double precision sqrt)

@gnzlbg
Copy link
Contributor

gnzlbg commented Jul 30, 2019

We used to do that, but only for arm and x86 we needed to add ~10.000 intrinsics. This meant modifying rustc every time a new intrinsic should be added, and doing CI of the intrinsics in rustc, which would add a dozen build jobs to run the tests on less common targets, and increase CI time of rustc build jobs by quite a while (I think the largest tests here take ~20 min).

It would be simpler if other backends could just either hijack the llvm intrinsics, or add their own #[link_foo_intrinsic] attributes that can be used here to link against other intrinsics. Particularly for the more complex algorithms exposed that require calling different intrinsics.

@bjorn3
Copy link
Member Author

bjorn3 commented Jul 30, 2019

Particularly for the more complex algorithms exposed that require calling different intrinsics.

I meant the simple ones which already have a simd_* platform-intrinsic.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jul 30, 2019

This PR was the last one removing support for that: rust-lang/rust#57416 in case you are considering adding it again. If that happens, we could incrementally remove the #[link_llvm_intrinsics] from this crate, and start calling those.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jul 30, 2019

I meant the simple ones which already have a simd_* platform-intrinsic.

Your list appeared unbounded, it would then maybe be helpful if you could write a complete list of operations that you want to propose. The packed_simd crate already implements all the basic stuff portably, and for that the currently available intrinsics mostly suffice.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jul 30, 2019

The packed_simd crate already implements all the basic stuff portably, and for that the currently available intrinsics mostly suffice

Expanding on that, it implements the two operations you mention, and they currently produce optimal code generation with LLVM. For doing a min we just use < and simd_select. For doing a sqrt we just use the vector llvm.fsqrt intrinsics. We probably need to add an LLVM pass to lower those to libmvec calls at some point, packed_simd just optionally links a vector math library, and lowers those to FFI calls, but that's not the best place to do that (it should happen somewhere in the code generation backend).

@bjorn3
Copy link
Member Author

bjorn3 commented Jul 30, 2019

Your list appeared unbounded, it would then maybe be helpful if you could write a complete list of operations that you want to propose.

I understand that it is practically impossible to do this for all intrinsics, but at least the following platform-intrinsics are already supported by the llvm codegen backend:

extern "platform-intrinsic" {
pub fn simd_eq<T, U>(x: T, y: T) -> U;
pub fn simd_ne<T, U>(x: T, y: T) -> U;
pub fn simd_lt<T, U>(x: T, y: T) -> U;
pub fn simd_le<T, U>(x: T, y: T) -> U;
pub fn simd_gt<T, U>(x: T, y: T) -> U;
pub fn simd_ge<T, U>(x: T, y: T) -> U;
pub fn simd_shuffle2<T, U>(x: T, y: T, idx: [u32; 2]) -> U;
pub fn simd_shuffle4<T, U>(x: T, y: T, idx: [u32; 4]) -> U;
pub fn simd_shuffle8<T, U>(x: T, y: T, idx: [u32; 8]) -> U;
pub fn simd_shuffle16<T, U>(x: T, y: T, idx: [u32; 16]) -> U;
pub fn simd_shuffle32<T, U>(x: T, y: T, idx: [u32; 32]) -> U;
pub fn simd_shuffle64<T, U>(x: T, y: T, idx: [u32; 64]) -> U;
pub fn simd_shuffle128<T, U>(x: T, y: T, idx: [u32; 128]) -> U;
pub fn simd_insert<T, U>(x: T, idx: u32, val: U) -> T;
pub fn simd_extract<T, U>(x: T, idx: u32) -> U;
pub fn simd_cast<T, U>(x: T) -> U;
pub fn simd_add<T>(x: T, y: T) -> T;
pub fn simd_sub<T>(x: T, y: T) -> T;
pub fn simd_mul<T>(x: T, y: T) -> T;
pub fn simd_div<T>(x: T, y: T) -> T;
pub fn simd_rem<T>(x: T, y: T) -> T;
pub fn simd_shl<T>(x: T, y: T) -> T;
pub fn simd_shr<T>(x: T, y: T) -> T;
pub fn simd_and<T>(x: T, y: T) -> T;
pub fn simd_or<T>(x: T, y: T) -> T;
pub fn simd_xor<T>(x: T, y: T) -> T;
pub fn simd_reduce_add_unordered<T, U>(x: T) -> U;
pub fn simd_reduce_mul_unordered<T, U>(x: T) -> U;
pub fn simd_reduce_add_ordered<T, U>(x: T, acc: U) -> U;
pub fn simd_reduce_mul_ordered<T, U>(x: T, acc: U) -> U;
pub fn simd_reduce_min<T, U>(x: T) -> U;
pub fn simd_reduce_max<T, U>(x: T) -> U;
pub fn simd_reduce_min_nanless<T, U>(x: T) -> U;
pub fn simd_reduce_max_nanless<T, U>(x: T) -> U;
pub fn simd_reduce_and<T, U>(x: T) -> U;
pub fn simd_reduce_or<T, U>(x: T) -> U;
pub fn simd_reduce_xor<T, U>(x: T) -> U;
pub fn simd_reduce_all<T>(x: T) -> bool;
pub fn simd_reduce_any<T>(x: T) -> bool;
pub fn simd_select<M, T>(m: M, a: T, b: T) -> T;
pub fn simd_select_bitmask<M, T>(m: M, a: T, b: T) -> T;
pub fn simd_fmin<T>(a: T, b: T) -> T;
pub fn simd_fmax<T>(a: T, b: T) -> T;
pub fn simd_fsqrt<T>(a: T) -> T;
pub fn simd_fma<T>(a: T, b: T, c: T) -> T;

@bjorn3
Copy link
Member Author

bjorn3 commented Jul 30, 2019

It would be simpler if other backends could just either hijack the llvm intrinsics, or add their own #[link_foo_intrinsic] attributes that can be used here to link against other intrinsics. Particularly for the more complex algorithms exposed that require calling different intrinsics.

Hijacking the llvm intrinsics is what I currently do, but every combination of op, lane_type and lane_count has its own llvm intrinsic, leading to combinatory explosion of the intrinsics to be implemented.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jul 30, 2019

I understand that it is practically impossible to do this for all intrinsics,

This makes me wonder, why are you trying to support core::arch instead of core::simd ? core::simd only exposes the portable operations, and does so on all arches, so it probably makes sense to support that first.

If there is a way to implement some of the core::arch intrinsics in terms of core::simd, we could do that.

@bjorn3
Copy link
Member Author

bjorn3 commented Jul 30, 2019

why are you trying to support core::arch

Because everybody uses core::arch, as it's the only stable way to use SIMD. That includes crates like memchr which are depended upon by many other crates.

@bjorn3
Copy link
Member Author

bjorn3 commented Jul 30, 2019

If there is a way to implement some of the core::arch intrinsics in terms of core::simd, we could do that.

+1

@gnzlbg
Copy link
Contributor

gnzlbg commented Jul 30, 2019

Also, at least for the two intrinsics mentioned:

_mm_sqrt_ps: llvm.x86.sse.sqrt.ps -> simd_fsqrt

The specific intrinsics only work for the particular vector length, type, and one specific #[target_feature] .

A generic intrinsic would need to support both f32 and f64 and all supported vector lengths for those types (4, 8, and 16 for 32-bit), for all combinations of target features available.

For the LLVM backend, exposing a generic simd_fsqrt would require:

I think that work is worth doing, and it would improve the current situation a bit, where we handle all of this in a library (e.g. by enabling some optimizations that are not possible today).

_mm_min_ps: llvm.x86.sse.min.ps -> simd_fmin

The main reason I point you to packed_simd, is because it has already solved most of the problems you are trying to solve: https://github.com/rust-lang-nursery/packed_simd/blob/master/src/api/ops/vector_float_min_max.rs#L13 The intrinsic you are requesting, already exist, and it is precisely called simd_fmin.

@bjorn3
Copy link
Member Author

bjorn3 commented Jul 30, 2019

The intrinsic you are requesting, already exist, and it is precisely called simd_fmin.

I was actually requesting it to be used in _mm_min_ps, not to add it :)

For the LLVM backend, exposing a generic simd_fsqrt would require:

  • generate an llvm.fsqrt.xy from the front-end

This is already done by cg_llvm for simd_fsqrt and friends:

https://github.com/rust-lang/rust/blob/9a239ef4ded03d155c72b68b5a2dd7aff013e141/src/librustc_codegen_llvm/intrinsic.rs#L1330

@bjorn3
Copy link
Member Author

bjorn3 commented Jul 30, 2019

writing and registering a custom LLVM pass that replaces llvm.fsqrt.xy in LLVM-IR before the backend, e.g., doing something like what is manually done here: https://github.com/rust-lang-nursery/packed_simd/blob/accf702d478010adb72bcba980bb59b9cf218d7d/src/codegen/math/float/sqrt.rs#L56

Doesn't llvm do that itself?

@bjorn3
Copy link
Member Author

bjorn3 commented Jul 30, 2019

The main reason I point you to packed_simd, is because it has already solved most of the problems you are trying to solve

I understand, but until core::arch is implemented in terms of packed_simd it doesn't help much, as many crates use core::arch instead.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jul 30, 2019

I was actually requesting it to be used in _mm_min_ps, not to add it :)

Ahhh! Duh, I thought you were proposing adding new generic intrinsics for all the exposed operations.

If you just want to use the generic ones when they are appropriate and they do work, you can do just that, and our test suite should make sure that the exact same instruction is generated. Just keep in mind that sometimes the target-specific intrinsics might mean something different than the generic ones (e.g. for floating-point, things like the precision might be different).


Doesn't llvm do that itself?

No, not at all. The simd_fsqrt<T> intrinsic isn't generic, in that it is not intended to work for all Ts that are SIMD types. It only works for some particular T and particular target-feature combinations, e.g., for f32x4 if the appropriate target-feature is available, for f32x8 when avx2 is available, and that's pretty much it.

core::arch only exposes these cases IIRC (except for some of the mathematical intrinsics that we do not implement yet like _mm_invsqrt_pd), so for these cases using simd_fsqrt should "just work".

@bjorn3
Copy link
Member Author

bjorn3 commented Jul 30, 2019

By the way, it turns out that cg_llvm implements more simd_f* intrinsics, than seem to be defined in either core_arch or packed_simd:

https://github.com/rust-lang/rust/blob/master/src/librustc_codegen_llvm/intrinsic.rs#L1340-L1381

@gnzlbg
Copy link
Contributor

gnzlbg commented Jul 30, 2019

By the way, it turns out that cg_llvm implements more simd_f* intrinsics, than seem to be defined in either core_arch or packed_simd:

Yes, most (or all?) of those are used by packed_simd - the ones that aren't is because they don't make sense without workarounds, and adding workarounds takes time (e.g. there are PRs to packed simd to implement some of those).

@bjorn3
Copy link
Member Author

bjorn3 commented Jul 30, 2019

so for these cases using simd_fsqrt should "just work".

🎉

@gnzlbg
Copy link
Contributor

gnzlbg commented Jul 30, 2019

I'm closing this as the issue has been "resolved". Feel free to submit PRs to change whichever parts of core::arch can use some of the already existing generic intrinsics to use those. I hope we can get CI back up soon, so that we can start merging PRs again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants