x64: Add a smattering of lowerings for `shuffle` specializations #5930

alexcrichton · 2023-03-04T22:18:45Z

I took the meshoptimizer benchmark, compiled it to wasm, and ran it through both Wasmtime and v8. One major improvement for Wasmtime was to use specialized shuffle instructions instead of the generic fallback which requires two pshufd with rip-relative immediates plus a por vs a single instruction in many cases. I trawled through the v8 disassembly and found a number of shuffle-related instructions and translated all of their lowerings to Wasmtime. On one specific benchmark in that program this PR improves the throughput by 30%, and another by 5%.

github-actions · 2023-03-04T22:44:59Z

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "cranelift:area:machinst", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

cfallin: isle
fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

Add some special cases for `shuffle` for more specialized x86 instructions.

This commit adds special-cased lowerings for the x64 `shuffle` instruction when the `pshufd` instruction alone is necessary. This is possible when the shuffle immediate permutes 32-bit values within one of the vector inputs of the `shuffle` instruction, but not both.

This adds specific permutations for some x86 instructions which specifically interleave high/low bytes for 32 and 64-bit values. This corresponds to the preexisting specific lowerings for interleaving 8 and 16-bit values.

This commit adds targeted lowerings for the `shuffle` instruction that match the pattern that `shufps` supports. The `shufps` instruction selects two elements from the first vector and two elements from the second vector which means while it's not generally applicable it should still be more useful than the catch-all lowering of `shuffle`.

This commit adds special lowering cases for these instructions which permute 16-bit values within a 128-bit value either within the upper or lower half of the 128-bit value.

Instead of loading the all-zeros immediate from a rip-relative address at the end of the function instead generate a zero with a `pxor` instruction and then use `pshufb` to do the broadcast.

abrown

LGTM! I think my only nit is documentation for shuffle_imm; the documentation for the other helper functions is great. For others' future reference, the equivalent V8 lowerings can be found in code-generator-x64.cc (search for punpck and pshuf).

abrown · 2023-03-09T19:00:34Z

cranelift/codegen/src/isa/x64/lower.isle

+(extern extractor pshuflw_rhs_imm pshuflw_rhs_imm)
+(decl pshufhw_lhs_imm (u8) Immediate)
+(extern extractor pshufhw_lhs_imm pshufhw_lhs_imm)
+(decl pshufhw_rhs_imm (u8) Immediate)


Wish the matching logic could be visible here in ISLE but that doesn't really change how I feel about this PR.

And I get that it gets rather hairy in the helpers...

I agree! I originally was wondering how bad it would be to put this all in ISLE but I ended up concluding that it would end up with even more extractors and not necessarily simplify the overall scheme, so I opted to leave the rules in ISLE but the extraction of immediates in Rust (for better or worse)

abrown · 2023-03-09T19:03:54Z

cranelift/codegen/src/isa/x64/lower.isle

+(rule 10 (lower (shuffle x y (pshufhw_lhs_imm imm)))
+      (x64_pshufhw x imm))
+(rule 9 (lower (shuffle x y (pshufhw_rhs_imm imm)))
+      (x64_pshufhw y imm))


I started to wonder how many comparisons need to be made before one can emit a shuffle lowering: I think in the end it is worth it to spend a bit more time finding the right instruction here but I wonder if there is quicker way to do so. (Just wondering, no action neeeded).

Same! My thinking though was that there's really not all that many shuffle instructions in a program so if shuffle is 100x slower to translate than load it's probably not the end of the world. That being said I do think that these can be slightly optimized where instead of having an extractor-per-immediate we could have one extractor that returns an enum of all the possible shuffle instructions that could be used which is then pattern-matched on in ISLE. I'm not sure if it would be that much faster but it would prevent bouncing back and forth between generated code and isle.rs if it becomes an issue.

I was going to mention that V8 does this "match once, then emit" through separate opcodes like you mention below:

Oh I meant to take a look at that but never got around to it. Interestingly though it looks like the x64 backend for v8 has individual opcodes for all the punpck* variants so it seems like they probably translate shuffles to different opcodes earlier in the pipeline than what Cranelift is doing in the backend here.

I like the enum idea for some point in the future.

Somewhere in v8 though something is translating from the wasm shuffle to each individual kX64* instruction though, right? I would naively expect that the translation there is on the same order-of-magnitude of complexity/time/etc for what these ISLE lowerings are generating.

Since ISLE will combine all the u128_from_immediate calls in the rules at priority 6 below, many of these rules are really cheap to evaluate together. This isn't as bad as it looks. 😆

The enum approach would let ISLE do the same collapsing for a bunch more of these rules, so yeah, that should be good. Also, all the rules that would then use the same extractor but match on different returned enum variants should then be placed at the same priority if there aren't other rules at priorities in between.

cranelift/codegen/src/isa/x64/lower/isle.rs

abrown · 2023-03-09T19:18:05Z

cranelift/codegen/src/machinst/isle.rs

    };
 }

+pub fn shuffle_imm(size: u8, bytes: &[u8]) -> Option<u8> {


I think this could use some documentation; it's not exactly clear why only bytes[0] matters here (little-endianness-related, it would seem). This is a foundational block to all of the helpers above and it would be nice to see clearly when we expect the Nones from this function to be the ones propagated outwards.

alexcrichton · 2023-03-09T22:17:38Z

the equivalent V8 lowerings can be found in

Oh I meant to take a look at that but never got around to it. Interestingly though it looks like the x64 backend for v8 has individual opcodes for all the punpck* variants so it seems like they probably translate shuffles to different opcodes earlier in the pipeline than what Cranelift is doing in the backend here.

jameysharp · 2023-03-09T23:37:26Z

cranelift/codegen/src/isa/x64/lower.isle

+;; more efficient to execute the `pshufb` here-and-now with an xor'd-to-be-zero
+;; register.
+(rule 6 (lower (shuffle a _ (u128_from_immediate 0)))
+      (x64_pshufb a (xmm_zero $I8X16)))


I suppose there could be a similar version of this rule if all lanes specify to broadcast the first byte of the second operand. That's, what, if every byte in the mask is 0x10?

True! I've been going back and forth a bit on how exhaustive all these rules should be. For example this rule:

(rule 6 (lower (shuffle a b (u128_from_immediate 0x1f1e_0f0e_1d1c_0d0c_1b1a_0b0a_1918_0908))) (x64_punpckhwd a b))

could also match the u128 with the two 64-bit halves swapped to generate a punpckhwd b a instruction.

I added a bunch of lhs/rhs shuffles here for things like pshufd and shufps but I'm also second guessing myself doing that since I didn't actually see any such shuffles in the wild and I was just adding them here on a whim.

Some of this I've been feeling would be better implemented as a egraph-style optimization. For example if all the shuffles lanes are <16 then the b operand could be dropped. If they're all >=16 then they could all be updated and the a operand could be dropped. For the punpckhwd case above it's theoretically possible where the LSB of the shuffle mask could be one of the lowest numbered bytes. You can see though that by this point it feels much more complicated than a simple rule so I also haven't tried to do this yet (plus from an egraph-perspective it's not really "simpler" to replace a shuffle with a shuffle)

In the end I figured I'd stick to doing the obvious lowerings for each instruction a platform has along with ensuring that some various simd binaries I've been seeing all have all their shuffles special-cased. We can always add more here in the future, and I was mainly hoping that by that point it's more obivous how to solve the problem of "could this shuffle be special-cased if the operands were swapped?"

jameysharp · 2023-03-09T23:43:16Z

cranelift/codegen/src/isa/x64/lower.isle

+(rule 10 (lower (shuffle x y (pshufhw_lhs_imm imm)))
+      (x64_pshufhw x imm))
+(rule 9 (lower (shuffle x y (pshufhw_rhs_imm imm)))
+      (x64_pshufhw y imm))


Since ISLE will combine all the u128_from_immediate calls in the rules at priority 6 below, many of these rules are really cheap to evaluate together. This isn't as bad as it looks. 😆

The enum approach would let ISLE do the same collapsing for a bunch more of these rules, so yeah, that should be good. Also, all the rules that would then use the same extractor but match on different returned enum variants should then be placed at the same priority if there aren't other rules at priorities in between.

* x64: Add a smattering of lowerings for `shuffle` specializations (#5930) * x64: Add lowerings for `punpck{h,l}wd` Add some special cases for `shuffle` for more specialized x86 instructions. * x64: Add `shuffle` lowerings for `pshufd` This commit adds special-cased lowerings for the x64 `shuffle` instruction when the `pshufd` instruction alone is necessary. This is possible when the shuffle immediate permutes 32-bit values within one of the vector inputs of the `shuffle` instruction, but not both. * x64: Add shuffle lowerings for `punpck{h,l}{q,}dq` This adds specific permutations for some x86 instructions which specifically interleave high/low bytes for 32 and 64-bit values. This corresponds to the preexisting specific lowerings for interleaving 8 and 16-bit values. * x64: Add `shuffle` lowerings for `shufps` This commit adds targeted lowerings for the `shuffle` instruction that match the pattern that `shufps` supports. The `shufps` instruction selects two elements from the first vector and two elements from the second vector which means while it's not generally applicable it should still be more useful than the catch-all lowering of `shuffle`. * x64: Add shuffle support for `pshuf{l,h}w` This commit adds special lowering cases for these instructions which permute 16-bit values within a 128-bit value either within the upper or lower half of the 128-bit value. * x64: Specialize `shuffle` with an all-zeros immediate Instead of loading the all-zeros immediate from a rip-relative address at the end of the function instead generate a zero with a `pxor` instruction and then use `pshufb` to do the broadcast. * Review comments * x64: Add an AVX encoding for the `pshufd` instruction This will benefit from lack of need for alignment vs the `pshufd` instruction if working with a memory operand and additionally, as I've just learned, this reduces dependencies between instructions because the `v*` instructions zero the upper bits as opposed to preserving them which could accidentally create false dependencies in the CPU between instructions. * x64: Add more support for AVX loads/stores This commit adds VEX-encoded versions of instructions such as `mov{ss,sd,upd,ups,dqu}` for load and store operations. This also changes some signatures so the `load` helpers specifically take a `SyntheticAmode` argument which ended up doing a small refactoring of the `*_regmove` variant used for `insertlane 0` into f64x2 vectors. * x64: Enable using AVX instructions for zero regs This commit refactors the internal ISLE helpers for creating zero'd xmm registers to leverage the AVX support for all other instructions. This moves away from picking opcodes to instead picking instructions with a bit of reorganization. * x64: Remove `XmmConstOp` as an instruction All existing users can be replaced with usage of the `xmm_uninit_value` helper instruction so there's no longer any need for these otherwise constant operations. This additionally reduces manual usage of opcodes in favor of instruction helpers. * Review comments * Update test expectations

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen isle Related to the ISLE domain-specific language labels Mar 4, 2023

alexcrichton requested a review from abrown March 8, 2023 20:49

alexcrichton added 6 commits March 8, 2023 14:35

x64: Add lowerings for punpck{h,l}wd

9100445

Add some special cases for `shuffle` for more specialized x86 instructions.

x64: Add shuffle lowerings for punpck{h,l}{q,}dq

30f60cf

This adds specific permutations for some x86 instructions which specifically interleave high/low bytes for 32 and 64-bit values. This corresponds to the preexisting specific lowerings for interleaving 8 and 16-bit values.

x64: Add shuffle support for pshuf{l,h}w

36a65c7

This commit adds special lowering cases for these instructions which permute 16-bit values within a 128-bit value either within the upper or lower half of the 128-bit value.

x64: Specialize shuffle with an all-zeros immediate

29f7e7a

Instead of loading the all-zeros immediate from a rip-relative address at the end of the function instead generate a zero with a `pxor` instruction and then use `pshufb` to do the broadcast.

alexcrichton force-pushed the x64-shuffles branch from 157e9e8 to 29f7e7a Compare March 8, 2023 22:35

abrown approved these changes Mar 9, 2023

View reviewed changes

Review comments

75c9933

alexcrichton enabled auto-merge March 9, 2023 22:27

alexcrichton added this pull request to the merge queue Mar 9, 2023

Merged via the queue into bytecodealliance:main with commit 1c3a1bd Mar 9, 2023

alexcrichton deleted the x64-shuffles branch March 9, 2023 23:40

jameysharp reviewed Mar 9, 2023

View reviewed changes

alexcrichton mentioned this pull request Mar 10, 2023

aarch64: Add specialized shuffle lowerings #5977

Merged

jameysharp mentioned this pull request Apr 5, 2023

Add release notes for 8.0.0 #6145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x64: Add a smattering of lowerings for `shuffle` specializations #5930

x64: Add a smattering of lowerings for `shuffle` specializations #5930

alexcrichton commented Mar 4, 2023

github-actions bot commented Mar 4, 2023

abrown left a comment

abrown Mar 9, 2023

abrown Mar 9, 2023

alexcrichton Mar 9, 2023

abrown Mar 9, 2023

alexcrichton Mar 9, 2023

abrown Mar 9, 2023

alexcrichton Mar 9, 2023

jameysharp Mar 9, 2023

abrown Mar 9, 2023

alexcrichton commented Mar 9, 2023

jameysharp Mar 9, 2023

alexcrichton Mar 10, 2023

jameysharp Mar 9, 2023

x64: Add a smattering of lowerings for shuffle specializations #5930

x64: Add a smattering of lowerings for shuffle specializations #5930

Conversation

alexcrichton commented Mar 4, 2023

github-actions bot commented Mar 4, 2023

Subscribe to Label Action

abrown left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexcrichton commented Mar 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

x64: Add a smattering of lowerings for `shuffle` specializations #5930

x64: Add a smattering of lowerings for `shuffle` specializations #5930