Improve LHS tensor.pack on non-f32 types for x86 #15441

hanhanW · 2023-11-06T21:24:20Z

We have optimized codegen for packing on f32 types, but not int8. This is a tracking issue for int8 case. I observed that some pack ops are not vectorized. Because masking is only supported on limited ops for dynamic shapes. We should just relax the condition to using isElementwise(), so linalg.transpose op can also get vectorized. I have an easy fix locally, and will send it out for review.

With the change and better distribution logic, we can save up to 43% total dispatch sizes for int8 models, see https://gist.github.com/iree-github-actions-bot/fa5becb880b9a6afc2d362883a585d5a

The next step is having better pack codegen for non-f32 types. We need a pattern to pack innermost tile being a single element and leverage it to 16x16 transpose lowering. Looking at transpose permutation map and using vector.bitcast op should help here.

The text was updated successfully, but these errors were encountered:

hanhanW · 2023-11-06T22:44:18Z

llvm/llvm-project#71454 fixes the vectorization issue.

It disables special vector sizes for non-f32 cases because the logic is only for 16x16 transpose cases. The improvements of dispatch sizes are from vectorization. We are not able to vectorize named ops if they have dynamic shapes, which is fixed by llvm/llvm-project@03529b9. The change allows backends to vectorize them because they become static shapes (by tiling with size=1). It is not a hard condition; we track it in #15441 The revision takes the number of threads into account, so we have better performance on multi-threaded. It also reduces runtime overheads. This is a step toward to #15391 and #15349 It improves the performance of [tensor.pack](#15349) op from 420 ms to 170 ms on 8-threaded x86 CPU.

It disables special vector sizes for non-f32 cases because the logic is only for 16x16 transpose cases. The improvements of dispatch sizes are from vectorization. We are not able to vectorize named ops if they have dynamic shapes, which is fixed by llvm/llvm-project@03529b9. The change allows backends to vectorize them because they become static shapes (by tiling with size=1). It is not a hard condition; we track it in iree-org#15441 The revision takes the number of threads into account, so we have better performance on multi-threaded. It also reduces runtime overheads. This is a step toward to iree-org#15391 and iree-org#15349 It improves the performance of [tensor.pack](iree-org#15349) op from 420 ms to 170 ms on 8-threaded x86 CPU.

hanhanW · 2024-01-08T19:37:39Z

This is the details about 16x16 transpose trick. The 4x4, 8x8, 16x16 tricks have the same idea. https://stackoverflow.com/questions/29519222/how-to-transpose-a-16x16-matrix-using-simd-instructions
Implementation: https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/Vector/Transforms/LowerVectorTranspose.cpp

In your prototype, I think we can add the bitcast pattern before transpose lowering, i.e., https://github.com/openxla/iree/blob/bd2c92dbb3d2109cd624fa18e75b9bf3caaa4ae5/compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPUVectorLowering.cpp#L148

You can preset lowering_config; the 16x16 shuffle optimization should be kicked in automatically. It should give us a much better performance. If it works, then we can teach tile size selection about it at https://github.com/openxla/iree/blob/bd2c92dbb3d2109cd624fa18e75b9bf3caaa4ae5/compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp#L1259-L1270

Here is the benchmark data when I implemented the trick for f32 types: #13318

There can still be some performance improvements left on the table. We should be able to replace the vunpck*pd ones with a combination of shufps + blends that should be faster, in theory. The review comments in https://reviews.llvm.org/D148685 is very helpful to me.

hanhanW · 2024-02-05T15:18:51Z

I have some prototype in https://github.com/hanhanW/iree/tree/improve-pack, but need to structure them better. The next step in mind to try are

Generate vector.bitcast around vector.transpose in virtual vector lowering stage
Add a pattern matching for vector.transfer_read -> shape_cast -> bitcast to help flatten them correctly.
Similiar pattern for vector.transfer_write chain.
Unroll bitcast to 1d vectors, which is prototyped in the branch.

2 and 3 are needed. Otherwise we will generate bunch of scalar vector.bitcast op during 4.

hanhanW added codegen Shared code generation infrastructure and dialects codegen/llvm LLVM code generation compiler backend labels Nov 6, 2023

hanhanW self-assigned this Nov 6, 2023

hanhanW mentioned this issue Nov 7, 2023

[CPU] Improve tile sizes selection for tensor.pack ops. #15397

Merged

hanhanW mentioned this issue Nov 17, 2023

[EPIC][CPU] Enable predictable performance on mixed-types GEMM using data-tiling #15629

Open

16 tasks

hanhanW changed the title ~~Improve tensor.pack on int8 for x86~~ Improve tensor.pack on non-f32 types for x86 Jan 4, 2024

hanhanW assigned Max191 Jan 8, 2024

hanhanW changed the title ~~Improve tensor.pack on non-f32 types for x86~~ Improve LHS tensor.pack on non-f32 types for x86 Feb 5, 2024

hanhanW mentioned this issue Feb 5, 2024

[EPIC][CPU] Generate efficient vectorized pack op for non-f32 types. #16314

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve LHS tensor.pack on non-f32 types for x86 #15441

Improve LHS tensor.pack on non-f32 types for x86 #15441

hanhanW commented Nov 6, 2023 •

edited

Loading

hanhanW commented Nov 6, 2023

hanhanW commented Jan 8, 2024

hanhanW commented Feb 5, 2024

Improve LHS tensor.pack on non-f32 types for x86 #15441

Improve LHS tensor.pack on non-f32 types for x86 #15441

Comments

hanhanW commented Nov 6, 2023 • edited Loading

hanhanW commented Nov 6, 2023

hanhanW commented Jan 8, 2024

hanhanW commented Feb 5, 2024

hanhanW commented Nov 6, 2023 •

edited

Loading