Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] Improve tile sizes selection for tensor.pack ops. #15397

Merged
merged 1 commit into from
Nov 7, 2023

Conversation

hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Nov 2, 2023

It disables special vector sizes for non-f32 cases because the logic is only for 16x16 transpose cases. The improvements of dispatch sizes are from vectorization. We are not able to vectorize named ops if they have dynamic shapes, which is fixed by llvm/llvm-project@03529b9. The change allows backends to vectorize them because they become static shapes (by tiling with size=1). It is not a hard condition; we track it in #15441

The revision takes the number of threads into account, so we have better performance on multi-threaded. It also reduces runtime overheads.

This is a step toward to #15391 and #15349

It improves the performance of tensor.pack op from 420 ms to 170 ms on 8-threaded x86 CPU.

@hanhanW hanhanW added benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:android-cpu Run default Android CPU benchmarks labels Nov 2, 2023
Copy link

github-actions bot commented Nov 3, 2023

Abbreviated Benchmark Summary

@ commit 24fc0109d548e8dff4a7cecb75ece805f5920356 (vs. base a4b1a78646f65e8e84a73fcf547bb75a60af5cf1)

Regressed Latencies 🚩

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileNetV2\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags] local\_task(embedded\_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-16[cpu] 15.620 (vs. 12.891, 21.16%↑) 15.634 0.081
Falcon7bGptqPT(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags] local\_task(embedded\_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-16[cpu] 45014.478 (vs. 37280.459, 20.75%↑) 45013.931 362.379
DeepLabV3\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags] local\_task(embedded\_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-16[cpu] 38.889 (vs. 33.125, 17.40%↑) 38.817 0.347

[Top 3 out of 10 results showed]

Regressed Total Dispatch Sizes 🚩

Benchmark Name Total Dispatch Size (bytes)
MobileBertSquad\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,data-tiling,ukernel,compile-stats] 26984 (vs. 25512, 5.77%↑)

Improved Total Dispatch Sizes 🎉

Benchmark Name Total Dispatch Size (bytes)
Vit\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,data-tiling,ukernel,compile-stats] 766072 (vs. 1366184, 43.93%↓)
PersonDetect\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,data-tiling,ukernel,compile-stats] 65256 (vs. 106056, 38.47%↓)
Falcon7bGptqPT(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,data-tiling,ukernel,compile-stats] 103688 (vs. 131144, 20.94%↓)

[Top 3 out of 6 results showed]

Improved Total Artifact Sizes 🎉

Benchmark Name Total Artifact Size (bytes)
PersonDetect\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,data-tiling,ukernel,compile-stats] 331845 (vs. 372613, 10.94%↓)
Falcon7bGptqPT(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,data-tiling,ukernel,compile-stats] 344296 (vs. 371752, 7.39%↓)

For more information:

Source Workflow Run

@hanhanW hanhanW force-pushed the improve-multi branch 4 times, most recently from b4701bd to 90f1f14 Compare November 6, 2023 23:17
@hanhanW hanhanW changed the title [CPU] Improve distribution tile sizes selection. [CPU] Improve tile sizes selection for tensor.pack ops. Nov 7, 2023
@hanhanW hanhanW marked this pull request as ready for review November 7, 2023 00:58
Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me... I dont have much of a handle on the heuristics here... If you have any data as to why it helps, post it here for recording.

@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 7, 2023

Looks fine to me... I dont have much of a handle on the heuristics here... If you have any data as to why it helps, post it here for recording.

done. I include the perf number in the description.

@hanhanW hanhanW merged commit e3f2ab3 into iree-org:main Nov 7, 2023
@hanhanW hanhanW deleted the improve-multi branch November 7, 2023 17:45
@dcaballe
Copy link
Contributor

dcaballe commented Nov 7, 2023

Do you know if the Falcon7B regression is real? That's one of the main models we are working on right now.

@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 7, 2023

Do you know if the Falcon7B regression is real? That's one of the main models we are working on right now.

I can take a look, but I feel that it is not real.. The regression did not happen before I rebase it to ToT. https://gist.github.com/iree-github-actions-bot/99c8f439051a13ad1a3d0ebe112056b9

@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 7, 2023

I'm pretty sure that it is not real because it is on default flags, not data-tiling.

ramiro050 pushed a commit to ramiro050/iree that referenced this pull request Dec 19, 2023
It disables special vector sizes for non-f32 cases because the logic is
only for 16x16 transpose cases. The improvements of dispatch sizes are
from vectorization. We are not able to vectorize named ops if they have
dynamic shapes, which is fixed by
llvm/llvm-project@03529b9.
The change allows backends to vectorize them because they become static
shapes (by tiling with size=1). It is not a hard condition; we track it
in iree-org#15441

The revision takes the number of threads into account, so we have better
performance on multi-threaded. It also reduces runtime overheads.

This is a step toward to iree-org#15391
and iree-org#15349

It improves the performance of
[tensor.pack](iree-org#15349) op from 420
ms to 170 ms on 8-threaded x86 CPU.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks:android-cpu Run default Android CPU benchmarks benchmarks:x86_64 Run default x86_64 benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants