[CPU] Improve tile sizes selection for tensor.pack ops. #15397

hanhanW · 2023-11-02T23:32:18Z

It disables special vector sizes for non-f32 cases because the logic is only for 16x16 transpose cases. The improvements of dispatch sizes are from vectorization. We are not able to vectorize named ops if they have dynamic shapes, which is fixed by llvm/llvm-project@03529b9. The change allows backends to vectorize them because they become static shapes (by tiling with size=1). It is not a hard condition; we track it in #15441

The revision takes the number of threads into account, so we have better performance on multi-threaded. It also reduces runtime overheads.

This is a step toward to #15391 and #15349

It improves the performance of tensor.pack op from 420 ms to 170 ms on 8-threaded x86 CPU.

github-actions · 2023-11-03T01:42:23Z

Abbreviated Benchmark Summary

@ commit 24fc0109d548e8dff4a7cecb75ece805f5920356 (vs. base a4b1a78646f65e8e84a73fcf547bb75a60af5cf1)

Regressed Latencies 🚩

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
MobileNetV2\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags] local\_task(embedded\_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-16[cpu]	15.620 (vs. 12.891, 21.16%↑)	15.634	0.081
Falcon7bGptqPT(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags] local\_task(embedded\_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-16[cpu]	45014.478 (vs. 37280.459, 20.75%↑)	45013.931	362.379
DeepLabV3\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags] local\_task(embedded\_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-16[cpu]	38.889 (vs. 33.125, 17.40%↑)	38.817	0.347

[Top 3 out of 10 results showed]

Regressed Total Dispatch Sizes 🚩

Benchmark Name	Total Dispatch Size (bytes)
MobileBertSquad\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,data-tiling,ukernel,compile-stats]	26984 (vs. 25512, 5.77%↑)

Improved Total Dispatch Sizes 🎉

Benchmark Name	Total Dispatch Size (bytes)
Vit\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,data-tiling,ukernel,compile-stats]	766072 (vs. 1366184, 43.93%↓)
PersonDetect\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,data-tiling,ukernel,compile-stats]	65256 (vs. 106056, 38.47%↓)
Falcon7bGptqPT(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,data-tiling,ukernel,compile-stats]	103688 (vs. 131144, 20.94%↓)

[Top 3 out of 6 results showed]

Improved Total Artifact Sizes 🎉

Benchmark Name	Total Artifact Size (bytes)
PersonDetect\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,data-tiling,ukernel,compile-stats]	331845 (vs. 372613, 10.94%↓)
Falcon7bGptqPT(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,data-tiling,ukernel,compile-stats]	344296 (vs. 371752, 7.39%↓)

For more information:

Source Workflow Run

MaheshRavishankar

Looks fine to me... I dont have much of a handle on the heuristics here... If you have any data as to why it helps, post it here for recording.

hanhanW · 2023-11-07T17:45:35Z

Looks fine to me... I dont have much of a handle on the heuristics here... If you have any data as to why it helps, post it here for recording.

done. I include the perf number in the description.

dcaballe · 2023-11-07T18:09:33Z

Do you know if the Falcon7B regression is real? That's one of the main models we are working on right now.

hanhanW · 2023-11-07T18:18:59Z

Do you know if the Falcon7B regression is real? That's one of the main models we are working on right now.

I can take a look, but I feel that it is not real.. The regression did not happen before I rebase it to ToT. https://gist.github.com/iree-github-actions-bot/99c8f439051a13ad1a3d0ebe112056b9

hanhanW · 2023-11-07T18:52:48Z

I'm pretty sure that it is not real because it is on default flags, not data-tiling.

It disables special vector sizes for non-f32 cases because the logic is only for 16x16 transpose cases. The improvements of dispatch sizes are from vectorization. We are not able to vectorize named ops if they have dynamic shapes, which is fixed by llvm/llvm-project@03529b9. The change allows backends to vectorize them because they become static shapes (by tiling with size=1). It is not a hard condition; we track it in iree-org#15441 The revision takes the number of threads into account, so we have better performance on multi-threaded. It also reduces runtime overheads. This is a step toward to iree-org#15391 and iree-org#15349 It improves the performance of [tensor.pack](iree-org#15349) op from 420 ms to 170 ms on 8-threaded x86 CPU.

hanhanW added benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:android-cpu Run default Android CPU benchmarks labels Nov 2, 2023

hanhanW force-pushed the improve-multi branch from 97d8152 to f69db97 Compare November 2, 2023 23:50

hanhanW force-pushed the improve-multi branch 4 times, most recently from b4701bd to 90f1f14 Compare November 6, 2023 23:17

hanhanW changed the title ~~[CPU] Improve distribution tile sizes selection.~~ [CPU] Improve tile sizes selection for tensor.pack ops. Nov 7, 2023

[CPU] Improve distribution tile sizes selection.

f8ad419

hanhanW force-pushed the improve-multi branch from 90f1f14 to f8ad419 Compare November 7, 2023 00:57

hanhanW marked this pull request as ready for review November 7, 2023 00:58

hanhanW requested review from dcaballe and MaheshRavishankar as code owners November 7, 2023 00:58

MaheshRavishankar approved these changes Nov 7, 2023

View reviewed changes

hanhanW merged commit e3f2ab3 into iree-org:main Nov 7, 2023

hanhanW deleted the improve-multi branch November 7, 2023 17:45

hanhanW mentioned this pull request Nov 9, 2023

Turn data-tiling on by default. #15256

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Improve tile sizes selection for tensor.pack ops. #15397

[CPU] Improve tile sizes selection for tensor.pack ops. #15397

hanhanW commented Nov 2, 2023 •

edited

Loading

github-actions bot commented Nov 3, 2023 •

edited

Loading

MaheshRavishankar left a comment

hanhanW commented Nov 7, 2023

dcaballe commented Nov 7, 2023

hanhanW commented Nov 7, 2023

hanhanW commented Nov 7, 2023

[CPU] Improve tile sizes selection for tensor.pack ops. #15397

[CPU] Improve tile sizes selection for tensor.pack ops. #15397

Conversation

hanhanW commented Nov 2, 2023 • edited Loading

github-actions bot commented Nov 3, 2023 • edited Loading

Abbreviated Benchmark Summary

Regressed Latencies 🚩

Regressed Total Dispatch Sizes 🚩

Improved Total Dispatch Sizes 🎉

Improved Total Artifact Sizes 🎉

MaheshRavishankar left a comment

Choose a reason for hiding this comment

hanhanW commented Nov 7, 2023

dcaballe commented Nov 7, 2023

hanhanW commented Nov 7, 2023

hanhanW commented Nov 7, 2023

hanhanW commented Nov 2, 2023 •

edited

Loading

github-actions bot commented Nov 3, 2023 •

edited

Loading