Turn data-tiling on by default. #15256

hanhanW · 2023-10-20T23:14:10Z

The revision turns data-tiling ons by default. In the commit:

Disable data-tiling for transform dialect tests
Allow tile size adjustment failing if it is going to be on vmvx ukernel path. All the inner tile sizes are kDynamic, and we have nothing to do for adjustment.
Move all related passes under transformOptions.options.dataTiling control.

We can get up to 10x faster for models haven't been over-looked (e.g., GPT2, Falcon7bGptqPT, BertLarge, etc); we have 25%, 36%, 45% slow down for some models we've been working on previously. The full benchmark result (with microkernels on) can be found at https://gist.github.com/iree-github-actions-bot/b2ac57f9f1fd6894e93da19d6137c5da

github-actions · 2023-10-25T23:55:22Z

Abbreviated Benchmark Summary

@ commit 2ed1417fa859d2a4bc3fffc378825b48ad290596 (vs. base 8604e07185fe2ad2ae0d38c7ef83a905a55c631d)

Regressed Latencies 🚩

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
MobileBertSquad\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags] local\_task(embedded\_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	1052.382 (vs. 485.368, 116.82%↑)	1052.183	3.660
PersonDetect\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags] local\_sync(embedded\_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu]	1.291 (vs. 0.694, 86.00%↑)	1.294	0.005
MobileNetV2\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags] local\_task(embedded\_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	43.113 (vs. 23.264, 85.32%↑)	43.064	0.104

[Top 3 out of 20 results showed]

Improved Latencies 🎉

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	31.464 (vs. 338.696, 90.71%↓)	30.959	1.409
GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags] local\_sync(embedded\_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores]	28.803 (vs. 247.051, 88.34%↓)	28.969	0.890
GPT2\_117M\_TF\_1X1XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	23.764 (vs. 161.571, 85.29%↓)	23.795	0.125

[Top 3 out of 24 results showed]

Regressed Compilation Times 🚩

Benchmark Name	Compilation Time (ms)
MobileNetV2\_fp32(tflite) [vmvx-generic-vmvx-vmvx][experimental-flags,compile-stats]	15849 (vs. 6937, 128.47%↑)

Regressed Total Dispatch Sizes 🚩

Benchmark Name	Total Dispatch Size (bytes)
MobileNetV2\_fp32(tflite) [vmvx-generic-vmvx-vmvx][experimental-flags,compile-stats]	228725 (vs. 99253, 130.45%↑)
MobileNetV3Small\_fp32(tflite) [vmvx-generic-vmvx-vmvx][experimental-flags,compile-stats]	335413 (vs. 166325, 101.66%↑)
GPT2\_117M\_TF\_1X1XI32(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	24080 (vs. 20336, 18.41%↑)

[Top 3 out of 4 results showed]

Improved Total Dispatch Sizes 🎉

Benchmark Name	Total Dispatch Size (bytes)
MobileBertSquad\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,compile-stats]	27888 (vs. 124024, 77.51%↓)
EfficientNet\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	170152 (vs. 498824, 65.89%↓)
MobileBertSquad\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	2631656 (vs. 6500568, 59.52%↓)

[Top 3 out of 22 results showed]

Regressed Total Artifact Sizes 🚩

Benchmark Name	Total Artifact Size (bytes)
BertForMaskedLMTF(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	874377425 (vs. 532227729, 64.29%↑)
MiniLML12H384Uncased(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	219488468 (vs. 133974932, 63.83%↑)
GPT2\_117M\_TF\_1X4XI32(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	992548692 (vs. 652746132, 52.06%↑)

[Top 3 out of 6 results showed]

Improved Total Artifact Sizes 🎉

Benchmark Name	Total Artifact Size (bytes)
PersonDetect\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	355205 (vs. 416901, 14.80%↓)
MobileBertSquad\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	29207621 (vs. 32872133, 11.15%↓)
Falcon7bGptqPT(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	343784 (vs. 370088, 7.11%↓)

[Top 3 out of 4 results showed]

Regressed Stream IR Dispatch Count (# of cmd.dispatch ops) 🚩

Benchmark Name	Stream IR Dispatch Count (# of cmd.dispatch ops)
MobileBertSquad\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	1861 (vs. 704, 164.35%↑)
MobileBertSquad\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,compile-stats]	1861 (vs. 704, 164.35%↑)
MobileNetV2\_fp32(tflite) [vmvx-generic-vmvx-vmvx][experimental-flags,compile-stats]	188 (vs. 73, 157.53%↑)

[Top 3 out of 28 results showed]

For more information:

Source Workflow Run

hanhanW · 2023-11-01T23:26:10Z

The biggest regression on llvm-cpu side is DeepLabV3, MobileBert, MobileNetV2, and MobileSSD on 8-threads. This is a known issue about distribution.

For single-threaded regression, they are 7-20% -- which is acceptable to me. See the below snapshot for highlights:

I don't really care about vmvx regression, they are likely because of batch_matmul issues. If the microkernel is on, data-tiling is not applied to batch_matmul. In the context, the regression is expected. It's been tracking in #15314

dcaballe · 2023-11-02T01:50:51Z

The biggest regression on llvm-cpu side is DeepLabV3, MobileBert, MobileNetV2, and MobileSSD on 8-threads. This is a known issue about distribution.

Is this because DT is using different distribution sizes to non-DT?

hanhanW · 2023-11-02T17:12:11Z

Is this because DT is using different distribution sizes to non-DT?

DT introduces mmt4d ops, pack ops, and unpack ops, while non-DT does not have those ops. The mmt4d distribution tile sizes could be outdated, which was implemented 1 to 2 years ago by me. The pack/unpack distribution tile sizes are not tuned yet. We do observe similar issue in #15349 (comment)

hanhanW · 2023-11-11T03:33:49Z

@bjacob it is +8 -1 now. I can create a PR that explicitly disables data-tiling for those tests to make this PR be +1 -0 if you think that's better.

bjacob · 2023-11-11T03:42:48Z

Thanks for the splitting already, much better. Now I still think it would be best to go all the way to a 1 line change, like you describe.

Also, the current diff still turns on ukernels as well.

Thanks for all the work. I think it would be wise to wait until Monday anyway to flip the switch!

hanhanW · 2023-11-11T03:47:40Z

ukernels is mostly for performance check. I will disable it before land the PR. I plan to wait for a more couple days, and see if others are okay with the flip. I will make this be one-line change.

hanhanW · 2023-11-11T03:52:01Z

the PR is ready: #15557

dcaballe · 2023-11-12T04:24:39Z

Posted new perf results in #15313

I also see some big regressions in CI. Has anybody looked into those?

bjacob · 2023-11-13T14:38:23Z

I have lost track of the discussions about remaining regressions. The latest benchmark results still report some 2x regressions. Can you link to the latest thinking about them?

hanhanW · 2023-11-13T17:49:49Z

Posted new perf results in #15313

I also see some big regressions in CI. Has anybody looked into those?

The big regressions happen when ukernels are disabled. There is known issue about codegen i8.i8.i32 mmt4d ops. The vector dialect and LLVM backend is not good at i8.i8.i32 things. The benchmark with ukernels on can be found at https://gist.github.com/iree-github-actions-bot/b2ac57f9f1fd6894e93da19d6137c5da and the regressions are okay

bjacob

Thanks for pushing it this far. This is now a 1-line change that's a clear improvement in IREE's benchmarks. OK to merge this pending checking with @dcaballe , but the fact that known regressions are due to non-ukernel mmt4d codegen means to me that it's time to flip this switch and deal with imperfect mmt4d codegen separately.

On a parallel front, I am working to make it easy to switch on ukernel for mmt4d selectively (as opposed to wholesale for all ops for which we might have more ukernels in the future). The first step was #15571 and I'll advance that further today. So that provides another layer of reasoning why it's OK to move ahead without letting mmt4d codegen block this.

bjacob · 2023-11-14T16:16:08Z

Just to make this concrete, #15586 shows how a 1-line change can now enable UK for mmt4d by default (without this being a UK blank cheque wrt future UK lowerings). This is saying, if the last thing preventing DT-by-default is currently suboptimal codegen for mmt4d ops, we can enable UK for mmt4d so that doesn't affect the default path anymore.

MaheshRavishankar

Could we merge this change with @bjacob s change to make mmt4d use ukernels and land those together. Might help with regressions (I am treating mmt4d + non-ukernel as the non-default path).

bjacob · 2023-11-14T16:30:24Z

Then let's discuss both in the "Mai-Tai sync" meeting in 30 min from now. And if we confirm we want to do this now, let's merge #15586 first, then this, so they're still separate (useful for bisection, we never know) and so that order ensures no intermediate state where the performance of non-ukernel mmt4d codegen is felt by default.

MaheshRavishankar · 2023-11-14T17:08:43Z

Then let's discuss both in the "Mai-Tai sync" meeting in 30 min from now. And if we confirm we want to do this now, let's merge #15586 first, then this, so they're still separate (useful for bisection, we never know) and so that order ensures no intermediate state where the performance of non-ukernel mmt4d codegen is felt by default.

That sounds good to me!

bjacob · 2023-11-14T18:25:20Z

This has been on Discord with @dcaballe here: https://discord.com/channels/689900678990135345/1146173056537079919/1174043054282391654

The summary is that what is blocking this now is actually #15242 . It's orthogonal to mmt4d codegen and ukernels. It's being actively worked on now by @NatashaKnk and discussed on Discord below the above linked message.

I think it is valuable to wait a bit to give this a chance to be resolved first, so we can hope for DT enablement to be a 100% clean win :-)

hanhanW · 2023-11-15T22:30:27Z

Posted new perf results in #15313

I also see some big regressions in CI. Has anybody looked into those?

The regression are mostly from i8.i8.i32 mmt4d kernels. We are not generating VNNI. Issue filed: #15611

stellaraccident · 2023-11-27T20:35:19Z

I just touched base with some of the stakeholders and it sounds like there was consensus to enable this by default. @jpienaar concur?

hanhanW · 2023-11-29T00:49:28Z

There is a failure after rebasing on ToT. The vmvx issue will be fixed by #15705

hanhanW · 2023-11-29T20:30:50Z

Landing the PR because the last major compilation issue (#15692) is fixed.

The revision turns data-tiling on by default. We can get up to 10x faster for models haven't been over-looked (e.g., GPT2, Falcon7bGptqPT, BertLarge, etc); we have 25%, 36%, 45% slow down for some models we've been working on previously. The full benchmark result (with microkernels on) can be found at https://gist.github.com/iree-github-actions-bot/b2ac57f9f1fd6894e93da19d6137c5da

hanhanW added codegen/riscv Code generation targeting RISC-V architecture benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:android-cpu Run default Android CPU benchmarks labels Oct 20, 2023

hanhanW force-pushed the dt-default branch 7 times, most recently from 0231530 to 1283bbb Compare October 25, 2023 22:56

hanhanW force-pushed the dt-default branch 5 times, most recently from 1ea87bc to c55b974 Compare October 31, 2023 18:17

hanhanW removed the codegen/riscv Code generation targeting RISC-V architecture label Nov 1, 2023

hanhanW force-pushed the dt-default branch 2 times, most recently from 01378d7 to ebeef83 Compare November 1, 2023 18:49

hanhanW changed the title ~~WIP - Turn data-tiling on by default v2~~ Turn data-tiling on by default. Nov 1, 2023

hanhanW force-pushed the dt-default branch 4 times, most recently from 10cc44e to 647b05a Compare November 7, 2023 18:49

hanhanW mentioned this pull request Nov 7, 2023

Turn data tiling on by default for x86, aarch64 and vmvx #15215

Closed

hanhanW force-pushed the dt-default branch 2 times, most recently from fa89343 to 17a9abc Compare November 8, 2023 19:16

remove unneeded changes

94df92a

hanhanW requested a review from bjacob November 11, 2023 03:31

hanhanW added 2 commits November 11, 2023 04:44

Merge branch 'main' into dt-default

6a72b17

disable ukernel

d8fa65f

bjacob approved these changes Nov 14, 2023

View reviewed changes

bjacob mentioned this pull request Nov 14, 2023

Enable the mmt4d ukernel by default in llvm-cpu #15586

Closed

MaheshRavishankar approved these changes Nov 14, 2023

View reviewed changes

Merge branch 'main' into dt-default

f8587ff

hanhanW force-pushed the dt-default branch from 17596c2 to 4e39b3c Compare November 29, 2023 00:35

Merge branch 'main' into dt-default

8b8ff81

hanhanW force-pushed the dt-default branch from 4e39b3c to 8b8ff81 Compare November 29, 2023 01:28

Merge branch 'main' into dt-default

989e5f8

hanhanW merged commit 52eb7e9 into iree-org:main Nov 29, 2023
57 of 59 checks passed

hanhanW deleted the dt-default branch November 29, 2023 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turn data-tiling on by default. #15256

Turn data-tiling on by default. #15256

hanhanW commented Oct 20, 2023 •

edited

Loading

github-actions bot commented Oct 25, 2023 •

edited

Loading

hanhanW commented Nov 1, 2023

dcaballe commented Nov 2, 2023

hanhanW commented Nov 2, 2023

hanhanW commented Nov 11, 2023 •

edited

Loading

bjacob commented Nov 11, 2023

hanhanW commented Nov 11, 2023

hanhanW commented Nov 11, 2023

dcaballe commented Nov 12, 2023

bjacob commented Nov 13, 2023

hanhanW commented Nov 13, 2023 •

edited

Loading

bjacob left a comment

bjacob commented Nov 14, 2023

MaheshRavishankar left a comment

bjacob commented Nov 14, 2023 •

edited

Loading

MaheshRavishankar commented Nov 14, 2023

bjacob commented Nov 14, 2023

hanhanW commented Nov 15, 2023

stellaraccident commented Nov 27, 2023

hanhanW commented Nov 29, 2023

hanhanW commented Nov 29, 2023

Turn data-tiling on by default. #15256

Turn data-tiling on by default. #15256

Conversation

hanhanW commented Oct 20, 2023 • edited Loading

github-actions bot commented Oct 25, 2023 • edited Loading

Abbreviated Benchmark Summary

Regressed Latencies 🚩

Improved Latencies 🎉

Regressed Compilation Times 🚩

Regressed Total Dispatch Sizes 🚩

Improved Total Dispatch Sizes 🎉

Regressed Total Artifact Sizes 🚩

Improved Total Artifact Sizes 🎉

Regressed Stream IR Dispatch Count (# of cmd.dispatch ops) 🚩

hanhanW commented Nov 1, 2023

dcaballe commented Nov 2, 2023

hanhanW commented Nov 2, 2023

hanhanW commented Nov 11, 2023 • edited Loading

bjacob commented Nov 11, 2023

hanhanW commented Nov 11, 2023

hanhanW commented Nov 11, 2023

dcaballe commented Nov 12, 2023

bjacob commented Nov 13, 2023

hanhanW commented Nov 13, 2023 • edited Loading

bjacob left a comment

Choose a reason for hiding this comment

bjacob commented Nov 14, 2023

MaheshRavishankar left a comment

Choose a reason for hiding this comment

bjacob commented Nov 14, 2023 • edited Loading

MaheshRavishankar commented Nov 14, 2023

bjacob commented Nov 14, 2023

hanhanW commented Nov 15, 2023

stellaraccident commented Nov 27, 2023

hanhanW commented Nov 29, 2023

hanhanW commented Nov 29, 2023

hanhanW commented Oct 20, 2023 •

edited

Loading

github-actions bot commented Oct 25, 2023 •

edited

Loading

hanhanW commented Nov 11, 2023 •

edited

Loading

hanhanW commented Nov 13, 2023 •

edited

Loading

bjacob commented Nov 14, 2023 •

edited

Loading