Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn data-tiling on by default. #15256

Merged
merged 9 commits into from
Nov 29, 2023
Merged

Turn data-tiling on by default. #15256

merged 9 commits into from
Nov 29, 2023

Conversation

hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Oct 20, 2023

The revision turns data-tiling ons by default. In the commit:

  • Disable data-tiling for transform dialect tests
  • Allow tile size adjustment failing if it is going to be on vmvx ukernel path. All the inner tile sizes are kDynamic, and we have nothing to do for adjustment.
  • Move all related passes under transformOptions.options.dataTiling control.

We can get up to 10x faster for models haven't been over-looked (e.g., GPT2, Falcon7bGptqPT, BertLarge, etc); we have 25%, 36%, 45% slow down for some models we've been working on previously. The full benchmark result (with microkernels on) can be found at https://gist.github.com/iree-github-actions-bot/b2ac57f9f1fd6894e93da19d6137c5da

@hanhanW hanhanW added codegen/riscv Code generation targeting RISC-V architecture benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:android-cpu Run default Android CPU benchmarks labels Oct 20, 2023
@hanhanW hanhanW force-pushed the dt-default branch 7 times, most recently from 0231530 to 1283bbb Compare October 25, 2023 22:56
@github-actions
Copy link

github-actions bot commented Oct 25, 2023

Abbreviated Benchmark Summary

@ commit 2ed1417fa859d2a4bc3fffc378825b48ad290596 (vs. base 8604e07185fe2ad2ae0d38c7ef83a905a55c631d)

Regressed Latencies 🚩

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileBertSquad\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags] local\_task(embedded\_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 1052.382 (vs. 485.368, 116.82%↑) 1052.183 3.660
PersonDetect\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags] local\_sync(embedded\_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 1.291 (vs. 0.694, 86.00%↑) 1.294 0.005
MobileNetV2\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags] local\_task(embedded\_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 43.113 (vs. 23.264, 85.32%↑) 43.064 0.104

[Top 3 out of 20 results showed]

Improved Latencies 🎉

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 31.464 (vs. 338.696, 90.71%↓) 30.959 1.409
GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags] local\_sync(embedded\_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 28.803 (vs. 247.051, 88.34%↓) 28.969 0.890
GPT2\_117M\_TF\_1X1XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 23.764 (vs. 161.571, 85.29%↓) 23.795 0.125

[Top 3 out of 24 results showed]

Regressed Compilation Times 🚩

Benchmark Name Compilation Time (ms)
MobileNetV2\_fp32(tflite) [vmvx-generic-vmvx-vmvx][experimental-flags,compile-stats] 15849 (vs. 6937, 128.47%↑)

Regressed Total Dispatch Sizes 🚩

Benchmark Name Total Dispatch Size (bytes)
MobileNetV2\_fp32(tflite) [vmvx-generic-vmvx-vmvx][experimental-flags,compile-stats] 228725 (vs. 99253, 130.45%↑)
MobileNetV3Small\_fp32(tflite) [vmvx-generic-vmvx-vmvx][experimental-flags,compile-stats] 335413 (vs. 166325, 101.66%↑)
GPT2\_117M\_TF\_1X1XI32(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 24080 (vs. 20336, 18.41%↑)

[Top 3 out of 4 results showed]

Improved Total Dispatch Sizes 🎉

Benchmark Name Total Dispatch Size (bytes)
MobileBertSquad\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,compile-stats] 27888 (vs. 124024, 77.51%↓)
EfficientNet\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 170152 (vs. 498824, 65.89%↓)
MobileBertSquad\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 2631656 (vs. 6500568, 59.52%↓)

[Top 3 out of 22 results showed]

Regressed Total Artifact Sizes 🚩

Benchmark Name Total Artifact Size (bytes)
BertForMaskedLMTF(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 874377425 (vs. 532227729, 64.29%↑)
MiniLML12H384Uncased(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 219488468 (vs. 133974932, 63.83%↑)
GPT2\_117M\_TF\_1X4XI32(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 992548692 (vs. 652746132, 52.06%↑)

[Top 3 out of 6 results showed]

Improved Total Artifact Sizes 🎉

Benchmark Name Total Artifact Size (bytes)
PersonDetect\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 355205 (vs. 416901, 14.80%↓)
MobileBertSquad\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 29207621 (vs. 32872133, 11.15%↓)
Falcon7bGptqPT(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 343784 (vs. 370088, 7.11%↓)

[Top 3 out of 4 results showed]

Regressed Stream IR Dispatch Count (# of cmd.dispatch ops) 🚩

Benchmark Name Stream IR Dispatch Count (# of cmd.dispatch ops)
MobileBertSquad\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 1861 (vs. 704, 164.35%↑)
MobileBertSquad\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,compile-stats] 1861 (vs. 704, 164.35%↑)
MobileNetV2\_fp32(tflite) [vmvx-generic-vmvx-vmvx][experimental-flags,compile-stats] 188 (vs. 73, 157.53%↑)

[Top 3 out of 28 results showed]

For more information:

Source Workflow Run

@hanhanW hanhanW force-pushed the dt-default branch 5 times, most recently from 1ea87bc to c55b974 Compare October 31, 2023 18:17
@hanhanW hanhanW removed the codegen/riscv Code generation targeting RISC-V architecture label Nov 1, 2023
@hanhanW hanhanW force-pushed the dt-default branch 2 times, most recently from 01378d7 to ebeef83 Compare November 1, 2023 18:49
@hanhanW hanhanW changed the title WIP - Turn data-tiling on by default v2 Turn data-tiling on by default. Nov 1, 2023
@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 1, 2023

The biggest regression on llvm-cpu side is DeepLabV3, MobileBert, MobileNetV2, and MobileSSD on 8-threads. This is a known issue about distribution.

For single-threaded regression, they are 7-20% -- which is acceptable to me. See the below snapshot for highlights:

image

I don't really care about vmvx regression, they are likely because of batch_matmul issues. If the microkernel is on, data-tiling is not applied to batch_matmul. In the context, the regression is expected. It's been tracking in #15314

@dcaballe
Copy link
Contributor

dcaballe commented Nov 2, 2023

The biggest regression on llvm-cpu side is DeepLabV3, MobileBert, MobileNetV2, and MobileSSD on 8-threads. This is a known issue about distribution.

Is this because DT is using different distribution sizes to non-DT?

@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 2, 2023

Is this because DT is using different distribution sizes to non-DT?

DT introduces mmt4d ops, pack ops, and unpack ops, while non-DT does not have those ops. The mmt4d distribution tile sizes could be outdated, which was implemented 1 to 2 years ago by me. The pack/unpack distribution tile sizes are not tuned yet. We do observe similar issue in #15349 (comment)

@hanhanW hanhanW force-pushed the dt-default branch 4 times, most recently from 10cc44e to 647b05a Compare November 7, 2023 18:49
@hanhanW hanhanW force-pushed the dt-default branch 2 times, most recently from fa89343 to 17a9abc Compare November 8, 2023 19:16
@hanhanW hanhanW requested a review from bjacob November 11, 2023 03:31
@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 11, 2023

@bjacob it is +8 -1 now. I can create a PR that explicitly disables data-tiling for those tests to make this PR be +1 -0 if you think that's better.

@bjacob
Copy link
Contributor

bjacob commented Nov 11, 2023

Thanks for the splitting already, much better. Now I still think it would be best to go all the way to a 1 line change, like you describe.

Also, the current diff still turns on ukernels as well.

Thanks for all the work. I think it would be wise to wait until Monday anyway to flip the switch!

@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 11, 2023

ukernels is mostly for performance check. I will disable it before land the PR. I plan to wait for a more couple days, and see if others are okay with the flip. I will make this be one-line change.

@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 11, 2023

the PR is ready: #15557

@dcaballe
Copy link
Contributor

Posted new perf results in #15313

I also see some big regressions in CI. Has anybody looked into those?

@bjacob
Copy link
Contributor

bjacob commented Nov 13, 2023

I have lost track of the discussions about remaining regressions. The latest benchmark results still report some 2x regressions. Can you link to the latest thinking about them?

@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 13, 2023

Posted new perf results in #15313

I also see some big regressions in CI. Has anybody looked into those?

The big regressions happen when ukernels are disabled. There is known issue about codegen i8.i8.i32 mmt4d ops. The vector dialect and LLVM backend is not good at i8.i8.i32 things. The benchmark with ukernels on can be found at https://gist.github.com/iree-github-actions-bot/b2ac57f9f1fd6894e93da19d6137c5da and the regressions are okay

Copy link
Contributor

@bjacob bjacob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing it this far. This is now a 1-line change that's a clear improvement in IREE's benchmarks. OK to merge this pending checking with @dcaballe , but the fact that known regressions are due to non-ukernel mmt4d codegen means to me that it's time to flip this switch and deal with imperfect mmt4d codegen separately.

On a parallel front, I am working to make it easy to switch on ukernel for mmt4d selectively (as opposed to wholesale for all ops for which we might have more ukernels in the future). The first step was #15571 and I'll advance that further today. So that provides another layer of reasoning why it's OK to move ahead without letting mmt4d codegen block this.

@bjacob
Copy link
Contributor

bjacob commented Nov 14, 2023

Just to make this concrete, #15586 shows how a 1-line change can now enable UK for mmt4d by default (without this being a UK blank cheque wrt future UK lowerings). This is saying, if the last thing preventing DT-by-default is currently suboptimal codegen for mmt4d ops, we can enable UK for mmt4d so that doesn't affect the default path anymore.

Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we merge this change with @bjacob s change to make mmt4d use ukernels and land those together. Might help with regressions (I am treating mmt4d + non-ukernel as the non-default path).

@bjacob
Copy link
Contributor

bjacob commented Nov 14, 2023

Then let's discuss both in the "Mai-Tai sync" meeting in 30 min from now. And if we confirm we want to do this now, let's merge #15586 first, then this, so they're still separate (useful for bisection, we never know) and so that order ensures no intermediate state where the performance of non-ukernel mmt4d codegen is felt by default.

@MaheshRavishankar
Copy link
Contributor

Then let's discuss both in the "Mai-Tai sync" meeting in 30 min from now. And if we confirm we want to do this now, let's merge #15586 first, then this, so they're still separate (useful for bisection, we never know) and so that order ensures no intermediate state where the performance of non-ukernel mmt4d codegen is felt by default.

That sounds good to me!

@bjacob
Copy link
Contributor

bjacob commented Nov 14, 2023

This has been on Discord with @dcaballe here: https://discord.com/channels/689900678990135345/1146173056537079919/1174043054282391654

The summary is that what is blocking this now is actually #15242 . It's orthogonal to mmt4d codegen and ukernels. It's being actively worked on now by @NatashaKnk and discussed on Discord below the above linked message.

I think it is valuable to wait a bit to give this a chance to be resolved first, so we can hope for DT enablement to be a 100% clean win :-)

@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 15, 2023

Posted new perf results in #15313

I also see some big regressions in CI. Has anybody looked into those?

The regression are mostly from i8.i8.i32 mmt4d kernels. We are not generating VNNI. Issue filed: #15611

@stellaraccident
Copy link
Collaborator

I just touched base with some of the stakeholders and it sounds like there was consensus to enable this by default. @jpienaar concur?

@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 29, 2023

There is a failure after rebasing on ToT. The vmvx issue will be fixed by #15705

@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 29, 2023

Landing the PR because the last major compilation issue (#15692) is fixed.

@hanhanW hanhanW merged commit 52eb7e9 into iree-org:main Nov 29, 2023
57 of 59 checks passed
@hanhanW hanhanW deleted the dt-default branch November 29, 2023 20:31
ramiro050 pushed a commit to ramiro050/iree that referenced this pull request Dec 19, 2023
The revision turns data-tiling on by default.

We can get up to 10x faster for models haven't been over-looked (e.g.,
GPT2, Falcon7bGptqPT, BertLarge, etc); we have 25%, 36%, 45% slow down
for some models we've been working on previously. The full benchmark
result (with microkernels on) can be found at
https://gist.github.com/iree-github-actions-bot/b2ac57f9f1fd6894e93da19d6137c5da
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks:android-cpu Run default Android CPU benchmarks benchmarks:x86_64 Run default x86_64 benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants