-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turn data-tiling on by default. #15256
Conversation
0231530
to
1283bbb
Compare
1ea87bc
to
c55b974
Compare
01378d7
to
ebeef83
Compare
The biggest regression on llvm-cpu side is DeepLabV3, MobileBert, MobileNetV2, and MobileSSD on 8-threads. This is a known issue about distribution. For single-threaded regression, they are 7-20% -- which is acceptable to me. See the below snapshot for highlights: I don't really care about vmvx regression, they are likely because of batch_matmul issues. If the microkernel is on, data-tiling is not applied to batch_matmul. In the context, the regression is expected. It's been tracking in #15314 |
Is this because DT is using different distribution sizes to non-DT? |
DT introduces mmt4d ops, pack ops, and unpack ops, while non-DT does not have those ops. The mmt4d distribution tile sizes could be outdated, which was implemented 1 to 2 years ago by me. The pack/unpack distribution tile sizes are not tuned yet. We do observe similar issue in #15349 (comment) |
10cc44e
to
647b05a
Compare
fa89343
to
17a9abc
Compare
@bjacob it is |
Thanks for the splitting already, much better. Now I still think it would be best to go all the way to a 1 line change, like you describe. Also, the current diff still turns on ukernels as well. Thanks for all the work. I think it would be wise to wait until Monday anyway to flip the switch! |
ukernels is mostly for performance check. I will disable it before land the PR. I plan to wait for a more couple days, and see if others are okay with the flip. I will make this be one-line change. |
the PR is ready: #15557 |
Posted new perf results in #15313 I also see some big regressions in CI. Has anybody looked into those? |
I have lost track of the discussions about remaining regressions. The latest benchmark results still report some 2x regressions. Can you link to the latest thinking about them? |
The big regressions happen when ukernels are disabled. There is known issue about codegen i8.i8.i32 mmt4d ops. The vector dialect and LLVM backend is not good at i8.i8.i32 things. The benchmark with ukernels on can be found at https://gist.github.com/iree-github-actions-bot/b2ac57f9f1fd6894e93da19d6137c5da and the regressions are okay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing it this far. This is now a 1-line change that's a clear improvement in IREE's benchmarks. OK to merge this pending checking with @dcaballe , but the fact that known regressions are due to non-ukernel mmt4d codegen means to me that it's time to flip this switch and deal with imperfect mmt4d codegen separately.
On a parallel front, I am working to make it easy to switch on ukernel for mmt4d selectively (as opposed to wholesale for all ops for which we might have more ukernels in the future). The first step was #15571 and I'll advance that further today. So that provides another layer of reasoning why it's OK to move ahead without letting mmt4d codegen block this.
Just to make this concrete, #15586 shows how a 1-line change can now enable UK for mmt4d by default (without this being a UK blank cheque wrt future UK lowerings). This is saying, if the last thing preventing DT-by-default is currently suboptimal codegen for mmt4d ops, we can enable UK for mmt4d so that doesn't affect the default path anymore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we merge this change with @bjacob s change to make mmt4d use ukernels and land those together. Might help with regressions (I am treating mmt4d + non-ukernel as the non-default path).
Then let's discuss both in the "Mai-Tai sync" meeting in 30 min from now. And if we confirm we want to do this now, let's merge #15586 first, then this, so they're still separate (useful for bisection, we never know) and so that order ensures no intermediate state where the performance of non-ukernel mmt4d codegen is felt by default. |
That sounds good to me! |
This has been on Discord with @dcaballe here: https://discord.com/channels/689900678990135345/1146173056537079919/1174043054282391654 The summary is that what is blocking this now is actually #15242 . It's orthogonal to I think it is valuable to wait a bit to give this a chance to be resolved first, so we can hope for DT enablement to be a 100% clean win :-) |
I just touched base with some of the stakeholders and it sounds like there was consensus to enable this by default. @jpienaar concur? |
There is a failure after rebasing on ToT. The vmvx issue will be fixed by #15705 |
Landing the PR because the last major compilation issue (#15692) is fixed. |
The revision turns data-tiling on by default. We can get up to 10x faster for models haven't been over-looked (e.g., GPT2, Falcon7bGptqPT, BertLarge, etc); we have 25%, 36%, 45% slow down for some models we've been working on previously. The full benchmark result (with microkernels on) can be found at https://gist.github.com/iree-github-actions-bot/b2ac57f9f1fd6894e93da19d6137c5da
The revision turns data-tiling ons by default. In the commit:
kDynamic
, and we have nothing to do for adjustment.transformOptions.options.dataTiling
control.We can get up to 10x faster for models haven't been over-looked (e.g., GPT2, Falcon7bGptqPT, BertLarge, etc); we have 25%, 36%, 45% slow down for some models we've been working on previously. The full benchmark result (with microkernels on) can be found at https://gist.github.com/iree-github-actions-bot/b2ac57f9f1fd6894e93da19d6137c5da