-
Notifications
You must be signed in to change notification settings - Fork 650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] Improve tile sizes selection for tensor.pack ops. #15397
Conversation
97d8152
to
f69db97
Compare
b4701bd
to
90f1f14
Compare
90f1f14
to
f8ad419
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me... I dont have much of a handle on the heuristics here... If you have any data as to why it helps, post it here for recording.
done. I include the perf number in the description. |
Do you know if the Falcon7B regression is real? That's one of the main models we are working on right now. |
I can take a look, but I feel that it is not real.. The regression did not happen before I rebase it to ToT. https://gist.github.com/iree-github-actions-bot/99c8f439051a13ad1a3d0ebe112056b9 |
I'm pretty sure that it is not real because it is on default flags, not data-tiling. |
It disables special vector sizes for non-f32 cases because the logic is only for 16x16 transpose cases. The improvements of dispatch sizes are from vectorization. We are not able to vectorize named ops if they have dynamic shapes, which is fixed by llvm/llvm-project@03529b9. The change allows backends to vectorize them because they become static shapes (by tiling with size=1). It is not a hard condition; we track it in iree-org#15441 The revision takes the number of threads into account, so we have better performance on multi-threaded. It also reduces runtime overheads. This is a step toward to iree-org#15391 and iree-org#15349 It improves the performance of [tensor.pack](iree-org#15349) op from 420 ms to 170 ms on 8-threaded x86 CPU.
It disables special vector sizes for non-f32 cases because the logic is only for 16x16 transpose cases. The improvements of dispatch sizes are from vectorization. We are not able to vectorize named ops if they have dynamic shapes, which is fixed by llvm/llvm-project@03529b9. The change allows backends to vectorize them because they become static shapes (by tiling with size=1). It is not a hard condition; we track it in #15441
The revision takes the number of threads into account, so we have better performance on multi-threaded. It also reduces runtime overheads.
This is a step toward to #15391 and #15349
It improves the performance of tensor.pack op from 420 ms to 170 ms on 8-threaded x86 CPU.