Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Match TileAndFuse Matmul heuristics to VectorDistribute #19666

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

nirvedhmeshram
Copy link
Contributor

This provides comparable performance for CI tests.

@nirvedhmeshram nirvedhmeshram changed the title [GPU] Match TileAndFuse Matmul heuristics to Vector Distribute [GPU] Match TileAndFuse Matmul heuristics to VectorDistribute Jan 10, 2025
@nirvedhmeshram
Copy link
Contributor Author

nirvedhmeshram commented Jan 10, 2025

It appears punet (which I assume has convs and hence the IGEMM path) prefers the old configurations. Going to experiment to see if having different heuristics for those two has better perf on CI tests and update this PR accordingly.

@qedawkins
Copy link
Contributor

We could make the seeds a parameter to getMmaScheduleFromProblemAndTarget and pass different ones for conv and matmul

@nirvedhmeshram
Copy link
Contributor Author

We could make the seeds a parameter to getMmaScheduleFromProblemAndTarget and pass different ones for conv and matmul

Yeah thats what I was thinking too, there is also the transposedLhs and transposedRhs bools that have an effect on the GEMM side need to see how sensitive the conv dispatches are to that..

@nirvedhmeshram
Copy link
Contributor Author

Here is the clip performance with the old(exisiting) heuristics
image
Here it is with new heuristic (copied from vectordistribute)(from this PR)
image
I will share the configs in the problem dispatches in the next message

@nirvedhmeshram
Copy link
Contributor Author

nirvedhmeshram commented Jan 10, 2025

dispatch_120
is 116 us with old heuristic and 78us with new heuristic

Old(bad) config is

{lowering_config = #iree_gpu.lowering_config
<{mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>, 
promote_operands = [0, 1], reduction = [0, 0, 4], subgroup = [1, 4, 0], workgroup = [16, 256, 0]}>} 

New config is

 {lowering_config = #iree_gpu.lowering_config
<{mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>, 
promote_operands = [0, 1], reduction = [0, 0, 8], subgroup = [1, 2, 0], workgroup = [16, 128, 0]}>}

Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to block it if improves performance on real-world programs, but I'd really hope we can explain our choice and derive it from the hw constants

@nirvedhmeshram
Copy link
Contributor Author

nirvedhmeshram commented Jan 10, 2025

I don't want to block it if improves performance on real-world programs, but I'd really hope we can explain our choice and derive it from the hw constants

That is fair, I was able to derive from the hw constants, just need a larger bestKElementCountPerSubgroup which can still be a function of the kCacheLineSizeBits. However, uncovered a bug in the process
#19675. Currently pushing with a workaround.

Signed-off-by: Nirvedh Meshram <[email protected]>
Signed-off-by: Nirvedh Meshram <[email protected]>
@nirvedhmeshram nirvedhmeshram force-pushed the match_heuristics branch 2 times, most recently from a246ab1 to 342f49e Compare January 10, 2025 23:28
@nirvedhmeshram
Copy link
Contributor Author

I believe the regression seen after heuristic change, could be something to do with what we found in #19671. I am going to wait for that to land and then rebase and see how the CI does,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants