[GPU] Match TileAndFuse Matmul heuristics to VectorDistribute #19666

nirvedhmeshram · 2025-01-10T18:03:43Z

This provides comparable performance for CI tests.

Signed-off-by: Nirvedh Meshram <[email protected]>

nirvedhmeshram · 2025-01-10T20:06:58Z

It appears punet (which I assume has convs and hence the IGEMM path) prefers the old configurations. Going to experiment to see if having different heuristics for those two has better perf on CI tests and update this PR accordingly.

qedawkins · 2025-01-10T20:08:19Z

We could make the seeds a parameter to getMmaScheduleFromProblemAndTarget and pass different ones for conv and matmul

nirvedhmeshram · 2025-01-10T20:10:38Z

We could make the seeds a parameter to getMmaScheduleFromProblemAndTarget and pass different ones for conv and matmul

Yeah thats what I was thinking too, there is also the transposedLhs and transposedRhs bools that have an effect on the GEMM side need to see how sensitive the conv dispatches are to that..

compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp

nirvedhmeshram · 2025-01-10T21:01:51Z

Here is the clip performance with the old(exisiting) heuristics

Here it is with new heuristic (copied from vectordistribute)(from this PR)

I will share the configs in the problem dispatches in the next message

nirvedhmeshram · 2025-01-10T21:10:55Z

dispatch_120
is 116 us with old heuristic and 78us with new heuristic

Old(bad) config is

{lowering_config = #iree_gpu.lowering_config
<{mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>, 
promote_operands = [0, 1], reduction = [0, 0, 4], subgroup = [1, 4, 0], workgroup = [16, 256, 0]}>}

New config is

 {lowering_config = #iree_gpu.lowering_config
<{mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>, 
promote_operands = [0, 1], reduction = [0, 0, 8], subgroup = [1, 2, 0], workgroup = [16, 128, 0]}>}

kuhar

I don't want to block it if improves performance on real-world programs, but I'd really hope we can explain our choice and derive it from the hw constants

Signed-off-by: Nirvedh Meshram <[email protected]>

nirvedhmeshram · 2025-01-10T22:38:10Z

I don't want to block it if improves performance on real-world programs, but I'd really hope we can explain our choice and derive it from the hw constants

That is fair, I was able to derive from the hw constants, just need a larger bestKElementCountPerSubgroup which can still be a function of the kCacheLineSizeBits. However, uncovered a bug in the process
#19675. Currently pushing with a workaround.

Signed-off-by: Nirvedh Meshram <[email protected]>

nirvedhmeshram · 2025-01-11T02:13:17Z

I believe the regression seen after heuristic change, could be something to do with what we found in #19671. I am going to wait for that to land and then rebase and see how the CI does,

nirvedhmeshram requested review from MaheshRavishankar, qedawkins, kuhar, Groverkss and antiagainst as code owners January 10, 2025 18:03

[GPU] Match TileAndFuse Matmul heuristics to Vector Distribute

b3f2c11

Signed-off-by: Nirvedh Meshram <[email protected]>

nirvedhmeshram force-pushed the match_heuristics branch from aabe983 to b3f2c11 Compare January 10, 2025 18:04

nirvedhmeshram changed the title ~~[GPU] Match TileAndFuse Matmul heuristics to Vector Distribute~~ [GPU] Match TileAndFuse Matmul heuristics to VectorDistribute Jan 10, 2025

qedawkins approved these changes Jan 10, 2025

View reviewed changes

kuhar requested changes Jan 10, 2025

View reviewed changes

compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp Outdated Show resolved Hide resolved

kuhar approved these changes Jan 10, 2025

View reviewed changes

Go back to old Heuristic but ask for more bestKElementCountPerSubgroup

fe2b43e

Signed-off-by: Nirvedh Meshram <[email protected]>

nirvedhmeshram force-pushed the match_heuristics branch from 282e5a7 to fe2b43e Compare January 10, 2025 22:36

nirvedhmeshram added 2 commits January 10, 2025 16:42

Keep transposes

09dfa87

Signed-off-by: Nirvedh Meshram <[email protected]>

fix tests

559d166

Signed-off-by: Nirvedh Meshram <[email protected]>

nirvedhmeshram force-pushed the match_heuristics branch 2 times, most recently from a246ab1 to 342f49e Compare January 10, 2025 23:28

Make bestKElementCountPerSubgroup less aggressive

31c26a4

Signed-off-by: Nirvedh Meshram <[email protected]>

nirvedhmeshram force-pushed the match_heuristics branch from 342f49e to 31c26a4 Compare January 10, 2025 23:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Match TileAndFuse Matmul heuristics to VectorDistribute #19666

[GPU] Match TileAndFuse Matmul heuristics to VectorDistribute #19666

nirvedhmeshram commented Jan 10, 2025

nirvedhmeshram commented Jan 10, 2025 •

edited

Loading

qedawkins commented Jan 10, 2025

nirvedhmeshram commented Jan 10, 2025

nirvedhmeshram commented Jan 10, 2025

nirvedhmeshram commented Jan 10, 2025 •

edited

Loading

kuhar left a comment

nirvedhmeshram commented Jan 10, 2025 •

edited

Loading

nirvedhmeshram commented Jan 11, 2025

[GPU] Match TileAndFuse Matmul heuristics to VectorDistribute #19666

Are you sure you want to change the base?

[GPU] Match TileAndFuse Matmul heuristics to VectorDistribute #19666

Conversation

nirvedhmeshram commented Jan 10, 2025

nirvedhmeshram commented Jan 10, 2025 • edited Loading

qedawkins commented Jan 10, 2025

nirvedhmeshram commented Jan 10, 2025

nirvedhmeshram commented Jan 10, 2025

nirvedhmeshram commented Jan 10, 2025 • edited Loading

kuhar left a comment

Choose a reason for hiding this comment

nirvedhmeshram commented Jan 10, 2025 • edited Loading

nirvedhmeshram commented Jan 11, 2025

nirvedhmeshram commented Jan 10, 2025 •

edited

Loading

nirvedhmeshram commented Jan 10, 2025 •

edited

Loading

nirvedhmeshram commented Jan 10, 2025 •

edited

Loading