[AMD] Restructure ReorderInstructions pass #4998

antiagainst · 2024-10-26T23:28:52Z

This commit restructures the ReorderInstructions pass to be
more modular and cleaner with utility functions and guard
rule applications against their intended usage rather than
always being globally on. Along the way, dropped some
unnecessary local_load sink given it's later hoisted again.

Fixes https://github.com/ROCm/triton-internal/issues/280

This commit restructures the ReorderInstructions pass to be more modular and cleaner with utility functions and guard rule applications against their intended usage rather than always being globally on.

antiagainst · 2024-10-27T00:25:51Z

@zhanglx13: first commit is basically NFC with some IR mutation robustness improvements. the second and third commit is meaningful changes. i'd suggest to view the diff with "Hide whitespace" toggle on.

sjw36

LGTM!
The comment below doesn't change functionality, and will be deprecated by sched-v3 anyway.

sjw36 · 2024-10-28T17:45:21Z

third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp

+static bool isPureMatmulProblem(ModuleOp moduleOp) {
+  for (auto forOp : moduleOp.getOps<scf::ForOp>()) {
+    int counter = 0;
+    forOp.walk([&counter](triton::DotOp dotOp) { ++counter; });


This would trigger for parent of nested loops also..

(cherry picked from commit 86a2ac7)

Cherry pick list: - #4925 - #5053 - #5019 - #5002 - #4935 - required additional cherry picks #4991 and #4951 - #4998 - #4925 - #5281 - #5308 - All previous LLVM hash PRs before #5308 --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Alexander Efimov <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]>

(cherry picked from commit 86a2ac7)

Reverts #5191 due to some mlir errors in pytorch unit tests Smaller set of cherry picks: - #5308 (and previous LLVM upgrades) - #5281 - #4925 - #5053 - #5019 - #4998 --------- Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Ilya V <[email protected]> Co-authored-by: Kyle Wang <[email protected]>

(cherry picked from commit 86a2ac7)

* [AMD] Emit vectorized 16-bit float LLVM atomic ops (triton-lang#4925) In the case of 16 bit floats operands for tt::AtomicRMWOp, construct only one LLVM::AtomicRMWOp but use vector of elements. Such approach allows to generate packed intrinsics and process 2 elements at once. Added a lit test for f16 vectorized case. (cherry picked from commit 78c8054) * [AMD] Restructure ReorderInstructions pass (triton-lang#4998) (cherry picked from commit 86a2ac7) * [AMD] Support warp-level reduction with DPP (triton-lang#5019) This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3) * [AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053) TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce) * [AMD] Support warp-level reduction with DPP (triton-lang#5019) This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3) * [AMD] Use DPP to accelerate 16-bit floats (triton-lang#5072) In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <[email protected]> (cherry picked from commit bab3470) * [AMD] Reland sinking the 2nd tt.load after local_load's (triton-lang#4935) This PR adds more restrictions about when should we apply the sched-load optimizations and un-revert triton-lang#4823. We will only apply the optimization when all of the following is satisfied: 1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop 2. two `tt.load`s in the main loop 3. 2nd `tt.load` is ahead of the `tt.dot` 4. 1st user of 2nd `tt.load` is after the `tt.dot` 5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64 (cherry picked from commit 4f6f768) --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]>

Cherry pick list: - triton-lang#4925 - triton-lang#5053 - triton-lang#5019 - triton-lang#5002 - triton-lang#4935 - required additional cherry picks triton-lang#4991 and triton-lang#4951 - triton-lang#4998 - triton-lang#4925 - triton-lang#5281 - triton-lang#5308 - All previous LLVM hash PRs before triton-lang#5308 --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Alexander Efimov <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> (cherry picked from commit 2d8093c)

Reverts triton-lang#5191 due to some mlir errors in pytorch unit tests Smaller set of cherry picks: - triton-lang#5308 (and previous LLVM upgrades) - triton-lang#5281 - triton-lang#4925 - triton-lang#5053 - triton-lang#5019 - triton-lang#4998 --------- Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Ilya V <[email protected]> Co-authored-by: Kyle Wang <[email protected]> (cherry picked from commit 7e401df)

Cherry pick list: - #4925 - #5053 - #5019 - #5002 - #4935 - required additional cherry picks #4991 and #4951 - #4998 - #4925 - #5281 - #5308 - All previous LLVM hash PRs before #5308 --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Alexander Efimov <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]>

Reverts #5191 due to some mlir errors in pytorch unit tests Smaller set of cherry picks: - #5308 (and previous LLVM upgrades) - #5281 - #4925 - #5053 - #5019 - #4998 --------- Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Ilya V <[email protected]> Co-authored-by: Kyle Wang <[email protected]>

antiagainst added 3 commits October 26, 2024 23:56

[AMD] NFC: Restructure ReorderInstructions pass

92ad7b5

This commit restructures the ReorderInstructions pass to be more modular and cleaner with utility functions and guard rule applications against their intended usage rather than always being globally on.

Drop local_load sink logic given it's reverted in a next step

3255955

Add guard for logic that appliable to matmul

c6a6a7f

antiagainst force-pushed the amd-reorder-pass branch from 6705805 to c6a6a7f Compare October 27, 2024 00:07

antiagainst marked this pull request as ready for review October 27, 2024 00:22

antiagainst requested a review from zhanglx13 as a code owner October 27, 2024 00:22

zhanglx13 approved these changes Oct 28, 2024

View reviewed changes

sjw36 reviewed Oct 28, 2024

View reviewed changes

sjw36 approved these changes Oct 28, 2024

View reviewed changes

antiagainst merged commit 86a2ac7 into triton-lang:main Oct 28, 2024
7 checks passed

antiagainst deleted the amd-reorder-pass branch October 28, 2024 23:31

AlexAUT pushed a commit to AlexAUT/triton that referenced this pull request Oct 29, 2024

[AMD] Restructure ReorderInstructions pass (triton-lang#4998)

dffb97d

Luosuu pushed a commit to Luosuu/triton that referenced this pull request Nov 13, 2024

[AMD] Restructure ReorderInstructions pass (triton-lang#4998)

e9655c2

guacamoleo pushed a commit to guacamoleo/triton that referenced this pull request Nov 14, 2024

[AMD] Restructure ReorderInstructions pass (triton-lang#4998)

bd74bd7

jataylo pushed a commit to jataylo/triton that referenced this pull request Nov 18, 2024

[AMD] Restructure ReorderInstructions pass (triton-lang#4998)

bbd72b7

(cherry picked from commit 86a2ac7)

jataylo mentioned this pull request Nov 19, 2024

[AMD] release/3.2.x AMD perf cherry picks #5191

Merged

jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 5, 2024

[AMD] Restructure ReorderInstructions pass (triton-lang#4998)

218d8b0

(cherry picked from commit 86a2ac7)

jataylo mentioned this pull request Dec 5, 2024

[AMD] rc/3.2.x cherry picks #5347

Merged

jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 11, 2024

[AMD] Restructure ReorderInstructions pass (triton-lang#4998)

0c35781

(cherry picked from commit 86a2ac7)

jataylo mentioned this pull request Dec 12, 2024

[Release/3.2.x] AMD Cherry Picks #5413

Closed

jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 13, 2024

[AMD] Restructure ReorderInstructions pass (triton-lang#4998)

4c7d56e

(cherry picked from commit 86a2ac7)

jataylo mentioned this pull request Dec 13, 2024

[CP] AMD Performance cherry picks ROCm/triton#682

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Restructure ReorderInstructions pass #4998

[AMD] Restructure ReorderInstructions pass #4998

antiagainst commented Oct 26, 2024 •

edited

Loading

antiagainst commented Oct 27, 2024

sjw36 left a comment

sjw36 Oct 28, 2024

[AMD] Restructure ReorderInstructions pass #4998

[AMD] Restructure ReorderInstructions pass #4998

Conversation

antiagainst commented Oct 26, 2024 • edited Loading

antiagainst commented Oct 27, 2024

sjw36 left a comment

Choose a reason for hiding this comment

sjw36 Oct 28, 2024

Choose a reason for hiding this comment

antiagainst commented Oct 26, 2024 •

edited

Loading