[AMD] Reland sinking the 2nd tt.load after local_load's #4935

zhanglx13 · 2024-10-16T20:20:19Z

This PR adds more restrictions about when should we apply the sched-load optimizations and un-revert #4823.
We will only apply the optimization when all of the following is satisfied:

pureMatmulProblem, i.e. 1 tt.dot in the main loop
two tt.loads in the main loop
2nd tt.load is ahead of the tt.dot
1st user of 2nd tt.load is after the tt.dot
tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64

zhanglx13 · 2024-10-16T20:35:44Z

~~I'm adding more lit tests~~
Done

third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp

test/TritonGPU/amd/amd-sched-2nd-load.mlir

This helps backend to interleave global load and mfma instructions and can reduce global load issue latency.

There are two issues with this function: 1. It returns true when there is no for loop 2. It cannot detect for loop when there is any. The 2nd bullet is because "getOps<OpTy>() is useful to iterate on some Operations immediately listed inside a single block (or a single region)", therefore, moduleOp.getOps<scf::ForOp> will always return nothing. Instead, we use walker here to find scf.forOp in a nested fashion.

This test expects scheduleGlobalLoadLocalStore to move load earlier. However, we only apply scheduleGlobalLoadLocalStore for pureMatmulProblem. And this test is too long. If we decide to apply the optimization for attn like kernel in the future, we will add a smaller test.

…4935) This PR adds more restrictions about when should we apply the sched-load optimizations and un-revert triton-lang#4823. We will only apply the optimization when all of the following is satisfied: 1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop 2. two `tt.load`s in the main loop 3. 2nd `tt.load` is ahead of the `tt.dot` 4. 1st user of 2nd `tt.load` is after the `tt.dot` 5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64

…4935) This PR adds more restrictions about when should we apply the sched-load optimizations and un-revert triton-lang#4823. We will only apply the optimization when all of the following is satisfied: 1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop 2. two `tt.load`s in the main loop 3. 2nd `tt.load` is ahead of the `tt.dot` 4. 1st user of 2nd `tt.load` is after the `tt.dot` 5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64 (cherry picked from commit 4f6f768)

Cherry pick list: - #4925 - #5053 - #5019 - #5002 - #4935 - required additional cherry picks #4991 and #4951 - #4998 - #4925 - #5281 - #5308 - All previous LLVM hash PRs before #5308 --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Alexander Efimov <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]>

…4935) This PR adds more restrictions about when should we apply the sched-load optimizations and un-revert triton-lang#4823. We will only apply the optimization when all of the following is satisfied: 1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop 2. two `tt.load`s in the main loop 3. 2nd `tt.load` is ahead of the `tt.dot` 4. 1st user of 2nd `tt.load` is after the `tt.dot` 5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64 (cherry picked from commit 4f6f768)

* [AMD] Emit vectorized 16-bit float LLVM atomic ops (triton-lang#4925) In the case of 16 bit floats operands for tt::AtomicRMWOp, construct only one LLVM::AtomicRMWOp but use vector of elements. Such approach allows to generate packed intrinsics and process 2 elements at once. Added a lit test for f16 vectorized case. (cherry picked from commit 78c8054) * [AMD] Restructure ReorderInstructions pass (triton-lang#4998) (cherry picked from commit 86a2ac7) * [AMD] Support warp-level reduction with DPP (triton-lang#5019) This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3) * [AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053) TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce) * [AMD] Support warp-level reduction with DPP (triton-lang#5019) This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3) * [AMD] Use DPP to accelerate 16-bit floats (triton-lang#5072) In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <[email protected]> (cherry picked from commit bab3470) * [AMD] Reland sinking the 2nd tt.load after local_load's (triton-lang#4935) This PR adds more restrictions about when should we apply the sched-load optimizations and un-revert triton-lang#4823. We will only apply the optimization when all of the following is satisfied: 1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop 2. two `tt.load`s in the main loop 3. 2nd `tt.load` is ahead of the `tt.dot` 4. 1st user of 2nd `tt.load` is after the `tt.dot` 5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64 (cherry picked from commit 4f6f768) --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]>

Cherry pick list: - triton-lang#4925 - triton-lang#5053 - triton-lang#5019 - triton-lang#5002 - triton-lang#4935 - required additional cherry picks triton-lang#4991 and triton-lang#4951 - triton-lang#4998 - triton-lang#4925 - triton-lang#5281 - triton-lang#5308 - All previous LLVM hash PRs before triton-lang#5308 --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Alexander Efimov <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> (cherry picked from commit 2d8093c)

Cherry pick list: - #4925 - #5053 - #5019 - #5002 - #4935 - required additional cherry picks #4991 and #4951 - #4998 - #4925 - #5281 - #5308 - All previous LLVM hash PRs before #5308 --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Alexander Efimov <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]>

zhanglx13 force-pushed the robust_sched_load branch 3 times, most recently from c341a5f to 748b853 Compare October 29, 2024 02:13

zhanglx13 changed the title ~~Robust sched load~~ [AMD] Robust sched-2nd-load Oct 29, 2024

zhanglx13 force-pushed the robust_sched_load branch 2 times, most recently from 2d6cd3d to d0485f1 Compare October 29, 2024 03:47

antiagainst requested changes Oct 29, 2024

View reviewed changes

third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp Outdated Show resolved Hide resolved

third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp Outdated Show resolved Hide resolved

test/TritonGPU/amd/amd-sched-2nd-load.mlir Show resolved Hide resolved

zhanglx13 requested a review from antiagainst October 29, 2024 22:46

antiagainst approved these changes Oct 29, 2024

View reviewed changes

antiagainst marked this pull request as ready for review October 29, 2024 23:00

antiagainst requested a review from ptillet as a code owner October 29, 2024 23:00

antiagainst changed the title ~~[AMD] Robust sched-2nd-load~~ [AMD] Reland sinking the 2nd tt.load after local_load's Oct 29, 2024

zhanglx13 added 6 commits October 30, 2024 11:30

[AMD] Sink the 2nd tt.load after local_load's (triton-lang#4823)

3088c5d

This helps backend to interleave global load and mfma instructions and can reduce global load issue latency.

Add more restrictions

d647338

Add more lit tests

6246baf

Address review comments

ca22491

zhanglx13 force-pushed the robust_sched_load branch from f9501b7 to ca22491 Compare October 30, 2024 16:42

antiagainst merged commit 4f6f768 into triton-lang:main Oct 31, 2024
7 checks passed

antiagainst deleted the robust_sched_load branch October 31, 2024 02:05

jataylo mentioned this pull request Nov 19, 2024

[AMD] release/3.2.x AMD perf cherry picks #5191

Merged

jataylo mentioned this pull request Dec 13, 2024

[CP] AMD Performance cherry picks ROCm/triton#682

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Reland sinking the 2nd tt.load after local_load's #4935

[AMD] Reland sinking the 2nd tt.load after local_load's #4935

zhanglx13 commented Oct 16, 2024 •

edited

Loading

zhanglx13 commented Oct 16, 2024 •

edited

Loading

[AMD] Reland sinking the 2nd tt.load after local_load's #4935

[AMD] Reland sinking the 2nd tt.load after local_load's #4935

Conversation

zhanglx13 commented Oct 16, 2024 • edited Loading

zhanglx13 commented Oct 16, 2024 • edited Loading

zhanglx13 commented Oct 16, 2024 •

edited

Loading

zhanglx13 commented Oct 16, 2024 •

edited

Loading