[AMD] Support Cross-Lane Reduction With DPP #5019

knwng · 2024-10-30T21:35:17Z

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

antiagainst

Thanks @knwng! A bunch of comments inlined. In addition, can we also add some to llvm lit tests?

third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.cpp

third_party/amd/lib/TritonAMDGPUToLLVM/Utility.cpp

antiagainst

Looks better! A few more comments. BTW, when you address comments, can you push new commits, not squash into the existing one? It's easier to review that way--otherwise need to reread the whole change again.

include/triton/Conversion/TritonGPUToLLVM/Utility.h

third_party/amd/lib/TritonAMDGPUToLLVM/Utility.h

test/Conversion/amd/tritongpu_to_llvm.mlir

third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.cpp

third_party/amd/include/TritonAMDGPUToLLVM/TargetUtils.h

third_party/amd/lib/TritonAMDGPUToLLVM/Utility.cpp

knwng · 2024-11-09T18:20:48Z

BTW, when you address comments, can you push new commits, not squash into the existing one? It's easier to review that way--otherwise need to reread the whole change again.

Definitely. I have been trying to keep it as a single commit. Didn't realized that. I'm not sure this project has been set to squash commits during PR.

antiagainst

Nice. Just one final comment.

include/triton/Dialect/Triton/IR/TritonOps.td

lib/Dialect/Triton/IR/Ops.cpp

This commit supported warp-level reduction with dpp modifier, including: - For numLaneToReduce == 64, reduction with fully dpp-modified instructions. - For numLaneToReduce < 64, reduction with different instructions according to stride: - stride > 16: ds-bpermute - stride == 16: ds-swizzle - stride == 8/4/2/1: dpp-modified instructions

This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3)

Cherry pick list: - #4925 - #5053 - #5019 - #5002 - #4935 - required additional cherry picks #4991 and #4951 - #4998 - #4925 - #5281 - #5308 - All previous LLVM hash PRs before #5308 --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Alexander Efimov <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]>

This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3)

Reverts #5191 due to some mlir errors in pytorch unit tests Smaller set of cherry picks: - #5308 (and previous LLVM upgrades) - #5281 - #4925 - #5053 - #5019 - #4998 --------- Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Ilya V <[email protected]> Co-authored-by: Kyle Wang <[email protected]>

This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3)

* [AMD] Emit vectorized 16-bit float LLVM atomic ops (triton-lang#4925) In the case of 16 bit floats operands for tt::AtomicRMWOp, construct only one LLVM::AtomicRMWOp but use vector of elements. Such approach allows to generate packed intrinsics and process 2 elements at once. Added a lit test for f16 vectorized case. (cherry picked from commit 78c8054) * [AMD] Restructure ReorderInstructions pass (triton-lang#4998) (cherry picked from commit 86a2ac7) * [AMD] Support warp-level reduction with DPP (triton-lang#5019) This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3) * [AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053) TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce) * [AMD] Support warp-level reduction with DPP (triton-lang#5019) This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3) * [AMD] Use DPP to accelerate 16-bit floats (triton-lang#5072) In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <[email protected]> (cherry picked from commit bab3470) * [AMD] Reland sinking the 2nd tt.load after local_load's (triton-lang#4935) This PR adds more restrictions about when should we apply the sched-load optimizations and un-revert triton-lang#4823. We will only apply the optimization when all of the following is satisfied: 1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop 2. two `tt.load`s in the main loop 3. 2nd `tt.load` is ahead of the `tt.dot` 4. 1st user of 2nd `tt.load` is after the `tt.dot` 5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64 (cherry picked from commit 4f6f768) --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]>

Cherry pick list: - triton-lang#4925 - triton-lang#5053 - triton-lang#5019 - triton-lang#5002 - triton-lang#4935 - required additional cherry picks triton-lang#4991 and triton-lang#4951 - triton-lang#4998 - triton-lang#4925 - triton-lang#5281 - triton-lang#5308 - All previous LLVM hash PRs before triton-lang#5308 --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Alexander Efimov <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> (cherry picked from commit 2d8093c)

Reverts triton-lang#5191 due to some mlir errors in pytorch unit tests Smaller set of cherry picks: - triton-lang#5308 (and previous LLVM upgrades) - triton-lang#5281 - triton-lang#4925 - triton-lang#5053 - triton-lang#5019 - triton-lang#4998 --------- Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Ilya V <[email protected]> Co-authored-by: Kyle Wang <[email protected]> (cherry picked from commit 7e401df)

Cherry pick list: - #4925 - #5053 - #5019 - #5002 - #4935 - required additional cherry picks #4991 and #4951 - #4998 - #4925 - #5281 - #5308 - All previous LLVM hash PRs before #5308 --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Alexander Efimov <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]>

Reverts #5191 due to some mlir errors in pytorch unit tests Smaller set of cherry picks: - #5308 (and previous LLVM upgrades) - #5281 - #4925 - #5053 - #5019 - #4998 --------- Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Ilya V <[email protected]> Co-authored-by: Kyle Wang <[email protected]>

knwng force-pushed the impl_reduce_w_dpp_inst branch 4 times, most recently from f3ac479 to 6fc2973 Compare November 2, 2024 00:13

knwng force-pushed the impl_reduce_w_dpp_inst branch 2 times, most recently from 79d5e91 to 8f0b60e Compare November 7, 2024 18:06

antiagainst requested changes Nov 7, 2024

View reviewed changes

knwng force-pushed the impl_reduce_w_dpp_inst branch 2 times, most recently from 990fccd to 9bdc244 Compare November 9, 2024 03:15

knwng requested a review from antiagainst November 9, 2024 03:17

knwng force-pushed the impl_reduce_w_dpp_inst branch from 9bdc244 to 4e7c5b1 Compare November 9, 2024 05:03

antiagainst requested changes Nov 9, 2024

View reviewed changes

knwng force-pushed the impl_reduce_w_dpp_inst branch from 4e7c5b1 to 7ff9683 Compare November 11, 2024 22:34

knwng requested a review from antiagainst November 11, 2024 22:34

antiagainst approved these changes Nov 12, 2024

View reviewed changes

include/triton/Dialect/Triton/IR/TritonOps.td Show resolved Hide resolved

antiagainst marked this pull request as ready for review November 12, 2024 02:19

antiagainst requested review from zhanglx13 and ptillet as code owners November 12, 2024 02:19

knwng force-pushed the impl_reduce_w_dpp_inst branch from 9a4f6ae to 54d7eda Compare November 12, 2024 03:59

antiagainst reviewed Nov 12, 2024

View reviewed changes

lib/Dialect/Triton/IR/Ops.cpp Outdated Show resolved Hide resolved

knwng added 4 commits November 12, 2024 13:48

Resolved comments

b53183c

Resolved commits by adding comments to getSingleCombiner

e95e71b

Resolved commits by moving comments to TritonOps.td

e5a9bd2

knwng force-pushed the impl_reduce_w_dpp_inst branch from 54d7eda to e5a9bd2 Compare November 12, 2024 21:49

antiagainst merged commit 21119e3 into triton-lang:main Nov 14, 2024
7 checks passed

jataylo mentioned this pull request Nov 19, 2024

[AMD] release/3.2.x AMD perf cherry picks #5191

Merged

jataylo mentioned this pull request Dec 5, 2024

[AMD] rc/3.2.x cherry picks #5347

Merged

jataylo mentioned this pull request Dec 12, 2024

[Release/3.2.x] AMD Cherry Picks #5413

Closed

jataylo mentioned this pull request Dec 13, 2024

[CP] AMD Performance cherry picks ROCm/triton#682

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Support Cross-Lane Reduction With DPP #5019

[AMD] Support Cross-Lane Reduction With DPP #5019

knwng commented Oct 30, 2024 •

edited

Loading

antiagainst left a comment

antiagainst left a comment

knwng commented Nov 9, 2024

antiagainst left a comment

[AMD] Support Cross-Lane Reduction With DPP #5019

[AMD] Support Cross-Lane Reduction With DPP #5019

Conversation

knwng commented Oct 30, 2024 • edited Loading

antiagainst left a comment

Choose a reason for hiding this comment

antiagainst left a comment

Choose a reason for hiding this comment

knwng commented Nov 9, 2024

antiagainst left a comment

Choose a reason for hiding this comment

knwng commented Oct 30, 2024 •

edited

Loading