Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMD] Support Cross-Lane Reduction With DPP #5019

Merged
merged 4 commits into from
Nov 14, 2024

Conversation

knwng
Copy link
Contributor

@knwng knwng commented Oct 30, 2024

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

@knwng knwng force-pushed the impl_reduce_w_dpp_inst branch 4 times, most recently from f3ac479 to 6fc2973 Compare November 2, 2024 00:13
@knwng knwng force-pushed the impl_reduce_w_dpp_inst branch 2 times, most recently from 79d5e91 to 8f0b60e Compare November 7, 2024 18:06
Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @knwng! A bunch of comments inlined. In addition, can we also add some to llvm lit tests?

third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.cpp Outdated Show resolved Hide resolved
third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.cpp Outdated Show resolved Hide resolved
third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.cpp Outdated Show resolved Hide resolved
third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.cpp Outdated Show resolved Hide resolved
third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.cpp Outdated Show resolved Hide resolved
third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.cpp Outdated Show resolved Hide resolved
third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.cpp Outdated Show resolved Hide resolved
third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.cpp Outdated Show resolved Hide resolved
third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.cpp Outdated Show resolved Hide resolved
third_party/amd/lib/TritonAMDGPUToLLVM/Utility.cpp Outdated Show resolved Hide resolved
@knwng knwng force-pushed the impl_reduce_w_dpp_inst branch 2 times, most recently from 990fccd to 9bdc244 Compare November 9, 2024 03:15
@knwng knwng requested a review from antiagainst November 9, 2024 03:17
@knwng knwng force-pushed the impl_reduce_w_dpp_inst branch from 9bdc244 to 4e7c5b1 Compare November 9, 2024 05:03
Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better! A few more comments. BTW, when you address comments, can you push new commits, not squash into the existing one? It's easier to review that way--otherwise need to reread the whole change again.

include/triton/Conversion/TritonGPUToLLVM/Utility.h Outdated Show resolved Hide resolved
third_party/amd/lib/TritonAMDGPUToLLVM/Utility.h Outdated Show resolved Hide resolved
test/Conversion/amd/tritongpu_to_llvm.mlir Show resolved Hide resolved
third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.cpp Outdated Show resolved Hide resolved
third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.cpp Outdated Show resolved Hide resolved
third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.cpp Outdated Show resolved Hide resolved
third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.cpp Outdated Show resolved Hide resolved
third_party/amd/include/TritonAMDGPUToLLVM/TargetUtils.h Outdated Show resolved Hide resolved
third_party/amd/include/TritonAMDGPUToLLVM/TargetUtils.h Outdated Show resolved Hide resolved
third_party/amd/lib/TritonAMDGPUToLLVM/Utility.cpp Outdated Show resolved Hide resolved
@knwng
Copy link
Contributor Author

knwng commented Nov 9, 2024

BTW, when you address comments, can you push new commits, not squash into the existing one? It's easier to review that way--otherwise need to reread the whole change again.

Definitely. I have been trying to keep it as a single commit. Didn't realized that. I'm not sure this project has been set to squash commits during PR.

@knwng knwng force-pushed the impl_reduce_w_dpp_inst branch from 4e7c5b1 to 7ff9683 Compare November 11, 2024 22:34
@knwng knwng requested a review from antiagainst November 11, 2024 22:34
Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Just one final comment.

include/triton/Dialect/Triton/IR/TritonOps.td Show resolved Hide resolved
@antiagainst antiagainst marked this pull request as ready for review November 12, 2024 02:19
@knwng knwng force-pushed the impl_reduce_w_dpp_inst branch from 9a4f6ae to 54d7eda Compare November 12, 2024 03:59
This commit supported warp-level reduction with dpp modifier, including:
- For numLaneToReduce == 64, reduction with fully dpp-modified instructions.
- For numLaneToReduce < 64, reduction with different instructions according to stride:
  - stride > 16: ds-bpermute
  - stride == 16: ds-swizzle
  - stride == 8/4/2/1: dpp-modified instructions
@knwng knwng force-pushed the impl_reduce_w_dpp_inst branch from 54d7eda to e5a9bd2 Compare November 12, 2024 21:49
@antiagainst antiagainst merged commit 21119e3 into triton-lang:main Nov 14, 2024
7 checks passed
jataylo pushed a commit to jataylo/triton that referenced this pull request Nov 18, 2024
This commit adds support for warp-level reduction
with DPP instructions, which can improve performance.

See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

(cherry picked from commit 21119e3)
jataylo pushed a commit to jataylo/triton that referenced this pull request Nov 18, 2024
This commit adds support for warp-level reduction
with DPP instructions, which can improve performance.

See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

(cherry picked from commit 21119e3)
antiagainst added a commit that referenced this pull request Dec 4, 2024
Cherry pick list:
- #4925
- #5053 
- #5019 
- #5002 
- #4935 - required additional cherry picks #4991 and #4951
- #4998 
- #4925 
- #5281 
- #5308 
- All previous LLVM hash PRs before #5308

---------

Co-authored-by: Ilya V <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
Co-authored-by: Lixun Zhang <[email protected]>
Co-authored-by: Keren Zhou <[email protected]>
Co-authored-by: Alexander Efimov <[email protected]>
Co-authored-by: Kyle Wang <[email protected]>
Co-authored-by: Jungwook Park <[email protected]>
Co-authored-by: peterbell10 <[email protected]>
Co-authored-by: Hongtao Yu <[email protected]>
jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 5, 2024
This commit adds support for warp-level reduction
with DPP instructions, which can improve performance.

See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

(cherry picked from commit 21119e3)
antiagainst added a commit that referenced this pull request Dec 5, 2024
Reverts #5191 due to some mlir errors in pytorch unit tests

Smaller set of cherry picks:
- #5308 (and previous LLVM upgrades)
- #5281 
- #4925 
- #5053 
- #5019 
- #4998

---------

Co-authored-by: Jungwook Park <[email protected]>
Co-authored-by: peterbell10 <[email protected]>
Co-authored-by: Hongtao Yu <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
Co-authored-by: Ilya V <[email protected]>
Co-authored-by: Kyle Wang <[email protected]>
jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 11, 2024
This commit adds support for warp-level reduction
with DPP instructions, which can improve performance.

See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

(cherry picked from commit 21119e3)
jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 12, 2024
This commit adds support for warp-level reduction
with DPP instructions, which can improve performance.

See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

(cherry picked from commit 21119e3)
jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 13, 2024
This commit adds support for warp-level reduction
with DPP instructions, which can improve performance.

See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

(cherry picked from commit 21119e3)
jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 13, 2024
This commit adds support for warp-level reduction
with DPP instructions, which can improve performance.

See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

(cherry picked from commit 21119e3)
jataylo added a commit to ROCm/triton that referenced this pull request Dec 13, 2024
* [AMD] Emit vectorized 16-bit float LLVM atomic ops (triton-lang#4925)

In the case of 16 bit floats operands for tt::AtomicRMWOp, construct
only one LLVM::AtomicRMWOp but use vector of elements.
Such approach allows to generate packed intrinsics and process 2
elements at once.
Added a lit test for f16 vectorized case.

(cherry picked from commit 78c8054)

* [AMD] Restructure ReorderInstructions pass (triton-lang#4998)

(cherry picked from commit 86a2ac7)

* [AMD] Support warp-level reduction with DPP (triton-lang#5019)

This commit adds support for warp-level reduction
with DPP instructions, which can improve performance.

See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

(cherry picked from commit 21119e3)

* [AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053)

TritonAMDGPUTransforms now depends on it.

(cherry picked from commit 0b443ce)

* [AMD] Support warp-level reduction with DPP (triton-lang#5019)

This commit adds support for warp-level reduction
with DPP instructions, which can improve performance.

See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

(cherry picked from commit 21119e3)

* [AMD] Use DPP to accelerate 16-bit floats (triton-lang#5072)

In the case of unpaired f16 elements utilize DPP instructions to
accelerate atomics. Here is an algorithm of lowering
`tt::atomicRmwOp(%ptr, %val, %mask)`:

0. Group thread by pairs. Master thread is (tid % 2 == 0);
1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp
shl`, so all the masters recieve value from secondary threads;
2. Take into account parity in the `%mask` value, build CF structures
according to it;
3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value;
4. All the threads send result of generated operation to `(tid + 1)`
thread via `dppUpdateOp shl`, so all secondary thread also recieve their
result.

DPP approach has ~5% perf improvment so use this one in the
case target arch supports DPP.

Signed-off-by: Ilya Veselov <[email protected]>
(cherry picked from commit bab3470)

* [AMD] Reland sinking the 2nd tt.load after local_load's (triton-lang#4935)

This PR adds more restrictions about when should we apply
the sched-load optimizations and un-revert
triton-lang#4823.

We will only apply the optimization when all of the following is
satisfied:
1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop
2. two `tt.load`s in the main loop
3. 2nd `tt.load` is ahead of the `tt.dot`
4. 1st user of 2nd `tt.load` is after the `tt.dot`
5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64

(cherry picked from commit 4f6f768)

---------

Co-authored-by: Ilya V <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
Co-authored-by: Kyle Wang <[email protected]>
Co-authored-by: Lixun Zhang <[email protected]>
jataylo added a commit to jataylo/triton that referenced this pull request Dec 18, 2024
Cherry pick list:
- triton-lang#4925
- triton-lang#5053
- triton-lang#5019
- triton-lang#5002
- triton-lang#4935 - required additional cherry picks triton-lang#4991 and triton-lang#4951
- triton-lang#4998
- triton-lang#4925
- triton-lang#5281
- triton-lang#5308
- All previous LLVM hash PRs before triton-lang#5308

---------

Co-authored-by: Ilya V <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
Co-authored-by: Lixun Zhang <[email protected]>
Co-authored-by: Keren Zhou <[email protected]>
Co-authored-by: Alexander Efimov <[email protected]>
Co-authored-by: Kyle Wang <[email protected]>
Co-authored-by: Jungwook Park <[email protected]>
Co-authored-by: peterbell10 <[email protected]>
Co-authored-by: Hongtao Yu <[email protected]>
(cherry picked from commit 2d8093c)
jataylo added a commit to jataylo/triton that referenced this pull request Dec 18, 2024
Reverts triton-lang#5191 due to some mlir errors in pytorch unit tests

Smaller set of cherry picks:
- triton-lang#5308 (and previous LLVM upgrades)
- triton-lang#5281
- triton-lang#4925
- triton-lang#5053
- triton-lang#5019
- triton-lang#4998

---------

Co-authored-by: Jungwook Park <[email protected]>
Co-authored-by: peterbell10 <[email protected]>
Co-authored-by: Hongtao Yu <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
Co-authored-by: Ilya V <[email protected]>
Co-authored-by: Kyle Wang <[email protected]>
(cherry picked from commit 7e401df)
bertmaher pushed a commit that referenced this pull request Dec 19, 2024
Cherry pick list:
- #4925
- #5053
- #5019
- #5002
- #4935 - required additional cherry picks #4991 and #4951
- #4998
- #4925
- #5281
- #5308
- All previous LLVM hash PRs before #5308

---------

Co-authored-by: Ilya V <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
Co-authored-by: Lixun Zhang <[email protected]>
Co-authored-by: Keren Zhou <[email protected]>
Co-authored-by: Alexander Efimov <[email protected]>
Co-authored-by: Kyle Wang <[email protected]>
Co-authored-by: Jungwook Park <[email protected]>
Co-authored-by: peterbell10 <[email protected]>
Co-authored-by: Hongtao Yu <[email protected]>
bertmaher pushed a commit that referenced this pull request Dec 19, 2024
Reverts #5191 due to some mlir errors in pytorch unit tests

Smaller set of cherry picks:
- #5308 (and previous LLVM upgrades)
- #5281 
- #4925 
- #5053 
- #5019 
- #4998

---------

Co-authored-by: Jungwook Park <[email protected]>
Co-authored-by: peterbell10 <[email protected]>
Co-authored-by: Hongtao Yu <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
Co-authored-by: Ilya V <[email protected]>
Co-authored-by: Kyle Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants