[AMD] release/3.2.x AMD perf cherry picks #5191

jataylo · 2024-11-19T13:55:02Z

Cherry pick list:

In the case of 16 bit floats operands for tt::AtomicRMWOp, construct only one LLVM::AtomicRMWOp but use vector of elements. Such approach allows to generate packed intrinsics and process 2 elements at once. Added a lit test for f16 vectorized case. (cherry picked from commit 78c8054)

(cherry picked from commit 86a2ac7)

…4935) This PR adds more restrictions about when should we apply the sched-load optimizations and un-revert triton-lang#4823. We will only apply the optimization when all of the following is satisfied: 1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop 2. two `tt.load`s in the main loop 3. 2nd `tt.load` is ahead of the `tt.dot` 4. 1st user of 2nd `tt.load` is after the `tt.dot` 5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64 (cherry picked from commit 4f6f768)

…n-lang#4991) Specifically, it fixes problems when `srcLayout` and `dstLayout` have different number of registers but the same number of not free registers. We solved the problem by padding free registers to either `srcLayout` or `dstLayout`, but this can be improved by fixing the `invertAndCompose` function. (cherry picked from commit 15c5e55)

…triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion. (cherry picked from commit 1d5fdfe)

This commit removes special cases for MFMA -> Dot Operand LDS shortcuts. Now it is supported by common linear layout infrastructure. No tests are added, mfma-shortcut.mlir already testing this. (cherry picked from commit 69f656c)

This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3)

TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce)

In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <[email protected]> (cherry picked from commit bab3470)

Enable new arch target since backend support has been added. (cherry picked from commit ed39410)

Fixes triton-lang#4769 (cherry picked from commit f484cb8)

triton-lang#5064) Bumping llvm to include a loop unroller fix: llvm/llvm-project#114573. This is needed for subsequent loop unroller upstreaming work. (cherry picked from commit 3c296ab)

This pulls in llvm/llvm-project@bd9145c8c213 to enable ASan on AMD backend. (cherry picked from commit 0bd30a2)

This includes llvm/llvm-project#115627 (cherry picked from commit 6404fbb)

This pulls in the AMDGPU backend support for the gfx950 target. We need to fix the rewrites in `Combine.td` given that llvm/llvm-project#112700 adds a new attribute for denorm mode for `arith.addf`. --------- Co-authored-by: Lei Zhang <[email protected]> (cherry picked from commit 1d5e9a2)

jataylo · 2024-12-04T11:57:49Z

cc: @bertmaher

jataylo · 2024-12-04T12:23:39Z

=================================== 8995 passed, 2285 skipped, 153 warnings in 1246.46s (0:20:46) ===================================

@antiagainst, @antiagainst mind taking a quick look to sanity check and hopefully @bertmaher can help us merge into rc/3.2.x

This reverts commit 2d8093c.

Reverts #5191 due to some mlir errors in pytorch unit tests Smaller set of cherry picks: - #5308 (and previous LLVM upgrades) - #5281 - #4925 - #5053 - #5019 - #4998 --------- Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Ilya V <[email protected]> Co-authored-by: Kyle Wang <[email protected]>

This reverts commit 2d8093c.

This PR brings in required LLVM bumps and additional targets for gfx950 support. - #5040 - #5064 - #5180 - #5242 - #5392 Note this PR reverts the last two PRs to only focus on the LLVM upgrade - #5347 - #5191 --------- Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Jungwook Park <[email protected]>

This PR brings in required LLVM bumps and additional targets for gfx950 support. - triton-lang#5040 - triton-lang#5064 - triton-lang#5180 - triton-lang#5242 - triton-lang#5392 Note this PR reverts the last two PRs to only focus on the LLVM upgrade - triton-lang#5347 - triton-lang#5191 --------- Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> (cherry picked from commit f11c5ba)

Cherry pick list: - triton-lang#4925 - triton-lang#5053 - triton-lang#5019 - triton-lang#5002 - triton-lang#4935 - required additional cherry picks triton-lang#4991 and triton-lang#4951 - triton-lang#4998 - triton-lang#4925 - triton-lang#5281 - triton-lang#5308 - All previous LLVM hash PRs before triton-lang#5308 --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Alexander Efimov <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> (cherry picked from commit 2d8093c)

Reverts triton-lang#5191 due to some mlir errors in pytorch unit tests Smaller set of cherry picks: - triton-lang#5308 (and previous LLVM upgrades) - triton-lang#5281 - triton-lang#4925 - triton-lang#5053 - triton-lang#5019 - triton-lang#4998 --------- Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Ilya V <[email protected]> Co-authored-by: Kyle Wang <[email protected]> (cherry picked from commit 7e401df)

This PR brings in required LLVM bumps and additional targets for gfx950 support. - triton-lang#5040 - triton-lang#5064 - triton-lang#5180 - triton-lang#5242 - triton-lang#5392 Note this PR reverts the last two PRs to only focus on the LLVM upgrade - triton-lang#5347 - triton-lang#5191 --------- Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> (cherry picked from commit f11c5ba)

This PR brings in required LLVM bumps and additional targets for gfx950 support. - #5040 - #5064 - #5180 - #5242 - #5392 Reverts: - #5347 - #5191

Cherry pick list: - #4925 - #5053 - #5019 - #5002 - #4935 - required additional cherry picks #4991 and #4951 - #4998 - #4925 - #5281 - #5308 - All previous LLVM hash PRs before #5308 --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Alexander Efimov <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]>

Reverts #5191 due to some mlir errors in pytorch unit tests Smaller set of cherry picks: - #5308 (and previous LLVM upgrades) - #5281 - #4925 - #5053 - #5019 - #4998 --------- Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Ilya V <[email protected]> Co-authored-by: Kyle Wang <[email protected]>

This PR brings in required LLVM bumps and additional targets for gfx950 support. - #5040 - #5064 - #5180 - #5242 - #5392 Note this PR reverts the last two PRs to only focus on the LLVM upgrade - #5347 - #5191 --------- Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Jungwook Park <[email protected]>

joviliast and others added 9 commits November 18, 2024 16:56

[AMD] Restructure ReorderInstructions pass (triton-lang#4998)

bbd72b7

(cherry picked from commit 86a2ac7)

[BACKEND] Replace isMmaToDotShortcut with linear layout based logic (…

4499262

…triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion. (cherry picked from commit 1d5fdfe)

[AMD] Support warp-level reduction with DPP (triton-lang#5019)

5014ca9

This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3)

[AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053)

2527a67

TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce)

jataylo requested review from antiagainst, zhanglx13, Jokeren and ptillet as code owners November 19, 2024 13:55

jataylo marked this pull request as draft November 19, 2024 13:56

Add gfx950 target definitions.

6c0e131

Enable new arch target since backend support has been added. (cherry picked from commit ed39410)

jataylo marked this pull request as ready for review December 4, 2024 11:11

peterbell10 and others added 5 commits December 4, 2024 11:54

[BACKEND] Update LLVM hash (triton-lang#5040)

bca741f

Fixes triton-lang#4769 (cherry picked from commit f484cb8)

[BACKEND] Update llvm to llvm/llvm-project@fa57c7a6a5f594a9e3ae2dbe35… (

a1b7d9e

triton-lang#5064) Bumping llvm to include a loop unroller fix: llvm/llvm-project#114573. This is needed for subsequent loop unroller upstreaming work. (cherry picked from commit 3c296ab)

Update to llvm/llvm-project@bd9145c8c213 (triton-lang#5180)

c05c08c

This pulls in llvm/llvm-project@bd9145c8c213 to enable ASan on AMD backend. (cherry picked from commit 0bd30a2)

[LLVM] Update to llvm-project@86b69c3 (triton-lang#5242)

0ef26b4

This includes llvm/llvm-project#115627 (cherry picked from commit 6404fbb)

jataylo force-pushed the 32_amd_cherrypicks_pr branch from 324fb09 to 7c6da39 Compare December 4, 2024 11:55

zhanglx13 approved these changes Dec 4, 2024

View reviewed changes

antiagainst merged commit 2d8093c into triton-lang:rc/3.2.x Dec 4, 2024
7 checks passed

jataylo added a commit to jataylo/triton that referenced this pull request Dec 5, 2024

Revert "[AMD] release/3.2.x AMD perf cherry picks (triton-lang#5191)"

ed5dc78

This reverts commit 2d8093c.

jataylo mentioned this pull request Dec 5, 2024

[AMD] rc/3.2.x cherry picks #5347

Merged

jataylo added a commit to jataylo/triton that referenced this pull request Dec 11, 2024

Revert "[AMD] release/3.2.x AMD perf cherry picks (triton-lang#5191)"

13c1aeb

This reverts commit 2d8093c.

jataylo added a commit to jataylo/triton that referenced this pull request Dec 12, 2024

Revert "[AMD] release/3.2.x AMD perf cherry picks (triton-lang#5191)"

91866ef

This reverts commit 2d8093c.

jataylo mentioned this pull request Dec 12, 2024

[rc/3.2.x] LLVM bump for gfx950 target support #5417

Merged

jataylo mentioned this pull request Dec 18, 2024

[release/3.2.x] [CHERRY PICK] Add gfx950 target definition #5452

Merged

atalman pushed a commit that referenced this pull request Dec 19, 2024

[release/3.2.x] [CHERRY PICK] Add gfx950 target definition (#5452)

aba0fbf

This PR brings in required LLVM bumps and additional targets for gfx950 support. - #5040 - #5064 - #5180 - #5242 - #5392 Reverts: - #5347 - #5191

bertmaher mentioned this pull request Dec 19, 2024

Release cherry picks: #5191 #5347 #5417 #5084 #3731 #5464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] release/3.2.x AMD perf cherry picks #5191

[AMD] release/3.2.x AMD perf cherry picks #5191

jataylo commented Nov 19, 2024 •

edited

Loading

jataylo commented Dec 4, 2024

jataylo commented Dec 4, 2024 •

edited

Loading

[AMD] release/3.2.x AMD perf cherry picks #5191

[AMD] release/3.2.x AMD perf cherry picks #5191

Conversation

jataylo commented Nov 19, 2024 • edited Loading

jataylo commented Dec 4, 2024

jataylo commented Dec 4, 2024 • edited Loading

jataylo commented Nov 19, 2024 •

edited

Loading

jataylo commented Dec 4, 2024 •

edited

Loading