-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[AMD] Add support for buffer atomic RMW (#5549)
# Overview This PR enables the raw.ptr.buffer.atomic.* RMW ops in the AMD backend. They feature similar calling conventions and semantics to the other buffer ops in the AMD backend. The new ops are gated behind the `AMDGCN_ENABLE_BUFFER_ATOMICS` environment variable which must be used in conjunction with `AMDGCN_USE_BUFFER_OPS`. They are also gated behind the GPU being CDNA3 (MI300-series GPUs) for now as the optimizations I added make assumptions regarding GFX942. I originally started exploratory work on the PR to better understand the comment in `LoadStoreOpToLLVM.cpp` referring to buffer atomics as "more efficient". In short I found that on their own they aren't necessarily more efficient, but using them in conjunction with more careful control over how cache coherence ops/memory fences are emitted can improve performance by a significant fraction. # How I've added a new buffer atomic RMW op in the AMDGPUOps dialect which has its own lowering in the backend. There are a number of checks in place to ensure that the lowering is done correctly between the ConvertToBufferOps pass and the LoadStoreOpToLLVM lowering. The actual lowering is where most of the performance gains come from. At a high-level, when non-buffer atomic RMW ops are emitted, the memory fences lower to something along the lines of: ```python buffer_wbl2 sc1 s_waitcnt lgkmcnt(0) atomicRMWop() s_waitcnt vmcnt(0) buffer_inv sc1 buffer_wbl2 sc1 s_waitcnt lgkmcnt(0) atomicRMWop() s_waitcnt vmcnt(0) buffer_inv sc1 ``` If my understanding of the [GFX942 memory model](https://llvm.org/docs/AMDGPUUsage.html#memory-model-gfx942) is correct, then given several assumptions regarding CDNA3, this can actually be lowered to something that resembles: ```python buffer_wbl2 sc1 s_waitcnt lgkmcnt(0) atomicRMWop() s_waitcnt vmcnt(0) # AMDGCN specific cross-CU synchronization primitive atomicRMWop() s_waitcnt vmcnt(0) buffer_inv sc1 ``` There are comments in the code which explain the thought process for why (I think) that this is okay. It appears the AMD's CK library (AMD version of CUTLASS) uses similar synchronization mechanisms, although I am probably missing some of the context here for sure (https://github.com/ROCm/composable_kernel/blob/9e95d54cd2160dffc07c1197951a9ab1ca6c35f2/include/ck_tile/core/arch/amd_buffer_addressing.hpp#L619). # Results and Testing In addition to the added lit test, I ran the existing atomic rmw tests in tree with buffer ops + buffer atomics enabled and they appear to pass. Following this, I evaluated FP16 Split-K [gemm](https://github.com/pytorch-labs/tritonbench/blob/a2f668e38ec55978bfcf2a6a8d15294a5b9d3d36/tritonbench/operators/gemm/kernels/matmul.py#L190) with [llama shapes](https://github.com/pytorch-labs/tritonbench/blob/a2f668e38ec55978bfcf2a6a8d15294a5b9d3d36/tritonbench/utils/triton_op.py#L149) in tritonbench using an MI300x. Some minor modifications to the kernel were made to emit buffer ops (e.g., tl.assume calls). For testing purposes, I disabled the non split-k configurations. I also checked the numerical accuracy with rtol=atol=1e-4 for all shapes here. <img width="768" alt="image" src="https://github.com/user-attachments/assets/83b40b22-675a-410f-a44d-a138d2387935" /> Each bucket in the figure above corresponds to the average TFlops of all shapes with the same shared `M`-dim. At smaller batch sizes the performance is roughly equivalent. At BS=32, buffer atomics have ~50% greater TFlops. At BS=256 buffer atomics have ~3.75x the TFlops. Note: the purpose of this test is to evaluate the performance of buffer atomics---split-k is not always optimal for these shapes/workload etc...
- Loading branch information
1 parent
94f80f4
commit 6556ec6
Showing
16 changed files
with
735 additions
and
116 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.