-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MLIR][NVGPU] Adding nvgpu.warpgroup.mma
Op for Hopper GPUs
#65440
Conversation
Typo in the title: |
nvgpu.wargroup.mma
Op for Hopper GPUsnvgpu.warpgroup.mma
Op for Hopper GPUs
Good catch. I somehow wrote wargroup, I put it everywhere :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good to me.
Highlighted a few places where comments would make things easier to understand.
```mlir | ||
%res = nvgpu.warpgroup.mma %wgmmaDescA, %wgmmaDescB, %acc: | ||
!nvgpu.wgmma.descriptor<tensor = memref<128x64xf16, 3>>, | ||
!nvgpu.wgmma.descriptor<tensor = memref<64x128xf16, 3>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call out that we only support sizes that are a multiple of 8
and are in [8; 256]
.
Unless we plan to lift that up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understood this comment
@llvm/pr-subscribers-mlir-llvm @llvm/pr-subscribers-mlir-nvgpu ChangesThis work introduces a new operation called `warpgroup.mma` to the NVGPU dialect of MLIR. The purpose of this operation is to facilitate warpgroup-level matrix multiply and accumulate (WGMMA) operations on Hopper GPUs with sm_90a architecture.Previously, the The
Here's an example usage of the
-- Patch is 36.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/65440.diff 7 Files Affected:
<pre> +def NVGPU_WarpgroupAccumulator : NVGPU_Type<"WarpgroupAccumulator", "warpgroup.accumulator", []> {
//===----------------------------------------------------------------------===// +def NVGPU_WarpgroupMmaOp : NVGPU_Op<"warpgroup.mma"> {
#endif // NVGPU #include "mlir/Dialect/NVGPU/IR/NVGPUEnums.h.inc" +constexpr int kWarpSize = 32; diff --git a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp #define DEBUG_TYPE "nvgpu-to-nvvm" using namespace mlir; +/// Number of bits that needs to excluded when building matrix descriptor for
+struct NVGPUWarpgroupMmaOpLowering
} // namespace void mlir::populateNVGPUToNVVMConversionPatterns(LLVMTypeConverter &converter,
diff --git a/mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp b/mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp @@ -151,7 +152,6 @@ static LogicalResult verifyMmaSyncOp(Operation *op,
@@ -402,6 +402,133 @@ LogicalResult GenerateGmmaDescriptorOp::verify() { +//===----------------------------------------------------------------------===//
+LogicalResult isAllowedSizeN(int sizeN, Type typeA) {
|
This work introduces a new operation called `wargroup.mma` to the NVGPU dialect of MLIR. The purpose of this operation is to facilitate warpgroup-level matrix multiply and accumulate (WGMMA) operations on Hopper GPUs with sm_90a architecture. Previously, the `nvvm.wgmma.mma_async` operation was introduced to support wargroup-level matrix operations in NVVM dialect. This op is used multiple instances of `nvvm.wgmma.mma_async` to achieve the desired shape. The new `nvgpu.wargroup.mma` operation abstracts this complexity and provides a higher-level interface for performing wargroup-level matrix operations. The `nvgpu.wargroup.mma` does followings: 1) Corresponds multiple `wgmma` instructions. 2) Iterates input matrix descriptors to achieve the desired computation shape. 3) Groups and runs `wgmma` instructions asynchronously, and eventually waits them. This are done by `wgmma.fence.aligned`, `wgmma.commit.group.sync.aligned`, and `wgmma.wait.group.sync.aligned` 4) Results fragmented matrices Here's an example usage of the `nvgpu.wargroup.mma` operation: ``` %wgmmaResult, %wgmmaResult2 = nvgpu.wargroup.mma %descA, %descB, %acc, group = 1 {transposeB}: !nvgpu.wgmma.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.wgmma.descriptor<tensor = memref<64x128xf16, 3>>, vector<128x128xf32> -> !nvgpu.warpgroup.result<tensor = !llvm.struct<...>, !nvgpu.warpgroup.result<tensor = !llvm.struct<...>> ``` Differential Revision: https://reviews.llvm.org/D158434
This work introduces a new operation called `warpgroup.mma.store` to the NVGPU dialect of MLIR. The purpose of this operation is to facilitate storing fragmanted results of WGMMA to the given memref. An example of fragmentation is given here : https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d The `warpgroup.mma.store` does followings: 1) Takes one or more fragmented results matrix. 2) Calculates indexes per thread in warp group and stores the data into give memref. Here's an example usage of the `nvgpu.warpgroup.mma` operation: ``` // Performs matmul, results are fragmented and in registers %res, %res2 = nvgpu.warpgroup.mma ... // Stores the fragmented result to the give memory nvgpu.warpgroup.mma.store [%res1, %res2], %matrixD : !nvgpu.warpgroup.result<tensor = !llvm.struct<...>>, !nvgpu.warpgroup.result<tensor = !llvm.struct<...>> to memref<128x128xf32,3> ``` Depends on llvm#65440
This work introduces a new operation called
warpgroup.mma
to the NVGPU dialect of MLIR. The purpose of this operation is to facilitate warpgroup-level matrix multiply and accumulate (WGMMA) operations on Hopper GPUs with sm_90a architecture.Previously, the
nvvm.wgmma.mma_async
operation was introduced to support warpgroup-level matrix operations in NVVM dialect. This op is used multiple instances ofnvvm.wgmma.mma_async
to achieve the desired shape. The newnvgpu.warpgroup.mma
operation abstracts this complexity and provides a higher-level interface for performing warpgroup-level matrix operations.The
nvgpu.warpgroup.mma
does followings:wgmma
instructions.wgmma
instructions asynchronously, and eventually waits them. This are done bywgmma.fence.aligned
,wgmma.commit.group.sync.aligned
, andwgmma.wait.group.sync.aligned
4) Results fragmented matricesHere's an example usage of the
nvgpu.warpgroup.mma
operation:The op will result following PTX:
The Op keeps
{%f1, %f2, 62 more registers}
) ->%acc1
{%f500,%f501, 62 more registers}
) ->%acc2
.