-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PTX] Support mma.sp to use Sparse Tensor Cores and refactor mma codegen #10339
Conversation
@vinx13 one thing I'm not sure is if we need offset for metadata (which stores the indices information and is always 32bit). |
If metadata is stored in a larger buffer, and any some elements of it are passed to mma.sp each time, offset is needed. #9727 introduced some breaking changes in the semantic of |
@vinx13 , refactor of MMA codegen is also finished. I deleted some unittests such as p.s. I can add these tests back if necessary. |
okay, it turns out they mean elements inside the two multiplicands must have the same data type, but the two multiplicands can have different data types. cutlass also support u8s8s32 mma's. |
…gen (apache#10339) * init * upd * upd * lint * lint again * upd * add m16n8k32 testcase * format * use make_tuple instead of initializer list * add metadata offset * upd * docstring and sanity * add u8s8s32 back * improvement * compatible apache#9727
…y to warp memory (#10855) We already have PTX mma and mma.sp builtin support in #9909 and #10339 . However, we have not supported corresponding data movement builtins for these mma instructions, so the data movement would not be as fast as wmma. This PR brings the `ldmatrix` builtin, which is a native PTX warp-level instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix), and we can use it to load several (1/2/4) 8x8 matrices from shared memory to warp memory.
…gen (apache#10339) * init * upd * upd * lint * lint again * upd * add m16n8k32 testcase * format * use make_tuple instead of initializer list * add metadata offset * upd * docstring and sanity * add u8s8s32 back * improvement * compatible apache#9727
…y to warp memory (apache#10855) We already have PTX mma and mma.sp builtin support in apache#9909 and apache#10339 . However, we have not supported corresponding data movement builtins for these mma instructions, so the data movement would not be as fast as wmma. This PR brings the `ldmatrix` builtin, which is a native PTX warp-level instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix), and we can use it to load several (1/2/4) 8x8 matrices from shared memory to warp memory.
…y to warp memory (apache#10855) We already have PTX mma and mma.sp builtin support in apache#9909 and apache#10339 . However, we have not supported corresponding data movement builtins for these mma instructions, so the data movement would not be as fast as wmma. This PR brings the `ldmatrix` builtin, which is a native PTX warp-level instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix), and we can use it to load several (1/2/4) 8x8 matrices from shared memory to warp memory.
Sparse Tensor Cores(STC) was first introduced in the Ampere architecture of NVIDIA GPUs (see whitepaper). Up to now, developers can only activate STC by calling wrapped APIs such as
cuSparseLt
or writing PTX assembly code.Following #9909, this PR support
ptx_mma_sp
intrinsic so as to expose the interface of STC at the TIR level.This PR also refactors the
ptx_mma.cc
to use template-based codegen.cc @vinx13 @junrushao1994 @Hzfengsy