Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PTX] Support mma.sp to use Sparse Tensor Cores and refactor mma codegen #10339

Merged
merged 15 commits into from
Mar 8, 2022

Conversation

yzh119
Copy link
Member

@yzh119 yzh119 commented Feb 22, 2022

Sparse Tensor Cores(STC) was first introduced in the Ampere architecture of NVIDIA GPUs (see whitepaper). Up to now, developers can only activate STC by calling wrapped APIs such as cuSparseLt or writing PTX assembly code.

Following #9909, this PR support ptx_mma_sp intrinsic so as to expose the interface of STC at the TIR level.

This PR also refactors the ptx_mma.cc to use template-based codegen.

cc @vinx13 @junrushao1994 @Hzfengsy

@yzh119
Copy link
Member Author

yzh119 commented Feb 23, 2022

@vinx13 one thing I'm not sure is if we need offset for metadata (which stores the indices information and is always 32bit).

@vinx13
Copy link
Member

vinx13 commented Feb 23, 2022

If metadata is stored in a larger buffer, and any some elements of it are passed to mma.sp each time, offset is needed.

#9727 introduced some breaking changes in the semantic of T.allocate, we might want to wait until that PR is merged to prevent conflicts.

@yzh119 yzh119 changed the title [WIP][PTX] Support mma.sp to use Sparse Tensor Cores [PTX] Support mma.sp to use Sparse Tensor Cores and refactor mma codegen Feb 25, 2022
@yzh119
Copy link
Member Author

yzh119 commented Feb 25, 2022

@vinx13 , refactor of MMA codegen is also finished.

I deleted some unittests such as s8u832, u8s8s32, s4u432, u4s4s32 because they do not conform to the standard described here, which requires multiplicands to have the same data type. I wonder are there cases we want these s8u8s32 in quantization?

p.s. I can add these tests back if necessary.

@yzh119
Copy link
Member Author

yzh119 commented Feb 25, 2022

okay, it turns out they mean elements inside the two multiplicands must have the same data type, but the two multiplicands can have different data types.

cutlass also support u8s8s32 mma's.

@yzh119
Copy link
Member Author

yzh119 commented Mar 7, 2022

@vinx13 refactored according to the change in #9727 .

@vinx13 vinx13 merged commit 7688db7 into apache:main Mar 8, 2022
ziqiangxu8457 pushed a commit to ziqiangxu8457/tvm that referenced this pull request Mar 9, 2022
…gen (apache#10339)

* init

* upd

* upd

* lint

* lint again

* upd

* add m16n8k32 testcase

* format

* use make_tuple instead of initializer list

* add metadata offset

* upd

* docstring and sanity

* add u8s8s32 back

* improvement

* compatible apache#9727
junrushao pushed a commit that referenced this pull request Apr 3, 2022
…y to warp memory (#10855)

We already have PTX mma and mma.sp builtin support in #9909  and #10339 . However, we have not supported corresponding data movement builtins for these mma instructions, so the data movement would not be as fast as wmma.

This PR brings the `ldmatrix` builtin, which is a native PTX warp-level instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix), and we can use it to load several (1/2/4) 8x8 matrices from shared memory to warp memory.
pfk-beta pushed a commit to pfk-beta/tvm that referenced this pull request Apr 11, 2022
…gen (apache#10339)

* init

* upd

* upd

* lint

* lint again

* upd

* add m16n8k32 testcase

* format

* use make_tuple instead of initializer list

* add metadata offset

* upd

* docstring and sanity

* add u8s8s32 back

* improvement

* compatible apache#9727
pfk-beta pushed a commit to pfk-beta/tvm that referenced this pull request Apr 11, 2022
…y to warp memory (apache#10855)

We already have PTX mma and mma.sp builtin support in apache#9909  and apache#10339 . However, we have not supported corresponding data movement builtins for these mma instructions, so the data movement would not be as fast as wmma.

This PR brings the `ldmatrix` builtin, which is a native PTX warp-level instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix), and we can use it to load several (1/2/4) 8x8 matrices from shared memory to warp memory.
mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Apr 11, 2022
…y to warp memory (apache#10855)

We already have PTX mma and mma.sp builtin support in apache#9909  and apache#10339 . However, we have not supported corresponding data movement builtins for these mma instructions, so the data movement would not be as fast as wmma.

This PR brings the `ldmatrix` builtin, which is a native PTX warp-level instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix), and we can use it to load several (1/2/4) 8x8 matrices from shared memory to warp memory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants