[PTX] Support mma.sp to use Sparse Tensor Cores and refactor mma codegen #10339

yzh119 · 2022-02-22T00:06:09Z

Sparse Tensor Cores(STC) was first introduced in the Ampere architecture of NVIDIA GPUs (see whitepaper). Up to now, developers can only activate STC by calling wrapped APIs such as cuSparseLt or writing PTX assembly code.

Following #9909, this PR support ptx_mma_sp intrinsic so as to expose the interface of STC at the TIR level.

This PR also refactors the ptx_mma.cc to use template-based codegen.

cc @vinx13 @junrushao1994 @Hzfengsy

yzh119 · 2022-02-23T19:58:31Z

@vinx13 one thing I'm not sure is if we need offset for metadata (which stores the indices information and is always 32bit).

vinx13 · 2022-02-23T20:17:47Z

If metadata is stored in a larger buffer, and any some elements of it are passed to mma.sp each time, offset is needed.

#9727 introduced some breaking changes in the semantic of T.allocate, we might want to wait until that PR is merged to prevent conflicts.

yzh119 · 2022-02-25T03:22:42Z

@vinx13 , refactor of MMA codegen is also finished.

I deleted some unittests such as s8u832, u8s8s32, s4u432, u4s4s32 because they do not conform to the standard described here, which requires multiplicands to have the same data type. I wonder are there cases we want these s8u8s32 in quantization?

p.s. I can add these tests back if necessary.

yzh119 · 2022-02-25T05:30:30Z

okay, it turns out they mean elements inside the two multiplicands must have the same data type, but the two multiplicands can have different data types.

cutlass also support u8s8s32 mma's.

yzh119 · 2022-03-07T03:30:48Z

@vinx13 refactored according to the change in #9727 .

…gen (apache#10339) * init * upd * upd * lint * lint again * upd * add m16n8k32 testcase * format * use make_tuple instead of initializer list * add metadata offset * upd * docstring and sanity * add u8s8s32 back * improvement * compatible apache#9727

…y to warp memory (#10855) We already have PTX mma and mma.sp builtin support in #9909 and #10339 . However, we have not supported corresponding data movement builtins for these mma instructions, so the data movement would not be as fast as wmma. This PR brings the `ldmatrix` builtin, which is a native PTX warp-level instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix), and we can use it to load several (1/2/4) 8x8 matrices from shared memory to warp memory.

…gen (apache#10339) * init * upd * upd * lint * lint again * upd * add m16n8k32 testcase * format * use make_tuple instead of initializer list * add metadata offset * upd * docstring and sanity * add u8s8s32 back * improvement * compatible apache#9727

…y to warp memory (apache#10855) We already have PTX mma and mma.sp builtin support in apache#9909 and apache#10339 . However, we have not supported corresponding data movement builtins for these mma instructions, so the data movement would not be as fast as wmma. This PR brings the `ldmatrix` builtin, which is a native PTX warp-level instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix), and we can use it to load several (1/2/4) 8x8 matrices from shared memory to warp memory.

yzh119 marked this pull request as ready for review February 22, 2022 19:07

yzh119 requested review from junrushao, vinx13, tqchen, kparzysz-quic, ZihengJiang, masahi, Hzfengsy, comaniac, jroesch, areusch, yzhliu, merrymercy and icemelon as code owners February 22, 2022 19:07

yzh119 force-pushed the mma-sp branch from 2bcd4c8 to 2b1125f Compare February 22, 2022 20:12

vinx13 self-assigned this Feb 23, 2022

yzh119 changed the title ~~[WIP][PTX] Support mma.sp to use Sparse Tensor Cores~~ [PTX] Support mma.sp to use Sparse Tensor Cores and refactor mma codegen Feb 25, 2022

yzh119 force-pushed the mma-sp branch from 89536ff to 045d943 Compare March 2, 2022 20:24

yzh119 added 8 commits March 6, 2022 19:14

init

4067af4

upd

7471f29

upd

21d293d

lint

708480e

lint again

f15632d

upd

295de96

add m16n8k32 testcase

4872e32

format

a1afc48

yzh119 added 7 commits March 6, 2022 19:14

use make_tuple instead of initializer list

bbf7cec

add metadata offset

c4e716a

upd

24e95e8

docstring and sanity

2f512be

add u8s8s32 back

64014cd

improvement

7567b48

compatible apache#9727

b3f5751

yzh119 force-pushed the mma-sp branch from 045d943 to b3f5751 Compare March 7, 2022 03:30

vinx13 approved these changes Mar 8, 2022

View reviewed changes

vinx13 merged commit 7688db7 into apache:main Mar 8, 2022

yzh119 mentioned this pull request Apr 1, 2022

[PTX] ldmatrix builtin to accelerate copying data from shared memory to warp memory #10855

Merged

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PTX] Support mma.sp to use Sparse Tensor Cores and refactor mma codegen #10339

[PTX] Support mma.sp to use Sparse Tensor Cores and refactor mma codegen #10339

yzh119 commented Feb 22, 2022 •

edited

Loading

yzh119 commented Feb 23, 2022

vinx13 commented Feb 23, 2022

yzh119 commented Feb 25, 2022 •

edited

Loading

yzh119 commented Feb 25, 2022 •

edited

Loading

yzh119 commented Mar 7, 2022

[PTX] Support mma.sp to use Sparse Tensor Cores and refactor mma codegen #10339

[PTX] Support mma.sp to use Sparse Tensor Cores and refactor mma codegen #10339

Conversation

yzh119 commented Feb 22, 2022 • edited Loading

yzh119 commented Feb 23, 2022

vinx13 commented Feb 23, 2022

yzh119 commented Feb 25, 2022 • edited Loading

yzh119 commented Feb 25, 2022 • edited Loading

yzh119 commented Mar 7, 2022

yzh119 commented Feb 22, 2022 •

edited

Loading

yzh119 commented Feb 25, 2022 •

edited

Loading

yzh119 commented Feb 25, 2022 •

edited

Loading