CUTLASS 2.11
2.11.0 (2022-11-19)
-
Stream-K, which is a new general way to do split-K. It can not only improve performance, but can also significantly reduce the number of tile sizes that need to be profiled to find the best one.
-
Fused multi-head attention Kernel. It has two variants: one uses batched GEMM for the fixed sequence length, and the other one uses group GEMM for the variable sequence length. Both versions just need one kernel.
-
Dual GEMM, which can fuse A x B and A x C into one kernel. Two GEMMs has no producer-consumer dependency.
-
Hopper improves double precision matrix multiplication by 2x compared to Ampere at iso-clocks. It is supported since CUDA 11.8.
-
BLAS3 functions with Hoppers new double precision matrix multiplication instructions.
-
ELL Block Sparse GEMM, which uses an ELL matrix to describe the sparsity of A matrix. B and output matrices are still dense. The block size can be arbitary.
-
Optimized Group Conv for SingleGroup mode, which requires that the output channel per group is a multiple of Threadblock tile N.
-
Optimized DepthWise Conv. Two new modes are added
- kOptimized - use direct conv to compute instead of implicit GEMM.
- The restrictions are: 1) input ,output channel and group number should be multiple of (128 / sizeof(input element)). 2) The input filter size should be the same as the template parameter configuration.
- kFixedStrideDilation - which puts stride and dilation into templates to further improve the performance. In this mode, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommanded.
- The restrictions are: 1) input, output channel and group number should be multiple of (128 / sizeof(input element)). 2) input filter size, stride, dilation should same as the template parameter configuration.
- kOptimized - use direct conv to compute instead of implicit GEMM.
-
Scripts to fuse multiple back-to-back GEMM. Its implementation was discussed in a GTC'22 Spring talk.
-
Updates and bugfixes from the community (thanks!). Big shout out to Meta's xFormers.
-
Deprecation announcement: CUTLASS plans to deprecate the following:
- Maxwell and Pascal GPU architectures
- Ubuntu 16.04
- CUDA 10.2