Skip to content

CUTLASS 2.11

Compare
Choose a tag to compare
@kerrmudgeon kerrmudgeon released this 20 Jan 21:35
66d9cdd

2.11.0 (2022-11-19)

  • Stream-K, which is a new general way to do split-K. It can not only improve performance, but can also significantly reduce the number of tile sizes that need to be profiled to find the best one.

  • Fused multi-head attention Kernel. It has two variants: one uses batched GEMM for the fixed sequence length, and the other one uses group GEMM for the variable sequence length. Both versions just need one kernel.

  • Dual GEMM, which can fuse A x B and A x C into one kernel. Two GEMMs has no producer-consumer dependency.

  • Hopper improves double precision matrix multiplication by 2x compared to Ampere at iso-clocks. It is supported since CUDA 11.8.

  • BLAS3 functions with Hoppers new double precision matrix multiplication instructions.

  • ELL Block Sparse GEMM, which uses an ELL matrix to describe the sparsity of A matrix. B and output matrices are still dense. The block size can be arbitary.

  • Optimized Group Conv for SingleGroup mode, which requires that the output channel per group is a multiple of Threadblock tile N.

  • Optimized DepthWise Conv. Two new modes are added

    • kOptimized - use direct conv to compute instead of implicit GEMM.
      • The restrictions are: 1) input ,output channel and group number should be multiple of (128 / sizeof(input element)). 2) The input filter size should be the same as the template parameter configuration.
    • kFixedStrideDilation - which puts stride and dilation into templates to further improve the performance. In this mode, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommanded.
      • The restrictions are: 1) input, output channel and group number should be multiple of (128 / sizeof(input element)). 2) input filter size, stride, dilation should same as the template parameter configuration.
  • Scripts to fuse multiple back-to-back GEMM. Its implementation was discussed in a GTC'22 Spring talk.

  • FP8 data type definition and conversion routines.

  • Updates and bugfixes from the community (thanks!). Big shout out to Meta's xFormers.

  • Deprecation announcement: CUTLASS plans to deprecate the following:

    • Maxwell and Pascal GPU architectures
    • Ubuntu 16.04
    • CUDA 10.2