Data-tiling and Ukernels for `i16xi4->i32` group-quantized matmuls. #15158

bjacob · 2023-10-11T21:01:40Z

Continuing from existing issues about this kind of workload (e.g. #14951). The workload to be optimized is as given in @Max191's gist here: https://gist.github.com/Max191/8ae398f4697edd31bd07a613b46dcbbd

It can be described as:

A first linalg.generic effectively amounts to some transposes and a linalg.batch_matmul, with element types: i16xi4->i32.
A second linalg.generic takes the output of the first generic and performs some arithmetic dequantizing to f32 and does add-reduction on what was the batch dimension in the first generic.

@Max191 has pushed the performance of this so far by improving default non-data-tiled non-ukernel codegen as much as possible, adding VectorContractCustomKernels as needed.

This Issue is about doing a V2 of this switching to data-tiling and ukernels.

The work could break down into these pieces:

graph TD
a1["Update EncodingAttr to have a type tuple
instead of encoding that in enums (#15182)"]

a3_0["@NatashaKnk's ExpandVectors pattern (#15273) expanding matvec ops to matmul"]

a3_1["Add missing `batch_vecmat` linalg named op"]

a3_2["Update ExpandVectors to support `batch_vecmat`"]

a3_3["Lift that first `linalg.generic` to `arith.ext*`,
`linalg.batch_matmul` and transposes (#15339)"]

a4_0["SetEncoding for `arith.ext + batch_matmul`"]

a4_0_1["MaterializeEncoding for `arith.ext + batch_matmul`"]

a4_1["Verify that `batch_mmt4d` lowers to a loop of `mmt4d`"]

a4_2["LowerToUkernel for `arith.ext + mmt4d`"]

a5_0["ukernels: stop abusing signless as signed"]

a5_1["ukernels: sub-8bit support, and
generic mmt4d support for s16u4 and s16s16"]

a5_2["ukernels: optimized s16u4 and s16s16
tile functions for x86 and arm"]

a6_0["MaterializeEncoding refactor to
enable custom vecmat tiles
instead of just truncating
a generic matmul tile"]

a6["ukernels: optimized `s16i4`
tile function for vecmat with
custom vecmat-tuned tile"]

a7["Add data-tiling tile-selection logic for `i16xi4`"]

a8["Ensure that fusions work as intended"]

a8_1["Ensure that const-eval works as intended"]

a9["Benchmark and study e2e"]

a1-->a7-->a9
a3_0-->a3_2-->a3_3-->a4_0-->a4_0_1-->a4_1-->a4_2-->a9
a4_0_1-->a8-->a9
a4_0_1-->a8_1-->a9
a3_1-->a3_2
a5_0-->a5_1-->a5_2-->a6-->a7
a6_0-->a6

subgraph Legend
  notstartedLegend["Not Started"]
  inprogressLegend["In Progress"]
  doneLegend["Done"]
end

classDef notstarted fill:#ddd,text:#000
classDef inprogress fill:#ff6,text:#000
classDef done fill:#9f9,text:#000

class notstartedLegend notstarted
class inprogressLegend,a5_2,a8,a8_1,a9 inprogress
class doneLegend,a1,a3_0,a3_1,a3_2,a3_3,a5_0,a5_1,a4_0,a6_0,a4_2,a4_0_1,a4_1,a6,a7,a4_1 done

style Legend fill:#fff, stroke:#000

The text was updated successfully, but these errors were encountered:

This has been discussed for a while: ever since #14336 made encodings a data structure, it was an odd remnant that we were still encoding the element types tuple in the user enum. This was cumbersome, and resurfaced in every design discussion as it looked like something that wasn't scaling with new data types. Concretely, this is good to fix ahead of adding `i16xi16` and `i16xi4` data-tiling support (#15158).

…-org#15182) This has been discussed for a while: ever since iree-org#14336 made encodings a data structure, it was an odd remnant that we were still encoding the element types tuple in the user enum. This was cumbersome, and resurfaced in every design discussion as it looked like something that wasn't scaling with new data types. Concretely, this is good to fix ahead of adding `i16xi16` and `i16xi4` data-tiling support (iree-org#15158).

hanhanW · 2023-10-19T03:32:29Z

wow, I just wanna say that mermaid (i.e., the diagram) looks really cool! This is new to me.

bjacob · 2023-10-19T17:52:25Z

It's pretty new to GitHub ;) https://github.blog/2022-02-14-include-diagrams-markdown-files-mermaid/

Linalg currently has these named ops: * `matmul` * `matvec` * `vecmat` * `batch_matmul` * `batch_matvec` But it does not have: * `batch_vecmat` This PRs adds that for consistency, and I have a short-term need for it ( iree-org/iree#15158 ), so not having this would cause some contortion on my end.

In MLIR we are chronically abusing signless types as signed. Ongoing discussion at #15241, etc. At least in ukernels we are insulated enough from the compiler code that we don't have to replicate the same confusion. So, just because `linalg` named matmul ops want `int8 x int8 -> int32` matmuls to be represented as `i8 x i8 -> i32` even though these are signless types and this is silently interpreting these as signed, we didn't have to let that lower to a `_i8i8i32` ukernel, we could instead more properly represent that as a `_s8s8s32` ukernel. Fixing this now paves the way for unsigned LHS/RHS support in ukernels, a step in #15158, without adding more tech debt. The bulk of is PR is just a regex substitution: ``` find runtime/src/ compiler/src/ -type f | xargs sed -i 's/IREE_UK_FLAG_MMT4D_TYPE_I8I8I32/IREE_UK_FLAG_MMT4D_TYPE_S8S8S32/g;s/iree_uk_mmt4d_type_i8i8i32/iree_uk_mmt4d_type_s8s8s32/g;s/iree_uk_mmt4d_tile_i8i8i32/iree_uk_mmt4d_tile_s8s8s32/g;s/iree_mmt4d_reference_innerloop_i8i8i32/iree_mmt4d_reference_innerloop_s8s8s32/g' ```

…> s32` (#15343) `s16 * u4 -> s32` is the main goal in #15158, and `s16 * s16 -> s32` is seen as a less-optimized but generically useful path to run generic "some non-8-bit quantized matmuls that we don't have super specialized code for" on.

@dcaballe

…rializeEncoding for narrow shapes. (#15431) This changes how we approach narrow matmul tile size selection (in particular, vecmat/matvec), from "we don't really care that much, so let's derive narrow tiles from general ones by just truncation", to "we actually care, at least in specific cases, about freely controlling narrow matmul tiles independently of the general wide matmul case." There are 2 immediate needs for this: @dcaballe was doing something comparable in #15421 to generally unlock better AVX-512 codegen for `f32` `vecmat`, and I have a specific need for this in #15158 for some `s16 x u4` quantized `vecmat` case. The solution proposed here is more general than the one in #15241 in that it is not only about `vecmat` and `matvec`, it supports any narrow-M / narrow-N case. Like #15241, it does so by extending `EncodingAttr` in some way. Unlike #15241, it does so by adding optional narrow-M / narrow-N integer attributes, instead of extending the `user` enum. Along the way, this rationalizes MaterializeEncoding code around tile-size selection. Narrow tile sizes are now explicitly enumerated, and the enumeration of tile sizes is now clearly decoupled from the choosing among the enumerated tile sizes. Another change made along the way: this changes the tile shape convention around here from MxKxN to MxNxK, bringing this in line with the convention in use in ukernels. The motivation for this convention is that the MxN part here is particularly important as the output tile shape, so it helps that the MxNxK convention has that as a contiguous subset. To avoid useless redundancy as the narrow-N case is almost 100% symmetrical to the narrow-M case, the enumeration only goes over narrow-M cases, and the handling of narrow-N is deferred to the choose function, transposing the problem to derive narrow-N tiles from narrow-M tiles. For `vecmat`/`matvec`, the symmetry is perfect, as the accumulator tile is 1D in these cases, there is no difference at all. For other non-vector narrow cases, there could conceivably be a difference someday motivating decoupling narrow-N from narrow-M, but this is sufficiently far-fetched that it's best to left that to be dealt with then a concrete use case arises, and enjoy the benefit of smaller code until then.

As noted in #15525 , it performs about a third faster, and our primary use case in #15158 has no padding problem with that wider size.

bjacob · 2023-11-13T17:31:53Z

What was laid out here is complete. What's left is e2e integration and benchmarking and resolving any performance issues that arise then. That's tracked at #15566.

…-org#15182) This has been discussed for a while: ever since iree-org#14336 made encodings a data structure, it was an odd remnant that we were still encoding the element types tuple in the user enum. This was cumbersome, and resurfaced in every design discussion as it looked like something that wasn't scaling with new data types. Concretely, this is good to fix ahead of adding `i16xi16` and `i16xi4` data-tiling support (iree-org#15158).

In MLIR we are chronically abusing signless types as signed. Ongoing discussion at iree-org#15241, etc. At least in ukernels we are insulated enough from the compiler code that we don't have to replicate the same confusion. So, just because `linalg` named matmul ops want `int8 x int8 -> int32` matmuls to be represented as `i8 x i8 -> i32` even though these are signless types and this is silently interpreting these as signed, we didn't have to let that lower to a `_i8i8i32` ukernel, we could instead more properly represent that as a `_s8s8s32` ukernel. Fixing this now paves the way for unsigned LHS/RHS support in ukernels, a step in iree-org#15158, without adding more tech debt. The bulk of is PR is just a regex substitution: ``` find runtime/src/ compiler/src/ -type f | xargs sed -i 's/IREE_UK_FLAG_MMT4D_TYPE_I8I8I32/IREE_UK_FLAG_MMT4D_TYPE_S8S8S32/g;s/iree_uk_mmt4d_type_i8i8i32/iree_uk_mmt4d_type_s8s8s32/g;s/iree_uk_mmt4d_tile_i8i8i32/iree_uk_mmt4d_tile_s8s8s32/g;s/iree_mmt4d_reference_innerloop_i8i8i32/iree_mmt4d_reference_innerloop_s8s8s32/g' ```

…> s32` (iree-org#15343) `s16 * u4 -> s32` is the main goal in iree-org#15158, and `s16 * s16 -> s32` is seen as a less-optimized but generically useful path to run generic "some non-8-bit quantized matmuls that we don't have super specialized code for" on.

@dcaballe

…rializeEncoding for narrow shapes. (iree-org#15431) This changes how we approach narrow matmul tile size selection (in particular, vecmat/matvec), from "we don't really care that much, so let's derive narrow tiles from general ones by just truncation", to "we actually care, at least in specific cases, about freely controlling narrow matmul tiles independently of the general wide matmul case." There are 2 immediate needs for this: @dcaballe was doing something comparable in iree-org#15421 to generally unlock better AVX-512 codegen for `f32` `vecmat`, and I have a specific need for this in iree-org#15158 for some `s16 x u4` quantized `vecmat` case. The solution proposed here is more general than the one in iree-org#15241 in that it is not only about `vecmat` and `matvec`, it supports any narrow-M / narrow-N case. Like iree-org#15241, it does so by extending `EncodingAttr` in some way. Unlike iree-org#15241, it does so by adding optional narrow-M / narrow-N integer attributes, instead of extending the `user` enum. Along the way, this rationalizes MaterializeEncoding code around tile-size selection. Narrow tile sizes are now explicitly enumerated, and the enumeration of tile sizes is now clearly decoupled from the choosing among the enumerated tile sizes. Another change made along the way: this changes the tile shape convention around here from MxKxN to MxNxK, bringing this in line with the convention in use in ukernels. The motivation for this convention is that the MxN part here is particularly important as the output tile shape, so it helps that the MxNxK convention has that as a contiguous subset. To avoid useless redundancy as the narrow-N case is almost 100% symmetrical to the narrow-M case, the enumeration only goes over narrow-M cases, and the handling of narrow-N is deferred to the choose function, transposing the problem to derive narrow-N tiles from narrow-M tiles. For `vecmat`/`matvec`, the symmetry is perfect, as the accumulator tile is 1D in these cases, there is no difference at all. For other non-vector narrow cases, there could conceivably be a difference someday motivating decoupling narrow-N from narrow-M, but this is sufficiently far-fetched that it's best to left that to be dealt with then a concrete use case arises, and enjoy the benefit of smaller code until then.

As noted in iree-org#15525 , it performs about a third faster, and our primary use case in iree-org#15158 has no padding problem with that wider size.

bjacob self-assigned this Oct 11, 2023

bjacob mentioned this issue Oct 13, 2023

Data-tiling encodings: take the element types out of the enums. #15182

Merged

bjacob mentioned this issue Oct 17, 2023

[DT][CPU] Lift linalg.generic to linalg.batch_matmul + transpose + arith.ext* #15214

Closed

This was referenced Oct 25, 2023

Add pass to expand all vectors in vecmat/matvec operations into matrices to enable tiling #15273

Merged

Add missing linalg.batch_vecmat named op llvm/llvm-project#70218

Merged

bjacob mentioned this issue Oct 25, 2023

Cherry-pick linalg.batch_vecmat named op #15293

Closed

bjacob mentioned this issue Oct 30, 2023

ukernels: stop abusing signless as signed #15338

Merged

bjacob mentioned this issue Oct 30, 2023

ukernels: sub-8-bit support, types s16 * u4 -> s32 and s16 * s16 -> s32 #15343

Merged

Max191 self-assigned this Nov 3, 2023

bjacob mentioned this issue Nov 6, 2023

Record narrow static M/N sizes in EncodingAttr and rationalize MaterializeEncoding for narrow shapes. #15431

Merged

bjacob mentioned this issue Nov 10, 2023

Switch i16 x ui4 -> i32 to the wider tile size 1x32x8 #15545

Merged

bjacob added a commit that referenced this issue Nov 10, 2023

Switch i16 x ui4 -> i32 to the wider tile size 1x32x8 (#15545)

5bf1c13

As noted in #15525 , it performs about a third faster, and our primary use case in #15158 has no padding problem with that wider size.

Max191 mentioned this issue Nov 13, 2023

[CPU] Performance tracking for llama2 with data tiling + ukernels #15566

Open

bjacob closed this as completed Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data-tiling and Ukernels for `i16xi4->i32` group-quantized matmuls. #15158

Data-tiling and Ukernels for `i16xi4->i32` group-quantized matmuls. #15158

bjacob commented Oct 11, 2023 •

edited

Loading

hanhanW commented Oct 19, 2023

bjacob commented Oct 19, 2023

bjacob commented Nov 13, 2023

Data-tiling and Ukernels for i16xi4->i32 group-quantized matmuls. #15158

Data-tiling and Ukernels for i16xi4->i32 group-quantized matmuls. #15158

Comments

bjacob commented Oct 11, 2023 • edited Loading

hanhanW commented Oct 19, 2023

bjacob commented Oct 19, 2023

bjacob commented Nov 13, 2023

Data-tiling and Ukernels for `i16xi4->i32` group-quantized matmuls. #15158

Data-tiling and Ukernels for `i16xi4->i32` group-quantized matmuls. #15158

bjacob commented Oct 11, 2023 •

edited

Loading