Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data-tiling and Ukernels for i16xi4->i32 group-quantized matmuls. #15158

Closed
bjacob opened this issue Oct 11, 2023 · 3 comments
Closed

Data-tiling and Ukernels for i16xi4->i32 group-quantized matmuls. #15158

bjacob opened this issue Oct 11, 2023 · 3 comments
Assignees

Comments

@bjacob
Copy link
Contributor

bjacob commented Oct 11, 2023

Continuing from existing issues about this kind of workload (e.g. #14951). The workload to be optimized is as given in @Max191's gist here: https://gist.github.com/Max191/8ae398f4697edd31bd07a613b46dcbbd

It can be described as:

  • A first linalg.generic effectively amounts to some transposes and a linalg.batch_matmul, with element types: i16xi4->i32.
  • A second linalg.generic takes the output of the first generic and performs some arithmetic dequantizing to f32 and does add-reduction on what was the batch dimension in the first generic.

@Max191 has pushed the performance of this so far by improving default non-data-tiled non-ukernel codegen as much as possible, adding VectorContractCustomKernels as needed.

This Issue is about doing a V2 of this switching to data-tiling and ukernels.

The work could break down into these pieces:

graph TD
a1["Update EncodingAttr to have a type tuple
instead of encoding that in enums (#15182)"]

a3_0["@NatashaKnk's ExpandVectors pattern (#15273) expanding matvec ops to matmul"]

a3_1["Add missing `batch_vecmat` linalg named op"]

a3_2["Update ExpandVectors to support `batch_vecmat`"]

a3_3["Lift that first `linalg.generic` to `arith.ext*`,
`linalg.batch_matmul` and transposes (#15339)"]

a4_0["SetEncoding for `arith.ext + batch_matmul`"]

a4_0_1["MaterializeEncoding for `arith.ext + batch_matmul`"]

a4_1["Verify that `batch_mmt4d` lowers to a loop of `mmt4d`"]

a4_2["LowerToUkernel for `arith.ext + mmt4d`"]

a5_0["ukernels: stop abusing signless as signed"]

a5_1["ukernels: sub-8bit support, and
generic mmt4d support for s16u4 and s16s16"]

a5_2["ukernels: optimized s16u4 and s16s16
tile functions for x86 and arm"]

a6_0["MaterializeEncoding refactor to
enable custom vecmat tiles
instead of just truncating
a generic matmul tile"]

a6["ukernels: optimized `s16i4`
tile function for vecmat with
custom vecmat-tuned tile"]

a7["Add data-tiling tile-selection logic for `i16xi4`"]

a8["Ensure that fusions work as intended"]

a8_1["Ensure that const-eval works as intended"]

a9["Benchmark and study e2e"]

a1-->a7-->a9
a3_0-->a3_2-->a3_3-->a4_0-->a4_0_1-->a4_1-->a4_2-->a9
a4_0_1-->a8-->a9
a4_0_1-->a8_1-->a9
a3_1-->a3_2
a5_0-->a5_1-->a5_2-->a6-->a7
a6_0-->a6

subgraph Legend
  notstartedLegend["Not Started"]
  inprogressLegend["In Progress"]
  doneLegend["Done"]
end

classDef notstarted fill:#ddd,text:#000
classDef inprogress fill:#ff6,text:#000
classDef done fill:#9f9,text:#000

class notstartedLegend notstarted
class inprogressLegend,a5_2,a8,a8_1,a9 inprogress
class doneLegend,a1,a3_0,a3_1,a3_2,a3_3,a5_0,a5_1,a4_0,a6_0,a4_2,a4_0_1,a4_1,a6,a7,a4_1 done

style Legend fill:#fff, stroke:#000

Loading
@bjacob bjacob self-assigned this Oct 11, 2023
bjacob added a commit that referenced this issue Oct 16, 2023
This has been discussed for a while: ever since #14336 made encodings a
data structure, it was an odd remnant that we were still encoding the
element types tuple in the user enum. This was cumbersome, and
resurfaced in every design discussion as it looked like something that
wasn't scaling with new data types. Concretely, this is good to fix
ahead of adding `i16xi16` and `i16xi4` data-tiling support (#15158).
bjacob added a commit to bjacob/iree that referenced this issue Oct 16, 2023
…-org#15182)

This has been discussed for a while: ever since iree-org#14336 made encodings a
data structure, it was an odd remnant that we were still encoding the
element types tuple in the user enum. This was cumbersome, and
resurfaced in every design discussion as it looked like something that
wasn't scaling with new data types. Concretely, this is good to fix
ahead of adding `i16xi16` and `i16xi4` data-tiling support (iree-org#15158).
@hanhanW
Copy link
Contributor

hanhanW commented Oct 19, 2023

wow, I just wanna say that mermaid (i.e., the diagram) looks really cool! This is new to me.

@bjacob
Copy link
Contributor Author

bjacob commented Oct 19, 2023

bjacob added a commit to llvm/llvm-project that referenced this issue Oct 25, 2023
Linalg currently has these named ops:
* `matmul`
* `matvec`
* `vecmat`
* `batch_matmul`
* `batch_matvec`

But it does not have:
* `batch_vecmat`

This PRs adds that for consistency, and I have a short-term need for it
( iree-org/iree#15158 ), so not having this
would cause some contortion on my end.
bjacob added a commit to iree-org/llvm-project that referenced this issue Oct 25, 2023
Linalg currently has these named ops:
* `matmul`
* `matvec`
* `vecmat`
* `batch_matmul`
* `batch_matvec`

But it does not have:
* `batch_vecmat`

This PRs adds that for consistency, and I have a short-term need for it
( iree-org/iree#15158 ), so not having this
would cause some contortion on my end.
bjacob added a commit to iree-org/llvm-project that referenced this issue Oct 26, 2023
Linalg currently has these named ops:
* `matmul`
* `matvec`
* `vecmat`
* `batch_matmul`
* `batch_matvec`

But it does not have:
* `batch_vecmat`

This PRs adds that for consistency, and I have a short-term need for it
( iree-org/iree#15158 ), so not having this
would cause some contortion on my end.
bjacob added a commit that referenced this issue Oct 30, 2023
In MLIR we are chronically abusing signless types as signed. Ongoing
discussion at #15241, etc. At least in ukernels we are insulated enough
from the compiler code that we don't have to replicate the same
confusion. So, just because `linalg` named matmul ops want `int8 x int8
-> int32` matmuls to be represented as `i8 x i8 -> i32` even though
these are signless types and this is silently interpreting these as
signed, we didn't have to let that lower to a `_i8i8i32` ukernel, we
could instead more properly represent that as a `_s8s8s32` ukernel.

Fixing this now paves the way for unsigned LHS/RHS support in ukernels,
a step in #15158, without adding more tech debt.

The bulk of is PR is just a regex substitution:

```
find runtime/src/ compiler/src/ -type f | xargs sed -i 's/IREE_UK_FLAG_MMT4D_TYPE_I8I8I32/IREE_UK_FLAG_MMT4D_TYPE_S8S8S32/g;s/iree_uk_mmt4d_type_i8i8i32/iree_uk_mmt4d_type_s8s8s32/g;s/iree_uk_mmt4d_tile_i8i8i32/iree_uk_mmt4d_tile_s8s8s32/g;s/iree_mmt4d_reference_innerloop_i8i8i32/iree_mmt4d_reference_innerloop_s8s8s32/g'
```
bjacob added a commit that referenced this issue Nov 1, 2023
…> s32` (#15343)

`s16 * u4 -> s32` is the main goal in #15158, and `s16 * s16 -> s32` is
seen as a less-optimized but generically useful path to run generic
"some non-8-bit quantized matmuls that we don't have super specialized
code for" on.
@Max191 Max191 self-assigned this Nov 3, 2023
bjacob added a commit that referenced this issue Nov 8, 2023
…rializeEncoding for narrow shapes. (#15431)

This changes how we approach narrow matmul tile size selection (in
particular, vecmat/matvec), from "we don't really care that much, so
let's derive narrow tiles from general ones by just truncation", to "we
actually care, at least in specific cases, about freely controlling
narrow matmul tiles independently of the general wide matmul case."

There are 2 immediate needs for this: @dcaballe was doing something
comparable in #15421 to generally unlock better AVX-512 codegen for
`f32` `vecmat`, and I have a specific need for this in #15158 for some
`s16 x u4` quantized `vecmat` case.

The solution proposed here is more general than the one in #15241 in
that it is not only about `vecmat` and `matvec`, it supports any
narrow-M / narrow-N case. Like #15241, it does so by extending
`EncodingAttr` in some way. Unlike #15241, it does so by adding optional
narrow-M / narrow-N integer attributes, instead of extending the `user`
enum.

Along the way, this rationalizes MaterializeEncoding code around
tile-size selection. Narrow tile sizes are now explicitly enumerated,
and the enumeration of tile sizes is now clearly decoupled from the
choosing among the enumerated tile sizes.

Another change made along the way: this changes the tile shape
convention around here from MxKxN to MxNxK, bringing this in line with
the convention in use in ukernels. The motivation for this convention is
that the MxN part here is particularly important as the output tile
shape, so it helps that the MxNxK convention has that as a contiguous
subset.

To avoid useless redundancy as the narrow-N case is almost 100%
symmetrical to the narrow-M case, the enumeration only goes over
narrow-M cases, and the handling of narrow-N is deferred to the choose
function, transposing the problem to derive narrow-N tiles from narrow-M
tiles. For `vecmat`/`matvec`, the symmetry is perfect, as the
accumulator tile is 1D in these cases, there is no difference at all.
For other non-vector narrow cases, there could conceivably be a
difference someday motivating decoupling narrow-N from narrow-M, but
this is sufficiently far-fetched that it's best to left that to be dealt
with then a concrete use case arises, and enjoy the benefit of smaller
code until then.
bjacob added a commit that referenced this issue Nov 10, 2023
As noted in #15525 , it performs about a third faster, and our primary
use case in #15158 has no padding problem with that wider size.
@bjacob
Copy link
Contributor Author

bjacob commented Nov 13, 2023

What was laid out here is complete. What's left is e2e integration and benchmarking and resolving any performance issues that arise then. That's tracked at #15566.

@bjacob bjacob closed this as completed Nov 13, 2023
ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023
…-org#15182)

This has been discussed for a while: ever since iree-org#14336 made encodings a
data structure, it was an odd remnant that we were still encoding the
element types tuple in the user enum. This was cumbersome, and
resurfaced in every design discussion as it looked like something that
wasn't scaling with new data types. Concretely, this is good to fix
ahead of adding `i16xi16` and `i16xi4` data-tiling support (iree-org#15158).
ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023
In MLIR we are chronically abusing signless types as signed. Ongoing
discussion at iree-org#15241, etc. At least in ukernels we are insulated enough
from the compiler code that we don't have to replicate the same
confusion. So, just because `linalg` named matmul ops want `int8 x int8
-> int32` matmuls to be represented as `i8 x i8 -> i32` even though
these are signless types and this is silently interpreting these as
signed, we didn't have to let that lower to a `_i8i8i32` ukernel, we
could instead more properly represent that as a `_s8s8s32` ukernel.

Fixing this now paves the way for unsigned LHS/RHS support in ukernels,
a step in iree-org#15158, without adding more tech debt.

The bulk of is PR is just a regex substitution:

```
find runtime/src/ compiler/src/ -type f | xargs sed -i 's/IREE_UK_FLAG_MMT4D_TYPE_I8I8I32/IREE_UK_FLAG_MMT4D_TYPE_S8S8S32/g;s/iree_uk_mmt4d_type_i8i8i32/iree_uk_mmt4d_type_s8s8s32/g;s/iree_uk_mmt4d_tile_i8i8i32/iree_uk_mmt4d_tile_s8s8s32/g;s/iree_mmt4d_reference_innerloop_i8i8i32/iree_mmt4d_reference_innerloop_s8s8s32/g'
```
ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023
…> s32` (iree-org#15343)

`s16 * u4 -> s32` is the main goal in iree-org#15158, and `s16 * s16 -> s32` is
seen as a less-optimized but generically useful path to run generic
"some non-8-bit quantized matmuls that we don't have super specialized
code for" on.
ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023
…rializeEncoding for narrow shapes. (iree-org#15431)

This changes how we approach narrow matmul tile size selection (in
particular, vecmat/matvec), from "we don't really care that much, so
let's derive narrow tiles from general ones by just truncation", to "we
actually care, at least in specific cases, about freely controlling
narrow matmul tiles independently of the general wide matmul case."

There are 2 immediate needs for this: @dcaballe was doing something
comparable in iree-org#15421 to generally unlock better AVX-512 codegen for
`f32` `vecmat`, and I have a specific need for this in iree-org#15158 for some
`s16 x u4` quantized `vecmat` case.

The solution proposed here is more general than the one in iree-org#15241 in
that it is not only about `vecmat` and `matvec`, it supports any
narrow-M / narrow-N case. Like iree-org#15241, it does so by extending
`EncodingAttr` in some way. Unlike iree-org#15241, it does so by adding optional
narrow-M / narrow-N integer attributes, instead of extending the `user`
enum.

Along the way, this rationalizes MaterializeEncoding code around
tile-size selection. Narrow tile sizes are now explicitly enumerated,
and the enumeration of tile sizes is now clearly decoupled from the
choosing among the enumerated tile sizes.

Another change made along the way: this changes the tile shape
convention around here from MxKxN to MxNxK, bringing this in line with
the convention in use in ukernels. The motivation for this convention is
that the MxN part here is particularly important as the output tile
shape, so it helps that the MxNxK convention has that as a contiguous
subset.

To avoid useless redundancy as the narrow-N case is almost 100%
symmetrical to the narrow-M case, the enumeration only goes over
narrow-M cases, and the handling of narrow-N is deferred to the choose
function, transposing the problem to derive narrow-N tiles from narrow-M
tiles. For `vecmat`/`matvec`, the symmetry is perfect, as the
accumulator tile is 1D in these cases, there is no difference at all.
For other non-vector narrow cases, there could conceivably be a
difference someday motivating decoupling narrow-N from narrow-M, but
this is sufficiently far-fetched that it's best to left that to be dealt
with then a concrete use case arises, and enjoy the benefit of smaller
code until then.
ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023
As noted in iree-org#15525 , it performs about a third faster, and our primary
use case in iree-org#15158 has no padding problem with that wider size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants