-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data-tiling and Ukernels for i16xi4->i32
group-quantized matmuls.
#15158
Comments
bjacob
added a commit
that referenced
this issue
Oct 16, 2023
This has been discussed for a while: ever since #14336 made encodings a data structure, it was an odd remnant that we were still encoding the element types tuple in the user enum. This was cumbersome, and resurfaced in every design discussion as it looked like something that wasn't scaling with new data types. Concretely, this is good to fix ahead of adding `i16xi16` and `i16xi4` data-tiling support (#15158).
bjacob
added a commit
to bjacob/iree
that referenced
this issue
Oct 16, 2023
…-org#15182) This has been discussed for a while: ever since iree-org#14336 made encodings a data structure, it was an odd remnant that we were still encoding the element types tuple in the user enum. This was cumbersome, and resurfaced in every design discussion as it looked like something that wasn't scaling with new data types. Concretely, this is good to fix ahead of adding `i16xi16` and `i16xi4` data-tiling support (iree-org#15158).
wow, I just wanna say that mermaid (i.e., the diagram) looks really cool! This is new to me. |
It's pretty new to GitHub ;) https://github.blog/2022-02-14-include-diagrams-markdown-files-mermaid/ |
This was referenced Oct 25, 2023
Merged
bjacob
added a commit
to llvm/llvm-project
that referenced
this issue
Oct 25, 2023
Linalg currently has these named ops: * `matmul` * `matvec` * `vecmat` * `batch_matmul` * `batch_matvec` But it does not have: * `batch_vecmat` This PRs adds that for consistency, and I have a short-term need for it ( iree-org/iree#15158 ), so not having this would cause some contortion on my end.
bjacob
added a commit
to iree-org/llvm-project
that referenced
this issue
Oct 25, 2023
Linalg currently has these named ops: * `matmul` * `matvec` * `vecmat` * `batch_matmul` * `batch_matvec` But it does not have: * `batch_vecmat` This PRs adds that for consistency, and I have a short-term need for it ( iree-org/iree#15158 ), so not having this would cause some contortion on my end.
bjacob
added a commit
to iree-org/llvm-project
that referenced
this issue
Oct 26, 2023
Linalg currently has these named ops: * `matmul` * `matvec` * `vecmat` * `batch_matmul` * `batch_matvec` But it does not have: * `batch_vecmat` This PRs adds that for consistency, and I have a short-term need for it ( iree-org/iree#15158 ), so not having this would cause some contortion on my end.
bjacob
added a commit
that referenced
this issue
Oct 30, 2023
In MLIR we are chronically abusing signless types as signed. Ongoing discussion at #15241, etc. At least in ukernels we are insulated enough from the compiler code that we don't have to replicate the same confusion. So, just because `linalg` named matmul ops want `int8 x int8 -> int32` matmuls to be represented as `i8 x i8 -> i32` even though these are signless types and this is silently interpreting these as signed, we didn't have to let that lower to a `_i8i8i32` ukernel, we could instead more properly represent that as a `_s8s8s32` ukernel. Fixing this now paves the way for unsigned LHS/RHS support in ukernels, a step in #15158, without adding more tech debt. The bulk of is PR is just a regex substitution: ``` find runtime/src/ compiler/src/ -type f | xargs sed -i 's/IREE_UK_FLAG_MMT4D_TYPE_I8I8I32/IREE_UK_FLAG_MMT4D_TYPE_S8S8S32/g;s/iree_uk_mmt4d_type_i8i8i32/iree_uk_mmt4d_type_s8s8s32/g;s/iree_uk_mmt4d_tile_i8i8i32/iree_uk_mmt4d_tile_s8s8s32/g;s/iree_mmt4d_reference_innerloop_i8i8i32/iree_mmt4d_reference_innerloop_s8s8s32/g' ```
bjacob
added a commit
that referenced
this issue
Nov 8, 2023
…rializeEncoding for narrow shapes. (#15431) This changes how we approach narrow matmul tile size selection (in particular, vecmat/matvec), from "we don't really care that much, so let's derive narrow tiles from general ones by just truncation", to "we actually care, at least in specific cases, about freely controlling narrow matmul tiles independently of the general wide matmul case." There are 2 immediate needs for this: @dcaballe was doing something comparable in #15421 to generally unlock better AVX-512 codegen for `f32` `vecmat`, and I have a specific need for this in #15158 for some `s16 x u4` quantized `vecmat` case. The solution proposed here is more general than the one in #15241 in that it is not only about `vecmat` and `matvec`, it supports any narrow-M / narrow-N case. Like #15241, it does so by extending `EncodingAttr` in some way. Unlike #15241, it does so by adding optional narrow-M / narrow-N integer attributes, instead of extending the `user` enum. Along the way, this rationalizes MaterializeEncoding code around tile-size selection. Narrow tile sizes are now explicitly enumerated, and the enumeration of tile sizes is now clearly decoupled from the choosing among the enumerated tile sizes. Another change made along the way: this changes the tile shape convention around here from MxKxN to MxNxK, bringing this in line with the convention in use in ukernels. The motivation for this convention is that the MxN part here is particularly important as the output tile shape, so it helps that the MxNxK convention has that as a contiguous subset. To avoid useless redundancy as the narrow-N case is almost 100% symmetrical to the narrow-M case, the enumeration only goes over narrow-M cases, and the handling of narrow-N is deferred to the choose function, transposing the problem to derive narrow-N tiles from narrow-M tiles. For `vecmat`/`matvec`, the symmetry is perfect, as the accumulator tile is 1D in these cases, there is no difference at all. For other non-vector narrow cases, there could conceivably be a difference someday motivating decoupling narrow-N from narrow-M, but this is sufficiently far-fetched that it's best to left that to be dealt with then a concrete use case arises, and enjoy the benefit of smaller code until then.
What was laid out here is complete. What's left is e2e integration and benchmarking and resolving any performance issues that arise then. That's tracked at #15566. |
ramiro050
pushed a commit
to ramiro050/iree
that referenced
this issue
Dec 19, 2023
…-org#15182) This has been discussed for a while: ever since iree-org#14336 made encodings a data structure, it was an odd remnant that we were still encoding the element types tuple in the user enum. This was cumbersome, and resurfaced in every design discussion as it looked like something that wasn't scaling with new data types. Concretely, this is good to fix ahead of adding `i16xi16` and `i16xi4` data-tiling support (iree-org#15158).
ramiro050
pushed a commit
to ramiro050/iree
that referenced
this issue
Dec 19, 2023
In MLIR we are chronically abusing signless types as signed. Ongoing discussion at iree-org#15241, etc. At least in ukernels we are insulated enough from the compiler code that we don't have to replicate the same confusion. So, just because `linalg` named matmul ops want `int8 x int8 -> int32` matmuls to be represented as `i8 x i8 -> i32` even though these are signless types and this is silently interpreting these as signed, we didn't have to let that lower to a `_i8i8i32` ukernel, we could instead more properly represent that as a `_s8s8s32` ukernel. Fixing this now paves the way for unsigned LHS/RHS support in ukernels, a step in iree-org#15158, without adding more tech debt. The bulk of is PR is just a regex substitution: ``` find runtime/src/ compiler/src/ -type f | xargs sed -i 's/IREE_UK_FLAG_MMT4D_TYPE_I8I8I32/IREE_UK_FLAG_MMT4D_TYPE_S8S8S32/g;s/iree_uk_mmt4d_type_i8i8i32/iree_uk_mmt4d_type_s8s8s32/g;s/iree_uk_mmt4d_tile_i8i8i32/iree_uk_mmt4d_tile_s8s8s32/g;s/iree_mmt4d_reference_innerloop_i8i8i32/iree_mmt4d_reference_innerloop_s8s8s32/g' ```
ramiro050
pushed a commit
to ramiro050/iree
that referenced
this issue
Dec 19, 2023
…> s32` (iree-org#15343) `s16 * u4 -> s32` is the main goal in iree-org#15158, and `s16 * s16 -> s32` is seen as a less-optimized but generically useful path to run generic "some non-8-bit quantized matmuls that we don't have super specialized code for" on.
ramiro050
pushed a commit
to ramiro050/iree
that referenced
this issue
Dec 19, 2023
…rializeEncoding for narrow shapes. (iree-org#15431) This changes how we approach narrow matmul tile size selection (in particular, vecmat/matvec), from "we don't really care that much, so let's derive narrow tiles from general ones by just truncation", to "we actually care, at least in specific cases, about freely controlling narrow matmul tiles independently of the general wide matmul case." There are 2 immediate needs for this: @dcaballe was doing something comparable in iree-org#15421 to generally unlock better AVX-512 codegen for `f32` `vecmat`, and I have a specific need for this in iree-org#15158 for some `s16 x u4` quantized `vecmat` case. The solution proposed here is more general than the one in iree-org#15241 in that it is not only about `vecmat` and `matvec`, it supports any narrow-M / narrow-N case. Like iree-org#15241, it does so by extending `EncodingAttr` in some way. Unlike iree-org#15241, it does so by adding optional narrow-M / narrow-N integer attributes, instead of extending the `user` enum. Along the way, this rationalizes MaterializeEncoding code around tile-size selection. Narrow tile sizes are now explicitly enumerated, and the enumeration of tile sizes is now clearly decoupled from the choosing among the enumerated tile sizes. Another change made along the way: this changes the tile shape convention around here from MxKxN to MxNxK, bringing this in line with the convention in use in ukernels. The motivation for this convention is that the MxN part here is particularly important as the output tile shape, so it helps that the MxNxK convention has that as a contiguous subset. To avoid useless redundancy as the narrow-N case is almost 100% symmetrical to the narrow-M case, the enumeration only goes over narrow-M cases, and the handling of narrow-N is deferred to the choose function, transposing the problem to derive narrow-N tiles from narrow-M tiles. For `vecmat`/`matvec`, the symmetry is perfect, as the accumulator tile is 1D in these cases, there is no difference at all. For other non-vector narrow cases, there could conceivably be a difference someday motivating decoupling narrow-N from narrow-M, but this is sufficiently far-fetched that it's best to left that to be dealt with then a concrete use case arises, and enjoy the benefit of smaller code until then.
ramiro050
pushed a commit
to ramiro050/iree
that referenced
this issue
Dec 19, 2023
As noted in iree-org#15525 , it performs about a third faster, and our primary use case in iree-org#15158 has no padding problem with that wider size.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Continuing from existing issues about this kind of workload (e.g. #14951). The workload to be optimized is as given in @Max191's gist here: https://gist.github.com/Max191/8ae398f4697edd31bd07a613b46dcbbd
It can be described as:
linalg.generic
effectively amounts to some transposes and alinalg.batch_matmul
, with element types:i16xi4->i32
.linalg.generic
takes the output of the firstgeneric
and performs some arithmetic dequantizing tof32
and does add-reduction on what was the batch dimension in the firstgeneric
.@Max191 has pushed the performance of this so far by improving default non-data-tiled non-ukernel codegen as much as possible, adding VectorContractCustomKernels as needed.
This Issue is about doing a V2 of this switching to data-tiling and ukernels.
The work could break down into these pieces:
The text was updated successfully, but these errors were encountered: