-
Notifications
You must be signed in to change notification settings - Fork 650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU][DT] Select proper vec/unroll sizes for vecmat/matvec codegen #15421
Conversation
6cbf780
to
e63833c
Compare
return MatmulTileParams{16, 1, 16}; | ||
MatmulTileParams tileParams{16, 1, 16}; | ||
if (!hasMicrokernels(target)) { | ||
if (isVecMatEncodingUser(user)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bjacob, @hanhanW, I spent quite some time doing some acrobatics to get the information needed here to determine if the matmul is a vecmat/matvec and ended up with something messy that didn't work for some cases. Then, I decided to try extending the encoding to support MATVEC/VECMAT and the implementation seems simple/cleaner. Do you think this is a fair approach?
More interesting, if we go with this, we could connect linalg.vecmat
/linalg.matvec
to the new VECMAT/MATVEC encoding and disable the expand vector pass, which should fix the simple case of reshape ops described in #15242 (comment) (also @NatashaKnk, @pzread)
Thoughts? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcaballe , I am working on a PR that makes tile size selection for vecmat / matvec / other narrow cases much more explicit and no longer implicit in the way it currently is (https://github.com/openxla/iree/blob/717c7a0dcda433b965ac14c2c3a287ae07796c8c/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoPackUnPack.cpp#L272-L317). I think it will be necessary to properly solve this problem space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To elaborate a bit more on what I'm proposing here as the solution... We should solve this for all narrow matmuls, where narrow means "either M or N is small enough that padding is a major performance concern, particularly the case where M or N is less than tile size". The present PR, tying this to VECMAT/MATVEC encodings, would solve this only for the cases M==1 and N==1 respectively. What about M==2, N==2, M==4, etc. To get something that scales better, I'm proposing sticking to only the MATMUL encoding user enum value, instead adding separate optional fields in EncodingAttr for static small M/N dimension sizes. (Which can't be inferred from the shape of the tensor being encoded because they can refer to a dimension that does not exist in that tensor, for instance, the RHS tensor only has dimensions K and N, does not involve dimension M).
And as alluded to in the previous comment, I'm also dropping the whole adjustTileSizesToNarrowStaticShape
thing, instead we will list explicit complete lists of narrow tile shapes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#15431 has my WIP stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That also sounds ok to me. Let me know when it's ready and I'll rebase on top of that. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NatashaKnk @pzread (echoing here after discussion with @dcaballe ) I believe that #15431 is a viable alternative to this. Diego told me about the possibility that we may evolve SetEncoding to match matvec/vecmat named ops, not just matmul. Doing so is neither easier nor harder after #15431.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5x slower with DT, 3x slower with DT + UK. Let's use the issues in #15313 for further discussion about this. I'll rebase this PR to use Benoit's changes.
…rializeEncoding for narrow shapes. (#15431) This changes how we approach narrow matmul tile size selection (in particular, vecmat/matvec), from "we don't really care that much, so let's derive narrow tiles from general ones by just truncation", to "we actually care, at least in specific cases, about freely controlling narrow matmul tiles independently of the general wide matmul case." There are 2 immediate needs for this: @dcaballe was doing something comparable in #15421 to generally unlock better AVX-512 codegen for `f32` `vecmat`, and I have a specific need for this in #15158 for some `s16 x u4` quantized `vecmat` case. The solution proposed here is more general than the one in #15241 in that it is not only about `vecmat` and `matvec`, it supports any narrow-M / narrow-N case. Like #15241, it does so by extending `EncodingAttr` in some way. Unlike #15241, it does so by adding optional narrow-M / narrow-N integer attributes, instead of extending the `user` enum. Along the way, this rationalizes MaterializeEncoding code around tile-size selection. Narrow tile sizes are now explicitly enumerated, and the enumeration of tile sizes is now clearly decoupled from the choosing among the enumerated tile sizes. Another change made along the way: this changes the tile shape convention around here from MxKxN to MxNxK, bringing this in line with the convention in use in ukernels. The motivation for this convention is that the MxN part here is particularly important as the output tile shape, so it helps that the MxNxK convention has that as a contiguous subset. To avoid useless redundancy as the narrow-N case is almost 100% symmetrical to the narrow-M case, the enumeration only goes over narrow-M cases, and the handling of narrow-N is deferred to the choose function, transposing the problem to derive narrow-N tiles from narrow-M tiles. For `vecmat`/`matvec`, the symmetry is perfect, as the accumulator tile is 1D in these cases, there is no difference at all. For other non-vector narrow cases, there could conceivably be a difference someday motivating decoupling narrow-N from narrow-M, but this is sufficiently far-fetched that it's best to left that to be dealt with then a concrete use case arises, and enjoy the benefit of smaller code until then.
Interesting... It looks like we don't have vecmat ops in CI because my latest changes lead to terrible performance due to missing canonicalization matters on the mmt4d side |
I thought we have those ops in GPT2 models? Otherwise we won't see 10x improvements from @NatashaKnk's matvec/vecmat commit #15273 (comment) |
Oh, it's because we are not testing DT without UK then :) |
The generated code looks great with this PR and #15621 :) |
The ExpandVector changes should go away after rebasing. |
// Code generation tile sizes. | ||
return { | ||
TileMxNxK{16, 16, 1}, // Aim to use VFMADD* (zmm). | ||
TileMxNxK{8, 32, 1}, // Truncation of the above. | ||
TileMxNxK{4, 64, 1}, // Truncation of the above. | ||
TileMxNxK{2, 64, 1}, // Truncation of the above. | ||
TileMxNxK{1, 128, 1}, // Truncation of the above. | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that you mentioned why they have different tile sizes comparing to ukernels. I miss it now. Could you add a comment on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
TileMxNxK{16, 16, 1}, // Aim to use VFMADD* (zmm). | ||
TileMxNxK{8, 32, 1}, // Truncation of the above. | ||
TileMxNxK{4, 64, 1}, // Truncation of the above. | ||
TileMxNxK{2, 64, 1}, // Truncation of the above. | ||
TileMxNxK{1, 128, 1}, // Truncation of the above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please only comment // Truncation of the above
when it really is a truncation of the above. Here, going from {16, 16, 1}
to {8, 32, 1}
is not, since it gets wider in N0
. So you have 2x narrower M0 but 2x wider N0, so still 16 zmm
accumulator registers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Which is what the comment just above is saying).
Terrible performance is expected
9e3d7c4
to
46e4dd7
Compare
Ready to land... Can anybody approve, please? :) |
(and merge it if I'm not around. Thanks!) |
@dcaballe I took a look this morning, but the current diff after the force-push doesn't seem different to the earlier one before review comments? |
Something got lost in the process... 1 sec... |
Forgot to commit the changes, sorry :) |
The regressed total dispatch sizes are probably the result of LLVM unrolling, which we intend to disable soon anyway. |
// Code generation tile sizes. | ||
return { | ||
TileMxNxK{16, 16, 1}, // Aim to use VFMADD* (zmm). | ||
TileMxNxK{8, 32, 1}, // Use same number of accumulators. | ||
TileMxNxK{4, 64, 1}, // Use same number of accumulators. | ||
TileMxNxK{2, 64, 1}, // Use half the number of accumulators. | ||
TileMxNxK{1, 128, 1}, // Use half the number of accumulators. | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the comment you added is gone after you rebase it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I''m talking about the review comment before your force-push: #15421 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure but I addressed the feedback provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was not just // Code generation tile sizes.
before force-push.. You added a comment about "why they have different tile sizes comparing to ukernels.", but it does not show up after force-push..
…rializeEncoding for narrow shapes. (iree-org#15431) This changes how we approach narrow matmul tile size selection (in particular, vecmat/matvec), from "we don't really care that much, so let's derive narrow tiles from general ones by just truncation", to "we actually care, at least in specific cases, about freely controlling narrow matmul tiles independently of the general wide matmul case." There are 2 immediate needs for this: @dcaballe was doing something comparable in iree-org#15421 to generally unlock better AVX-512 codegen for `f32` `vecmat`, and I have a specific need for this in iree-org#15158 for some `s16 x u4` quantized `vecmat` case. The solution proposed here is more general than the one in iree-org#15241 in that it is not only about `vecmat` and `matvec`, it supports any narrow-M / narrow-N case. Like iree-org#15241, it does so by extending `EncodingAttr` in some way. Unlike iree-org#15241, it does so by adding optional narrow-M / narrow-N integer attributes, instead of extending the `user` enum. Along the way, this rationalizes MaterializeEncoding code around tile-size selection. Narrow tile sizes are now explicitly enumerated, and the enumeration of tile sizes is now clearly decoupled from the choosing among the enumerated tile sizes. Another change made along the way: this changes the tile shape convention around here from MxKxN to MxNxK, bringing this in line with the convention in use in ukernels. The motivation for this convention is that the MxN part here is particularly important as the output tile shape, so it helps that the MxNxK convention has that as a contiguous subset. To avoid useless redundancy as the narrow-N case is almost 100% symmetrical to the narrow-M case, the enumeration only goes over narrow-M cases, and the handling of narrow-N is deferred to the choose function, transposing the problem to derive narrow-N tiles from narrow-M tiles. For `vecmat`/`matvec`, the symmetry is perfect, as the accumulator tile is 1D in these cases, there is no difference at all. For other non-vector narrow cases, there could conceivably be a difference someday motivating decoupling narrow-N from narrow-M, but this is sufficiently far-fetched that it's best to left that to be dealt with then a concrete use case arises, and enjoy the benefit of smaller code until then.
…ree-org#15421) This PR improves the tile sizes for vecmat/matvec flavors of mmt4d.
This PR improves the tile sizes for vecmat/matvec flavors of mmt4d.