Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU][DT] Select proper vec/unroll sizes for vecmat/matvec codegen #15421

Merged
merged 2 commits into from
Dec 1, 2023

Conversation

dcaballe
Copy link
Contributor

@dcaballe dcaballe commented Nov 4, 2023

This PR improves the tile sizes for vecmat/matvec flavors of mmt4d.

@dcaballe dcaballe added benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:android-cpu Run default Android CPU benchmarks labels Nov 4, 2023
Copy link

github-actions bot commented Nov 4, 2023

Abbreviated Benchmark Summary

@ commit f7fae4e4a62618197ad1cd35d9236de8ad9916cd (vs. base 7fdc31a3c41365f0baf4ce5da37a9c1e03ad5e3f)

Improved Latencies 🎉

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
EfficientNet\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags] local\_task(embedded\_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 50.541 (vs. 54.294, 6.91%↓) 50.459 0.210
MobileBertSquad\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 372.564 (vs. 396.552, 6.05%↓) 373.134 4.222
MobileBertSquad\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags] local\_sync(embedded\_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 365.676 (vs. 388.691, 5.92%↓) 366.317 4.846

[Top 3 out of 6 results showed]

Regressed Compilation Times 🚩

Benchmark Name Compilation Time (ms)
PoseNet\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 41641 (vs. 14799, 181.38%↑)

Regressed Total Dispatch Sizes 🚩

Benchmark Name Total Dispatch Size (bytes)
PoseNet\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 96520 (vs. 45256, 113.28%↑)
BertLargeTF(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 78904 (vs. 45320, 74.10%↑)
MobileBertSquad\_fp16(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 72152 (vs. 45688, 57.92%↑)

[Top 3 out of 8 results showed]

Improved Total Dispatch Sizes 🎉

Benchmark Name Total Dispatch Size (bytes)
PersonDetect\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 75672 (vs. 84968, 10.94%↓)
EfficientNet\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 152136 (vs. 170168, 10.60%↓)

Regressed Total Artifact Sizes 🚩

Benchmark Name Total Artifact Size (bytes)
MobileNetV3Small\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 11491909 (vs. 10455429, 9.91%↑)
EfficientNetV2STF(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 179696593 (vs. 169847697, 5.80%↑)

Regressed Stream IR Dispatch Count (# of cmd.dispatch ops) 🚩

Benchmark Name Stream IR Dispatch Count (# of cmd.dispatch ops)
PoseNet\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 99 (vs. 98, 1.02%↑)

For more information:

Source Workflow Run

return MatmulTileParams{16, 1, 16};
MatmulTileParams tileParams{16, 1, 16};
if (!hasMicrokernels(target)) {
if (isVecMatEncodingUser(user)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bjacob, @hanhanW, I spent quite some time doing some acrobatics to get the information needed here to determine if the matmul is a vecmat/matvec and ended up with something messy that didn't work for some cases. Then, I decided to try extending the encoding to support MATVEC/VECMAT and the implementation seems simple/cleaner. Do you think this is a fair approach?

More interesting, if we go with this, we could connect linalg.vecmat/linalg.matvec to the new VECMAT/MATVEC encoding and disable the expand vector pass, which should fix the simple case of reshape ops described in #15242 (comment) (also @NatashaKnk, @pzread)

Thoughts? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcaballe , I am working on a PR that makes tile size selection for vecmat / matvec / other narrow cases much more explicit and no longer implicit in the way it currently is (https://github.com/openxla/iree/blob/717c7a0dcda433b965ac14c2c3a287ae07796c8c/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoPackUnPack.cpp#L272-L317). I think it will be necessary to properly solve this problem space.

Copy link
Contributor

@bjacob bjacob Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate a bit more on what I'm proposing here as the solution... We should solve this for all narrow matmuls, where narrow means "either M or N is small enough that padding is a major performance concern, particularly the case where M or N is less than tile size". The present PR, tying this to VECMAT/MATVEC encodings, would solve this only for the cases M==1 and N==1 respectively. What about M==2, N==2, M==4, etc. To get something that scales better, I'm proposing sticking to only the MATMUL encoding user enum value, instead adding separate optional fields in EncodingAttr for static small M/N dimension sizes. (Which can't be inferred from the shape of the tensor being encoded because they can refer to a dimension that does not exist in that tensor, for instance, the RHS tensor only has dimensions K and N, does not involve dimension M).

And as alluded to in the previous comment, I'm also dropping the whole adjustTileSizesToNarrowStaticShape thing, instead we will list explicit complete lists of narrow tile shapes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#15431 has my WIP stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That also sounds ok to me. Let me know when it's ready and I'll rebase on top of that. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NatashaKnk @pzread (echoing here after discussion with @dcaballe ) I believe that #15431 is a viable alternative to this. Diego told me about the possibility that we may evolve SetEncoding to match matvec/vecmat named ops, not just matmul. Doing so is neither easier nor harder after #15431.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5x slower with DT, 3x slower with DT + UK. Let's use the issues in #15313 for further discussion about this. I'll rebase this PR to use Benoit's changes.

bjacob added a commit that referenced this pull request Nov 8, 2023
…rializeEncoding for narrow shapes. (#15431)

This changes how we approach narrow matmul tile size selection (in
particular, vecmat/matvec), from "we don't really care that much, so
let's derive narrow tiles from general ones by just truncation", to "we
actually care, at least in specific cases, about freely controlling
narrow matmul tiles independently of the general wide matmul case."

There are 2 immediate needs for this: @dcaballe was doing something
comparable in #15421 to generally unlock better AVX-512 codegen for
`f32` `vecmat`, and I have a specific need for this in #15158 for some
`s16 x u4` quantized `vecmat` case.

The solution proposed here is more general than the one in #15241 in
that it is not only about `vecmat` and `matvec`, it supports any
narrow-M / narrow-N case. Like #15241, it does so by extending
`EncodingAttr` in some way. Unlike #15241, it does so by adding optional
narrow-M / narrow-N integer attributes, instead of extending the `user`
enum.

Along the way, this rationalizes MaterializeEncoding code around
tile-size selection. Narrow tile sizes are now explicitly enumerated,
and the enumeration of tile sizes is now clearly decoupled from the
choosing among the enumerated tile sizes.

Another change made along the way: this changes the tile shape
convention around here from MxKxN to MxNxK, bringing this in line with
the convention in use in ukernels. The motivation for this convention is
that the MxN part here is particularly important as the output tile
shape, so it helps that the MxNxK convention has that as a contiguous
subset.

To avoid useless redundancy as the narrow-N case is almost 100%
symmetrical to the narrow-M case, the enumeration only goes over
narrow-M cases, and the handling of narrow-N is deferred to the choose
function, transposing the problem to derive narrow-N tiles from narrow-M
tiles. For `vecmat`/`matvec`, the symmetry is perfect, as the
accumulator tile is 1D in these cases, there is no difference at all.
For other non-vector narrow cases, there could conceivably be a
difference someday motivating decoupling narrow-N from narrow-M, but
this is sufficiently far-fetched that it's best to left that to be dealt
with then a concrete use case arises, and enjoy the benefit of smaller
code until then.
@dcaballe
Copy link
Contributor Author

Interesting... It looks like we don't have vecmat ops in CI because my latest changes lead to terrible performance due to missing canonicalization matters on the mmt4d side

@hanhanW
Copy link
Contributor

hanhanW commented Nov 15, 2023

Interesting... It looks like we don't have vecmat ops in CI because my latest changes lead to terrible performance due to missing canonicalization matters on the mmt4d side

I thought we have those ops in GPT2 models? Otherwise we won't see 10x improvements from @NatashaKnk's matvec/vecmat commit #15273 (comment)

@dcaballe
Copy link
Contributor Author

Oh, it's because we are not testing DT without UK then :)

@dcaballe dcaballe marked this pull request as ready for review November 16, 2023 20:10
@dcaballe dcaballe requested a review from hanhanW as a code owner November 16, 2023 20:10
@dcaballe
Copy link
Contributor Author

dcaballe commented Nov 16, 2023

The generated code looks great with this PR and #15621 :)

@dcaballe dcaballe requested a review from bjacob November 16, 2023 20:11
@dcaballe
Copy link
Contributor Author

The ExpandVector changes should go away after rebasing.

Comment on lines 175 to 180
// Code generation tile sizes.
return {
TileMxNxK{16, 16, 1}, // Aim to use VFMADD* (zmm).
TileMxNxK{8, 32, 1}, // Truncation of the above.
TileMxNxK{4, 64, 1}, // Truncation of the above.
TileMxNxK{2, 64, 1}, // Truncation of the above.
TileMxNxK{1, 128, 1}, // Truncation of the above.
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that you mentioned why they have different tile sizes comparing to ukernels. I miss it now. Could you add a comment on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment on lines 178 to 179
TileMxNxK{16, 16, 1}, // Aim to use VFMADD* (zmm).
TileMxNxK{8, 32, 1}, // Truncation of the above.
TileMxNxK{4, 64, 1}, // Truncation of the above.
TileMxNxK{2, 64, 1}, // Truncation of the above.
TileMxNxK{1, 128, 1}, // Truncation of the above.
Copy link
Contributor

@bjacob bjacob Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please only comment // Truncation of the above when it really is a truncation of the above. Here, going from {16, 16, 1} to {8, 32, 1} is not, since it gets wider in N0. So you have 2x narrower M0 but 2x wider N0, so still 16 zmm accumulator registers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Which is what the comment just above is saying).

Terrible performance is expected
@dcaballe
Copy link
Contributor Author

dcaballe commented Dec 1, 2023

Ready to land... Can anybody approve, please? :)

@dcaballe
Copy link
Contributor Author

dcaballe commented Dec 1, 2023

(and merge it if I'm not around. Thanks!)

@bjacob
Copy link
Contributor

bjacob commented Dec 1, 2023

@dcaballe I took a look this morning, but the current diff after the force-push doesn't seem different to the earlier one before review comments?

@dcaballe
Copy link
Contributor Author

dcaballe commented Dec 1, 2023

Something got lost in the process... 1 sec...

@dcaballe
Copy link
Contributor Author

dcaballe commented Dec 1, 2023

Forgot to commit the changes, sorry :)

@bjacob
Copy link
Contributor

bjacob commented Dec 1, 2023

The regressed total dispatch sizes are probably the result of LLVM unrolling, which we intend to disable soon anyway.

@dcaballe dcaballe merged commit 0e28a6d into iree-org:main Dec 1, 2023
59 checks passed
@dcaballe dcaballe deleted the mmt4d-vecmat branch December 1, 2023 15:00
Comment on lines +173 to +180
// Code generation tile sizes.
return {
TileMxNxK{16, 16, 1}, // Aim to use VFMADD* (zmm).
TileMxNxK{8, 32, 1}, // Use same number of accumulators.
TileMxNxK{4, 64, 1}, // Use same number of accumulators.
TileMxNxK{2, 64, 1}, // Use half the number of accumulators.
TileMxNxK{1, 128, 1}, // Use half the number of accumulators.
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the comment you added is gone after you rebase it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I''m talking about the review comment before your force-push: #15421 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure but I addressed the feedback provided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was not just // Code generation tile sizes. before force-push.. You added a comment about "why they have different tile sizes comparing to ukernels.", but it does not show up after force-push..

ramiro050 pushed a commit to ramiro050/iree that referenced this pull request Dec 19, 2023
…rializeEncoding for narrow shapes. (iree-org#15431)

This changes how we approach narrow matmul tile size selection (in
particular, vecmat/matvec), from "we don't really care that much, so
let's derive narrow tiles from general ones by just truncation", to "we
actually care, at least in specific cases, about freely controlling
narrow matmul tiles independently of the general wide matmul case."

There are 2 immediate needs for this: @dcaballe was doing something
comparable in iree-org#15421 to generally unlock better AVX-512 codegen for
`f32` `vecmat`, and I have a specific need for this in iree-org#15158 for some
`s16 x u4` quantized `vecmat` case.

The solution proposed here is more general than the one in iree-org#15241 in
that it is not only about `vecmat` and `matvec`, it supports any
narrow-M / narrow-N case. Like iree-org#15241, it does so by extending
`EncodingAttr` in some way. Unlike iree-org#15241, it does so by adding optional
narrow-M / narrow-N integer attributes, instead of extending the `user`
enum.

Along the way, this rationalizes MaterializeEncoding code around
tile-size selection. Narrow tile sizes are now explicitly enumerated,
and the enumeration of tile sizes is now clearly decoupled from the
choosing among the enumerated tile sizes.

Another change made along the way: this changes the tile shape
convention around here from MxKxN to MxNxK, bringing this in line with
the convention in use in ukernels. The motivation for this convention is
that the MxN part here is particularly important as the output tile
shape, so it helps that the MxNxK convention has that as a contiguous
subset.

To avoid useless redundancy as the narrow-N case is almost 100%
symmetrical to the narrow-M case, the enumeration only goes over
narrow-M cases, and the handling of narrow-N is deferred to the choose
function, transposing the problem to derive narrow-N tiles from narrow-M
tiles. For `vecmat`/`matvec`, the symmetry is perfect, as the
accumulator tile is 1D in these cases, there is no difference at all.
For other non-vector narrow cases, there could conceivably be a
difference someday motivating decoupling narrow-N from narrow-M, but
this is sufficiently far-fetched that it's best to left that to be dealt
with then a concrete use case arises, and enjoy the benefit of smaller
code until then.
ramiro050 pushed a commit to ramiro050/iree that referenced this pull request Dec 19, 2023
…ree-org#15421)

This PR improves the tile sizes for vecmat/matvec flavors of mmt4d.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks:android-cpu Run default Android CPU benchmarks benchmarks:x86_64 Run default x86_64 benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants