[CPU][DT] Select proper vec/unroll sizes for vecmat/matvec codegen #15421

dcaballe · 2023-11-04T06:50:39Z

This PR improves the tile sizes for vecmat/matvec flavors of mmt4d.

github-actions · 2023-11-04T08:02:32Z

Abbreviated Benchmark Summary

@ commit f7fae4e4a62618197ad1cd35d9236de8ad9916cd (vs. base 7fdc31a3c41365f0baf4ce5da37a9c1e03ad5e3f)

Improved Latencies 🎉

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
EfficientNet\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags] local\_task(embedded\_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	50.541 (vs. 54.294, 6.91%↓)	50.459	0.210
MobileBertSquad\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	372.564 (vs. 396.552, 6.05%↓)	373.134	4.222
MobileBertSquad\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags] local\_sync(embedded\_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores]	365.676 (vs. 388.691, 5.92%↓)	366.317	4.846

[Top 3 out of 6 results showed]

Regressed Compilation Times 🚩

Benchmark Name	Compilation Time (ms)
PoseNet\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	41641 (vs. 14799, 181.38%↑)

Regressed Total Dispatch Sizes 🚩

Benchmark Name	Total Dispatch Size (bytes)
PoseNet\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	96520 (vs. 45256, 113.28%↑)
BertLargeTF(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	78904 (vs. 45320, 74.10%↑)
MobileBertSquad\_fp16(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	72152 (vs. 45688, 57.92%↑)

[Top 3 out of 8 results showed]

Improved Total Dispatch Sizes 🎉

Benchmark Name	Total Dispatch Size (bytes)
PersonDetect\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	75672 (vs. 84968, 10.94%↓)
EfficientNet\_int8(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	152136 (vs. 170168, 10.60%↓)

Regressed Total Artifact Sizes 🚩

Benchmark Name	Total Artifact Size (bytes)
MobileNetV3Small\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	11491909 (vs. 10455429, 9.91%↑)
EfficientNetV2STF(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	179696593 (vs. 169847697, 5.80%↑)

Regressed Stream IR Dispatch Count (# of cmd.dispatch ops) 🚩

Benchmark Name	Stream IR Dispatch Count (# of cmd.dispatch ops)
PoseNet\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,compile-stats]	99 (vs. 98, 1.02%↑)

For more information:

Source Workflow Run

dcaballe · 2023-11-04T22:55:07Z

compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp

-      return MatmulTileParams{16, 1, 16};
+      MatmulTileParams tileParams{16, 1, 16};
+      if (!hasMicrokernels(target)) {
+        if (isVecMatEncodingUser(user)) {


@bjacob, @hanhanW, I spent quite some time doing some acrobatics to get the information needed here to determine if the matmul is a vecmat/matvec and ended up with something messy that didn't work for some cases. Then, I decided to try extending the encoding to support MATVEC/VECMAT and the implementation seems simple/cleaner. Do you think this is a fair approach?

More interesting, if we go with this, we could connect linalg.vecmat/linalg.matvec to the new VECMAT/MATVEC encoding and disable the expand vector pass, which should fix the simple case of reshape ops described in #15242 (comment) (also @NatashaKnk, @pzread)

Thoughts? :)

@dcaballe , I am working on a PR that makes tile size selection for vecmat / matvec / other narrow cases much more explicit and no longer implicit in the way it currently is (https://github.com/openxla/iree/blob/717c7a0dcda433b965ac14c2c3a287ae07796c8c/compiler/src/iree/compiler/Codegen/Common/MaterializeEncodingIntoPackUnPack.cpp#L272-L317). I think it will be necessary to properly solve this problem space.

To elaborate a bit more on what I'm proposing here as the solution... We should solve this for all narrow matmuls, where narrow means "either M or N is small enough that padding is a major performance concern, particularly the case where M or N is less than tile size". The present PR, tying this to VECMAT/MATVEC encodings, would solve this only for the cases M==1 and N==1 respectively. What about M==2, N==2, M==4, etc. To get something that scales better, I'm proposing sticking to only the MATMUL encoding user enum value, instead adding separate optional fields in EncodingAttr for static small M/N dimension sizes. (Which can't be inferred from the shape of the tensor being encoded because they can refer to a dimension that does not exist in that tensor, for instance, the RHS tensor only has dimensions K and N, does not involve dimension M).

And as alluded to in the previous comment, I'm also dropping the whole adjustTileSizesToNarrowStaticShape thing, instead we will list explicit complete lists of narrow tile shapes.

#15431 has my WIP stuff.

That also sounds ok to me. Let me know when it's ready and I'll rebase on top of that. Thanks!

@NatashaKnk @pzread (echoing here after discussion with @dcaballe ) I believe that #15431 is a viable alternative to this. Diego told me about the possibility that we may evolve SetEncoding to match matvec/vecmat named ops, not just matmul. Doing so is neither easier nor harder after #15431.

5x slower with DT, 3x slower with DT + UK. Let's use the issues in #15313 for further discussion about this. I'll rebase this PR to use Benoit's changes.

@dcaballe

…rializeEncoding for narrow shapes. (#15431) This changes how we approach narrow matmul tile size selection (in particular, vecmat/matvec), from "we don't really care that much, so let's derive narrow tiles from general ones by just truncation", to "we actually care, at least in specific cases, about freely controlling narrow matmul tiles independently of the general wide matmul case." There are 2 immediate needs for this: @dcaballe was doing something comparable in #15421 to generally unlock better AVX-512 codegen for `f32` `vecmat`, and I have a specific need for this in #15158 for some `s16 x u4` quantized `vecmat` case. The solution proposed here is more general than the one in #15241 in that it is not only about `vecmat` and `matvec`, it supports any narrow-M / narrow-N case. Like #15241, it does so by extending `EncodingAttr` in some way. Unlike #15241, it does so by adding optional narrow-M / narrow-N integer attributes, instead of extending the `user` enum. Along the way, this rationalizes MaterializeEncoding code around tile-size selection. Narrow tile sizes are now explicitly enumerated, and the enumeration of tile sizes is now clearly decoupled from the choosing among the enumerated tile sizes. Another change made along the way: this changes the tile shape convention around here from MxKxN to MxNxK, bringing this in line with the convention in use in ukernels. The motivation for this convention is that the MxN part here is particularly important as the output tile shape, so it helps that the MxNxK convention has that as a contiguous subset. To avoid useless redundancy as the narrow-N case is almost 100% symmetrical to the narrow-M case, the enumeration only goes over narrow-M cases, and the handling of narrow-N is deferred to the choose function, transposing the problem to derive narrow-N tiles from narrow-M tiles. For `vecmat`/`matvec`, the symmetry is perfect, as the accumulator tile is 1D in these cases, there is no difference at all. For other non-vector narrow cases, there could conceivably be a difference someday motivating decoupling narrow-N from narrow-M, but this is sufficiently far-fetched that it's best to left that to be dealt with then a concrete use case arises, and enjoy the benefit of smaller code until then.

dcaballe · 2023-11-15T15:40:15Z

Interesting... It looks like we don't have vecmat ops in CI because my latest changes lead to terrible performance due to missing canonicalization matters on the mmt4d side

hanhanW · 2023-11-15T18:46:23Z

Interesting... It looks like we don't have vecmat ops in CI because my latest changes lead to terrible performance due to missing canonicalization matters on the mmt4d side

I thought we have those ops in GPT2 models? Otherwise we won't see 10x improvements from @NatashaKnk's matvec/vecmat commit #15273 (comment)

dcaballe · 2023-11-15T20:07:39Z

Oh, it's because we are not testing DT without UK then :)

dcaballe · 2023-11-16T20:11:14Z

The generated code looks great with this PR and #15621 :)

dcaballe · 2023-11-16T20:12:08Z

The ExpandVector changes should go away after rebasing.

hanhanW · 2023-11-16T22:36:39Z

compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp

+        // Code generation tile sizes.
+        return {
+            TileMxNxK{16, 16, 1}, // Aim to use VFMADD* (zmm).
+            TileMxNxK{8, 32, 1},  // Truncation of the above.
+            TileMxNxK{4, 64, 1},  // Truncation of the above.
+            TileMxNxK{2, 64, 1},  // Truncation of the above.
+            TileMxNxK{1, 128, 1}, // Truncation of the above.
+        };


I remember that you mentioned why they have different tile sizes comparing to ukernels. I miss it now. Could you add a comment on this?

bjacob · 2023-11-17T02:13:59Z

compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp

+            TileMxNxK{16, 16, 1}, // Aim to use VFMADD* (zmm).
+            TileMxNxK{8, 32, 1},  // Truncation of the above.
+            TileMxNxK{4, 64, 1},  // Truncation of the above.
+            TileMxNxK{2, 64, 1},  // Truncation of the above.
+            TileMxNxK{1, 128, 1}, // Truncation of the above.


Please only comment // Truncation of the above when it really is a truncation of the above. Here, going from {16, 16, 1} to {8, 32, 1} is not, since it gets wider in N0. So you have 2x narrower M0 but 2x wider N0, so still 16 zmm accumulator registers.

(Which is what the comment just above is saying).

Terrible performance is expected

dcaballe · 2023-12-01T12:31:28Z

Ready to land... Can anybody approve, please? :)

dcaballe · 2023-12-01T12:31:57Z

(and merge it if I'm not around. Thanks!)

bjacob · 2023-12-01T12:47:44Z

@dcaballe I took a look this morning, but the current diff after the force-push doesn't seem different to the earlier one before review comments?

dcaballe · 2023-12-01T12:52:23Z

Something got lost in the process... 1 sec...

dcaballe · 2023-12-01T12:55:07Z

Forgot to commit the changes, sorry :)

bjacob · 2023-12-01T14:59:43Z

The regressed total dispatch sizes are probably the result of LLVM unrolling, which we intend to disable soon anyway.

hanhanW · 2023-12-01T17:59:20Z

compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp

+        // Code generation tile sizes.
+        return {
+            TileMxNxK{16, 16, 1}, // Aim to use VFMADD* (zmm).
+            TileMxNxK{8, 32, 1},  // Use same number of accumulators.
+            TileMxNxK{4, 64, 1},  // Use same number of accumulators.
+            TileMxNxK{2, 64, 1},  // Use half the number of accumulators.
+            TileMxNxK{1, 128, 1}, // Use half the number of accumulators.
+        };


It looks like the comment you added is gone after you rebase it?

I''m talking about the review comment before your force-push: #15421 (comment)

Not sure but I addressed the feedback provided.

I think it was not just // Code generation tile sizes. before force-push.. You added a comment about "why they have different tile sizes comparing to ukernels.", but it does not show up after force-push..

@dcaballe

…rializeEncoding for narrow shapes. (iree-org#15431) This changes how we approach narrow matmul tile size selection (in particular, vecmat/matvec), from "we don't really care that much, so let's derive narrow tiles from general ones by just truncation", to "we actually care, at least in specific cases, about freely controlling narrow matmul tiles independently of the general wide matmul case." There are 2 immediate needs for this: @dcaballe was doing something comparable in iree-org#15421 to generally unlock better AVX-512 codegen for `f32` `vecmat`, and I have a specific need for this in iree-org#15158 for some `s16 x u4` quantized `vecmat` case. The solution proposed here is more general than the one in iree-org#15241 in that it is not only about `vecmat` and `matvec`, it supports any narrow-M / narrow-N case. Like iree-org#15241, it does so by extending `EncodingAttr` in some way. Unlike iree-org#15241, it does so by adding optional narrow-M / narrow-N integer attributes, instead of extending the `user` enum. Along the way, this rationalizes MaterializeEncoding code around tile-size selection. Narrow tile sizes are now explicitly enumerated, and the enumeration of tile sizes is now clearly decoupled from the choosing among the enumerated tile sizes. Another change made along the way: this changes the tile shape convention around here from MxKxN to MxNxK, bringing this in line with the convention in use in ukernels. The motivation for this convention is that the MxN part here is particularly important as the output tile shape, so it helps that the MxNxK convention has that as a contiguous subset. To avoid useless redundancy as the narrow-N case is almost 100% symmetrical to the narrow-M case, the enumeration only goes over narrow-M cases, and the handling of narrow-N is deferred to the choose function, transposing the problem to derive narrow-N tiles from narrow-M tiles. For `vecmat`/`matvec`, the symmetry is perfect, as the accumulator tile is 1D in these cases, there is no difference at all. For other non-vector narrow cases, there could conceivably be a difference someday motivating decoupling narrow-N from narrow-M, but this is sufficiently far-fetched that it's best to left that to be dealt with then a concrete use case arises, and enjoy the benefit of smaller code until then.

…ree-org#15421) This PR improves the tile sizes for vecmat/matvec flavors of mmt4d.

dcaballe added benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:android-cpu Run default Android CPU benchmarks labels Nov 4, 2023

dcaballe force-pushed the mmt4d-vecmat branch from 6cbf780 to e63833c Compare November 4, 2023 22:47

dcaballe commented Nov 4, 2023

View reviewed changes

bjacob mentioned this pull request Nov 6, 2023

Record narrow static M/N sizes in EncodingAttr and rationalize MaterializeEncoding for narrow shapes. #15431

Merged

dcaballe marked this pull request as ready for review November 16, 2023 20:10

dcaballe requested a review from hanhanW as a code owner November 16, 2023 20:10

dcaballe requested a review from bjacob November 16, 2023 20:11

hanhanW reviewed Nov 16, 2023

View reviewed changes

bjacob reviewed Nov 17, 2023

View reviewed changes

dcaballe mentioned this pull request Nov 29, 2023

[CPU][DT] Improve vector lowering for mmt4d #15621

Merged

Rebase on top of Benoit's changes

46e4dd7

Terrible performance is expected

dcaballe force-pushed the mmt4d-vecmat branch from 9e3d7c4 to 46e4dd7 Compare November 30, 2023 22:39

Review comments

dcf3256

bjacob approved these changes Dec 1, 2023

View reviewed changes

dcaballe merged commit 0e28a6d into iree-org:main Dec 1, 2023
59 checks passed

dcaballe deleted the mmt4d-vecmat branch December 1, 2023 15:00

hanhanW reviewed Dec 1, 2023

View reviewed changes

ramiro050 pushed a commit to ramiro050/iree that referenced this pull request Dec 19, 2023

[CPU][DT] Select proper vec/unroll sizes for vecmat/matvec codegen (i…

593d268

…ree-org#15421) This PR improves the tile sizes for vecmat/matvec flavors of mmt4d.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU][DT] Select proper vec/unroll sizes for vecmat/matvec codegen #15421

[CPU][DT] Select proper vec/unroll sizes for vecmat/matvec codegen #15421

dcaballe commented Nov 4, 2023 •

edited

Loading

github-actions bot commented Nov 4, 2023 •

edited

Loading

dcaballe Nov 4, 2023

bjacob Nov 6, 2023

bjacob Nov 6, 2023 •

edited

Loading

bjacob Nov 6, 2023

dcaballe Nov 6, 2023

bjacob Nov 6, 2023

dcaballe Nov 14, 2023

dcaballe commented Nov 15, 2023

hanhanW commented Nov 15, 2023

dcaballe commented Nov 15, 2023

dcaballe commented Nov 16, 2023 •

edited

Loading

dcaballe commented Nov 16, 2023

hanhanW Nov 16, 2023

dcaballe Nov 16, 2023

bjacob Nov 17, 2023 •

edited

Loading

bjacob Nov 17, 2023

dcaballe commented Dec 1, 2023

dcaballe commented Dec 1, 2023

bjacob commented Dec 1, 2023

dcaballe commented Dec 1, 2023

dcaballe commented Dec 1, 2023

bjacob commented Dec 1, 2023

hanhanW Dec 1, 2023

hanhanW Dec 1, 2023

dcaballe Dec 2, 2023

hanhanW Dec 4, 2023

[CPU][DT] Select proper vec/unroll sizes for vecmat/matvec codegen #15421

[CPU][DT] Select proper vec/unroll sizes for vecmat/matvec codegen #15421

Conversation

dcaballe commented Nov 4, 2023 • edited Loading

github-actions bot commented Nov 4, 2023 • edited Loading

Abbreviated Benchmark Summary

Improved Latencies 🎉

Regressed Compilation Times 🚩

Regressed Total Dispatch Sizes 🚩

Improved Total Dispatch Sizes 🎉

Regressed Total Artifact Sizes 🚩

Regressed Stream IR Dispatch Count (# of cmd.dispatch ops) 🚩

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjacob Nov 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcaballe commented Nov 15, 2023

hanhanW commented Nov 15, 2023

dcaballe commented Nov 15, 2023

dcaballe commented Nov 16, 2023 • edited Loading

dcaballe commented Nov 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjacob Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcaballe commented Dec 1, 2023

dcaballe commented Dec 1, 2023

bjacob commented Dec 1, 2023

dcaballe commented Dec 1, 2023

dcaballe commented Dec 1, 2023

bjacob commented Dec 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcaballe commented Nov 4, 2023 •

edited

Loading

github-actions bot commented Nov 4, 2023 •

edited

Loading

bjacob Nov 6, 2023 •

edited

Loading

dcaballe commented Nov 16, 2023 •

edited

Loading

bjacob Nov 17, 2023 •

edited

Loading