[SVE][TOPI] Add conv2d NHWC hybrid SVE schedule for `arm_cpu` #16899

Anndrey24 · 2024-04-17T13:48:20Z

This commit adds an arm_cpu conv2d NHWC schedule which generates SVE instructions by extending the hybrid GeMM approach implemented in #16106 to use scalable expressions as splitting factors.

Various vscale-related fixes needed to implement the schedule are also included, such as:

adding vscale bounds in the ConstIntBoundAnalyzer and IntervalSetEvaluator
simplifying MinNode and MaxNode that have scalable expression operands in RewriteSimplifier, which would appear when defining the shape of a buffer padded to be a multiple of vscale and in its respective buffer access indices (e.g. C_1 = T.Buffer((1024 * (T.vscale() * 16 + 256 - 16 % T.vscale() * 16),), data=C) instead of C_1 = T.Buffer((1024 * (T.max(255, T.vscale() * 16 + 255 - 16 % T.vscale() * 16) + 1),), data=C))

The correctness of the new schedule is checked using a TOPI test, while the presence of generated SVE instructions is verified by a codegen_aarch64 test. The new rewrite_simplify rules are also covered by additional test cases.

cc @ekalda @lhutton1 @Lunderberg

lhutton1

Nice work @Anndrey24, I didn't finish reviewing yet, but I had a few sporadic comments about the scheduling

python/tvm/relay/op/strategy/arm_cpu.py

python/tvm/topi/arm_cpu/conv2d_gemm.py

tests/python/topi/test_topi_conv2d_nhwc.py

This commit adds an `arm_cpu` conv2d NHWC schedule which generates SVE instructions by extending the hybrid GeMM approach implemented in apache#16106 to use scalable expressions as splitting factors. Various vscale-related fixes needed to implement the schedule are also included, such as: - adding vscale bounds in the `ConstIntBoundAnalyzer` and `IntervalSetEvaluator` - simplifying `MinNode` and `MaxNode` that have scalable expression operands in `RewriteSimplifier`, which would appear when defining the shape of a buffer padded to be a multiple of vscale and in its respective buffer access indices (e.g. `C_1 = T.Buffer((1024 * (T.vscale() * 16 + 256 - 16 % T.vscale() * 16),), data=C)` instead of `C_1 = T.Buffer((1024 * (T.max(255, T.vscale() * 16 + 255 - 16 % T.vscale() * 16) + 1),), data=C)`) The correctness of the new schedule is checked using a TOPI test, while the presence of generated SVE instructions is verified by a codegen test. The new `rewrite_simplify` rules are also covered by additional test cases.

lhutton1

LGTM!

lhutton1 · 2024-04-24T08:45:31Z

src/arith/const_int_bound.cc

@@ -369,6 +370,8 @@ class ConstIntBoundAnalyzer::Impl
      return VisitLeftShift(op);
    } else if (op->op.same_as(tir::builtin::bitwise_and())) {
      return VisitBitwiseAnd(op);
+    } else if (op->op.same_as(tir::builtin::vscale()) && TargetHasSVE()) {
+      return MakeBound(1, 16);


nit: we could make the upper bound the length of kAArch64VScaleValues incase more values are added in the future. Happy for this to be added in a later patch though

ekalda

Thanks @Anndrey24, looks great! 🚀

lhutton1 · 2024-04-24T09:48:31Z

Thanks @Anndrey24, @ekalda!

Addresses a nitpick comment mentioned here: apache#16899 (comment) Change-Id: I5b3dbe2b08dbf3b498b55fb89d9bfc112049baa4

…pu` targets This patch partly reverts the unification of scalable and non-scalable scheduling of conv2d NHWC for `arm_cpu` targets introduced in apache#16899. The non-scalable schedule for float32 splits the N axis (corresponding to number of output channels) by 16 in both the unified and the nonunified schedule versions, and then additionally splits the inner partitions by 4 in only the nonunified version to which this patch is reverting (first added in apache#16106). The two versions' behaviour would be equivalent if none of the padding on the N axis was removed during lowering, however we allow for that to happen as it proved to increase performance for very small convolutions. As it stands, there seems to be a regression in cases where the datatype is float32 and the number of output channels is greater than 16, a multiple of 4, and not a multiple of 16, because even with the removed padding the nonunified schedule is able to vectorise over 4 elements, while the unified version cannot vectorise over 16 elements anymore. Since all of the conv2d NHWC hybrid topi test cases used numbers of output channels either less than 16 or divisible by 16, this patch also adds a new case which falls in the aforementioned regression area.

…pu` targets (#16951) This patch partly reverts the unification of scalable and non-scalable scheduling of conv2d NHWC for `arm_cpu` targets introduced in #16899. The non-scalable schedule for float32 splits the N axis (corresponding to number of output channels) by 16 in both the unified and the nonunified schedule versions, and then additionally splits the inner partitions by 4 in only the nonunified version to which this patch is reverting (first added in #16106). The two versions' behaviour would be equivalent if none of the padding on the N axis was removed during lowering, however we allow for that to happen as it proved to increase performance for very small convolutions. As it stands, there seems to be a regression in cases where the datatype is float32 and the number of output channels is greater than 16, a multiple of 4, and not a multiple of 16, because even with the removed padding the nonunified schedule is able to vectorise over 4 elements, while the unified version cannot vectorise over 16 elements anymore. Since all of the conv2d NHWC hybrid topi test cases used numbers of output channels either less than 16 or divisible by 16, this patch also adds a new case which falls in the aforementioned regression area.

github-actions bot requested review from Lunderberg, ekalda and lhutton1 April 17, 2024 13:49

ekalda mentioned this pull request Apr 18, 2024

[Tracking Issue] Scalable Vector Extension (SVE) upstreaming #16455

Open

12 tasks

lhutton1 reviewed Apr 19, 2024

View reviewed changes

Anndrey24 added 3 commits April 22, 2024 09:18

Remove unnecessary import and fix linting

0d408e4

Address comments and fix tests

63a5001

Anndrey24 force-pushed the sve-conv2d branch from a94d539 to 63a5001 Compare April 22, 2024 14:54

Fix scalable index rewrite_simplify tests

16b015d

lhutton1 approved these changes Apr 24, 2024

View reviewed changes

ekalda approved these changes Apr 24, 2024

View reviewed changes

lhutton1 merged commit 2f395f1 into apache:main Apr 24, 2024
20 checks passed

Anndrey24 deleted the sve-conv2d branch April 24, 2024 10:07

lhutton1 added a commit to lhutton1/tvm that referenced this pull request Apr 24, 2024

Address a comment from apache#16899

c557a32

Addresses a nitpick comment mentioned here: apache#16899 (comment) Change-Id: I5b3dbe2b08dbf3b498b55fb89d9bfc112049baa4

Anndrey24 mentioned this pull request Apr 29, 2024

[TOPI] Revert unification of conv2d NHWC hybrid scheduling for arm_cpu targets #16951

Merged

ysh329 mentioned this pull request Jul 20, 2024

[Release] v0.17.0 Release Candidate Notes #17178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SVE][TOPI] Add conv2d NHWC hybrid SVE schedule for `arm_cpu` #16899

[SVE][TOPI] Add conv2d NHWC hybrid SVE schedule for `arm_cpu` #16899

Anndrey24 commented Apr 17, 2024

lhutton1 left a comment

lhutton1 left a comment

lhutton1 Apr 24, 2024

ekalda left a comment

lhutton1 commented Apr 24, 2024

[SVE][TOPI] Add conv2d NHWC hybrid SVE schedule for arm_cpu #16899

[SVE][TOPI] Add conv2d NHWC hybrid SVE schedule for arm_cpu #16899

Conversation

Anndrey24 commented Apr 17, 2024

lhutton1 left a comment

Choose a reason for hiding this comment

lhutton1 left a comment

Choose a reason for hiding this comment

lhutton1 Apr 24, 2024

Choose a reason for hiding this comment

ekalda left a comment

Choose a reason for hiding this comment

lhutton1 commented Apr 24, 2024

[SVE][TOPI] Add conv2d NHWC hybrid SVE schedule for `arm_cpu` #16899

[SVE][TOPI] Add conv2d NHWC hybrid SVE schedule for `arm_cpu` #16899