[TOPI] Bugfix arm_cpu schedule_conv2d_spatial_pack_nhwc schedule #14003

ekalda · 2023-02-15T17:20:45Z

No changes to the compute, various bugfixes and improvements to the corresponding NHWC schedule:

There is currently a block that is not run as a part of tuning trials, but gets run during compilation with tuning logs. Since a lot of unrolling and vectorization happens there, for some conv2d operators the extra vectorizing and unrolling results in about 18x size increase in asm and can take around 10 minutes per operator to compile. That essentially makes whole networks uncompilable, so remove that block.
There is no fallback config or NHWC logs in the TopHub. So add a fallback config. This significantly reduces the no tuning compile time, e.g. by about 10x for mobilenet.
The order of axis we passed to reorder_config was different to the order that was used to define the reorder. By looking at the compute definition and based on tuning results of whole networks, it seems to be a bug.
Constrain potential unrolling to OWI and OCI axis as unrolling across OHI results in uncompilably huge code size. This change reduces the number of unsuccessful tuning trials from about 50% to about 20%.
Other minor tweaks.

tvm-bot · 2023-02-15T17:20:49Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

No users to tag found in teams: topi _{See #10317 for details}

_{Generated by tvm-bot}

ekalda · 2023-02-15T17:25:39Z

cc @Mousius @ashutosh-arm @neildhickey @FranklandJack

No changes to the compute, various bugfixes and improvements to the corresponding NHWC schedule: * There is currently a block that is not run as a part of tuning trials, but gets run during compilation with tuning logs. Since a lot of unrolling and vectorization happens there, for some conv2d operators the extra vectorizing and unrolling results in about 18x size increase in asm and can take around 10 minutes per operator to compile. That essentially makes whole networks uncompilable, so remove that block. * There is no fallback config or NHWC logs in the TopHub. So add a fallback config. This significantly reduces the no tuning compile time, e.g. by about 10x for mobilenet. * The order of axis we passed to reorder_config was different to the order that was used to define the reorder. By looking at the compute definition and based on tuning results of whole networks, it seems to be a bug. * Constrain potential unrolling to OWI and OCI axis as unrolling across OHI results in uncompilably huge code size. This change reduces the number of unsuccessful tuning trials from about 50% to about 20%. * Other minor tweaks. Change-Id: I426b80154ddae96bf7b9f06e05e178eee2a8b087

Looks like tile_co as a name is deeply ingrained into the codebase...

ekalda · 2023-02-20T14:41:05Z

tests/micro/arduino/test_arduino_workflow.py

@@ -95,7 +95,7 @@ def test_model_platform_templating(project_dir, project):
        # TVM causes the amount of memory needed to decrease.
        workspace_size = int(workspace_size_defs[0])
        assert workspace_size < 30000
-        assert workspace_size > 10000
+        assert workspace_size > 9000


cc @mehrdadh this gave an error in the upstream build
[2023-02-16T17:18:08.641Z] E assert 9776 > 10000
Is it ok to reduce the minimum workspace size there?

yeah that's fine. If it was the upper bound I would've been concerned.

Mousius

This seems pretty outstanding to me, tests appear to be passing which indicates it at least functions - I don't think we have any performance regression checks so I'll just trust @ekalda.

Given the Arduino tests are passing as well, my guess is @ekalda inadvertently lowered memory usage which is out of scope of this PR - but I'll let that slide.

Leaving this open for @mehrdadh or @guberti to comment on the Arduino memory but otherwise I'll aim to merge this tomorrow.

ashutosh-arm

Thanks @ekalda. This is a great improvement over existing schedule. I am sure this commit will act as a guiding change for other ops in future. A few questions below for my own understanding.

ashutosh-arm · 2023-02-20T17:47:15Z

python/tvm/topi/arm_cpu/conv2d_spatial_pack.py

+    # Inlining kernel vec brings a performance improvement, but the tuner seems to not
+    # like it, so inline only when we are using the fallback config
+    if cfg.is_fallback:
+        s[kernel_vec].compute_inline()


Out of curiosity, what does it mean to inline schedule of a constant?

It is essentially about whether we create an additional intermediate buffer to store the reorganised weights data. This is how it looks like when we don't inline:

for ax2_outer in range(16): for i2, i3 in T.grid(8, 72): cse_var_1: T.int32 = i2 * 72 PadInput_1 = T.Buffer((576,), data=PadInput) p0_1 = T.Buffer((1179648,), data=p0) PadInput_1[cse_var_1 + i3] = p0_1[ax0_ax1_outer_fused * 9216 + ax2_outer * 576 + cse_var_1 + i3] kernel_vec_1 = T.Buffer((1728,), data=kernel_vec) for oco, ic, oci in T.grid(3, 72, 8): p1_1 = T.Buffer((1728,), data=p1) kernel_vec_1[oco * 576 + ic * 8 + oci] = p1_1[ic * 24 + oco * 8 + oci] conv_1 = T.Buffer((24,), "float32x8", data=conv) for oco in range(3): for owi_init in range(8): conv_1[oco * 8 + owi_init] = T.Broadcast(T.float32(0), 8) for ic, owi in T.grid(72, 8): cse_var_2: T.int32 = oco * 8 + owi PadInput_1 = T.Buffer((576,), data=PadInput) conv_1[cse_var_2] = T.call_llvm_pure_intrin("float32x8", T.uint32(141), T.uint32(3), T.Broadcast(PadInput_1[owi * 72 + ic], 8), kernel_vec_1[oco * 576 + ic * 8:oco * 576 + ic * 8 + 8], conv_1[cse_var_2])

and this is what happens when we do:

for ax2_outer in range(16): for i2, i3 in T.grid(8, 72): cse_var_1: T.int32 = i2 * 72 PadInput_1 = T.Buffer((576,), data=PadInput) p0_1 = T.Buffer((1179648,), data=p0) PadInput_1[cse_var_1 + i3] = p0_1[ax0_ax1_outer_fused * 9216 + ax2_outer * 576 + cse_var_1 + i3] conv_1 = T.Buffer((24,), "float32x8", data=conv) for oco in range(3): for owi_init in range(8): conv_1[oco * 8 + owi_init] = T.Broadcast(T.float32(0), 8) for ic, owi in T.grid(72, 8): cse_var_3: T.int32 = oco * 8 cse_var_2: T.int32 = cse_var_3 + owi PadInput_1 = T.Buffer((576,), data=PadInput) p1_1 = T.Buffer((1728,), data=p1) conv_1[cse_var_2] = T.call_llvm_pure_intrin("float32x8", T.uint32(141), T.uint32(3), T.Broadcast(PadInput_1[owi * 72 + ic], 8), p1_1[ic * 24 + cse_var_3:ic * 24 + cse_var_3 + 8], conv_1[cse_var_2])

The first one has a loop to fill up the kernel_vec_1 buffer and we access this buffer when we calculate conv_1 values, while in the second case we don't create kernel_vec_1 and access weights data (in p1_1 input variable) directly.

ashutosh-arm · 2023-02-20T17:50:39Z

python/tvm/topi/arm_cpu/conv2d_spatial_pack.py

-    if cfg["compat"].val < 2:
-        compat_axis = [owo, oco][cfg["compat"].val]  # pylint: disable=R1706
-        s[conv].compute_at(s[output], compat_axis)
+    cfg["ann_spatial"].apply(s, output, [owi, oci], axis_lens=[OWI, OCI], max_unroll=16, cfg=cfg)


This is a great finding ⭐ Can we generalize this to other schedules where the split doubles the (maybe some of the) axes and unrolling higher up degrades performance? Theoretically sounds reasonable. To be clear, not asking for any modifications 😸 , but this is something that needs to be paid attention to while writing CPU schedules.

Yes, I haven't looked other schedules in detail, but I assume there are opportunities to limit the search space in favour of having less failed attempts during tuning. By eyeballing the tuning results, it seemed like unrolling/vectorizing across outer axis never succeeded, unless the values of the tiles corresponding to the inner axis were 1 and being optimised out, essentially corresponding to not tiling. These kind of configs were not particularly performant though.

Thanks for confirming.

ashutosh-arm · 2023-02-20T17:53:45Z

python/tvm/topi/arm_cpu/conv2d_spatial_pack.py

+
+        cfg["tile_oh"] = SplitEntity([-1, 1])
+        cfg["tile_ow"] = SplitEntity([-1, _tile_size(OW, [8, 4])])
+        cfg["tile_co"] = SplitEntity([-1, _tile_size(OC, [8, 4])])


Question: why 4 and 8 only? 8x2 tile sizes do not help some dims 🤔 ?

nit: can we add a comment about using those numbers?

Added a comment :)

There is no perfect config that works for all shapes of conv2d and all CPUs (that's why we tune), but chunks of data of 4 or 8 elements conveniently fit into vectors (the LLVM vectors of 8 elements get broken down to two vector instructions). That applies to float32 data though, I didn't analyze int8 performance of these schedules. However, from what I can see, this is the only conv2d NHWC floating point schedule and the default one for this layout, whilst for int8 we use other schedules. I don't claim it's the best possible config, but it worked well on a range of common models I looked at and it is easy to change and play around with.

ekalda · 2023-02-21T16:39:15Z

This seems pretty outstanding to me, tests appear to be passing which indicates it at least functions - I don't think we have any performance regression checks so I'll just trust @ekalda.

Given the Arduino tests are passing as well, my guess is @ekalda inadvertently lowered memory usage which is out of scope of this PR - but I'll let that slide.

Leaving this open for @mehrdadh or @guberti to comment on the Arduino memory but otherwise I'll aim to merge this tomorrow.

Yes, the lower memory footprint of the operator is a direct consequence of removing the explicit unrolling and vectorization across long axes, but I think it is a good thing, I can't think of a case where in the absence of performance improvement we would want to have a larger code size.

Mousius · 2023-02-21T17:29:04Z

I think @mehrdadh / @guberti might've been on holiday yesterday, so leave this open a bit longer 😸

guberti

Great find, LGTM!

mehrdadh · 2023-02-21T18:14:06Z

LGTM, thanks!

Mousius · 2023-02-22T09:05:20Z

Fantastic! This is now merged @ekalda, thanks to @ashutosh-arm, @mehrdadh and @guberti - team effort 😸!

…che#14003) * [TOPI] Bugfix arm_cpu schedule_conv2d_spatial_pack_nhwc schedule No changes to the compute, various bugfixes and improvements to the corresponding NHWC schedule: * There is currently a block that is not run as a part of tuning trials, but gets run during compilation with tuning logs. Since a lot of unrolling and vectorization happens there, for some conv2d operators the extra vectorizing and unrolling results in about 18x size increase in asm and can take around 10 minutes per operator to compile. That essentially makes whole networks uncompilable, so remove that block. * There is no fallback config or NHWC logs in the TopHub. So add a fallback config. This significantly reduces the no tuning compile time, e.g. by about 10x for mobilenet. * The order of axis we passed to reorder_config was different to the order that was used to define the reorder. By looking at the compute definition and based on tuning results of whole networks, it seems to be a bug. * Constrain potential unrolling to OWI and OCI axis as unrolling across OHI results in uncompilably huge code size. This change reduces the number of unsuccessful tuning trials from about 50% to about 20%. * Other minor tweaks.

ekalda added 3 commits February 20, 2023 09:11

Fix linting and bug in the default config

3ca715e

Reduce the minimum ws size in Arduino test and tile_oc->tile_co

4308191

Looks like tile_co as a name is deeply ingrained into the codebase...

ekalda force-pushed the nhwc-schedule-fix branch from fdc3b1c to 4308191 Compare February 20, 2023 10:36

ekalda commented Feb 20, 2023

View reviewed changes

Mousius approved these changes Feb 20, 2023

View reviewed changes

ashutosh-arm reviewed Feb 20, 2023

View reviewed changes

Add a comment about default config

140e0ba

ashutosh-arm approved these changes Feb 21, 2023

View reviewed changes

Lint the comment -_-

5a0ec43

guberti approved these changes Feb 21, 2023

View reviewed changes

Mousius merged commit fbd641c into apache:main Feb 22, 2023

ekalda deleted the nhwc-schedule-fix branch March 7, 2023 15:42

ysh329 mentioned this pull request Apr 17, 2023

[Release] v0.12.0 Release Candidate Notes #14645

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI] Bugfix arm_cpu schedule_conv2d_spatial_pack_nhwc schedule #14003

[TOPI] Bugfix arm_cpu schedule_conv2d_spatial_pack_nhwc schedule #14003

ekalda commented Feb 15, 2023

tvm-bot commented Feb 15, 2023

ekalda commented Feb 15, 2023

ekalda Feb 20, 2023

mehrdadh Feb 21, 2023

Mousius left a comment

ashutosh-arm left a comment

ashutosh-arm Feb 20, 2023

ekalda Feb 21, 2023

ashutosh-arm Feb 20, 2023 •

edited

Loading

ekalda Feb 21, 2023

ashutosh-arm Feb 21, 2023

ashutosh-arm Feb 20, 2023

ashutosh-arm Feb 20, 2023

ekalda Feb 21, 2023

ekalda commented Feb 21, 2023

Mousius commented Feb 21, 2023

guberti left a comment

mehrdadh commented Feb 21, 2023

Mousius commented Feb 22, 2023

[TOPI] Bugfix arm_cpu schedule_conv2d_spatial_pack_nhwc schedule #14003

[TOPI] Bugfix arm_cpu schedule_conv2d_spatial_pack_nhwc schedule #14003

Conversation

ekalda commented Feb 15, 2023

tvm-bot commented Feb 15, 2023

ekalda commented Feb 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mousius left a comment

Choose a reason for hiding this comment

ashutosh-arm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashutosh-arm Feb 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekalda commented Feb 21, 2023

Mousius commented Feb 21, 2023

guberti left a comment

Choose a reason for hiding this comment

mehrdadh commented Feb 21, 2023

Mousius commented Feb 22, 2023

ashutosh-arm Feb 20, 2023 •

edited

Loading