-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI] Bugfix arm_cpu schedule_conv2d_spatial_pack_nhwc schedule #14003
Changes from 4 commits
ab29a0d
3ca715e
4308191
140e0ba
5a0ec43
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,6 +20,7 @@ | |
import tvm | ||
from tvm import te | ||
from tvm import autotvm | ||
from tvm.autotvm.task.space import SplitEntity, OtherOptionEntity, AnnotateEntity, ReorderEntity | ||
from .. import nn | ||
from ..utils import get_const_tuple | ||
from ..nn.utils import get_const_int, get_pad_tuple | ||
|
@@ -302,9 +303,29 @@ def conv2d_spatial_pack_nhwc(cfg, data, kernel, strides, padding, dilation, out_ | |
) | ||
|
||
cfg.define_annotate("ann_reduce", [kh, kw], policy="try_unroll") | ||
cfg.define_annotate("ann_spatial", [ohi, owi, oci], policy="try_unroll_vec") | ||
cfg.define_annotate("ann_spatial", [owi, oci], policy="try_unroll_vec") | ||
# ==================================================================== | ||
|
||
# If there are no tuning records, use this config | ||
if cfg.is_fallback: | ||
|
||
def _tile_size(axis, candidates): | ||
for candidate in candidates: | ||
tiles_divisible_by_candidate = axis % candidate == 0 | ||
if tiles_divisible_by_candidate: | ||
return candidate | ||
return 1 | ||
|
||
# Tile size 8 results in efficient vectorization for these schedules. | ||
# If the axis is not divisible by 8, try 4 | ||
cfg["tile_oh"] = SplitEntity([-1, 1]) | ||
cfg["tile_ow"] = SplitEntity([-1, _tile_size(OW, [8, 4])]) | ||
cfg["tile_co"] = SplitEntity([-1, _tile_size(OC, [8, 4])]) | ||
cfg["ann_spatial"] = AnnotateEntity(["none", "vec"]) | ||
cfg["ann_reduce"] = AnnotateEntity(["none", "none"]) | ||
cfg["reorder_conv"] = ReorderEntity([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) | ||
cfg["compat"] = OtherOptionEntity(0) | ||
|
||
OCI = cfg["tile_co"].size[-1] | ||
OHI = cfg["tile_oh"].size[-1] | ||
OWI = cfg["tile_ow"].size[-1] | ||
|
@@ -390,7 +411,7 @@ def schedule_conv2d_spatial_pack_nhwc(cfg, s, op, output): | |
data_vec = conv.op.input_tensors[0] | ||
kernel_vec = conv.op.input_tensors[1] | ||
data_pad = data_vec.op.input_tensors[0] | ||
OHI = cfg["tile_oh"].size[-1] | ||
|
||
OWI = cfg["tile_ow"].size[-1] | ||
OCI = cfg["tile_co"].size[-1] | ||
|
||
|
@@ -402,20 +423,18 @@ def schedule_conv2d_spatial_pack_nhwc(cfg, s, op, output): | |
oho, ohi = cfg["tile_oh"].apply(s, output, oh) | ||
owo, owi = cfg["tile_ow"].apply(s, output, ow) | ||
s[output].reorder(n, oho, owo, oco, ohi, owi, oci) | ||
cfg["ann_spatial"].apply( | ||
s, output, [ohi, owi, oci], axis_lens=[OHI, OWI, OCI], max_unroll=16, cfg=cfg | ||
) | ||
cfg.define_knob("compat", [0, 1, 2]) | ||
if cfg["compat"].val < 2: | ||
compat_axis = [owo, oco][cfg["compat"].val] # pylint: disable=R1706 | ||
s[conv].compute_at(s[output], compat_axis) | ||
cfg["ann_spatial"].apply(s, output, [owi, oci], axis_lens=[OWI, OCI], max_unroll=16, cfg=cfg) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a great finding ⭐ Can we generalize this to other schedules where the split doubles the (maybe some of the) axes and unrolling higher up degrades performance? Theoretically sounds reasonable. To be clear, not asking for any modifications 😸 , but this is something that needs to be paid attention to while writing CPU schedules. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I haven't looked other schedules in detail, but I assume there are opportunities to limit the search space in favour of having less failed attempts during tuning. By eyeballing the tuning results, it seemed like unrolling/vectorizing across outer axis never succeeded, unless the values of the tiles corresponding to the inner axis were 1 and being optimised out, essentially corresponding to not tiling. These kind of configs were not particularly performant though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for confirming. |
||
|
||
cfg.define_knob("compat", [0, 1]) | ||
compat_axis = [owo, oco][cfg["compat"].val] # pylint: disable=R1706 | ||
s[conv].compute_at(s[output], compat_axis) | ||
paxis = s[output].fuse(n, oho) | ||
s[output].parallel(paxis) | ||
|
||
# schedule conv | ||
n, oho, owo, oco, ohi, owi, oci = s[conv].op.axis | ||
ic, kh, kw = s[conv].op.reduce_axis | ||
cfg["reorder_conv"].apply(s, conv, [n, oho, owo, oco, kh, kw, ohi, owi, ic, oci]) | ||
cfg["reorder_conv"].apply(s, conv, [n, oho, owo, oco, kh, kw, ic, ohi, owi, oci]) | ||
cfg["ann_reduce"].apply( | ||
s, | ||
conv, | ||
|
@@ -424,33 +443,22 @@ def schedule_conv2d_spatial_pack_nhwc(cfg, s, op, output): | |
max_unroll=16, | ||
cfg=cfg, | ||
) | ||
cfg["ann_spatial"].apply( | ||
s, conv, [ohi, owi, oci], axis_lens=[OHI, OWI, OCI], max_unroll=16, cfg=cfg | ||
) | ||
if cfg["compat"].val < 2: | ||
compat_axis = [owo, oco][cfg["compat"].val] # pylint: disable=R1706 | ||
s[kernel_vec].compute_at(s[conv], compat_axis) | ||
s[data_vec].compute_at(s[conv], compat_axis) | ||
|
||
if not autotvm.GLOBAL_SCOPE.in_tuning: | ||
# schedule kernel pack | ||
oco, kh, kw, ic, oci = kernel_vec.op.axis | ||
s[kernel_vec].vectorize(oci) | ||
s[kernel_vec].unroll(ic) | ||
if cfg["compat"].val == 2: | ||
s[kernel_vec].parallel(oco) | ||
|
||
# schedule data pack | ||
cfg["ann_spatial"].apply(s, conv, [owi, oci], axis_lens=[OWI, OCI], max_unroll=16, cfg=cfg) | ||
|
||
# schedule data_vec, data_pad and kernel_vec | ||
compat_axis = [owo, oco][cfg["compat"].val] # pylint: disable=R1706 | ||
s[kernel_vec].compute_at(s[conv], compat_axis) | ||
s[data_vec].compute_at(s[conv], compat_axis) | ||
|
||
# Inlining kernel vec brings a performance improvement, but the tuner seems to not | ||
# like it, so inline only when we are using the fallback config | ||
if cfg.is_fallback: | ||
s[kernel_vec].compute_inline() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Out of curiosity, what does it mean to inline schedule of a constant? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is essentially about whether we create an additional intermediate buffer to store the reorganised weights data. This is how it looks like when we don't inline:
and this is what happens when we do:
The first one has a loop to fill up the |
||
|
||
if data_vec.op.name == "data_vec_undilated": | ||
n, oho, owo, kh, kw, ic, ohi, owi = s[data_vec].op.axis | ||
s[data_vec].vectorize(owi) | ||
s[data_vec].unroll(ohi) | ||
else: | ||
n, oho, owo, ohi, owi, ic = s[data_vec].op.axis | ||
s[data_vec].vectorize(ic) | ||
s[data_vec].unroll(owi) | ||
if cfg["compat"].val == 2: | ||
paxis = s[data_vec].fuse(n, oho) | ||
s[data_vec].parallel(paxis) | ||
s[data_pad].compute_at(s[data_vec], n) | ||
|
||
return s |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -95,7 +95,7 @@ def test_model_platform_templating(project_dir, project): | |
# TVM causes the amount of memory needed to decrease. | ||
workspace_size = int(workspace_size_defs[0]) | ||
assert workspace_size < 30000 | ||
assert workspace_size > 10000 | ||
assert workspace_size > 9000 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @mehrdadh this gave an error in the upstream build There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah that's fine. If it was the upper bound I would've been concerned. |
||
|
||
|
||
def test_import_rerouting(project_dir, project): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: why 4 and 8 only? 8x2 tile sizes do not help some dims 🤔 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we add a comment about using those numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment :)
There is no perfect config that works for all shapes of conv2d and all CPUs (that's why we tune), but chunks of data of 4 or 8 elements conveniently fit into vectors (the LLVM vectors of 8 elements get broken down to two vector instructions). That applies to float32 data though, I didn't analyze int8 performance of these schedules. However, from what I can see, this is the only conv2d NHWC floating point schedule and the default one for this layout, whilst for int8 we use other schedules. I don't claim it's the best possible config, but it worked well on a range of common models I looked at and it is easy to change and play around with.