Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI][ARM] Improve injective schedule #2801

Merged
merged 1 commit into from
Mar 18, 2019
Merged

Conversation

hlu1
Copy link
Contributor

@hlu1 hlu1 commented Mar 13, 2019

The generic injective schedule does not have vectorization and is therefore slow on ARM CPU. With vectorization, it can run 2-3x faster. For example, for a upsample_relu layer with 48 x 48 x 48 (C, H, W), the vectorized code runs at 0.003 ms/iter compared to 0.008 ms/iter on raspberry pi.

@hlu1
Copy link
Contributor Author

hlu1 commented Mar 13, 2019

@ajtulloch please review :)

@FrozenGene
Copy link
Member

FrozenGene commented Mar 13, 2019

Yeah. this is very useful likelayout_transform op.

@hlu1 How about (io, ii) = s[x].split(list(s[x].op.axis)[-1], 8) if s[x].op.axis[-1] <8 and how about poor arm cpu like A9 (maybe 4 is better)? I have done it like this:

  if len(s[x].op.axis) >= 5: # it is very useful when we have NCHWxC.
        fused = s[x].fuse(s[x].op.axis[0], s[x].op.axis[1], s[x].op.axis[2])
        s[x].parallel(fused)
    elif len(s[x].op.axis) >= 3:
        fused = s[x].fuse(s[x].op.axis[0], s[x].op.axis[1])
        s[x].parallel(fused)
    else:
        s[x].parallel(s[x].op.axis[0])
    s[x].vectorize(list(s[x].op.axis)[-1])

@hlu1
Copy link
Contributor Author

hlu1 commented Mar 13, 2019

We also need to consider the case of int8/unit8. For example, when you add two int8 numbers together to produce 1 int16 number, the simd width is 128/16 = 8. I think in general 8 should be a good compromise.

@FrozenGene
Copy link
Member

Ok. Could you add >=5 like we do it in x86? https://github.com/dmlc/tvm/blob/master/topi/python/topi/x86/injective.py#L26. This could help us in NCHWxC layout transform.

@hlu1
Copy link
Contributor Author

hlu1 commented Mar 13, 2019

That should have been covered by:

if len(s[x].op.axis) >= 3:
    fused = s[x].fuse(s[x].op.axis[0], s[x].op.axis[1], s[x].op.axis[2])
    s[x].parallel(fused)

I used len(s[x].op.axis) >= 3 because it's needed for special cases like [1, 1, 224, 224].

@FrozenGene
Copy link
Member

FrozenGene commented Mar 13, 2019

Oops. I haven't noticed it.

Copy link
Member

@FrozenGene FrozenGene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@hlu1 hlu1 force-pushed the injective branch 4 times, most recently from d191369 to a6e3656 Compare March 15, 2019 00:57
@merrymercy merrymercy merged commit 5a8ab8f into apache:master Mar 18, 2019
wweic pushed a commit to wweic/tvm that referenced this pull request Mar 20, 2019
wweic pushed a commit to neo-ai/tvm that referenced this pull request Mar 20, 2019
@hlu1 hlu1 deleted the injective branch April 17, 2019 06:51
@FrozenGene
Copy link
Member

@hlu1 Could you help to see this discussion on discuss forum? https://discuss.tvm.ai/t/relay-build-target-rasp3b-something-wrong/2195 This issue should be related with this changeset.

@hlu1
Copy link
Contributor Author

hlu1 commented Apr 21, 2019

@FrozenGene, thanks for letting me know. Fixed in #3061

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants