-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QNN][Relay][Topi] Add qnn.dense with weight layout #13854
Conversation
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment. Generated by tvm-bot |
This commit imroves performance of qnn.mul operation without QNN canonicalization.
116a60d
to
f55a0c9
Compare
This is cool, thanks a lot. I went through this a little already earlier, but I'll go through the PR more closely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me apart from the one question about schedule.
Also, one small thought and I might be wrong here, but I see that the nn
version of this op is called nn.contrib_dense_pack
, so should we name this qnn.contrib_dense_pack
?
sch: Schedule | ||
The computation schedule for the op. | ||
""" | ||
return default_schedule(outs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a plan to add vrmpy tensorized schedule separately, or is that going to be dependent on metaschedule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have such plans, but with low priority. The most interesting is work of MetaScheduler for now.
Yes, I can rename this operator to be aligned with its analog |
# Shift kernel if necessary. | ||
if kernel_dtype == "int8": | ||
# Compute (QA + 128) and (zp_a + 128) | ||
kernel, kernel_zero_point = _shift(kernel, kernel_zero_point, "uint8") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to convert kernel to int8? This would result into non-zero value zero-point and introduced additional computation which could have been avoided if kernel stayed in int8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main goal of conversion int8 --> uint8 is to enable using of faster vrmpy u8u8i32 intrinsic instead of vrmpy u8i8i32 for qnn.dense
/qnn.conv2d
Yes, you're absolutely right. We have some overhead on such conversion. That's why I have commented out this code and skip i8 -> u8 conversion for weights. Right now I do not see any performance improvement due to overhead.
I will enable this code if I will manage to get better performance.
f55a0c9
to
7e45dda
Compare
@@ -296,7 +297,98 @@ def _get_mod(data_dtype, kernel_dtype): | |||
assert "cast" in legalized_mod.astext() and "qnn" in legalized_mod.astext() | |||
|
|||
|
|||
def test_qnn_legalize_qnn_conv2d_non_scalar_qnn_params(): | |||
""" | |||
Test QNN legalization for qnn.dense op for Hexagon target when kernel zero point and kernel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo?: qnn.dense->qnn.conv2d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, fixed
def test_qnn_legalize_qnn_conv2d_non_scalar_qnn_params(): | ||
""" | ||
Test QNN legalization for qnn.dense op for Hexagon target when kernel zero point and kernel | ||
scale are not scalars. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate what you mean by kernel zero-point and scale not being scalars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, by "not scalar" I mean constant vector of scalars. For example:
relay.const([val1, val2 ... valN])
This is more interesting case for testing because on legalization we do padding of weights. And dimension of weights, vector of scale/zero point and 'channels' attribute should be aligned.
This commit adds new Relay operation "qnn.contrib_dense_pack" that supports different weights layout (nn.dense and qnn.dense do not support this attribute). This new operation is full analog of "nn.contrib_dense_pack" operation but in QNN space.
7e45dda
to
15e85e5
Compare
* [Hexagon][QNN] Improve performance of qnn.mul This commit imroves performance of qnn.mul operation without QNN canonicalization. * [QNN][Relay][Topi] Add qnn.dense with weight layout This commit adds new Relay operation "qnn.contrib_dense_pack" that supports different weights layout (nn.dense and qnn.dense do not support this attribute). This new operation is full analog of "nn.contrib_dense_pack" operation but in QNN space.
This commit adds new Relay operation
qnn.contrib_dense_pack
that supports different weights layout (nn.dense
andqnn.dense
do not support this attribute). This new operation is full analog ofnn.contrib_dense_pack
operation but in QNN space.With this PR, current QNN Dense can achieve ~10x performance gain on Hexagon target without QNN canonicalization (through the use of
vrmpy
intrinsic).Also, this PR includes slight performance improvement for
qnn.mul
(without QNN canonicalization).