Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add one-sided widening intrinsics. #6967

Merged
merged 10 commits into from
Sep 13, 2022
Merged

Conversation

rootjalex
Copy link
Member

This PR adds three one-sided widening intrinsics:

widen_right_add(a, b) = a + widen(b)
widen_right_sub(a, b) = a - widen(b)
widen_right_mul(a, b) = a * widen(b)

These intrinsics are not intended to be used in the front end (i.e. widen_right_add(x_u16, y_u8) is exactly equivalent to the terser x_u16 + y_u8 due to implicit casting), but are useful for a number of peephole optimizations on ARM and HVX (to come in later PRs).

The addition of widen_right_mul actually simplifies some of the HexagonOptimize code considerably, and we can remove some of the pattern flags there. This PR also made me realize the semantic equivalence of two of the peephole optimizations that were in HexagonOptimize's OptimizePatterns Mul visitor:

{"halide.hexagon.mul.vw.vuh", wild_i32x * wild_i32x, Pattern::ReinterleaveOp0 | Pattern::NarrowUnsignedOp1},
{"halide.hexagon.mul.vuw.vuh", wild_u32x * wild_u32x, Pattern::ReinterleaveOp0 | Pattern::NarrowUnsignedOp1},

these both perform multiplication with zxt applied to the second argument, and the first generates a shorter code sequence, so the latter has been removed. Would love for someone from Qualcomm to confirm that this is indeed correct, @pranavb-ca or @aankit-ca ?

Given the significant issues with merging #6900 into Google, it might be ideal for Qualcomm/Adobe/Google/others to confirm that this PR does not break anything before merging it in. Other than the HVX optimization, there should be no other differences in codegen, but I have other PRs in the works that use these patterns for improved pattern matching.

@pranavb-ca
Copy link
Contributor

Thank you for this PR @rootjalex. As explained on gitter I am still out of the office (Back on Monday 8/29). @aankit-ca @sksarda - Can one of you please test this PR and make sure that the removal of the second pattern in particular doesn't regress. I think for us conv3x3 might be a good test, no? Or our impl. of gaussian5x5. Anything that is accumulating widening multiplies. Definitly check simd_op_check_hvx

@rootjalex
Copy link
Member Author

Thanks, and sorry for bothering you on OOTO time!

Definitly check simd_op_check_hvx

Fwiw, I did change one check in that test, but I believe it was an improvement

@steven-johnson
Copy link
Contributor

There appear to be failures in simd_op_check_hvx

@rootjalex
Copy link
Member Author

There appear to be failures in simd_op_check_hvx

I am investigating - it's quite weird, the STMT is exactly the same for the failing tests (so the lifting is not causing it), but I can't tell how the lowering would impact a shift right narrow expression

@rootjalex
Copy link
Member Author

Ah, it was because of a missing '!', whoops.

@aankit-ca
Copy link
Contributor

@rootjalex @pranavb-ca @sksarda Aren't there chances of overflows with using halide.hexagon.mul.vw.vuh for widen_right_mul(wild_u32x, wild_u16x)? For eg: INT32_MAX * 2. Using vmpyiewuh might not yield correct results.

@rootjalex
Copy link
Member Author

@aankit-ca Does HVX implement wrap-around for multiplication overflow? If so, widen_right_mul((uint32)reinterpret(INT32_MAX), (uint16)2), widen_right_mul((int32)NT32_MAX, (int16)2), and (int32)INT32_MAX * int32((int16)2) should all be exactly equivalent, and should all overflow in the same way.

@rootjalex
Copy link
Member Author

That being said, there are some failing tests, so perhaps I am wrong. I will investigate those tomorrow morning.

@aankit-ca
Copy link
Contributor

@aankit-ca Does HVX implement wrap-around for multiplication overflow? If so, widen_right_mul((uint32)reinterpret(INT32_MAX), (uint16)2), widen_right_mul((int32)NT32_MAX, (int16)2), and (int32)INT32_MAX * int32((int16)2) should all be exactly equivalent, and should all overflow in the same way.

I don't think so. The documentation for vmpyiewuh does not specify the semantics in case of overflows.

@rootjalex
Copy link
Member Author

@aankit-ca Does Hexagon use different multiplication implementations for signed versus unsigned multiplication? I expected it, like all other processors that I know of, to use a single multiply implementation per bitwidth

@aankit-ca
Copy link
Contributor

@rootjalex Thanks for the explanation. Your change seems right. The same instruction should work for both signed and unsigned numbers.

@rootjalex
Copy link
Member Author

@aankit-ca great, thanks for confirming!

@abadams Both of the performance failures look like flakes (those two tests have been pretty flaky for me on previous PRs). Think this is good to go?

Also @steven-johnson would you be able to check if this PR causes any google failures?

@steven-johnson
Copy link
Contributor

Also @steven-johnson would you be able to check if this PR causes any google failures?

Will do.

@rootjalex
Copy link
Member Author

Also @steven-johnson would you be able to check if this PR causes any google failures?

Will do.

Thank you!

@steven-johnson
Copy link
Contributor

It appears this may be injecting some breakage in the C++ backend -- we are missing some overloads for operator* for some vector operators -- not 100% if it's this PR or not, though, investigating

if (b.type().code() != narrow_a.type().code()) {
// Need to do a safe reinterpret.
Type t = b.type().with_code(code);
result = widen_right_mul(reinterpret(t, b), narrow_a);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is a follow-up to another comment from Steven regarding failures in C++ backend).

I am a bit confused by reinterpret here, do I understand it correctly that it might turn wild_i32x * u32(wild_u16x) into widen_right_mul(wild_u32x, wild_u16x)?

The problem we are seeing is in the Xtensa backend (not exactly a C++ backend, but it's derived from it; also, lives in a separate branch, so it's a bit harder to keep track of), we currently don't handle this intrinsic, so at the code generation stage we will call lower_intrinsic once we encounter widen_right_mul. As a result, the following will happen:

  1. we start with the expession like wild_i32x * widen(wild_u16x)
  2. it gets transformed into widen_right_mul(wild_u32x, wild_u16x)
  3. it gets lowered back to wild_u32x * widen(wild_u32x) [notice that left operand became unsingned]

If that's correct then the input of the 1) is not equiualent to the output of the 3), which seems a bit problematic? Is this transformation correct from numerical point of view (I guess depends on the actual implementation)? The specific problem we see in the Xtensa backend, is that Xtensa doesn't seem to have an intrinsic for multiplication of two wild_u32x vectors, but does have intrinsics for wild_i32x * wild_i32x and wild_i32x * wild_u32x (I know it's a bit weird and I can try to find out more details about it, but it may take some time).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numerically, the expressions are still equivalent, integer multiplication is the same for signed or unsigned arguments.

I see your issue though, and I see two possible solutions:

  1. in lowering intrinsics, specifically look for the pattern reinterpret(widen_right_op(reinterpret(a), b)), and lower it without the reinterprets (this seems messy)

  2. Xtensa should specifically pattern match widen_right_mul(reinterpret(i32), u16) and use the wild_i32 * wild_u32 op that you mentioned

I believe that the letter is better, what do you think? Feasibly both could be implemented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could also change the design of the intrinsics to be
a op cast(a.type(), b)
I think there was some reason that we chose not to do that initially. @abadams do you remember why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, 1) would be better in my opinion, but seems to be difficult to implement due to the outer reinterpret(), so probably not worth the effort or complexity.

I certainly can do 2), this should be pretty straigtforward. I was concerned that expressions are not equiualent after find_intrinsics -> lower_intrinsic, but it sounds it should be good numeric-wise.

I think, if this is the only issue we see in Google testing then it should be fine to merge in and I can address the issue before updating Halide in Google.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer we address the Xtensa issue first, so that we can complete a test of this change in Google before landing. (Currently there are a lot of false-positive failures in the test due to this.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's cool with me, I'll look into it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vksnk I realize now that I should have questioned deeper on the Xtensa multiplication intrinsics - because multiplication is not parameterized by sign, shouldn’t the intrinsic that you mentioned being used for i32 x i32 multiplication be used for any 32-bit integer multiplication?

That being said, I’m a little unsure what the i32 x u32 multiplication is used for. Is that a widening multiply by chance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to address your concern about find_intrinsics -> lower_intrinsics not matching perfectly - unfortunately, that is already the case for most (possibly all?) of the intrinsics, though I believe these are the only intrinsics that will add reinterprets

@steven-johnson
Copy link
Contributor

Where do we stand on this PR -- are we awaiting the Xtensa fixes, or what?

@rootjalex
Copy link
Member Author

Where do we stand on this PR -- are we awaiting the Xtensa fixes, or what?

Yep, just waiting on Xtensa fixes. Otherwise I think this PR is good to go (or at least, good to be reviewed again).

@vksnk
Copy link
Member

vksnk commented Sep 9, 2022

Where do we stand on this PR -- are we awaiting the Xtensa fixes, or what?

Yep, just waiting on Xtensa fixes. Otherwise I think this PR is good to go (or at least, good to be reviewed again).

Yes, sorry for the delay, it's blocked on me. I needed to finish something else first, but working on the fix for this now.

@steven-johnson
Copy link
Contributor

Yes, sorry for the delay, it's blocked on me.

No worries, just checking :-)

@vksnk
Copy link
Member

vksnk commented Sep 12, 2022

Where do we stand on this PR -- are we awaiting the Xtensa fixes, or what?

Yep, just waiting on Xtensa fixes. Otherwise I think this PR is good to go (or at least, good to be reviewed again).

Yes, sorry for the delay, it's blocked on me. I needed to finish something else first, but working on the fix for this now.

Should be good now, I've a fix for Xtensa issue and will update the branch once this is submitted.

Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending the necessary changes from @vksnk landing first.

@vksnk
Copy link
Member

vksnk commented Sep 13, 2022

LGTM pending the necessary changes from @vksnk landing first.

I can't land my changes because this PR (where new intrinsics are introduced) needs to be merged in first.

@steven-johnson
Copy link
Contributor

steven-johnson commented Sep 13, 2022

I can't land my changes because this PR (where new intrinsics are introduced) needs to be merged in first.

Ahh, right. I guess I should make an experimental branch with both this and your change and do a test integrate first (since the last test turned up breakage). Please point me at your relevant change(s) via Chat and I'll get it cranking.

UPDATE: @vksnk says he's already done this, so, LGTM :-)

@steven-johnson steven-johnson self-requested a review September 13, 2022 17:49
@rootjalex rootjalex merged commit 27b8a7d into main Sep 13, 2022
@rootjalex rootjalex deleted the rootjalex/extend-intrinsics branch September 13, 2022 18:14
ardier pushed a commit to ardier/Halide-mutation that referenced this pull request Mar 3, 2024
* implement widen_right_ ops

* update HVX patterns with one-sided widening intrinsics

* remove unused HVX pattern flags

* strengthen logic for finding rounding shifts

Co-authored-by: Steven Johnson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants