-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Enable and optimize AVX helper-intrinsics #17030
Conversation
This PR and #16955 will finish AVX intrinsics (except |
@fiigii @eerhardt @CarolEidt @tannergooding Which AVX2 intrinsics are still missing? If it is minority of them perhaps we should implement them before ZBB. This should not prevent us from stabilizing implementation of existing ones. |
src/jit/hwintrinsiccodegenxarch.cpp
Outdated
} | ||
else | ||
{ | ||
emit->emitIns_R_R(INS_mov_i2xmm, emitTypeSize(TYP_SIMD16), tmpXMM, op1Reg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this just always be emitTypeSize(baseType)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not... The current emitter for movd
/movq
is really messy, we need to refactor it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you log a bug tracking this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here can be simplified, thanks for @mikedn 's help.
I disagree. Any changes have the potential to destabilize, and often do. We really need to stop making changes so that we can stabilize. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One optional suggestion, but otherwise LGTM
src/jit/hwintrinsicxarch.cpp
Outdated
assert(varTypeIsArithmetic(baseType)); | ||
|
||
ival = ival & (32 / genTypeSize(baseType) - 1); // clear the unused bits | ||
int halfIndex = 16 / genTypeSize(baseType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines could be abstracted out for better readability (it's repeated below). It could also be made more efficient since we know that the type size is always a power of 2, though I don't think that's critical.
Both these suggestions can be left for future unless you are already going to do another commit on this PR, in which case you might consider extracting to a separate method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could also be made more efficient since we know that the type size is always a power of 2
I would expect that the native compiler will properly recognize and transform <pow-2-const> / value
and will transform it into the appropriate shift operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines could be abstracted out for better readability
Thanks, will change in this PR, and I will update the test cases to match #16957
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the compiler will do that, given that genTypeSize
gets its value from a table, though perhaps it will be more clever than I think. In any case, I don't consider optimizing this a high-value investment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the compiler will do that, given that genTypeSize gets its value from a table
You're probably right, I was thinking of the reverse operation value / <pow-2-const>
😄
I don't consider optimizing this a high-value investment.
Agreed
src/jit/hwintrinsiccodegenxarch.cpp
Outdated
|
||
if (compiler->compSupports(InstructionSet_AVX2)) | ||
{ | ||
emit->emitIns_R_R(ins, emitTypeSize(TYP_SIMD32), targetReg, op1Reg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will emit broadcast
, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add a comment.
break; | ||
case TYP_LONG: | ||
case TYP_ULONG: | ||
emit->emitIns_SIMD_R_R_I(INS_pshufd, emitTypeSize(TYP_SIMD16), op1Reg, op1Reg, 68); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why pshufd
instead of unpcklqdq
?
} | ||
case TYP_SHORT: | ||
case TYP_USHORT: | ||
emit->emitIns_SIMD_R_R_I(INS_pshuflw, emitTypeSize(TYP_SIMD16), op1Reg, op1Reg, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why pshuflw
instead of unpcklwd
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just followed Clang codegen. They have the same performance on most architectures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but the latter produces smaller code, iirc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can investigate it later.
src/jit/instrsxarch.h
Outdated
INST3( vextractf128, "extractf128" , 0, IUM_WR, 0, 0, SSE3A(0x19), BAD_CODE, BAD_CODE) // Extract 128-bit packed floating point values | ||
INST3( vextracti128, "extracti128" , 0, IUM_WR, 0, 0, SSE3A(0x39), BAD_CODE, BAD_CODE) // Extract 128-bit packed integer values | ||
INST3( vextractf128, "extractf128" , 0, IUM_WR, 0, 0, SSE3A(0x19), BAD_CODE, SSE3A(0x19)) // Extract 128-bit packed floating point values | ||
INST3( vextracti128, "extracti128" , 0, IUM_WR, 0, 0, SSE3A(0x39), BAD_CODE, SSE3A(0x39)) // Extract 128-bit packed integer values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed a mistake from #16957, looks like an oversight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this was explicit, there is no rm encoding for vextractf128 or vextracti128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests failed on that changes. emitIns_R_AR_I
seems to require RM
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is because you should be calling emitIns_AR_R_I
, which is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not have a emitIns_AR_R_I
now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You did not fix Avx/Avx2.ExtractVector128, the changes (removing the RM field of extractf128 and extracti128) from #16957 makes an assertion failure (emitIns_R_AR_I requires RM encoding).
Right, I missed that ExtractVector128
was also swapping operands and needed to be changed as well. However, I did fix other code that was broken because it was incorrectly trying to be R_RM, when it is actually RM_R (NI_SSE41_Extract
, for example).
The overall point is that:
- The code, without any changes, is currently wrong (because I missed something).
- The new changes are also ultimately incorrect (because it is not handling something that is known to cause failures in similar code)
As such, it is my opinion that the code should be fixed now rather than as a later PR.
I will, however, ultimately defer to @CarolEidt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new changes are also ultimately incorrect (because it is not handling something that is known to cause failures in similar code)
Question. How can I test the issue? Did you get ExtractVector128
failures from GCStress
tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before #16957:
- Code was bad, caused GCStress failures
- When run locally with
EnableIncompleteISAClass
- When run locally with
After #16957:
- Code is still bad, but now causes an assert for
EnableIncompleteISAClass
- This was missed because only GCStress and RegStress jobs were run in the final set of changes
After #17030 (this PR):
- I would like to see this passing for regular tests, GCStress tests, and RegStress tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question. How can I test the issue?
You have to run locally with both EnableIncompleteISAClass
and the appropriate Jit Stress flags (such as COMPlus_GCStress=0xC
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COMPlus_GCStress=0xC
Yes, I tried all the GCStress
values and ExtractVector128
tests all passed...
As such, it is my opinion that the code should be fixed now rather than as a later PR.
Let me try to add a emitIns_AR_R_I
.
assert(varTypeIsArithmetic(baseType)); | ||
// clear the unused bits to normalize the index into the range of [0, length of Vector256<baseType>) | ||
*indexPtr = (*indexPtr) & (32 / genTypeSize(baseType) - 1); | ||
return (16 / genTypeSize(baseType)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abstract these two lines in a static function.
Added the function |
src/jit/emitfmtsxarch.h
Outdated
@@ -194,6 +196,8 @@ IF_DEF(ARD_CNS, IS_AM_RD, AMD_CNS) // read [adr], const | |||
IF_DEF(AWR_CNS, IS_AM_WR, AMD_CNS) // write [adr], const | |||
IF_DEF(ARW_CNS, IS_AM_RW, AMD_CNS) // r/w [adr], const | |||
|
|||
IF_DEF(AWR_RRD_CNS, IS_AM_WR, AMD_CNS) // write [adr], reg, const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hsould be IS_AM_WR|IS_R1_RD
? and the comment should be write [adr], read reg, const
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hsould be IS_AM_WR|IS_R1_RD?
I am not sure, this field seems not used now.
and the comment should be write [adr], read reg, const?
Will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a flag being unused is a good reason to drop it. We should probably add it, in order to match the others, and it c an be removed, ignored, or consumed (whoever is appropriate) by both current or future code/changes
src/jit/emitxarch.cpp
Outdated
|
||
assert(emitGetInsAmdAny(id) == disp); // make sure "disp" is stored properly | ||
|
||
sz = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment as to why 6? I believe we have these elsewhere...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be good to understand why the size calculation functions can't be used. I think those should be preferred, if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only vextracti/f128
needs this function, so the code size is known and we do not need to calculate it. Meanwhile, the current code-size estimation is not accurate for vextracti/f128
.
Will add a comment.
src/jit/emitxarch.cpp
Outdated
@@ -7785,6 +7813,32 @@ void emitter::emitDispIns( | |||
break; | |||
} | |||
|
|||
case IF_AWR_RRD_CNS: | |||
case IF_MWR_RRD_CNS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be handled separately.
AWR_RRD_CNS
should be using emitDispAddrMode
MWR_RRD_CNS
should be using emitDispClsVar
src/jit/emitxarch.cpp
Outdated
case IF_MWR_RRD_CNS: | ||
{ | ||
assert(ins == INS_vextracti128 || ins == INS_vextractf128); | ||
sstr = codeGen->genSizeStr(EA_ATTR(16)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we doing sstr
here, I think everyone else is getting it above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vextracti/f128
is 256-bit instruction that extracts a 128-bit value, so we have to specially treat the sstr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments are useful for special comments are useful for special cases.
src/jit/emitxarch.cpp
Outdated
@@ -12213,6 +12267,16 @@ size_t emitter::emitOutputInstr(insGroup* ig, instrDesc* id, BYTE** dp) | |||
sz = emitSizeOfInsDsc(id); | |||
break; | |||
|
|||
case IF_AWR_RRD_CNS: | |||
case IF_MWR_RRD_CNS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be handled separately as well. AWR
goes through emitOutputAM
, but MWR
needs to go through emitOutputCV
. The functions also deal with regcode slightly differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR just add emitIns_AR_R_I
, shall I remove IF_MWR_RRD_CNS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need/want both to ensure all the same code exists as for the other cases.
emitMapFmtAtoM
specifically provides a mapping from one to the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CarolEidt might be able to provide better input as to whether or not IF_MWR_RRD_CNS
should be added SxS with IF_MWR_RRD_CNS
.
3f1705c
to
0aecc63
Compare
Addressed all the feedback, please take a look @CarolEidt @tannergooding |
src/jit/emitfmtsxarch.h
Outdated
@@ -139,6 +139,8 @@ IF_DEF(MRD_CNS, IS_GM_RD, DSP_CNS) // read [mem], const | |||
IF_DEF(MWR_CNS, IS_GM_WR, DSP_CNS) // write [mem], const | |||
IF_DEF(MRW_CNS, IS_GM_RW, DSP_CNS) // r/w [mem], const | |||
|
|||
IF_DEF(MWR_RRD_CNS, IS_GM_WR|IS_R1_RD, DSP_CNS) // write [mem] , read reg, const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space between the ]
and the comma
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix, thanks.
src/jit/emitxarch.cpp
Outdated
|
||
assert(emitGetInsAmdAny(id) == disp); // make sure "disp" is stored properly | ||
|
||
// the code size of "vextracti/f128 [add], ymm, imm8" is 6 byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: adr
or mem
would probably be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix, thanks.
@@ -12530,6 +12620,15 @@ size_t emitter::emitOutputInstr(insGroup* ig, instrDesc* id, BYTE** dp) | |||
sz = emitSizeOfInsDsc(id); | |||
break; | |||
|
|||
case IF_MWR_RRD_CNS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the register encoding needs to happen in the case for MWR
. emitOutputAM
is one of the only ones that handles the register encoding in itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I do not quite understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the other examples MWR cases handle the encodeReg3456 before the emit Output call. emitOutputAM is the only one that does that in the call itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked the above section of IF_RWR_MRD_CNS
that does not handle the encodeReg3456 in if (Is4ByteSSE4OrAVXInstruction(ins))
path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see.
A comment (or explicit assert) that Is4ByteSSE4OrAVXInstruction(ins)
is always true would be beneficial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Added a comment blow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly LGTM, but there's one naming & comment change that I think is worth making.
src/jit/hwintrinsicxarch.cpp
Outdated
@@ -1071,6 +1090,145 @@ GenTree* Compiler::impAvxOrAvx2Intrinsic(NamedIntrinsic intrinsic, | |||
|
|||
switch (intrinsic) | |||
{ | |||
case NI_AVX_Extract: | |||
{ | |||
// Avx.Extract executes software implementation when the imm8 argument is not complie-time constant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: should be "compile-time"
src/jit/hwintrinsicxarch.cpp
Outdated
@@ -1058,6 +1058,25 @@ GenTree* Compiler::impSSE42Intrinsic(NamedIntrinsic intrinsic, | |||
return retNode; | |||
} | |||
|
|||
//------------------------------------------------------------------------ | |||
// gethalfAndNormalizedIndex: compute the middle index of a Vector256<baseType> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name and the name of the method don't match. I actually think that normalizeAndGetHalfIndex
would be clearer; "mid" doesn't really mean anything to me.
Also, the comment needs to describe the modified indexPtr
as a return value, or at least describe that it is an "out" parameter that returns the normalized index, while the actual return value is the half index.
test Windows_NT x64 Checked jitincompletehwintrinsic test Windows_NT x86 Checked jitincompletehwintrinsic test Ubuntu x64 Checked jitincompletehwintrinsic |
@CarolEidt Thanks for the review. I have fixed the function name and comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks for the comments!
This PR:
Avx.Insert/Extract
, whichstackalloc
memory as the fallback when imm8 arguments are not compile-time constants.Avx.SetVector256
in the importer that is much simpler than managed-code/Codgen.AvxSetAllVector256
in CodeGen.@CarolEidt @tannergooding @AndyAyersMS PTAL