Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Enable and optimize AVX helper-intrinsics #17030

Merged
merged 6 commits into from
Mar 22, 2018
Merged

Conversation

fiigii
Copy link

@fiigii fiigii commented Mar 19, 2018

This PR:

  1. optimizes Avx.Insert/Extract, which
    1. move the implementation from managed-code to the JIT importer to get perfect codegen with compile-time constant imm8 arguments
    2. change the managed-code to read/write stackalloc memory as the fallback when imm8 arguments are not compile-time constants.
  2. implements Avx.SetVector256 in the importer that is much simpler than managed-code/Codgen.
  3. Implements AvxSetAllVector256 in CodeGen.

@CarolEidt @tannergooding @AndyAyersMS PTAL

@fiigii
Copy link
Author

fiigii commented Mar 19, 2018

This PR and #16955 will finish AVX intrinsics (except ZeroAll/ZeroUpper) for 2.1. After these two PRs get merged, we will stop enabling new intrinsics and start to stabilize hardware intrinsics for 2.1.

cc @eerhardt @4creators

@4creators
Copy link

@fiigii @eerhardt @CarolEidt @tannergooding

Which AVX2 intrinsics are still missing? If it is minority of them perhaps we should implement them before ZBB. This should not prevent us from stabilizing implementation of existing ones.

}
else
{
emit->emitIns_R_R(INS_mov_i2xmm, emitTypeSize(TYP_SIMD16), tmpXMM, op1Reg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this just always be emitTypeSize(baseType)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not... The current emitter for movd/movq is really messy, we need to refactor it later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you log a bug tracking this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here can be simplified, thanks for @mikedn 's help.

@CarolEidt
Copy link

This should not prevent us from stabilizing implementation of existing ones.

I disagree. Any changes have the potential to destabilize, and often do. We really need to stop making changes so that we can stabilize.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One optional suggestion, but otherwise LGTM

assert(varTypeIsArithmetic(baseType));

ival = ival & (32 / genTypeSize(baseType) - 1); // clear the unused bits
int halfIndex = 16 / genTypeSize(baseType);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines could be abstracted out for better readability (it's repeated below). It could also be made more efficient since we know that the type size is always a power of 2, though I don't think that's critical.
Both these suggestions can be left for future unless you are already going to do another commit on this PR, in which case you might consider extracting to a separate method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also be made more efficient since we know that the type size is always a power of 2

I would expect that the native compiler will properly recognize and transform <pow-2-const> / value and will transform it into the appropriate shift operation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines could be abstracted out for better readability

Thanks, will change in this PR, and I will update the test cases to match #16957

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the compiler will do that, given that genTypeSize gets its value from a table, though perhaps it will be more clever than I think. In any case, I don't consider optimizing this a high-value investment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the compiler will do that, given that genTypeSize gets its value from a table

You're probably right, I was thinking of the reverse operation value / <pow-2-const> 😄

I don't consider optimizing this a high-value investment.

Agreed


if (compiler->compSupports(InstructionSet_AVX2))
{
emit->emitIns_R_R(ins, emitTypeSize(TYP_SIMD32), targetReg, op1Reg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will emit broadcast, correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a comment.

break;
case TYP_LONG:
case TYP_ULONG:
emit->emitIns_SIMD_R_R_I(INS_pshufd, emitTypeSize(TYP_SIMD16), op1Reg, op1Reg, 68);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pshufd instead of unpcklqdq?

}
case TYP_SHORT:
case TYP_USHORT:
emit->emitIns_SIMD_R_R_I(INS_pshuflw, emitTypeSize(TYP_SIMD16), op1Reg, op1Reg, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pshuflw instead of unpcklwd?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just followed Clang codegen. They have the same performance on most architectures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but the latter produces smaller code, iirc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can investigate it later.

INST3( vextractf128, "extractf128" , 0, IUM_WR, 0, 0, SSE3A(0x19), BAD_CODE, BAD_CODE) // Extract 128-bit packed floating point values
INST3( vextracti128, "extracti128" , 0, IUM_WR, 0, 0, SSE3A(0x39), BAD_CODE, BAD_CODE) // Extract 128-bit packed integer values
INST3( vextractf128, "extractf128" , 0, IUM_WR, 0, 0, SSE3A(0x19), BAD_CODE, SSE3A(0x19)) // Extract 128-bit packed floating point values
INST3( vextracti128, "extracti128" , 0, IUM_WR, 0, 0, SSE3A(0x39), BAD_CODE, SSE3A(0x39)) // Extract 128-bit packed integer values
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a mistake from #16957, looks like an oversight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this was explicit, there is no rm encoding for vextractf128 or vextracti128

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests failed on that changes. emitIns_R_AR_I seems to require RM field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is because you should be calling emitIns_AR_R_I, which is different.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not have a emitIns_AR_R_I now...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did not fix Avx/Avx2.ExtractVector128, the changes (removing the RM field of extractf128 and extracti128) from #16957 makes an assertion failure (emitIns_R_AR_I requires RM encoding).

Right, I missed that ExtractVector128 was also swapping operands and needed to be changed as well. However, I did fix other code that was broken because it was incorrectly trying to be R_RM, when it is actually RM_R (NI_SSE41_Extract, for example).

The overall point is that:

  • The code, without any changes, is currently wrong (because I missed something).
  • The new changes are also ultimately incorrect (because it is not handling something that is known to cause failures in similar code)

As such, it is my opinion that the code should be fixed now rather than as a later PR.

I will, however, ultimately defer to @CarolEidt.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new changes are also ultimately incorrect (because it is not handling something that is known to cause failures in similar code)

Question. How can I test the issue? Did you get ExtractVector128 failures from GCStress tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before #16957:

  • Code was bad, caused GCStress failures
    • When run locally with EnableIncompleteISAClass

After #16957:

  • Code is still bad, but now causes an assert for EnableIncompleteISAClass
    • This was missed because only GCStress and RegStress jobs were run in the final set of changes

After #17030 (this PR):

  • I would like to see this passing for regular tests, GCStress tests, and RegStress tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question. How can I test the issue?

You have to run locally with both EnableIncompleteISAClass and the appropriate Jit Stress flags (such as COMPlus_GCStress=0xC)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COMPlus_GCStress=0xC

Yes, I tried all the GCStress values and ExtractVector128 tests all passed...

As such, it is my opinion that the code should be fixed now rather than as a later PR.

Let me try to add a emitIns_AR_R_I.

assert(varTypeIsArithmetic(baseType));
// clear the unused bits to normalize the index into the range of [0, length of Vector256<baseType>)
*indexPtr = (*indexPtr) & (32 / genTypeSize(baseType) - 1);
return (16 / genTypeSize(baseType));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abstract these two lines in a static function.

@fiigii
Copy link
Author

fiigii commented Mar 21, 2018

Added the function emitIns_AR_R_I to avoid swapping operands of vextracti/f128 (in the last commit).
@CarolEidt @tannergooding PTAL

@@ -194,6 +196,8 @@ IF_DEF(ARD_CNS, IS_AM_RD, AMD_CNS) // read [adr], const
IF_DEF(AWR_CNS, IS_AM_WR, AMD_CNS) // write [adr], const
IF_DEF(ARW_CNS, IS_AM_RW, AMD_CNS) // r/w [adr], const

IF_DEF(AWR_RRD_CNS, IS_AM_WR, AMD_CNS) // write [adr], reg, const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hsould be IS_AM_WR|IS_R1_RD? and the comment should be write [adr], read reg, const?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hsould be IS_AM_WR|IS_R1_RD?

I am not sure, this field seems not used now.

and the comment should be write [adr], read reg, const?

Will change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a flag being unused is a good reason to drop it. We should probably add it, in order to match the others, and it c an be removed, ignored, or consumed (whoever is appropriate) by both current or future code/changes


assert(emitGetInsAmdAny(id) == disp); // make sure "disp" is stored properly

sz = 6;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment as to why 6? I believe we have these elsewhere...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be good to understand why the size calculation functions can't be used. I think those should be preferred, if possible

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only vextracti/f128 needs this function, so the code size is known and we do not need to calculate it. Meanwhile, the current code-size estimation is not accurate for vextracti/f128.

Will add a comment.

@@ -7785,6 +7813,32 @@ void emitter::emitDispIns(
break;
}

case IF_AWR_RRD_CNS:
case IF_MWR_RRD_CNS:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be handled separately.

AWR_RRD_CNS should be using emitDispAddrMode
MWR_RRD_CNS should be using emitDispClsVar

case IF_MWR_RRD_CNS:
{
assert(ins == INS_vextracti128 || ins == INS_vextractf128);
sstr = codeGen->genSizeStr(EA_ATTR(16));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing sstr here, I think everyone else is getting it above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vextracti/f128 is 256-bit instruction that extracts a 128-bit value, so we have to specially treat the sstr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are useful for special comments are useful for special cases.

@@ -12213,6 +12267,16 @@ size_t emitter::emitOutputInstr(insGroup* ig, instrDesc* id, BYTE** dp)
sz = emitSizeOfInsDsc(id);
break;

case IF_AWR_RRD_CNS:
case IF_MWR_RRD_CNS:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be handled separately as well. AWR goes through emitOutputAM, but MWR needs to go through emitOutputCV. The functions also deal with regcode slightly differently.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR just add emitIns_AR_R_I, shall I remove IF_MWR_RRD_CNS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need/want both to ensure all the same code exists as for the other cases.

emitMapFmtAtoM specifically provides a mapping from one to the other.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CarolEidt might be able to provide better input as to whether or not IF_MWR_RRD_CNS should be added SxS with IF_MWR_RRD_CNS.

@fiigii fiigii force-pushed the avxhelper branch 2 times, most recently from 3f1705c to 0aecc63 Compare March 21, 2018 19:21
@fiigii
Copy link
Author

fiigii commented Mar 21, 2018

Addressed all the feedback, please take a look @CarolEidt @tannergooding

@@ -139,6 +139,8 @@ IF_DEF(MRD_CNS, IS_GM_RD, DSP_CNS) // read [mem], const
IF_DEF(MWR_CNS, IS_GM_WR, DSP_CNS) // write [mem], const
IF_DEF(MRW_CNS, IS_GM_RW, DSP_CNS) // r/w [mem], const

IF_DEF(MWR_RRD_CNS, IS_GM_WR|IS_R1_RD, DSP_CNS) // write [mem] , read reg, const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space between the ] and the comma

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix, thanks.


assert(emitGetInsAmdAny(id) == disp); // make sure "disp" is stored properly

// the code size of "vextracti/f128 [add], ymm, imm8" is 6 byte
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: adr or mem would probably be better

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix, thanks.

@@ -12530,6 +12620,15 @@ size_t emitter::emitOutputInstr(insGroup* ig, instrDesc* id, BYTE** dp)
sz = emitSizeOfInsDsc(id);
break;

case IF_MWR_RRD_CNS:
Copy link
Member

@tannergooding tannergooding Mar 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the register encoding needs to happen in the case for MWR. emitOutputAM is one of the only ones that handles the register encoding in itself.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I do not quite understand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the other examples MWR cases handle the encodeReg3456 before the emit Output call. emitOutputAM is the only one that does that in the call itself

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked the above section of IF_RWR_MRD_CNS that does not handle the encodeReg3456 in if (Is4ByteSSE4OrAVXInstruction(ins)) path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see.

A comment (or explicit assert) that Is4ByteSSE4OrAVXInstruction(ins) is always true would be beneficial.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added a comment blow.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly LGTM, but there's one naming & comment change that I think is worth making.

@@ -1071,6 +1090,145 @@ GenTree* Compiler::impAvxOrAvx2Intrinsic(NamedIntrinsic intrinsic,

switch (intrinsic)
{
case NI_AVX_Extract:
{
// Avx.Extract executes software implementation when the imm8 argument is not complie-time constant

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: should be "compile-time"

@@ -1058,6 +1058,25 @@ GenTree* Compiler::impSSE42Intrinsic(NamedIntrinsic intrinsic,
return retNode;
}

//------------------------------------------------------------------------
// gethalfAndNormalizedIndex: compute the middle index of a Vector256<baseType>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name and the name of the method don't match. I actually think that normalizeAndGetHalfIndex would be clearer; "mid" doesn't really mean anything to me.
Also, the comment needs to describe the modified indexPtr as a return value, or at least describe that it is an "out" parameter that returns the normalized index, while the actual return value is the half index.

@fiigii
Copy link
Author

fiigii commented Mar 22, 2018

test Windows_NT x64 Checked jitincompletehwintrinsic
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnosimd

test Windows_NT x86 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
test Windows_NT x86 Checked jitx86hwintrinsicnosimd

test Ubuntu x64 Checked jitincompletehwintrinsic
test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
test Ubuntu x64 Checked jitx86hwintrinsicnosimd

@fiigii
Copy link
Author

fiigii commented Mar 22, 2018

@CarolEidt Thanks for the review. I have fixed the function name and comments.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for the comments!

@tannergooding tannergooding merged commit 8f4a5e5 into dotnet:master Mar 22, 2018
@fiigii fiigii deleted the avxhelper branch March 22, 2018 16:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants