Enable and optimize AVX helper-intrinsics #17030

fiigii · 2018-03-19T07:29:31Z

This PR:

optimizes Avx.Insert/Extract, which
1. move the implementation from managed-code to the JIT importer to get perfect codegen with compile-time constant imm8 arguments
2. change the managed-code to read/write stackalloc memory as the fallback when imm8 arguments are not compile-time constants.
implements Avx.SetVector256 in the importer that is much simpler than managed-code/Codgen.
Implements AvxSetAllVector256 in CodeGen.

@CarolEidt @tannergooding @AndyAyersMS PTAL

fiigii · 2018-03-19T07:44:18Z

This PR and #16955 will finish AVX intrinsics (except ZeroAll/ZeroUpper) for 2.1. After these two PRs get merged, we will stop enabling new intrinsics and start to stabilize hardware intrinsics for 2.1.

cc @eerhardt @4creators

4creators · 2018-03-19T11:02:03Z

@fiigii @eerhardt @CarolEidt @tannergooding

Which AVX2 intrinsics are still missing? If it is minority of them perhaps we should implement them before ZBB. This should not prevent us from stabilizing implementation of existing ones.

tannergooding · 2018-03-19T15:22:44Z

src/jit/hwintrinsiccodegenxarch.cpp

+                }
+                else
+                {
+                    emit->emitIns_R_R(INS_mov_i2xmm, emitTypeSize(TYP_SIMD16), tmpXMM, op1Reg);


Shouldn't this just always be emitTypeSize(baseType)?

Unfortunately not... The current emitter for movd/movq is really messy, we need to refactor it later.

Could you log a bug tracking this?

Logged at https://github.com/dotnet/coreclr/issues/17051

Here can be simplified, thanks for @mikedn 's help.

CarolEidt · 2018-03-19T15:26:33Z

This should not prevent us from stabilizing implementation of existing ones.

I disagree. Any changes have the potential to destabilize, and often do. We really need to stop making changes so that we can stabilize.

CarolEidt

One optional suggestion, but otherwise LGTM

CarolEidt · 2018-03-19T18:48:09Z

src/jit/hwintrinsicxarch.cpp

+            assert(varTypeIsArithmetic(baseType));
+
+            ival                            = ival & (32 / genTypeSize(baseType) - 1); // clear the unused bits
+            int            halfIndex        = 16 / genTypeSize(baseType);


These two lines could be abstracted out for better readability (it's repeated below). It could also be made more efficient since we know that the type size is always a power of 2, though I don't think that's critical.
Both these suggestions can be left for future unless you are already going to do another commit on this PR, in which case you might consider extracting to a separate method.

It could also be made more efficient since we know that the type size is always a power of 2

I would expect that the native compiler will properly recognize and transform <pow-2-const> / value and will transform it into the appropriate shift operation.

These two lines could be abstracted out for better readability

Thanks, will change in this PR, and I will update the test cases to match #16957

I'm not sure the compiler will do that, given that genTypeSize gets its value from a table, though perhaps it will be more clever than I think. In any case, I don't consider optimizing this a high-value investment.

I'm not sure the compiler will do that, given that genTypeSize gets its value from a table

You're probably right, I was thinking of the reverse operation value / <pow-2-const> 😄

I don't consider optimizing this a high-value investment.

Agreed

tannergooding · 2018-03-19T18:59:50Z

src/jit/hwintrinsiccodegenxarch.cpp

+
+            if (compiler->compSupports(InstructionSet_AVX2))
+            {
+                emit->emitIns_R_R(ins, emitTypeSize(TYP_SIMD32), targetReg, op1Reg);


This will emit broadcast, correct?

Will add a comment.

tannergooding · 2018-03-19T19:01:52Z

src/jit/hwintrinsiccodegenxarch.cpp

+                        break;
+                    case TYP_LONG:
+                    case TYP_ULONG:
+                        emit->emitIns_SIMD_R_R_I(INS_pshufd, emitTypeSize(TYP_SIMD16), op1Reg, op1Reg, 68);


Why pshufd instead of unpcklqdq?

tannergooding · 2018-03-19T19:02:50Z

src/jit/hwintrinsiccodegenxarch.cpp

+                    }
+                    case TYP_SHORT:
+                    case TYP_USHORT:
+                        emit->emitIns_SIMD_R_R_I(INS_pshuflw, emitTypeSize(TYP_SIMD16), op1Reg, op1Reg, 0);


Why pshuflw instead of unpcklwd?

I just followed Clang codegen. They have the same performance on most architectures.

Right, but the latter produces smaller code, iirc

We can investigate it later.

fiigii · 2018-03-19T21:29:58Z

src/jit/instrsxarch.h

-INST3( vextractf128, "extractf128" , 0, IUM_WR, 0, 0, SSE3A(0x19),  BAD_CODE, BAD_CODE)      // Extract 128-bit packed floating point values
-INST3( vextracti128, "extracti128" , 0, IUM_WR, 0, 0, SSE3A(0x39),  BAD_CODE, BAD_CODE)      // Extract 128-bit packed integer values
+INST3( vextractf128, "extractf128" , 0, IUM_WR, 0, 0, SSE3A(0x19),  BAD_CODE, SSE3A(0x19))   // Extract 128-bit packed floating point values
+INST3( vextracti128, "extracti128" , 0, IUM_WR, 0, 0, SSE3A(0x39),  BAD_CODE, SSE3A(0x39))   // Extract 128-bit packed integer values


Fixed a mistake from #16957, looks like an oversight.

No, this was explicit, there is no rm encoding for vextractf128 or vextracti128

Tests failed on that changes. emitIns_R_AR_I seems to require RM field.

That is because you should be calling emitIns_AR_R_I, which is different.

We do not have a emitIns_AR_R_I now...

You did not fix Avx/Avx2.ExtractVector128, the changes (removing the RM field of extractf128 and extracti128) from #16957 makes an assertion failure (emitIns_R_AR_I requires RM encoding).

Right, I missed that ExtractVector128 was also swapping operands and needed to be changed as well. However, I did fix other code that was broken because it was incorrectly trying to be R_RM, when it is actually RM_R (NI_SSE41_Extract, for example).

The overall point is that:

The code, without any changes, is currently wrong (because I missed something).

The new changes are also ultimately incorrect (because it is not handling something that is known to cause failures in similar code)

As such, it is my opinion that the code should be fixed now rather than as a later PR.

I will, however, ultimately defer to @CarolEidt.

The new changes are also ultimately incorrect (because it is not handling something that is known to cause failures in similar code)

Question. How can I test the issue? Did you get ExtractVector128 failures from GCStress tests?

Before #16957:

Code was bad, caused GCStress failures

When run locally with EnableIncompleteISAClass

After #16957:

Code is still bad, but now causes an assert for EnableIncompleteISAClass

This was missed because only GCStress and RegStress jobs were run in the final set of changes

After #17030 (this PR):

I would like to see this passing for regular tests, GCStress tests, and RegStress tests

Question. How can I test the issue?

You have to run locally with both EnableIncompleteISAClass and the appropriate Jit Stress flags (such as COMPlus_GCStress=0xC)

COMPlus_GCStress=0xC

Yes, I tried all the GCStress values and ExtractVector128 tests all passed...

As such, it is my opinion that the code should be fixed now rather than as a later PR.

Let me try to add a emitIns_AR_R_I.

fiigii · 2018-03-19T21:31:00Z

src/jit/hwintrinsicxarch.cpp

+    assert(varTypeIsArithmetic(baseType));
+    // clear the unused bits to normalize the index into the range of [0, length of Vector256<baseType>)
+    *indexPtr = (*indexPtr) & (32 / genTypeSize(baseType) - 1);
+    return (16 / genTypeSize(baseType));


Abstract these two lines in a static function.

fiigii · 2018-03-21T01:35:54Z

Added the function emitIns_AR_R_I to avoid swapping operands of vextracti/f128 (in the last commit).
@CarolEidt @tannergooding PTAL

tannergooding · 2018-03-21T03:20:10Z

src/jit/emitfmtsxarch.h

@@ -194,6 +196,8 @@ IF_DEF(ARD_CNS,     IS_AM_RD,                   AMD_CNS)  // read  [adr], const
 IF_DEF(AWR_CNS,     IS_AM_WR,                   AMD_CNS)  // write [adr], const
 IF_DEF(ARW_CNS,     IS_AM_RW,                   AMD_CNS)  // r/w   [adr], const

+IF_DEF(AWR_RRD_CNS, IS_AM_WR,                   AMD_CNS)  // write  [adr], reg, const


This hsould be IS_AM_WR|IS_R1_RD? and the comment should be write [adr], read reg, const?

This hsould be IS_AM_WR|IS_R1_RD?

I am not sure, this field seems not used now.

and the comment should be write [adr], read reg, const?

Will change.

I don't think a flag being unused is a good reason to drop it. We should probably add it, in order to match the others, and it c an be removed, ignored, or consumed (whoever is appropriate) by both current or future code/changes

tannergooding · 2018-03-21T03:22:04Z

src/jit/emitxarch.cpp

+
+    assert(emitGetInsAmdAny(id) == disp); // make sure "disp" is stored properly
+
+    sz = 6;


Comment as to why 6? I believe we have these elsewhere...

It would also be good to understand why the size calculation functions can't be used. I think those should be preferred, if possible

Only vextracti/f128 needs this function, so the code size is known and we do not need to calculate it. Meanwhile, the current code-size estimation is not accurate for vextracti/f128.

Will add a comment.

tannergooding · 2018-03-21T03:26:02Z

src/jit/emitxarch.cpp

@@ -7785,6 +7813,32 @@ void emitter::emitDispIns(
            break;
        }

+        case IF_AWR_RRD_CNS:
+        case IF_MWR_RRD_CNS:


These should be handled separately.

AWR_RRD_CNS should be using emitDispAddrMode
MWR_RRD_CNS should be using emitDispClsVar

tannergooding · 2018-03-21T03:26:25Z

src/jit/emitxarch.cpp

+        case IF_MWR_RRD_CNS:
+        {
+            assert(ins == INS_vextracti128 || ins == INS_vextractf128);
+            sstr = codeGen->genSizeStr(EA_ATTR(16));


Why are we doing sstr here, I think everyone else is getting it above.

vextracti/f128 is 256-bit instruction that extracts a 128-bit value, so we have to specially treat the sstr.

Comments are useful for special comments are useful for special cases.

tannergooding · 2018-03-21T03:27:45Z

src/jit/emitxarch.cpp

@@ -12213,6 +12267,16 @@ size_t emitter::emitOutputInstr(insGroup* ig, instrDesc* id, BYTE** dp)
            sz = emitSizeOfInsDsc(id);
            break;

+        case IF_AWR_RRD_CNS:
+        case IF_MWR_RRD_CNS:


These should be handled separately as well. AWR goes through emitOutputAM, but MWR needs to go through emitOutputCV. The functions also deal with regcode slightly differently.

This PR just add emitIns_AR_R_I, shall I remove IF_MWR_RRD_CNS?

I think we need/want both to ensure all the same code exists as for the other cases.

emitMapFmtAtoM specifically provides a mapping from one to the other.

@CarolEidt might be able to provide better input as to whether or not IF_MWR_RRD_CNS should be added SxS with IF_MWR_RRD_CNS.

fiigii · 2018-03-21T19:22:36Z

Addressed all the feedback, please take a look @CarolEidt @tannergooding

tannergooding · 2018-03-21T19:25:39Z

src/jit/emitfmtsxarch.h

@@ -139,6 +139,8 @@ IF_DEF(MRD_CNS,     IS_GM_RD,                   DSP_CNS)  // read  [mem], const
 IF_DEF(MWR_CNS,     IS_GM_WR,                   DSP_CNS)  // write [mem], const
 IF_DEF(MRW_CNS,     IS_GM_RW,                   DSP_CNS)  // r/w   [mem], const

+IF_DEF(MWR_RRD_CNS, IS_GM_WR|IS_R1_RD,          DSP_CNS)  // write [mem] , read reg, const


nit: space between the ] and the comma

Will fix, thanks.

tannergooding · 2018-03-21T19:26:05Z

src/jit/emitxarch.cpp

+
+    assert(emitGetInsAmdAny(id) == disp); // make sure "disp" is stored properly
+
+    // the code size of "vextracti/f128 [add], ymm, imm8" is 6 byte


nit: adr or mem would probably be better

Will fix, thanks.

tannergooding · 2018-03-21T19:27:13Z

src/jit/emitxarch.cpp

@@ -12530,6 +12620,15 @@ size_t emitter::emitOutputInstr(insGroup* ig, instrDesc* id, BYTE** dp)
            sz = emitSizeOfInsDsc(id);
            break;

+        case IF_MWR_RRD_CNS:


I think the register encoding needs to happen in the case for MWR. emitOutputAM is one of the only ones that handles the register encoding in itself.

Sorry, I do not quite understand.

If you look at the other examples MWR cases handle the encodeReg3456 before the emit Output call. emitOutputAM is the only one that does that in the call itself

I looked the above section of IF_RWR_MRD_CNS that does not handle the encodeReg3456 in if (Is4ByteSSE4OrAVXInstruction(ins)) path.

Ah, I see.

A comment (or explicit assert) that Is4ByteSSE4OrAVXInstruction(ins) is always true would be beneficial.

Thanks, will do.

Done. Added a comment blow.

CarolEidt

This mostly LGTM, but there's one naming & comment change that I think is worth making.

CarolEidt · 2018-03-22T00:51:25Z

src/jit/hwintrinsicxarch.cpp

@@ -1071,6 +1090,145 @@ GenTree* Compiler::impAvxOrAvx2Intrinsic(NamedIntrinsic        intrinsic,

    switch (intrinsic)
    {
+        case NI_AVX_Extract:
+        {
+            // Avx.Extract executes software implementation when the imm8 argument is not complie-time constant


typo: should be "compile-time"

CarolEidt · 2018-03-22T00:59:46Z

src/jit/hwintrinsicxarch.cpp

@@ -1058,6 +1058,25 @@ GenTree* Compiler::impSSE42Intrinsic(NamedIntrinsic        intrinsic,
    return retNode;
 }

+//------------------------------------------------------------------------
+// gethalfAndNormalizedIndex: compute the middle index of a Vector256<baseType>


This name and the name of the method don't match. I actually think that normalizeAndGetHalfIndex would be clearer; "mid" doesn't really mean anything to me.
Also, the comment needs to describe the modified indexPtr as a return value, or at least describe that it is an "out" parameter that returns the normalized index, while the actual return value is the half index.

fiigii · 2018-03-22T02:20:15Z

test Windows_NT x64 Checked jitincompletehwintrinsic
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnosimd

test Windows_NT x86 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
test Windows_NT x86 Checked jitx86hwintrinsicnosimd

test Ubuntu x64 Checked jitincompletehwintrinsic
test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
test Ubuntu x64 Checked jitx86hwintrinsicnosimd

fiigii · 2018-03-22T02:22:10Z

@CarolEidt Thanks for the review. I have fixed the function name and comments.

CarolEidt

LGTM - thanks for the comments!

fiigii force-pushed the avxhelper branch from 31aed8c to 46549e9 Compare March 19, 2018 07:51

tannergooding reviewed Mar 19, 2018

View reviewed changes

CarolEidt approved these changes Mar 19, 2018

View reviewed changes

tannergooding reviewed Mar 19, 2018

View reviewed changes

fiigii force-pushed the avxhelper branch from 46549e9 to a46d532 Compare March 19, 2018 21:16

fiigii commented Mar 19, 2018

View reviewed changes

fiigii force-pushed the avxhelper branch from a46d532 to 2d650dc Compare March 19, 2018 22:26

fiigii mentioned this pull request Mar 20, 2018

Implement more AVX/AVX2 intrinsics #16955

Merged

fiigii force-pushed the avxhelper branch from 1b6329d to 9173932 Compare March 21, 2018 00:09

tannergooding reviewed Mar 21, 2018

View reviewed changes

Add more tests for AVX Insert/Extract intrinsics

d5716e4

fiigii force-pushed the avxhelper branch 2 times, most recently from 3f1705c to 0aecc63 Compare March 21, 2018 19:21

tannergooding reviewed Mar 21, 2018

View reviewed changes

fiigii force-pushed the avxhelper branch from 0aecc63 to fb08741 Compare March 21, 2018 21:53

CarolEidt suggested changes Mar 22, 2018

View reviewed changes

FeiPengIntel added 5 commits March 21, 2018 19:11

Optimize AVX Insert/Extract intrinsics

37c716b

Implement AVX SetVector256

744940a

Implement SetAllVector256

e938ad3

fix Set tests on 32-bit platforms

0c90eb4

Add emitIns_AR_R_I for vextracti/f128

c488262

fiigii force-pushed the avxhelper branch from fb08741 to c488262 Compare March 22, 2018 02:13

tannergooding approved these changes Mar 22, 2018

View reviewed changes

CarolEidt approved these changes Mar 22, 2018

View reviewed changes

tannergooding merged commit 8f4a5e5 into dotnet:master Mar 22, 2018

fiigii deleted the avxhelper branch March 22, 2018 16:48

fiigii mentioned this pull request Mar 29, 2018

Implement AVX SetHighLow #17313

Merged


		assert(emitGetInsAmdAny(id) == disp); // make sure "disp" is stored properly

		sz = 6;


		assert(emitGetInsAmdAny(id) == disp); // make sure "disp" is stored properly

		// the code size of "vextracti/f128 [add], ymm, imm8" is 6 byte

Enable and optimize AVX helper-intrinsics #17030

Enable and optimize AVX helper-intrinsics #17030

Conversation

fiigii commented Mar 19, 2018

fiigii commented Mar 19, 2018

4creators commented Mar 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarolEidt commented Mar 19, 2018

CarolEidt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Mar 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Mar 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Mar 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarolEidt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Mar 22, 2018

fiigii commented Mar 22, 2018 • edited Loading

CarolEidt left a comment

Choose a reason for hiding this comment

tannergooding Mar 21, 2018 •

edited

Loading

fiigii commented Mar 22, 2018 •

edited

Loading