Allow pregenerating most HW intrinsics in CoreLib #24917

MichalStrehovsky · 2019-06-03T09:35:36Z

This is a follow up to #24689 that lets us pregenerate most hardware intrinsics in CoreLib.

We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the IsSupported property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap.

There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up.

We only do this in CoreLib because user code could technically not guard intrinsic use in IsSupported checks and pregenerating the code could lead to illegal instruction traps at runtime (instead of PlatformNotSupportedException throws) - it's a bad user experience.

MichalStrehovsky · 2019-06-03T09:40:53Z

There's a slight regression when tiering is disabled since before the change, the methods that use the intrinsics would always be JITted.

The difference with tiered compilation enabled seems to be just noise.

	Default	TieredCompilation=0
After	311 ms	320 ms
Before	309 ms	315 ms

Test program:

using System;
using System.Diagnostics;
using System.Text.Unicode;

class Program
{
    static void Main()
    {
        string source = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis a auctor sem. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Integer non sagittis sem, ut feugiat eros. Praesent commodo ligula et magna efficitur, nec tempus ex elementum. Sed condimentum hendrerit massa ut rhoncus.";
        byte[] destination = new byte[1024];

        Utf8.FromUtf16(source, destination, out int _, out int _);

        for (int x = 0; x < 10; x++)
        {
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < 10000000; i++)
            {
                Utf8.FromUtf16(source, destination, out int _, out int _);
            }
            Console.WriteLine(sw.ElapsedMilliseconds);
        }
    }
}

src/zap/zapinfo.cpp

MichalStrehovsky · 2019-06-03T09:42:36Z

Cc @sergiy-k

This is a follow up to dotnet#24689 that lets us pregenerate all hardware intrinsics in CoreLib. We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the `IsSupported` property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap. There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up. We only do this in CoreLib because user code could technically not guard intrinsic use in `IsSupported` checks and pregenerating the code could lead to illegal instruction traps at runtime (instead of `PlatformNotSupportedException` throws) - it's a bad user experience.

src/vm/methodtablebuilder.cpp

src/zap/zapinfo.cpp

tannergooding · 2019-06-03T14:48:57Z

src/zap/zapper.cpp

    if (m_pEECompileInfo->GetAssemblyModule(m_hAssembly) == m_pEECompileInfo->GetLoaderModuleForMscorlib())
    {
        m_pOpt->m_compilerFlags.Set(CORJIT_FLAGS::CORJIT_FLAG_FEATURE_SIMD);
+
+#if defined(_TARGET_X86_) || defined(_TARGET_AMD64_)


This could potentially cause problems. The JIT currently decides to emit the VEX or non-VEX encoding based on whether or not AVX is supported.

This isn't going to correctly handle that difference for an SSE exclusive vs an AVX checked code path, which will prevent the generated code from running on some downstream processors.

If we keep the "do not crossgen" behavior for AVX/FMA/AVX2, would that fix the issue?

I think the safest thing to do for now would be to effectively treat the AOT codegen as if only SSE and SSE2 had been reported as supported by the CPUID checks the VM does and to treat everything else as "not supported" (we should likely also consider whether or not crossgen ignores the COMPlus_EnableISA=0 flags).

That should mitigate the startup time regressions and throughput should get fairly quickly resolved by rejitting the hot methods.

Anything else is going to have larger potential for a bug tail or may impact perf in interesting ways (hurting benchmark numbers, for example) and likely needs more investigating before we settle on a solution.

I confess that this hadn't occurred to me. I'd like to see us have the ability to include runtime-checked dual-version code, but I'm pretty sure the JIT can't generate both SSE and VEX encodings in the same method.
I'm not sure why we couldn't also handle the other Sse* intrinsics, with the assumption that such code will (as expected) be guarded with the appropriate IsSupported runtime call.

I'm not sure why we couldn't also handle the other Sse* intrinsics, with the assumption that such code will (as expected) be guarded with the appropriate IsSupported runtime call.

I agree.

Long term, we should be able to support a static AOT target or one of a couple forms of dynamic dispatch. The two most common, that I am aware of, are cached CPUID checks (which result in a branch per ISA target) and changing the method target (which involves fixing up which target the method points to).

But, those are post 3.0 goals and likely require some fixups to support generating both VEX and non-VEX code paths, for example.

tannergooding · 2019-06-03T14:59:15Z

src/zap/zapinfo.cpp

+#if defined(_TARGET_X86_) || defined(_TARGET_AMD64_)
+    if (attribs & CORINFO_FLG_JIT_INTRINSIC)
+    {
+        // Do not report the get_IsSupported method as an intrinsic. This will turn the call into a regular


This is likely going to improve startup but hurt throughput for some of the core functions. That will likely be addressed by rejitting fairly quickly, but its still something to keep in mind.

Long term, if this type of dynamic dispatch is supported, it should likely be changed so that the startup logic caches the CPUID checks and these resolve to be inlined checks of those cached checks (which is how the C Runtime library handles this, for example).

Another alternative might be to just have these be treated as intrinsic for the purposes of AOT, except for when it is a recursive call.

That is, when called from another method, the intrinsic would be jitted normally. However, the intrinsic calls themselves would be marked with "requires jitting" so that reflection and other scenarios continue working as expected.

Do you have a specific scenario in mind? I can add a benchmark for it (see the benchmark above where I picked one of the affected places). The extra call to a method that returns a constant 1 or 0 didn't hurt perf much.

It's likely dependent on how the code was setup and where the ISA check exists. If you have a helper method that was providing a software fallback path (such as BitOperations.LeadingZeroCount) and you are calling it in a loop, you are going to have an additional call and branch per loop iteration now which can significantly hurt throughput.

If you just have a centralized check that handles a large block of logic, then the cost of the call is going to be amortized (as was the case for the benchmark above).

True, BitOperations.LeadingZeroCount in a loop is probably the most pathological case.

I used this to benchmark it:

class Program { [MethodImpl(MethodImplOptions.AggressiveOptimization)] static void Main() { int sum = 0; for (int x = 0; x < 10; x++) { Stopwatch sw = Stopwatch.StartNew(); for (int i = 0; i < 1000000000; i++) { sum += BitOperations.LeadingZeroCount((uint)i); } Console.WriteLine(sw.ElapsedMilliseconds); } sum.ToString(); } }

I'm not able to measure a slowdown.

Default TieredCompilation=0

After 511 381

Before 510 381

Maybe my CPU is just really good at optimizing that extra call and branch away.

I'm a bit surprised disabling tiering makes this so much faster, given I decorated the method with AggressiveOptimization. Cc @kouvel on that.

I wasn't able to repro this initially on a recent SDK, but eventually I was able to repro by removing the AggressiveOptimization flag and running with COMPlus_ReadyToRun=0, even though that doesn't make sense. On my machine at the moment anyway, the numbers are similar to yours but opposite, where it's faster with tiered compilation enabled. The code generated is identical, there doesn't seem to be any alignment issues in the small hot loop, and the method is slightly differently aligned. Looks like in the slow version, there is a decoding stall somewhere, haven't looked into the details.

AggressiveOptimization would also prevent prejitting the method such that it gets jitted at runtime with full opts as quickly as possible.

src/zap/zapinfo.cpp

src/zap/zapper.cpp

CarolEidt · 2019-06-03T15:59:17Z

src/zap/zapper.cpp

    if (m_pEECompileInfo->GetAssemblyModule(m_hAssembly) == m_pEECompileInfo->GetLoaderModuleForMscorlib())
    {
        m_pOpt->m_compilerFlags.Set(CORJIT_FLAGS::CORJIT_FLAG_FEATURE_SIMD);
+
+#if defined(_TARGET_X86_) || defined(_TARGET_AMD64_)


I confess that this hadn't occurred to me. I'd like to see us have the ability to include runtime-checked dual-version code, but I'm pretty sure the JIT can't generate both SSE and VEX encodings in the same method.
I'm not sure why we couldn't also handle the other Sse* intrinsics, with the assumption that such code will (as expected) be guarded with the appropriate IsSupported runtime call.

MichalStrehovsky · 2019-06-04T15:32:44Z

In the second iteration, I restored the old behavior for Avx, Avx2 and FMA intrinsics (we'll not pass the flag to RyuJIT and we'll prevent these from loading). Methods that use these don't show up in the PowerShell startup path anyway.

sergiy-k · 2019-06-08T07:25:38Z

What else is required before this change can be merged?

MichalStrehovsky · 2019-06-10T11:10:18Z

What else is required before this change can be merged?

I think I addressed all code review feedback at this point.

davidwrighton

Looks good to me except for one nit about a comment.

src/zap/zapper.cpp

src/vm/methodtablebuilder.cpp

src/zap/zapinfo.cpp

src/vm/methodtablebuilder.cpp

src/zap/zapinfo.cpp

CarolEidt · 2019-06-10T23:20:54Z

src/zap/zapper.cpp

+        m_pOpt->m_compilerFlags.Set(CORJIT_FLAGS::CORJIT_FLAG_USE_POPCNT);
+        // Leaving out CORJIT_FLAGS::CORJIT_FLAG_USE_AVX, CORJIT_FLAGS::CORJIT_FLAG_USE_FMA
+        // CORJIT_FLAGS::CORJIT_FLAG_USE_AVX2 on purpose - these enable JIT to use VEX encoding
+        // and that's not safe.


I would change this comment to say "these require VEX encodings and the JIT doesn't support generating code for methods with mixed encodings."

MichalStrehovsky · 2019-06-12T11:01:25Z

I ended up writing down the expected behavior after this pull request (still have to push the commits out that do this - what's in the branch now doesn't match). This also captures what I did in the previous pull request:

For everything but S.P.CoreLib:
- CORJIT_FLAG_FEATURE_SIMD, CORJIT_FLAG_USE_* (for all intrinsic types) is not specified to RyuJIT.
- Any attempt to access hardware intrinsics in the System.Runtime.Intrinsics.Arm.Arm64 (on ARM64) or System.Runtime.Intrinsics.X86 (on XArch) will throw at crossgen time, leading to no precompilation for the method that has the reference
- Any attempt to access the generic Vector64/128/256 type will throw at crossgen time as well
For S.P.CoreLib:
- CORJIT_FLAG_FEATURE_SIMD, CORJIT_FLAG_USE_* (for everything but AVX/FMA/AVX2/BMI1/BMI2) is specified to RyuJIT.
- Calls to get_IsSupported on all intrinsic classes in the Intrinsics.Arm.Arm64, Intrinsics.X86, and Intrinsics namespaces, except for Sse/Sse.X64/Sse2/Sse2.X64/AVX/FMA/AVX2/BMI1/BMI2 are quirked so that RyuJIT doesn’t see them as intrinsics and generates a normal/non-inlined method call instead.
- Calls to intrinsics (except for get_IsSupported) on AVX/FMA/AVX2/BMI1/BMI2 get quirked so that RyuJIT doesn’t see them as intrinsics and generates a normal/non-inlined method call instead
- None of the method bodies for intrinsics in the Intrinsics.Arm.Arm64, Intrinsics.X86, and Intrinsics namespaces are pregenerated

tannergooding · 2019-06-12T15:15:53Z

I ended up writing down the expected behavior

That reads correct to me.

Any attempt to access the generic Vector64/128/256 type will throw at crossgen time as well

You might clarify the above to also include Vector<T>, which can't be used due to its variable size

Calls to get_IsSupported on all intrinsic classes in the Intrinsics.Arm.Arm64, Intrinsics.X86, and Intrinsics namespaces, except for ...

As per #24917 (comment), the ARM IsSupported checks are intrinsics on x86 and should be treated as always returning false (this will never change even upon rejit/etc). Likewise for the x86 IsSupported checks on ARM.

Calls to intrinsics (except for get_IsSupported) on AVX/FMA/AVX2/BMI1/BMI2 get quirked so that RyuJIT doesn’t see them as intrinsics and generates a normal/non-inlined method call instead

Just to confirm. For the ARM intrinsics on x86, they are not intrinsic, so they are treated as regular calls. Correct? (same goes for x86 intrinsics on ARM).

tannergooding · 2019-06-12T15:21:07Z

src/zap/zapinfo.cpp

-        // If it's IsSupported on Sse/Sse2, we can expand unconditionally since this is reliably
+        bool fIsX86intrinsic = strcmp(namespaceName, "System.Runtime.Intrinsics.X86") == 0;
+
+        // If it's anything related to Sse/Sse2, we can expand unconditionally since this is reliably


nit: It would probably be better to say: "since this is a baseline requirement for CoreCLR".

tannergooding · 2019-06-12T15:26:17Z

src/zap/zapinfo.cpp

+                return attribs;
+
+            // Treat as a regular method call (into a JITted method).
+            return (attribs & ~CORINFO_FLG_JIT_INTRINSIC) | CORINFO_FLG_DONT_INLINE;


Is this going to be correct? The calls are recursive and the mustExpand support is dependent on CORINFO_FLG_JIT_INTRINSIC being set.

So, if it is recursive and CORINFO_FLG_JIT_INTRINSIC isn't set, it will be AOT'd as if it was a regular recursive call...

What we want is the caller to treat these as regular, non-inline method calls and for the JIT to not AOT the method bodies of the intrinsics themselves.

Isn't that what CORINFO_FLG_DONT_INLINE will fix up?

The recursive call within the actual intrinsic method body will never go through here because we don't AOT compile them.

We have a compiler check for "is compiling for R2R/crossgen/AOT", right?

It might be simpler to just update the logic in Compiler::impHWIntrinsic method (see https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsicxarch.cpp#L61-L70) to do the following

bool isIsaSupported = comp->compSupports(isa) && comp->compSupportsHWIntrinsic(isa); if (strcmp(methodName, "get_IsSupported") == 0) { return isIsaSupported ? NI_IsSupported_True : NI_IsSupported_False; } else if (!isIsaSupported) { + if (isCompilingForAOT) + { + return nullptr; + } return NI_Throw_PlatformNotSupportedException; }

I believe that would allow us to treat the AVX/AVX2/FMA/BMI1/BMI2 code paths as normal (you don't need to mark them as DONT_INLINE or strip the JIT_INTRINSIC flag).

The existing logic for not expanding method bodies would kick-in here: 6b5b161#diff-04a590b55693db432b3a24d0d9a59cceR451

And the added check would ensure that not supported ISAs (which these won't be due to us not setting the compiler flags in the zapper here: 6b5b161#diff-efefdb418a1b8be4f78c7c9aedf91cb4L1212) will return nullptr (forcing them to be emitted as regular method calls).

@tannergooding is correct here - the intrinsics are recognized during importing, prior to the call being considered for inlining.

I think it makes some sense to unconditionally fail to expand (i.e. return nullptr) for these when pre-jitting (aka AOT compiling). I don't think the JIT can distinguish when it's compiling SPC.dll vs. something else, but I don't think it makes a difference in practice, since any intrinsics that are called under the check for IsSupported will fail to pre-jit anyway.

I don't believe so. Intrinsics are special and are handled before the inlining logic.

But this line strips the intrinsic bit from the method. It should be treated as a simple method call, no?

That's what we want for BMI and friends - keep IsSupported untouched and reported as an intrinsic so that it expands to false, but treat everything else as a regular non-inlined method call - the fact that the IL is recursive shouldn't matter because RyuJIT is not going to look at it due to the no inline flag, no?

the fact that the IL is recursive shouldn't matter because RyuJIT is not going to look at it due to the no inline flag, no?

The problem ends up coming in when we go to AOT those methods. You stripped the JIT_INTRINSIC flag so the logic for "skip generating hardware intrinsic method bodies" won't kick in. Which means we will AOT the BMI and friends method bodies as regular recursive calls.

You stripped the JIT_INTRINSIC flag so the logic for "skip generating hardware intrinsic method bodies" won't kick in

Attributes in that spot are retrieved from the standard JitInterface - we don't call the getMethodAttribs that messes with the intrinsic bit there.

Precompiling method bodies on e.g. the Bmi1 class would be wrong.

Ah, ok. That definitely wasn't clear to me. A comment calling it out might be helpful.

tannergooding · 2019-06-12T15:42:11Z

src/zap/zapinfo.cpp

+        // call. We also make sure none of the hardware intrinsic method bodies get pregenerated in crossgen
+        // (see ZapInfo::CompileMethod) but get JITted instead. The JITted method will have the correct
+        // answer for the CPU the code is running on.
+        if (fIsGetIsSupportedMethod &&


As per #24917 (comment), I think the ARM IsSupported checks should be constant on x86 as should the x86 IsSupported checks on ARM.

Isn't that what this code is doing?

This logic is going to strip the intrinsic flag - so we're making sure we only strip it for x86 intrinsics on x86 and arm64 intrinsics on arm64.

Ah, right. I misread the combination of the comment plus the if checks. 👍

tannergooding · 2019-06-12T15:45:41Z

src/zap/zapinfo.cpp

+            strcmp(namespaceName, "System.Runtime.Intrinsics") == 0)
+        {
+            // Treat as a regular method call (into a JITted method).
+            return (attribs & ~CORINFO_FLG_JIT_INTRINSIC) | CORINFO_FLG_DONT_INLINE;


Same concern as I called out above. Intrinsics are special and are handled in the importer before inlining checks and the recursive calls are dependent on mustExpand (which itself depends on JIT_INTRINSIC flag).

It might be better to just fixup the compiler::impIntrinsic method to have a if (isCompilingForAOT) path for the IsSupported checks, which returns nullptr rather than a gtNewIconNode (see https://github.com/dotnet/coreclr/blob/master/src/jit/importer.cpp#L3486-L3494, as we handle the IsSupported and PlatformNotSupportedException intrinsics a bit earlier).

That makes sense to me, though note that the JIT doesn't know whether it's SPC.dll or not.

Is that actually a concern now that we have a mechanism to handle functions which consume AVX/AVX2/FMA/BMI1/BMI2 methods, so things work even if they are called without an IsSupported check?

I believe the "worst case" scenario is someone doesn't use the check and calls, for example Sse41.DotProduct on a processor that doesn't support it. We will have emitted the correct instruction encoding and that will trigger a hardware exception as the instruction isn't supported.

There is also the chance for subtle bugs, if users don't provide software fallbacks, but I think it might be better to just have that be upfront in the initial release. If we ever want to make this more generally available, users will always have that scenario.

Right - I don't think it is a concern that we don't specially handle SPC.dll here.

I believe the "worst case" scenario is someone doesn't use the check and calls, for example Sse41.DotProduct on a processor that doesn't support it.

But the code won't work in any event. I guess if they were actually relying on getting PNSE it might. I suppose we could trap the hardware exception and throw PNSE, but I'm not sure that complication is justified.

I agree, I think it would be better to encourage people to have the appropriate IsSupported checks 😄

MichalStrehovsky · 2019-06-12T19:41:44Z

Updated comments per the conversations above. I think all the open issues have been addressed at this point.

tannergooding

LGTM.

As I called out here, I think that all of the initial concerns around enabling this for non S.P.Corelib binaries have actually been addressed now.

So, I think it may be worth considering either:
A. Allowing this for everything by default (we can always flip it back off if we get feedback)
B. Provide a switch allowing users to explicitly opt-into AOT'ing their methods which contain HWIntrinsics (you could also provide an opt-out switch if option A was chosen)

CarolEidt

LGTM - thanks for your patience @MichalStrehovsky !

CarolEidt · 2019-06-13T18:14:58Z

So, I think it may be worth considering either:
A. Allowing this for everything by default (we can always flip it back off if we get feedback)
B. Provide a switch allowing users to explicitly opt-into AOT'ing their methods which contain HWIntrinsics (you could also provide an opt-out switch if option A was chosen)

I would suggest that initially we offer this as an opt-in switch. And we could possibly change that in future depending on usage & feedback.

tannergooding · 2019-06-13T18:36:44Z

I would suggest that initially we offer this as an opt-in switch. And we could possibly change that in future depending on usage & feedback.

That sounds good to me, if other people agree (and could be done as a secondary PR). That would also allow CoreFX to use this as well, which would likely improve perf in a couple of other places.

MichalStrehovsky · 2019-06-14T08:11:20Z

I would suggest that initially we offer this as an opt-in switch. And we could possibly change that in future depending on usage & feedback.

That sounds good to me, if other people agree (and could be done as a secondary PR). That would also allow CoreFX to use this as well, which would likely improve perf in a couple of other places.

I think we would still not be able to lift the restriction around loading Vector64/128/256 type inside the crossgen process for things outside S.P.CoreLib, which somewhat limits the usefulness of the switch (even for CoreFX) - same problem also applies to switches requested in #20034.

I think we block those types in crossgen because we expect an ABI breaking change around them in the future and we don't want to invalidate older R2R images.

CarolEidt · 2019-06-14T15:04:46Z

we expect an ABI breaking change around them in the future and we don't want to invalidate older R2R images.

Yes, very good point, though I hadn't realized that we supported retaining R2R images across versions.

Perhaps it's something to keep in mind for the next release, after we've added support for the vector aspects of the ABIs.

tannergooding · 2019-06-14T15:46:15Z

we don't want to invalidate older R2R images.

I don't think we are going to be able to keep this contract across every major version (especially when roll forward across major versions isn't supported).

I think it would be fine, given that it is opt-in and we would likely explicitly document around that switch that it may be invalidated by the next release.

I think given that the release next year (and earliest we would ship the vector ABI changes) is .NET 5 where we are likely going to have some non-trivial changes anyways, I don't think there will be any negative fallout from this.

* Allow pregenerating all HW intrinsics in CoreLib This is a follow up to dotnet/coreclr#24689 that lets us pregenerate all hardware intrinsics in CoreLib. We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the `IsSupported` property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap. There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up. We only do this in CoreLib because user code could technically not guard intrinsic use in `IsSupported` checks and pregenerating the code could lead to illegal instruction traps at runtime (instead of `PlatformNotSupportedException` throws) - it's a bad user experience. Commit migrated from dotnet/coreclr@e73c8e6

MichalStrehovsky commented Jun 3, 2019

View reviewed changes

src/zap/zapinfo.cpp Outdated Show resolved Hide resolved

MichalStrehovsky requested review from CarolEidt and tannergooding June 3, 2019 09:42

MichalStrehovsky force-pushed the pregenHw branch from 29a34e8 to cef2048 Compare June 3, 2019 13:13

tannergooding reviewed Jun 3, 2019

View reviewed changes

src/vm/methodtablebuilder.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Jun 3, 2019

View reviewed changes

src/zap/zapinfo.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Jun 3, 2019

View reviewed changes

CarolEidt reviewed Jun 3, 2019

View reviewed changes

Review feedback

b4b8244

MichalStrehovsky changed the title ~~Allow pregenerating all HW intrinsics in CoreLib~~ Allow pregenerating most HW intrinsics in CoreLib Jun 4, 2019

I hate C

5c3974f

MichalStrehovsky added 2 commits June 10, 2019 11:42

Merge branch 'master' into pregenHw

aa19230

Also mark get_IsSupported as noinlining

da02afe

davidwrighton approved these changes Jun 10, 2019

View reviewed changes

src/zap/zapper.cpp Outdated Show resolved Hide resolved