-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Allow pregenerating most HW intrinsics in CoreLib #24917
Conversation
There's a slight regression when tiering is disabled since before the change, the methods that use the intrinsics would always be JITted. The difference with tiered compilation enabled seems to be just noise.
Test program: using System;
using System.Diagnostics;
using System.Text.Unicode;
class Program
{
static void Main()
{
string source = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis a auctor sem. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Integer non sagittis sem, ut feugiat eros. Praesent commodo ligula et magna efficitur, nec tempus ex elementum. Sed condimentum hendrerit massa ut rhoncus.";
byte[] destination = new byte[1024];
Utf8.FromUtf16(source, destination, out int _, out int _);
for (int x = 0; x < 10; x++)
{
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 10000000; i++)
{
Utf8.FromUtf16(source, destination, out int _, out int _);
}
Console.WriteLine(sw.ElapsedMilliseconds);
}
}
} |
Cc @sergiy-k |
This is a follow up to dotnet#24689 that lets us pregenerate all hardware intrinsics in CoreLib. We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the `IsSupported` property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap. There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up. We only do this in CoreLib because user code could technically not guard intrinsic use in `IsSupported` checks and pregenerating the code could lead to illegal instruction traps at runtime (instead of `PlatformNotSupportedException` throws) - it's a bad user experience.
29a34e8
to
cef2048
Compare
if (m_pEECompileInfo->GetAssemblyModule(m_hAssembly) == m_pEECompileInfo->GetLoaderModuleForMscorlib()) | ||
{ | ||
m_pOpt->m_compilerFlags.Set(CORJIT_FLAGS::CORJIT_FLAG_FEATURE_SIMD); | ||
|
||
#if defined(_TARGET_X86_) || defined(_TARGET_AMD64_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could potentially cause problems. The JIT currently decides to emit the VEX or non-VEX encoding based on whether or not AVX is supported.
This isn't going to correctly handle that difference for an SSE exclusive vs an AVX checked code path, which will prevent the generated code from running on some downstream processors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep the "do not crossgen" behavior for AVX/FMA/AVX2, would that fix the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the safest thing to do for now would be to effectively treat the AOT codegen as if only SSE
and SSE2
had been reported as supported by the CPUID checks the VM does and to treat everything else as "not supported" (we should likely also consider whether or not crossgen ignores the COMPlus_EnableISA=0
flags).
That should mitigate the startup time regressions and throughput should get fairly quickly resolved by rejitting the hot methods.
Anything else is going to have larger potential for a bug tail or may impact perf in interesting ways (hurting benchmark numbers, for example) and likely needs more investigating before we settle on a solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confess that this hadn't occurred to me. I'd like to see us have the ability to include runtime-checked dual-version code, but I'm pretty sure the JIT can't generate both SSE and VEX encodings in the same method.
I'm not sure why we couldn't also handle the other Sse* intrinsics, with the assumption that such code will (as expected) be guarded with the appropriate IsSupported
runtime call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we couldn't also handle the other Sse* intrinsics, with the assumption that such code will (as expected) be guarded with the appropriate IsSupported runtime call.
I agree.
Long term, we should be able to support a static AOT target or one of a couple forms of dynamic dispatch. The two most common, that I am aware of, are cached CPUID checks (which result in a branch per ISA target) and changing the method target (which involves fixing up which target the method points to).
But, those are post 3.0 goals and likely require some fixups to support generating both VEX and non-VEX code paths, for example.
src/zap/zapinfo.cpp
Outdated
#if defined(_TARGET_X86_) || defined(_TARGET_AMD64_) | ||
if (attribs & CORINFO_FLG_JIT_INTRINSIC) | ||
{ | ||
// Do not report the get_IsSupported method as an intrinsic. This will turn the call into a regular |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely going to improve startup but hurt throughput for some of the core functions. That will likely be addressed by rejitting fairly quickly, but its still something to keep in mind.
Long term, if this type of dynamic dispatch is supported, it should likely be changed so that the startup logic caches the CPUID checks and these resolve to be inlined checks of those cached checks (which is how the C Runtime library handles this, for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative might be to just have these be treated as intrinsic for the purposes of AOT, except for when it is a recursive call.
That is, when called from another method, the intrinsic would be jitted normally. However, the intrinsic calls themselves would be marked with "requires jitting" so that reflection and other scenarios continue working as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a specific scenario in mind? I can add a benchmark for it (see the benchmark above where I picked one of the affected places). The extra call to a method that returns a constant 1 or 0 didn't hurt perf much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's likely dependent on how the code was setup and where the ISA check exists. If you have a helper method that was providing a software fallback path (such as BitOperations.LeadingZeroCount
) and you are calling it in a loop, you are going to have an additional call and branch per loop iteration now which can significantly hurt throughput.
If you just have a centralized check that handles a large block of logic, then the cost of the call is going to be amortized (as was the case for the benchmark above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, BitOperations.LeadingZeroCount
in a loop is probably the most pathological case.
I used this to benchmark it:
class Program
{
[MethodImpl(MethodImplOptions.AggressiveOptimization)]
static void Main()
{
int sum = 0;
for (int x = 0; x < 10; x++)
{
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 1000000000; i++)
{
sum += BitOperations.LeadingZeroCount((uint)i);
}
Console.WriteLine(sw.ElapsedMilliseconds);
}
sum.ToString();
}
}
I'm not able to measure a slowdown.
Default | TieredCompilation=0 | |
---|---|---|
After | 511 | 381 |
Before | 510 | 381 |
Maybe my CPU is just really good at optimizing that extra call and branch away.
I'm a bit surprised disabling tiering makes this so much faster, given I decorated the method with AggressiveOptimization
. Cc @kouvel on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to repro this initially on a recent SDK, but eventually I was able to repro by removing the AggressiveOptimization
flag and running with COMPlus_ReadyToRun=0
, even though that doesn't make sense. On my machine at the moment anyway, the numbers are similar to yours but opposite, where it's faster with tiered compilation enabled. The code generated is identical, there doesn't seem to be any alignment issues in the small hot loop, and the method is slightly differently aligned. Looks like in the slow version, there is a decoding stall somewhere, haven't looked into the details.
AggressiveOptimization
would also prevent prejitting the method such that it gets jitted at runtime with full opts as quickly as possible.
if (m_pEECompileInfo->GetAssemblyModule(m_hAssembly) == m_pEECompileInfo->GetLoaderModuleForMscorlib()) | ||
{ | ||
m_pOpt->m_compilerFlags.Set(CORJIT_FLAGS::CORJIT_FLAG_FEATURE_SIMD); | ||
|
||
#if defined(_TARGET_X86_) || defined(_TARGET_AMD64_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confess that this hadn't occurred to me. I'd like to see us have the ability to include runtime-checked dual-version code, but I'm pretty sure the JIT can't generate both SSE and VEX encodings in the same method.
I'm not sure why we couldn't also handle the other Sse* intrinsics, with the assumption that such code will (as expected) be guarded with the appropriate IsSupported
runtime call.
In the second iteration, I restored the old behavior for Avx, Avx2 and FMA intrinsics (we'll not pass the flag to RyuJIT and we'll prevent these from loading). Methods that use these don't show up in the PowerShell startup path anyway. |
What else is required before this change can be merged? |
I think I addressed all code review feedback at this point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me except for one nit about a comment.
src/zap/zapper.cpp
Outdated
m_pOpt->m_compilerFlags.Set(CORJIT_FLAGS::CORJIT_FLAG_USE_POPCNT); | ||
// Leaving out CORJIT_FLAGS::CORJIT_FLAG_USE_AVX, CORJIT_FLAGS::CORJIT_FLAG_USE_FMA | ||
// CORJIT_FLAGS::CORJIT_FLAG_USE_AVX2 on purpose - these enable JIT to use VEX encoding | ||
// and that's not safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change this comment to say "these require VEX encodings and the JIT doesn't support generating code for methods with mixed encodings."
I ended up writing down the expected behavior after this pull request (still have to push the commits out that do this - what's in the branch now doesn't match). This also captures what I did in the previous pull request:
|
That reads correct to me.
You might clarify the above to also include
As per #24917 (comment), the ARM
Just to confirm. For the ARM intrinsics on x86, they are not intrinsic, so they are treated as regular calls. Correct? (same goes for x86 intrinsics on ARM). |
src/zap/zapinfo.cpp
Outdated
// If it's IsSupported on Sse/Sse2, we can expand unconditionally since this is reliably | ||
bool fIsX86intrinsic = strcmp(namespaceName, "System.Runtime.Intrinsics.X86") == 0; | ||
|
||
// If it's anything related to Sse/Sse2, we can expand unconditionally since this is reliably |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It would probably be better to say: "since this is a baseline requirement for CoreCLR".
return attribs; | ||
|
||
// Treat as a regular method call (into a JITted method). | ||
return (attribs & ~CORINFO_FLG_JIT_INTRINSIC) | CORINFO_FLG_DONT_INLINE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be correct? The calls are recursive and the mustExpand
support is dependent on CORINFO_FLG_JIT_INTRINSIC
being set.
So, if it is recursive and CORINFO_FLG_JIT_INTRINSIC
isn't set, it will be AOT'd as if it was a regular recursive call...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we want is the caller to treat these as regular, non-inline method calls and for the JIT to not AOT the method bodies of the intrinsics themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that what CORINFO_FLG_DONT_INLINE
will fix up?
The recursive call within the actual intrinsic method body will never go through here because we don't AOT compile them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a compiler check for "is compiling for R2R/crossgen/AOT", right?
It might be simpler to just update the logic in Compiler::impHWIntrinsic
method (see https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsicxarch.cpp#L61-L70) to do the following
bool isIsaSupported = comp->compSupports(isa) && comp->compSupportsHWIntrinsic(isa);
if (strcmp(methodName, "get_IsSupported") == 0)
{
return isIsaSupported ? NI_IsSupported_True : NI_IsSupported_False;
}
else if (!isIsaSupported)
{
+ if (isCompilingForAOT)
+ {
+ return nullptr;
+ }
return NI_Throw_PlatformNotSupportedException;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that would allow us to treat the AVX/AVX2/FMA/BMI1/BMI2 code paths as normal (you don't need to mark them as DONT_INLINE
or strip the JIT_INTRINSIC
flag).
The existing logic for not expanding method bodies would kick-in here: 6b5b161#diff-04a590b55693db432b3a24d0d9a59cceR451
And the added check would ensure that not supported ISAs (which these won't be due to us not setting the compiler flags in the zapper here: 6b5b161#diff-efefdb418a1b8be4f78c7c9aedf91cb4L1212) will return nullptr
(forcing them to be emitted as regular method calls).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding is correct here - the intrinsics are recognized during importing, prior to the call being considered for inlining.
I think it makes some sense to unconditionally fail to expand (i.e. return nullptr
) for these when pre-jitting (aka AOT compiling). I don't think the JIT can distinguish when it's compiling SPC.dll vs. something else, but I don't think it makes a difference in practice, since any intrinsics that are called under the check for IsSupported
will fail to pre-jit anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe so. Intrinsics are special and are handled before the inlining logic.
But this line strips the intrinsic bit from the method. It should be treated as a simple method call, no?
That's what we want for BMI and friends - keep IsSupported untouched and reported as an intrinsic so that it expands to false
, but treat everything else as a regular non-inlined method call - the fact that the IL is recursive shouldn't matter because RyuJIT is not going to look at it due to the no inline flag, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fact that the IL is recursive shouldn't matter because RyuJIT is not going to look at it due to the no inline flag, no?
The problem ends up coming in when we go to AOT those methods. You stripped the JIT_INTRINSIC
flag so the logic for "skip generating hardware intrinsic method bodies" won't kick in. Which means we will AOT the BMI and friends method bodies as regular recursive calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You stripped the JIT_INTRINSIC flag so the logic for "skip generating hardware intrinsic method bodies" won't kick in
Attributes in that spot are retrieved from the standard JitInterface - we don't call the getMethodAttribs
that messes with the intrinsic bit there.
Precompiling method bodies on e.g. the Bmi1 class would be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. That definitely wasn't clear to me. A comment calling it out might be helpful.
src/zap/zapinfo.cpp
Outdated
// call. We also make sure none of the hardware intrinsic method bodies get pregenerated in crossgen | ||
// (see ZapInfo::CompileMethod) but get JITted instead. The JITted method will have the correct | ||
// answer for the CPU the code is running on. | ||
if (fIsGetIsSupportedMethod && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per #24917 (comment), I think the ARM IsSupported
checks should be constant on x86 as should the x86 IsSupported
checks on ARM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that what this code is doing?
This logic is going to strip the intrinsic flag - so we're making sure we only strip it for x86 intrinsics on x86 and arm64 intrinsics on arm64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. I misread the combination of the comment plus the if checks. 👍
strcmp(namespaceName, "System.Runtime.Intrinsics") == 0) | ||
{ | ||
// Treat as a regular method call (into a JITted method). | ||
return (attribs & ~CORINFO_FLG_JIT_INTRINSIC) | CORINFO_FLG_DONT_INLINE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same concern as I called out above. Intrinsics are special and are handled in the importer before inlining checks and the recursive calls are dependent on mustExpand
(which itself depends on JIT_INTRINSIC
flag).
It might be better to just fixup the compiler::impIntrinsic
method to have a if (isCompilingForAOT)
path for the IsSupported
checks, which returns nullptr
rather than a gtNewIconNode
(see https://github.com/dotnet/coreclr/blob/master/src/jit/importer.cpp#L3486-L3494, as we handle the IsSupported
and PlatformNotSupportedException
intrinsics a bit earlier).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me, though note that the JIT doesn't know whether it's SPC.dll or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that actually a concern now that we have a mechanism to handle functions which consume AVX/AVX2/FMA/BMI1/BMI2 methods, so things work even if they are called without an IsSupported
check?
I believe the "worst case" scenario is someone doesn't use the check and calls, for example Sse41.DotProduct
on a processor that doesn't support it. We will have emitted the correct instruction encoding and that will trigger a hardware exception as the instruction isn't supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also the chance for subtle bugs, if users don't provide software fallbacks, but I think it might be better to just have that be upfront in the initial release. If we ever want to make this more generally available, users will always have that scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - I don't think it is a concern that we don't specially handle SPC.dll here.
I believe the "worst case" scenario is someone doesn't use the check and calls, for example Sse41.DotProduct on a processor that doesn't support it.
But the code won't work in any event. I guess if they were actually relying on getting PNSE it might. I suppose we could trap the hardware exception and throw PNSE, but I'm not sure that complication is justified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I think it would be better to encourage people to have the appropriate IsSupported
checks 😄
Updated comments per the conversations above. I think all the open issues have been addressed at this point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
As I called out here, I think that all of the initial concerns around enabling this for non S.P.Corelib binaries have actually been addressed now.
So, I think it may be worth considering either:
A. Allowing this for everything by default (we can always flip it back off if we get feedback)
B. Provide a switch allowing users to explicitly opt-into AOT'ing their methods which contain HWIntrinsics (you could also provide an opt-out switch if option A was chosen)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks for your patience @MichalStrehovsky !
I would suggest that initially we offer this as an opt-in switch. And we could possibly change that in future depending on usage & feedback. |
That sounds good to me, if other people agree (and could be done as a secondary PR). That would also allow CoreFX to use this as well, which would likely improve perf in a couple of other places. |
I think we would still not be able to lift the restriction around loading Vector64/128/256 type inside the crossgen process for things outside S.P.CoreLib, which somewhat limits the usefulness of the switch (even for CoreFX) - same problem also applies to switches requested in #20034. I think we block those types in crossgen because we expect an ABI breaking change around them in the future and we don't want to invalidate older R2R images. |
Yes, very good point, though I hadn't realized that we supported retaining R2R images across versions. Perhaps it's something to keep in mind for the next release, after we've added support for the vector aspects of the ABIs. |
I don't think we are going to be able to keep this contract across every major version (especially when roll forward across major versions isn't supported). I think it would be fine, given that it is I think given that the release next year (and earliest we would ship the vector ABI changes) is .NET 5 where we are likely going to have some non-trivial changes anyways, I don't think there will be any negative fallout from this. |
* Allow pregenerating all HW intrinsics in CoreLib This is a follow up to dotnet/coreclr#24689 that lets us pregenerate all hardware intrinsics in CoreLib. We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the `IsSupported` property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap. There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up. We only do this in CoreLib because user code could technically not guard intrinsic use in `IsSupported` checks and pregenerating the code could lead to illegal instruction traps at runtime (instead of `PlatformNotSupportedException` throws) - it's a bad user experience. Commit migrated from dotnet/coreclr@e73c8e6
This is a follow up to #24689 that lets us pregenerate most hardware intrinsics in CoreLib.
We ensures the potentially unsupported code will never be reachable at runtime on CPUs that don't support it by not reporting the
IsSupported
property as intrinsic in crossgen. This ensures the support checks are always JITted. JITting the support checks is very cheap.There is cost in the form of an extra call and failure to do constant propagation of the return value, but the cost is negligible in practice and gets eliminated once the tiered JIT tiers the method up.
We only do this in CoreLib because user code could technically not guard intrinsic use in
IsSupported
checks and pregenerating the code could lead to illegal instruction traps at runtime (instead ofPlatformNotSupportedException
throws) - it's a bad user experience.