-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Regression: UTF8 Span Methods 3x slower on AMD Platforms (NetCore2.1 -> 3.1) #2251
Comments
@tannergooding Is the Ryzen CPU listed here one of the models that suffers from poor I can investigate using some of the SIMD instructions here where possible ( |
That's really interesting, actually. I've learned something today! Based on my reading, it looks as if this instruction is implemented in microcode on my chip, while Intel has a hardware implementation for it. Also understand why the CLR wouldn't want to be in the business of micro optimizations like this for different architectures. Running a quick experiment using the Kernel32 APIs for MultiByteToWideChar were slightly quicker, but still not as fast as NetCore2.1. Thanks for the responses here. |
Yes. You can see this in the
This might be a good first usecase for #785. The savings on platforms with fast |
@jamescourtney - Can you set the environment variable (Tanner informs me that the environment variable processing logic is case-sensitive on Linux. It shouldn't matter on Windows.) |
Setting that environment variable resolves the issue completely. Write performance more than doubles over NetCore2.1:
|
Just adding a small something to this discussion. |
The model for the hardware intrinsic is complex already. This would make it even more complex than what it is today. My suggestion for how to fix this issue would be to make |
I'm torn on whether the runtime lying about the intrinsic being supported is the correct approach. That seems put the runtime into the position of making a value judgment. Up until now the intrinsics have been only raw capabilities checks. |
This should be a point in time problem. I expect AMD to fix this performance bug and then all this will be legacy that I everybody will try to forget about. The reason why I have proposed implementing the workaround in the runtime is that it will be done in just one central place instead of spread over all places that use this instruction. As with any workaround, there may be corner cases where it may not be perfect, but that's ok. |
Loose some non-flag altering instructions (probably not high usage); and However between 18x and 290x times slower for Can provide a more general purpose fallback versions in |
I would be strongly against this approach both in terms of how the hardware intrinsics have been exposed and as someone who has an AMD processor. We have provided no guarantees that using a given hardware intrinsic will be "performant"; only that it will emit the instruction a given API is mapped to. It has always been up to the user to utilize them efficiently (which includes pipelining, dealing with micro-architectural differences, etc).
I'm not convinced we need a workaround.
We should likely, first and foremost, get some numbers on how much "gain" Intel is getting via PDEP/PEXT here. My initial speculation is that they are not actually providing that great of a benefit and we would likely minimally regress Intel (while significantly boosting AMD) by just dropping their usage in the framework (at least here where it is only used "once" per call; and not in a hot loop).
I don't believe it will be significantly more complex. We are already testing the various ISA combinations. We will continue to get full coverage of code paths even if we codify a few paths to do |
One of the key values of .NET platform is simplicity. Among other things, we hide number of OS bugs, quirks and performance issues by including workarounds for them in the platform, so that our users do not have to worry about them. These workarounds may stay in the runtime code for years, but they make the .NET platform better in the long run.
Yep, the workarounds are not perfect. As @benaadams explained, the other instructions do not seem to be that valuable, so we are not losing much.
The effect of this is nearly identical to returning false from
I do not think we would want to promote this pattern. It is not future proof. It will not work well once AMD fixes this performance bug. |
There are exceptions to every rule and System.Runtime.Intrinsics violates most of the standard simplicity rules the framework has. It is ~1500 hardware specific APIs (and that is practically doubling with the introduction of the ARM intrinsics). It requires the user to write unsafe code, to code multiple similar algorithms for the various paths to ensure good codegen is possible on all paths and that a software fallback is available. There can already be large performance differences for a given instruction depending on what microarchitecture you are comparing; it isn't just limited to PDEP/PEXT (which happens to be an extreme case). If you go back to pre-AVX machines; you will get severe penalties for using unaligned loads/stores. If you use non-temporal stores (even on modern hardware) for the wrong scenario, you can be looking at ~400 cycle latency rather than 1 cycle latency; but used in the right scenario (such as large memcpy), you can achieve over a 2x perf increase. If you compare "Broadwell" to the newer "Skylake-X" (https://www.agner.org/optimize/instruction_tables.pdf), there are several instructions that have gotten slower and several that have gotten faster (sometimes significantly). Developers using hardware intrinsics need to consider these scenarios; us included.
I believe this is going to set a bad precedent for the hardware intrinsics. Especially if this happens to a larger ISA where a couple of instructions happen to behave poorly. For example, we may eventually add AVX-512 support. That support is relatively new on Intel and doesn't exist on AMD yet, but may exist there someday. AVX-512 is split into multiple feature ISAs and exposes ranges of instructions. It may be that one of the two (Intel or AMD) has significantly worse performance on an instruction as compared to the other. Someone will likewise log a bug and then this will be looked to as an example of how it should be handled. In my mind, the correct thing is to just reiterate that this is an extremely low level feature that allows you to fine tune performance, but as part of that, the amount of code you need to write and what considerations you need to take are also greatly expanded (it is effectively writing pseudo-inline assembly after all, just without control over register allocation and spilling).
Yes, which is something developers already need to be concerned about. They already need to consider manufacturer and microarchitecture differences because what works well on "X" may not work well on "Y". There is generally some pattern that works well on both, but that is not always the case/scenario.
Neither is disabling the functionality for AMD systems. We will still have the consideration of Znver1/Znver2 vs the new ZnverX that has it be faster. CPUID is meant to expose hardware specific information such as the |
Right. So it is a good reason to not make it worse.
I believe that this would be a good precedent for the hardware intrinsics. If there is a configuration with broken implementation of the hardware intrinsics (whether it is broken functionality or broken performance), we should pretend that the specific hardware intrinsics do not exist in the runtime. It would not be the first time we have done such thing. For example, we used to artificially disable vectorized
Some of our important customers are not happy with dependency on JIT for these paths. We have plans to implement AOT for hardware intrinsics by pregenerating multiple variants of the method (it is runtime teams OKR). If you start allowing special casing like this, the matrix for this is going to get out of control.
I believe that underlying question of this discussion (and several other discussion in other issues) is whether we need to go out of our way to expose 100% of the hardware intrinsics potential, or whether it is good enough to keep things simpler and expose just 95%. I strongly believe that 95% is good enough. |
There is not an easy way to do that without disabling an entire ISA and that won't be viable for larger ISAs where performance problems exist.
Unless I'm missing something, that is already going to be the case. Once AOT comes into play, you need to have a runtime or compile time consideration for this performance difference. For the BMI2 code path, that means the AOT code needs to do one of:
|
(sorry, hit a shortcut key by accident) |
I think it would be perfectly viable. Half-done hardware features are hard for the software to deal with. It is a problem for hardware manufactures to fix, not a problem for software to invent a perfect workarounds for.
Right. The plan is to have a matrix. For example, if the method contains If you allow e.g. checks on ModelId versions, the complexity of how this is detected, encoded and propagated through the system goes through the roof. |
Yes, I understand this. It is very similar to the outer loop tests we already run where we cover all the valid ISA configurations using the What I don't understand is how you plan on addressing the perf of PDEP/PEXT for Intel vs AMD here... I'm still not clear on how this will be resolved in the future; where BMI2 is "fast" on some future AMD machine but is still slow on the Znver1/Znver2 machines (which will likely still be in use; in both developer machines and in the Azure VMs that utilize it for presumably the next 7-10 years).
Having a particular instruction be slow doesn't make it half done. Both Intel and AMD have varying performance for instructions based on a number of factors, including what is believed to be "mainline" scenarios. There are some instructions (like |
The right AOT method body to use will be picked on the information computed by If we want with my proposed solution, we would add runtime/src/coreclr/src/vm/codeman.cpp Line 1456 in fcd862e
It is ok to get 2x slower - happens all the time. It is not ok to get 100x slower - it is what I consider broken. |
I don't see how this is significantly better than allowing the same check to be done on the managed side. In either case, we will eventually need to either accept the perf loss on specific hardware or limit the disablement to specific microarchitectures (e.g. only disable on AMD Znver1/Znver2 but not on AMD FutureMicroarch). The latter will also likely become possible if #785 is reviewed and approved. Even if the |
It believe that it is better because of it is much simpler. It is the question that I have said above - whether we need to go out of our way to expose 100% of the hardware intrinsics potential, or whether it is good enough to keep things simpler and expose just 95%. |
Would it be worthwhile for me to experiment with removing BMI2 from the transcoding code paths? Where it's not in a tight loop we can remove intrinsics entirely, and where it is in a tight loop I might be able to substitute a SIMD implementation. It might regress Intel performance slightly but it should substantially improve AMD performance. |
My belief is that it goes against the principles of the hardware intrinsics. They were designed to give developers the ability to access the functionality available in the underlying hardware. While one of the main focuses is performance, there is no possible way for us to "cleanly" expose only things which are always performant (as it greatly depends on the use-case etc). Even for PDEP/PEXT (on AMD); there are some cases where it takes as few as 18 cycles and some cases where it takes as many as almost 300. Developers already have to code support for vastly different CPUs (ARM vs x86) and for different hardware levels (SSE/SSE2 as a baseline vs SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, FMA, BMI1, BMI2, LZCNT, POPCNT etc; and mixtures there-in). They also already need to consider microarchitectural differences to an extent (instructions may perform significantly worse on some older hardware; the cost of using unaligned vector loads/stores may be significantly greater or lesser; etc). Allowing them to explicitly check for the manufacturer and model/family isn't a great jump from there and is something that has already been requested by external customers. |
Yes, I think we should consider this fix for .NET Core 3.1 servicing. |
And the other variant of the fix to consider to servicing is to disable Bmi2 for AMD processors. Each one of these fixes comes with a different risk profile and trade-offs. You can let servicing shiproom to pick which one they like better. |
@jkotas I do understand your point of view, but ultimately, once the intrinsics cat is out of the bag, a world of complexity comes with it. Sure, you could postpone supporting/dealing with this to some extent, but ultimately this
It seems very arbitrary to draw a line around what sort of complex view developers "get to see" accepting partial solutions like the As an example, Let's assume that at some future point in time, CoreCLR will support AVX512 to some extent, if only to be able to support I assume we'll all agree (?) that AVX512 support will be exposed through an imaginary Querying the CPU manfacturer, through CPUID, as per #785, to detect an AMD processor, or querying cache latency / line sizes / structure shouldn't be considered as radically different than getting partial CPUID "bits" leaked out through the I guess that my argument is, and perhaps this is also the problem with it, that you either accept your fate and the need to support these exotic instructions that come with their complexities because you wish to be able to play that game and compete with the programming language that do, or you don't. I personally don't see so much of a middle ground that makes sense. And I can easily see a variaty of ways where over-simplifying will ultimately fail. Then again, I'm not the one tasked to provide and maintain this complexity, so ╮( ˘ 、 ˘ )╭ |
I worked with @GrabYourPitchforks and got some numbers on both AMD and Intel for the current code with BMI2 On vs Off. For the first scenario, I ran the benchmark given in the original post, but covered three input strings:
vs
|
Then, we also tested a more "real-world" scenario, where the benchmark data comes from: https://github.com/GrabYourPitchforks/fast-utf8/tree/master/FastUtf8Tester/SampleTexts using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Buffers;
using System.IO;
using System.Reflection;
using System.Text;
namespace ConsoleAppBenchmark
{
public class TranscodingRunner
{
private static readonly string SampleTextsFolder = Path.GetDirectoryName(Assembly.GetEntryAssembly().Location);
private byte[] _dataAsUtf8;
private char[] _dataAsUtf16;
[Params("11.txt", "11-0.txt", "25249-0.txt")]
public string Corpus;
[GlobalSetup]
public void Setup()
{
_dataAsUtf8 = File.ReadAllBytes(Path.Combine(SampleTextsFolder, Corpus));
if (_dataAsUtf8.AsSpan().StartsWith(Encoding.UTF8.Preamble))
{
_dataAsUtf8 = _dataAsUtf8[Encoding.UTF8.Preamble.Length..];
}
_dataAsUtf16 = Encoding.UTF8.GetChars(_dataAsUtf8);
}
[Benchmark]
public void TranscodeUtf16ToUtf8()
{
byte[] rented = ArrayPool<byte>.Shared.Rent(_dataAsUtf8.Length);
Encoding.UTF8.GetBytes(_dataAsUtf16, rented);
ArrayPool<byte>.Shared.Return(rented);
}
[Benchmark]
public void TranscodeUtf8ToUtf16()
{
char[] rented = ArrayPool<char>.Shared.Rent(_dataAsUtf16.Length);
Encoding.UTF8.GetChars(_dataAsUtf8, rented);
ArrayPool<char>.Shared.Return(rented);
}
static void Main(string[] args)
{
BenchmarkRunner.Run<TranscodingRunner>();
}
}
} |
vs
|
I can also provide numbers against a |
We have dealt with our own share of complexities to deal with the varying performance characteristics of primitives, like the The SIMD intrinsics are even harder to get right. I believe that there is negative value in trying to teach people how to conditionalize those using CPUIDs. There is no way these condition would be right, and then we would just have end up with a pile of problems every time a new hardware generation shows up. |
There is currently no |
Right. And I do not believe that we want to add APIs like this to the core platform. If somebody believes that they really need them, they can build them on their own. |
I removed the BMI2 calls from the hot path and replaced them with SIMD instructions instead. On my Intel box it results in some small perf wins. BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-alpha1-015755
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.19.56303, CoreFX 5.0.19.56306), X64 RyuJIT
Job-PDISED : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
Job-PIYPPK : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
I haven't yet run the full test battery over it, so I don't know how it will respond for smaller payloads. And I don't have access to an AMD box for testing right now. But nevertheless it's promising. Edit: The particular change that gave these numbers is GrabYourPitchforks@f355128. |
I feel bad for doing my share in hijacking this issue, so I'll answer @jkotas on #785 |
Looks like PDEP/PEXT aren't a problem on new AMD processors (Zen3) ? https://twitter.com/0x22h/status/1319177288039104517 |
Performance of some UTF8 string methods seems to have regressed in a major way on AMD platforms in .NET Core 3.1. Others may be affected as well.
Test:
AMD system results:
Intel system results:
The text was updated successfully, but these errors were encountered: