Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove usage of BMI2 due to regression on AMD #18949

Closed
gfoidl opened this issue Feb 11, 2020 · 7 comments
Closed

Remove usage of BMI2 due to regression on AMD #18949

gfoidl opened this issue Feb 11, 2020 · 7 comments
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions feature-kestrel ✔️ Resolution: Fixed The bug or enhancement requested in this issue has been checked-in! Perf

Comments

@gfoidl
Copy link
Member

gfoidl commented Feb 11, 2020

For context see dotnet/runtime#2251 and #17556 (comment)

Usages of BMI2 in ASP.NET Core:

internal static class StringUtilities

Maybe put the usage of BMI2 under a toggle #if USE_BMI2 (or with a better name), so it can be switched on/off by configuration?

Of course we should do some benchmarks to validate if

  • this use case really regresses perf on AMD
  • the alternative solution doesn't regress other uses (Intel)
@gfoidl
Copy link
Member Author

gfoidl commented Feb 11, 2020

Looking at dotnet/runtime#31904, I see quite some similarities between the codes. So maybe it makes sense to build StringUtilities on top of
https://github.com/dotnet/runtime/blob/d6e5d4efedc4fd32ced35bf7593fb2a0a5f80df4/src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs#L23
? So synergies can be used...
Problem: ASCIIUtility is internal (but I'm sure there's still a way to use it).

@gfoidl
Copy link
Member Author

gfoidl commented Feb 11, 2020

I have a branch ready.
Need to do benchmarks (next days are a bit busy, so hopefully next week).

@gfoidl
Copy link
Member Author

gfoidl commented Feb 11, 2020

My last comment for today 😉

For a simple / trivial benchmark (on Intel KabyLake) the SSE2 alternative shows a perf-win over BMI2 -- as already figured out by @GrabYourPitchforks

So it seems there is no need to keep BMI2 around (as I proposed in the first comment).
Still need to run some real benchmarks.

Benchmark code
using System;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Text;
using BenchmarkDotNet.Attributes;

#if !DEBUG
using BenchmarkDotNet.Running;
#endif

namespace ConsoleApp3.Bench
{
    public unsafe class Bench
    {
        private readonly long _input;
        private readonly char[] _output = new char[8];

        public Bench()
        {
            byte[] input = Encoding.UTF8.GetBytes("Hello world");
            _input = Unsafe.ReadUnaligned<long>(ref input[0]);
        }

        public string Result => new string(_output);

        static void Main()
        {
#if DEBUG
            var bench = new Bench();
            bench.BMI2();
            Console.WriteLine($">{bench.Result}<");
            bench._output.AsSpan().Clear();
            bench.SSE2();
            Console.WriteLine($">{bench.Result}<");
#else
            BenchmarkRunner.Run<Bench>();
#endif
        }

#if Path_32bit
        [Benchmark(Baseline = true)]
        public void BMI2()
        {
            int value = (int)_input;

            fixed (char* output = _output)
            {
                ((uint*)output)[0] = Bmi2.ParallelBitDeposit((uint)value, 0x00FF00FFu);
                ((uint*)output)[1] = Bmi2.ParallelBitDeposit((uint)(value >> 16), 0x00FF00FFu);
            }
        }

        [Benchmark]
        public void SSE2()
        {
            int value = (int)_input;

            fixed (char* output = _output)
            {
                Vector128<sbyte> vecNarrow = Sse2.ConvertScalarToVector128Int32(value).AsSByte();
                Vector128<ulong> vecWide = Sse2.UnpackLow(vecNarrow, Vector128<sbyte>.Zero).AsUInt64();
                Unsafe.WriteUnaligned(output, Sse2.X64.ConvertToUInt64(vecWide));
            }
        }
#else
        [Benchmark(Baseline = true)]
        public void BMI2()
        {
            long value = _input;

            fixed (char* output = _output)
            {
                ((ulong*)output)[0] = Bmi2.X64.ParallelBitDeposit((ulong)value, 0x00FF00FF_00FF00FFul);
                ((ulong*)output)[1] = Bmi2.X64.ParallelBitDeposit((ulong)(value >> 32), 0x00FF00FF_00FF00FFul);
            }
        }

        [Benchmark]
        public void SSE2()
        {
            long value = _input;

            fixed (char* output = _output)
            {
                Vector128<sbyte> zero = Vector128<sbyte>.Zero;

                Vector128<sbyte> vecNarrow = Sse2.X64.ConvertScalarToVector128Int64(value).AsSByte();
                Vector128<ulong> vecWide = Sse2.UnpackLow(vecNarrow, zero).AsUInt64();
                Sse2.Store((ulong*)output, vecWide);
            }
        }
#endif
    }
}

Results for widening 4 ASCII-bytes

| Method |     Mean |     Error |    StdDev | Ratio | RatioSD |
|------- |---------:|----------:|----------:|------:|--------:|
|   BMI2 | 1.273 ns | 0.0312 ns | 0.0291 ns |  1.00 |    0.00 |
|   SSE2 | 1.068 ns | 0.0262 ns | 0.0219 ns |  0.84 |    0.03 |

Results for widening 8 ASCII-bytes

| Method |     Mean |     Error |    StdDev |   Median | Ratio | RatioSD |
|------- |---------:|----------:|----------:|---------:|------:|--------:|
|   BMI2 | 1.359 ns | 0.0660 ns | 0.1761 ns | 1.280 ns |  1.00 |    0.00 |
|   SSE2 | 1.130 ns | 0.0352 ns | 0.0329 ns | 1.133 ns |  0.74 |    0.07 |

@analogrelay
Copy link
Contributor

@gfoidl Sounds good! I take it from your comments you are able to submit a PR? That would be great!

@gfoidl
Copy link
Member Author

gfoidl commented Feb 12, 2020

you are able to submit a PR?

Yes, I'm able 😉
Will send one by next week, and need to run some more benchmarks. Maybe some more tuning of perf...

@analogrelay
Copy link
Contributor

No worries, just getting a feel for if I need folks to write some code or review some code ;P.

@gfoidl
Copy link
Member Author

gfoidl commented Feb 26, 2020

#19009 is merged, so this can be closed.

@gfoidl gfoidl closed this as completed Feb 26, 2020
@analogrelay analogrelay added the ✔️ Resolution: Fixed The bug or enhancement requested in this issue has been checked-in! label Feb 26, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 27, 2020
@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Jun 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions feature-kestrel ✔️ Resolution: Fixed The bug or enhancement requested in this issue has been checked-in! Perf
Projects
None yet
Development

No branches or pull requests

4 participants