Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize bit operations of BigInteger #113005

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

kzrnm
Copy link
Contributor

@kzrnm kzrnm commented Feb 28, 2025

This PR is a refactored version of #112632.

Benchmark


BenchmarkDotNet v0.13.12, Windows 11 (10.0.26100.3194)
13th Gen Intel Core i5-13500, 1 CPU, 20 logical and 14 physical cores
.NET SDK 10.0.100-preview.3.25125.5
  [Host]     : .NET 10.0.0 (10.0.25.12411), X64 RyuJIT AVX2
  Job-MGDTRH : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-TRYYFZ : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-BFYRFS : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2


Method Toolchain Mean Ratio Gen0 Gen1 Gen2 Allocated Alloc Ratio
PositiveLeftShift \main\corerun.exe 13.150 ms 1.00 62.5000 62.5000 62.5000 33 MB 1.00
PositiveLeftShift \vectorize\corerun.exe 6.654 ms 0.48 156.2500 156.2500 156.2500 33 MB 1.00
PositiveLeftShift \scalar\corerun.exe 8.763 ms 0.64 156.2500 156.2500 156.2500 33 MB 1.00
NegativeLeftShift \main\corerun.exe 12.914 ms 1.00 78.1250 78.1250 78.1250 33 MB 1.00
NegativeLeftShift \vectorize\corerun.exe 6.559 ms 0.50 156.2500 156.2500 156.2500 33 MB 1.00
NegativeLeftShift \scalar\corerun.exe 8.794 ms 0.68 156.2500 156.2500 156.2500 33 MB 1.00
PositiveRightShift \main\corerun.exe 15.399 ms 1.00 125.0000 125.0000 125.0000 31 MB 1.00
PositiveRightShift \vectorize\corerun.exe 9.417 ms 0.61 187.5000 187.5000 187.5000 31 MB 1.00
PositiveRightShift \scalar\corerun.exe 11.450 ms 0.74 187.5000 187.5000 187.5000 31 MB 1.00
NegativeRightShift \main\corerun.exe 19.945 ms 1.00 125.0000 125.0000 125.0000 31 MB 1.00
NegativeRightShift \vectorize\corerun.exe 9.426 ms 0.48 187.5000 187.5000 187.5000 31 MB 1.00
NegativeRightShift \scalar\corerun.exe 11.648 ms 0.58 187.5000 187.5000 187.5000 31 MB 1.00
PositiveUnsignedRightShift \main\corerun.exe 14.067 ms 1.00 125.0000 125.0000 125.0000 31 MB 1.00
PositiveUnsignedRightShift \vectorize\corerun.exe 9.431 ms 0.67 187.5000 187.5000 187.5000 31 MB 1.00
PositiveUnsignedRightShift \scalar\corerun.exe 11.628 ms 0.83 187.5000 187.5000 187.5000 31 MB 1.00
NegativeUnsignedRightShift \main\corerun.exe 16.347 ms 1.00 125.0000 125.0000 125.0000 31 MB 1.00
NegativeUnsignedRightShift \vectorize\corerun.exe 11.230 ms 0.69 187.5000 187.5000 187.5000 31 MB 1.00
NegativeUnsignedRightShift \scalar\corerun.exe 13.608 ms 0.83 187.5000 187.5000 187.5000 31 MB 1.00
PositiveRotateLeft \main\corerun.exe 13.674 ms 1.00 125.0000 125.0000 125.0000 32 MB 1.00
PositiveRotateLeft \vectorize\corerun.exe 10.775 ms 0.79 187.5000 187.5000 187.5000 32 MB 1.00
PositiveRotateLeft \scalar\corerun.exe 13.057 ms 0.96 187.5000 187.5000 187.5000 32 MB 1.00
NegativeRotateLeft \main\corerun.exe 18.040 ms 1.00 125.0000 125.0000 125.0000 32 MB 1.00
NegativeRotateLeft \vectorize\corerun.exe 14.679 ms 0.83 187.5000 187.5000 187.5000 32 MB 1.00
NegativeRotateLeft \scalar\corerun.exe 17.039 ms 0.94 187.5000 187.5000 187.5000 32 MB 1.00
PositiveRotateRight \main\corerun.exe 13.671 ms 1.00 125.0000 125.0000 125.0000 32 MB 1.00
PositiveRotateRight \vectorize\corerun.exe 10.651 ms 0.78 187.5000 187.5000 187.5000 32 MB 1.00
PositiveRotateRight \scalar\corerun.exe 13.071 ms 0.96 187.5000 187.5000 187.5000 32 MB 1.00
NegativeRotateRight \main\corerun.exe 16.000 ms 1.00 125.0000 125.0000 125.0000 32 MB 1.00
NegativeRotateRight \vectorize\corerun.exe 12.632 ms 0.79 187.5000 187.5000 187.5000 32 MB 1.00
NegativeRotateRight \scalar\corerun.exe 14.861 ms 0.93 187.5000 187.5000 187.5000 32 MB 1.00

<details>

<summary>benchmark code</summary>

```cs
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using System.Numerics;

[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class ShiftTest
{
    const int shiftSize = (1 << 23) + 13;
    BigInteger positive, negative;

    [GlobalSetup]
    public void Setup()
    {
        var bytes = new byte[1 << 25];
        new Random(227).NextBytes(bytes);
        positive = new BigInteger(bytes, isUnsigned: true);
        negative = -positive;
    }

    [Benchmark] public BigInteger PositiveLeftShift() => positive << shiftSize;
    [Benchmark] public BigInteger NegativeLeftShift() => negative << shiftSize;

    [Benchmark] public BigInteger PositiveRightShift() => positive >> shiftSize;
    [Benchmark] public BigInteger NegativeRightShift() => negative >> shiftSize;

    [Benchmark] public BigInteger PositiveUnsignedRightShift() => positive >>> shiftSize;
    [Benchmark] public BigInteger NegativeUnsignedRightShift() => negative >>> shiftSize;

    [Benchmark] public BigInteger PositiveRotateLeft() => BigInteger.RotateLeft(positive, shiftSize);
    [Benchmark] public BigInteger NegativeRotateLeft() => BigInteger.RotateLeft(negative, shiftSize);

    [Benchmark] public BigInteger PositiveRotateRight() => BigInteger.RotateRight(positive, shiftSize);
    [Benchmark] public BigInteger NegativeRotateRight() => BigInteger.RotateRight(negative, shiftSize);
}

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 28, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

// AAAAAAAAAAA
// where A = ~X

int offset = 0;
ref uint start = ref MemoryMarshal.GetReference(d);

while (Vector512.IsHardwareAccelerated && d.Length - offset >= Vector512<uint>.Count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are touching this, it may also be worthy to tune the branch prediction of these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should be done? For now, I have vectorized RightShiftSelf and LeftShiftSelf. db84170

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorganize the loops and branches like these:

if (Vector128.IsHardwareAccelerated)
{
if (Vector512.IsHardwareAccelerated && length >= (nuint)Vector512<byte>.Count)
{
nuint offset = 0;
nuint lengthToExamine = length - (nuint)Vector512<byte>.Count;
// Unsigned, so it shouldn't have overflowed larger than length (rather than negative)
Debug.Assert(lengthToExamine < length);
if (lengthToExamine != 0)
{
do
{
if (Vector512.LoadUnsafe(ref first, offset) !=
Vector512.LoadUnsafe(ref second, offset))
{
goto NotEqual;
}
offset += (nuint)Vector512<byte>.Count;
} while (lengthToExamine > offset);
}

Basically:

if (Vector512.IsHardwareAccelerated && Length >= 512)
{
    do
    {
        Process();
    }
    while (Length >= 512);
}

However, I'm not sure the branching is tuned in correct favor now. It's better to do more benchmarks for different sizes when doing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was no effect. The IsHardwareAccelerated property has the IntrinsicAttribute, so the JIT likely treats it as a constant.


BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3194)
13th Gen Intel Core i5-13500, 1 CPU, 20 logical and 14 physical cores
.NET SDK 10.0.100-preview.3.25125.5
  [Host]     : .NET 10.0.0 (10.0.25.12411), X64 RyuJIT AVX2
  DefaultJob : .NET 10.0.0 (10.0.25.12411), X64 RyuJIT AVX2

Job=DefaultJob  

Method N Mean Ratio Allocated Alloc Ratio
OnesComplement 100 9.004 ns 1.00 - NA
OnesComplementWithIf 100 9.093 ns 1.01 - NA
OnesComplement 1000 64.948 ns 1.00 - NA
OnesComplementWithIf 1000 65.522 ns 1.01 - NA
OnesComplement 10000 576.086 ns 1.00 - NA
OnesComplementWithIf 10000 572.086 ns 0.99 - NA
OnesComplement 100000 6,659.639 ns 1.00 - NA
OnesComplementWithIf 100000 6,785.877 ns 1.02 - NA
OnesComplement 1000000 127,892.079 ns 1.00 - NA
OnesComplementWithIf 1000000 127,843.652 ns 1.00 - NA
OnesComplement 10000000 2,394,471.143 ns 1.00 - NA
OnesComplementWithIf 10000000 2,502,400.466 ns 1.05 - NA
benchmark code
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;

[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class OnesComplementTest
{
    uint[] array;
    [Params([
        100,
        1000,
        10000,
        100000,
        1000000,
        10000000,
    ])]
    public int N;

    [GlobalSetup]
    public void Setup()
    {
        array = new uint[N];
        new Random(227).NextBytes(MemoryMarshal.AsBytes(array.AsSpan()));
    }

    [Benchmark(Baseline = true)] public void OnesComplement() => DangerousMakeOnesComplement(array);
    [Benchmark] public void OnesComplementWithIf() => DangerousMakeOnesComplementWithIf(array);

    // Do an in-place one's complement. "Dangerous" because it causes
    // a mutation and needs to be used with care for immutable types.
    public static void DangerousMakeOnesComplementWithIf(Span<uint> d)
    {
        // Given a number:
        //     XXXXXXXXXXX
        // where Y is non-zero,
        // The result of one's complement is
        //     AAAAAAAAAAA
        // where A = ~X

        int offset = 0;
        ref uint start = ref MemoryMarshal.GetReference(d);

        if (Vector512.IsHardwareAccelerated) while (d.Length - offset >= Vector512<uint>.Count)
            {
                Vector512<uint> complement = ~Vector512.LoadUnsafe(ref start, (nuint)offset);
                Vector512.StoreUnsafe(complement, ref start, (nuint)offset);
                offset += Vector512<uint>.Count;
            }

        if (Vector256.IsHardwareAccelerated) while (d.Length - offset >= Vector256<uint>.Count)
            {
                Vector256<uint> complement = ~Vector256.LoadUnsafe(ref start, (nuint)offset);
                Vector256.StoreUnsafe(complement, ref start, (nuint)offset);
                offset += Vector256<uint>.Count;
            }

        if (Vector128.IsHardwareAccelerated) while (d.Length - offset >= Vector128<uint>.Count)
            {
                Vector128<uint> complement = ~Vector128.LoadUnsafe(ref start, (nuint)offset);
                Vector128.StoreUnsafe(complement, ref start, (nuint)offset);
                offset += Vector128<uint>.Count;
            }
        for (; offset < d.Length; offset++)
        {
            d[offset] = ~d[offset];
        }
    }

    // Do an in-place one's complement. "Dangerous" because it causes
    // a mutation and needs to be used with care for immutable types.
    public static void DangerousMakeOnesComplement(Span<uint> d)
    {
        // Given a number:
        //     XXXXXXXXXXX
        // where Y is non-zero,
        // The result of one's complement is
        //     AAAAAAAAAAA
        // where A = ~X

        int offset = 0;
        ref uint start = ref MemoryMarshal.GetReference(d);

        while (Vector512.IsHardwareAccelerated && d.Length - offset >= Vector512<uint>.Count)
        {
            Vector512<uint> complement = ~Vector512.LoadUnsafe(ref start, (nuint)offset);
            Vector512.StoreUnsafe(complement, ref start, (nuint)offset);
            offset += Vector512<uint>.Count;
        }

        while (Vector256.IsHardwareAccelerated && d.Length - offset >= Vector256<uint>.Count)
        {
            Vector256<uint> complement = ~Vector256.LoadUnsafe(ref start, (nuint)offset);
            Vector256.StoreUnsafe(complement, ref start, (nuint)offset);
            offset += Vector256<uint>.Count;
        }

        while (Vector128.IsHardwareAccelerated && d.Length - offset >= Vector128<uint>.Count)
        {
            Vector128<uint> complement = ~Vector128.LoadUnsafe(ref start, (nuint)offset);
            Vector128.StoreUnsafe(complement, ref start, (nuint)offset);
            offset += Vector128<uint>.Count;
        }
        for (; offset < d.Length; offset++)
        {
            d[offset] = ~d[offset];
        }
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did observe some difference in branch ordering, but it's minor now. Perhaps JIT's loop normalization is already optimizing these loops.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that branch tuning is likely to affect more for small inputs (<128, <256 and <512 cases). It shouldn't affect large input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "<128, <256, and <512 cases" refer to d.Length? I don't think it would have any impact, considering that even Vector512<uint>.Count is only 16.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3194)
13th Gen Intel Core i5-13500, 1 CPU, 20 logical and 14 physical cores
.NET SDK 10.0.100-preview.3.25125.5
  [Host]     : .NET 10.0.0 (10.0.25.12411), X64 RyuJIT AVX2
  DefaultJob : .NET 10.0.0 (10.0.25.12411), X64 RyuJIT AVX2

Job=DefaultJob  

Method N Mean Ratio Allocated Alloc Ratio
OnesComplement 7 3.409 ns 1.00 - NA
OnesComplementWithIf 7 3.405 ns 1.00 - NA
OnesComplement 8 1.463 ns 1.00 - NA
OnesComplementWithIf 8 1.459 ns 1.00 - NA
OnesComplement 15 3.964 ns 1.00 - NA
OnesComplementWithIf 15 3.876 ns 0.98 - NA
OnesComplement 16 2.129 ns 1.00 - NA
OnesComplementWithIf 16 2.073 ns 0.97 - NA

@@ -2,6 +2,7 @@
// The .NET Foundation licenses this file to you under the MIT license.

using System.Diagnostics;
using System.Runtime.CompilerServices;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove it.

{
for (int i = xd.Length - 1; i >= digitShift; i--)
// Calculate the two�fs complement. The least significant nonzero bit has already been computed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-ASCII character

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had used U+2019 , so I replaced it with U+0027 '.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Numerics community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants