-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize bit operations of BigInteger #113005
base: main
Are you sure you want to change the base?
Conversation
Tagging subscribers to this area: @dotnet/area-system-numerics |
// AAAAAAAAAAA | ||
// where A = ~X | ||
|
||
int offset = 0; | ||
ref uint start = ref MemoryMarshal.GetReference(d); | ||
|
||
while (Vector512.IsHardwareAccelerated && d.Length - offset >= Vector512<uint>.Count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are touching this, it may also be worthy to tune the branch prediction of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should be done? For now, I have vectorized RightShiftSelf and LeftShiftSelf. db84170
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reorganize the loops and branches like these:
runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Lines 821 to 840 in 577c36c
if (Vector128.IsHardwareAccelerated) | |
{ | |
if (Vector512.IsHardwareAccelerated && length >= (nuint)Vector512<byte>.Count) | |
{ | |
nuint offset = 0; | |
nuint lengthToExamine = length - (nuint)Vector512<byte>.Count; | |
// Unsigned, so it shouldn't have overflowed larger than length (rather than negative) | |
Debug.Assert(lengthToExamine < length); | |
if (lengthToExamine != 0) | |
{ | |
do | |
{ | |
if (Vector512.LoadUnsafe(ref first, offset) != | |
Vector512.LoadUnsafe(ref second, offset)) | |
{ | |
goto NotEqual; | |
} | |
offset += (nuint)Vector512<byte>.Count; | |
} while (lengthToExamine > offset); | |
} |
Basically:
if (Vector512.IsHardwareAccelerated && Length >= 512)
{
do
{
Process();
}
while (Length >= 512);
}
However, I'm not sure the branching is tuned in correct favor now. It's better to do more benchmarks for different sizes when doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was no effect. The IsHardwareAccelerated
property has the IntrinsicAttribute
, so the JIT likely treats it as a constant.
BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3194)
13th Gen Intel Core i5-13500, 1 CPU, 20 logical and 14 physical cores
.NET SDK 10.0.100-preview.3.25125.5
[Host] : .NET 10.0.0 (10.0.25.12411), X64 RyuJIT AVX2
DefaultJob : .NET 10.0.0 (10.0.25.12411), X64 RyuJIT AVX2
Job=DefaultJob
Method | N | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
OnesComplement | 100 | 9.004 ns | 1.00 | - | NA |
OnesComplementWithIf | 100 | 9.093 ns | 1.01 | - | NA |
OnesComplement | 1000 | 64.948 ns | 1.00 | - | NA |
OnesComplementWithIf | 1000 | 65.522 ns | 1.01 | - | NA |
OnesComplement | 10000 | 576.086 ns | 1.00 | - | NA |
OnesComplementWithIf | 10000 | 572.086 ns | 0.99 | - | NA |
OnesComplement | 100000 | 6,659.639 ns | 1.00 | - | NA |
OnesComplementWithIf | 100000 | 6,785.877 ns | 1.02 | - | NA |
OnesComplement | 1000000 | 127,892.079 ns | 1.00 | - | NA |
OnesComplementWithIf | 1000000 | 127,843.652 ns | 1.00 | - | NA |
OnesComplement | 10000000 | 2,394,471.143 ns | 1.00 | - | NA |
OnesComplementWithIf | 10000000 | 2,502,400.466 ns | 1.05 | - | NA |
benchmark code
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class OnesComplementTest
{
uint[] array;
[Params([
100,
1000,
10000,
100000,
1000000,
10000000,
])]
public int N;
[GlobalSetup]
public void Setup()
{
array = new uint[N];
new Random(227).NextBytes(MemoryMarshal.AsBytes(array.AsSpan()));
}
[Benchmark(Baseline = true)] public void OnesComplement() => DangerousMakeOnesComplement(array);
[Benchmark] public void OnesComplementWithIf() => DangerousMakeOnesComplementWithIf(array);
// Do an in-place one's complement. "Dangerous" because it causes
// a mutation and needs to be used with care for immutable types.
public static void DangerousMakeOnesComplementWithIf(Span<uint> d)
{
// Given a number:
// XXXXXXXXXXX
// where Y is non-zero,
// The result of one's complement is
// AAAAAAAAAAA
// where A = ~X
int offset = 0;
ref uint start = ref MemoryMarshal.GetReference(d);
if (Vector512.IsHardwareAccelerated) while (d.Length - offset >= Vector512<uint>.Count)
{
Vector512<uint> complement = ~Vector512.LoadUnsafe(ref start, (nuint)offset);
Vector512.StoreUnsafe(complement, ref start, (nuint)offset);
offset += Vector512<uint>.Count;
}
if (Vector256.IsHardwareAccelerated) while (d.Length - offset >= Vector256<uint>.Count)
{
Vector256<uint> complement = ~Vector256.LoadUnsafe(ref start, (nuint)offset);
Vector256.StoreUnsafe(complement, ref start, (nuint)offset);
offset += Vector256<uint>.Count;
}
if (Vector128.IsHardwareAccelerated) while (d.Length - offset >= Vector128<uint>.Count)
{
Vector128<uint> complement = ~Vector128.LoadUnsafe(ref start, (nuint)offset);
Vector128.StoreUnsafe(complement, ref start, (nuint)offset);
offset += Vector128<uint>.Count;
}
for (; offset < d.Length; offset++)
{
d[offset] = ~d[offset];
}
}
// Do an in-place one's complement. "Dangerous" because it causes
// a mutation and needs to be used with care for immutable types.
public static void DangerousMakeOnesComplement(Span<uint> d)
{
// Given a number:
// XXXXXXXXXXX
// where Y is non-zero,
// The result of one's complement is
// AAAAAAAAAAA
// where A = ~X
int offset = 0;
ref uint start = ref MemoryMarshal.GetReference(d);
while (Vector512.IsHardwareAccelerated && d.Length - offset >= Vector512<uint>.Count)
{
Vector512<uint> complement = ~Vector512.LoadUnsafe(ref start, (nuint)offset);
Vector512.StoreUnsafe(complement, ref start, (nuint)offset);
offset += Vector512<uint>.Count;
}
while (Vector256.IsHardwareAccelerated && d.Length - offset >= Vector256<uint>.Count)
{
Vector256<uint> complement = ~Vector256.LoadUnsafe(ref start, (nuint)offset);
Vector256.StoreUnsafe(complement, ref start, (nuint)offset);
offset += Vector256<uint>.Count;
}
while (Vector128.IsHardwareAccelerated && d.Length - offset >= Vector128<uint>.Count)
{
Vector128<uint> complement = ~Vector128.LoadUnsafe(ref start, (nuint)offset);
Vector128.StoreUnsafe(complement, ref start, (nuint)offset);
offset += Vector128<uint>.Count;
}
for (; offset < d.Length; offset++)
{
d[offset] = ~d[offset];
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did observe some difference in branch ordering, but it's minor now. Perhaps JIT's loop normalization is already optimizing these loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that branch tuning is likely to affect more for small inputs (<128, <256 and <512 cases). It shouldn't affect large input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does "<128, <256, and <512 cases" refer to d.Length
? I don't think it would have any impact, considering that even Vector512<uint>.Count
is only 16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3194)
13th Gen Intel Core i5-13500, 1 CPU, 20 logical and 14 physical cores
.NET SDK 10.0.100-preview.3.25125.5
[Host] : .NET 10.0.0 (10.0.25.12411), X64 RyuJIT AVX2
DefaultJob : .NET 10.0.0 (10.0.25.12411), X64 RyuJIT AVX2
Job=DefaultJob
Method | N | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
OnesComplement | 7 | 3.409 ns | 1.00 | - | NA |
OnesComplementWithIf | 7 | 3.405 ns | 1.00 | - | NA |
OnesComplement | 8 | 1.463 ns | 1.00 | - | NA |
OnesComplementWithIf | 8 | 1.459 ns | 1.00 | - | NA |
OnesComplement | 15 | 3.964 ns | 1.00 | - | NA |
OnesComplementWithIf | 15 | 3.876 ns | 0.98 | - | NA |
OnesComplement | 16 | 2.129 ns | 1.00 | - | NA |
OnesComplementWithIf | 16 | 2.073 ns | 0.97 | - | NA |
@@ -2,6 +2,7 @@ | |||
// The .NET Foundation licenses this file to you under the MIT license. | |||
|
|||
using System.Diagnostics; | |||
using System.Runtime.CompilerServices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remove it.
{ | ||
for (int i = xd.Length - 1; i >= digitShift; i--) | ||
// Calculate the two�fs complement. The least significant nonzero bit has already been computed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-ASCII character
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had used U+2019 ’
, so I replaced it with U+0027 '
.
This PR is a refactored version of #112632.
Benchmark