-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Even faster modulo computation #406
Conversation
fece4fe
to
9e413c0
Compare
src/libraries/System.Private.CoreLib/src/System/Collections/HashHelpers.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Collections/HashHelpers.cs
Show resolved
Hide resolved
Would it be possible to run the Dictionary performance tests? |
8c1d54c
to
4bb0bbb
Compare
4bb0bbb
to
0927e77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as long as the perf tests results are shared and demonstrate the improvements.
I don't know how stable our dictionary perf tests are (@adamsitnik?), but I just tried running them comparing master against this PR (
|
I have an outstanding PR for stable keys (rather than random) dotnet/performance#938 |
@stephentoub Thank you for running the tests. Since this is a very local optimization (3 multiplications instead of 2), we should not see any big differences at all. Any difference above a couple of percent sounds suspicious to me. For instance, 15% diff in the last test, which I think don't even use the code path that I am changing (only in the GlobalSetup stage) sounds very fishy. I am going to run the tests myself on x64 and ARM64 in a few days and report results. Adding NO MERGE label for now. |
@AntonLapounov, any update? Should this be closed? Merged? |
@AntonLapounov, can you please comment on the status of this PR? |
@stephentoub Sorry, I have not had a chance to measure it due to other work. I might need some guidance on how to run those tests on ARM64. I would be quite surprised if this change had negative performance impact. |
I am fine with taking this even without confirmed throughput measurements. I agree with @AntonLapounov that it is very unlikely to make anything shower; and it makes the Dictionary code measurably smaller. For example, |
@AntonLapounov Could you please rebase or push a merge from master, so that we get a fresh run of the CI? |
@@ -90,21 +90,24 @@ public static int ExpandPrime(int oldSize) | |||
} | |||
|
|||
#if BIT64 | |||
// Returns approximate reciprocal of the divisor: ceil(2**64 / divisor) | |||
public static ulong GetFastModMultiplier(uint divisor) | |||
=> ulong.MaxValue / divisor + 1; | |||
|
|||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | |||
public static uint FastMod(uint value, uint divisor, ulong multiplier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: surely this is inlined even without the attribute..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you verified it? This method is going to be ~20 bytes of IL. It is in the gray where the JIT may choose to not inline it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are exactly right .. it is 18 bytes IL, if I remove the + 1
to make 15 bytes of IL, it is then inlined.
@jkotas Thank you for trying this change. One question that I had in mind is what would generate better code for x64 and ARM64, the current change: uint highbits = (uint)((((lowbits >> 32) + 1) * divisor) >> 32); or this variation: uint highbits = (uint)(((ulong)((uint)(lowbits >> 32) + 1) * divisor) >> 32); Here |
I think your current version is fine. I do not think that the alternative would make a difference with the current JIT, and given that this is enabled for 64-bit only anyway. |
Improved Direct Remainder Computation
While we assume 𝑁 = 32 below for simplicity, the same arguments trivially work for the general case as well. The recent work of Lemire, Kaser, and Kurz [1] presented the following algorithm for computing the modulo of two unsigned 32-bit integers:
Here the fastmod function requires either the 128-bit multiplication or the 64-bit unsigned-multiply-high instruction (e.g.,
__umulh
Visual C++ x64 intrinsic orUMULH
ARM instruction), which may not be natively available on some platforms or in some languages. Even when it is available, it may have worse performance characteristics than an ordinary 64-bit multiplication. (See, for instance, execution latency and throughput for ARM Cortex-A75 in section 4.6 of [2].) And if it is not available natively, it has to be emulated with two 64-bit multiplications, which reduces or even eliminates the potential performance win achieved by avoiding the integer division.We suggest the algorithm modification that solves the aforementioned issue. Namely, we round up lowbits to the next power of 2³², which allows to replace the unsigned-multiply-high instruction with an ordinary 32×32⭢64-bit unsigned multiplication:
Here casting
(lowbits >> 32)
touint32_t
is needed only to help the compiler to infer the correct size of the multiplication operand and generate faster code, especially when targeting 32-bit processors. Also note that our modification uses only the high half of the first multiplication, which may allow additional optimization.Let’s prove this algorithm produces the correct modulo value. The case 𝑑 ≥ 2³¹ is handled by the
if
statement: obviously the modulo is either 𝑛 itself, if 𝑛 < 𝑑, or 𝑛 − 𝑑 otherwise. If 𝑑 = 1, then we have 𝑐 = 0, lowbits = 0, and fastmod returns 0 as expected.Finally consider the case 𝑑 ∈ (1, 2³¹). We can write
𝑐 = ⌈2⁶⁴/𝑑⌉ = 2⁶⁴/𝑑 + ε,
where rational ε ∈ [0, 1), and 𝑛 = 𝑞𝑑 + 𝑟, where integer 𝑞 ≥ 0 and integer 𝑟 ∈ [0, 𝑑). Then
𝑐𝑛 = (2⁶⁴/𝑑 + ε)𝑛 = 2⁶⁴⋅𝑞 + 2⁶⁴⋅𝑟/𝑑 + ε𝑛.
Since ε𝑛𝑑 < 1⋅2³²⋅2³¹ < 2⁶⁴, we have ε𝑛 < 2⁶⁴/𝑑 and
0 ≤ 2⁶⁴⋅𝑟/𝑑 + ε𝑛 < 2⁶⁴.
That means
lowbits = 𝑐𝑛 mod 2⁶⁴ = 2⁶⁴⋅𝑟/𝑑 + ε𝑛
(in particular, that shows that the quotient 𝑞 may be calculated as 𝑐𝑛 >> 64). Rounding lowbits up may be represented as adding some rational δ ∈ (0, 1]:
(lowbits >> 32) + 1 = ⌊lowbits / 2³²⌋ + 1 = 2³²⋅𝑟/𝑑 + ε𝑛/2³² + δ.
After multiplying by 𝑑 we have
(2³²⋅𝑟/𝑑 + ε𝑛/2³² + δ)𝑑 = 2³²⋅𝑟 + (ε𝑛/2³² + δ)𝑑.
Since
0 < (ε𝑛/2³² + δ)𝑑 < (1 + 1)𝑑 < 2³²,
shifting the last multiplication result right by 32 bits produces 𝑟, which concludes the proof. □
The table below summarizes size requirements for the two multiplications performed in the original and our method for 𝑁-bit integers.
(full 2𝑁 bits)
(high 𝑁 bits only)
(high 𝑁 bits only)
(high 𝑁 bits only)
On common processors the 2𝑁×𝑁⭢3𝑁 multiplication from the original algorithm may be executed natively (as a single instruction) only in the form of a 2𝑁×2𝑁⭢4𝑁 multiplication, which imposes a limit on the size of integers that allow very fast modulo calculation. Roughly speaking, our approach makes possible to use native multiplications for twice longer numbers. As a practical example, below is the 64-bit variant of fastmod, which uses the
__uint128_t
extension type available in GCC and Clang compilers.Conclusion
We demonstrated that the unsigned-multiply-high instruction for 2𝑁-bit integers in the direct remainder computation via the multiplicative inverse may be replaced, to improve performance, with an ordinary 𝑁×𝑁⭢2𝑁-bit unsigned multiplication if the highest bit of the divisor is guaranteed to be zero. If there is no such guarantee, trivial handling of that special case adds a single branch to the common code path.
References