Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Even faster modulo computation #406

Merged
merged 3 commits into from
Jan 22, 2020
Merged

Conversation

AntonLapounov
Copy link
Member

@AntonLapounov AntonLapounov commented Nov 30, 2019

Improved Direct Remainder Computation

Summary: Lemire, Kaser, and Kurz developed a faster method for calculating the remainder of two unsigned 𝑁-bit integers if repeated integer divisions are performed with the same divisor [1]. Their method avoids an expensive 𝑁-bit integer division instruction by precomputing the 2𝑁-bit multiplicative inverse of the divisor and leveraging the unsigned-multiply-high instruction for 2𝑁-bit integers, which is, however, not universally available. We improve their method by replacing the unsigned-multiply-high instruction with an ordinary 𝑁×𝑁⭢2𝑁-bit unsigned multiplication for the frequent case when the divisor belongs to [1, 2ᴺ⁻¹) interval. For divisors outside that interval, the remainder may be calculated by simple comparison and subtraction instructions.

While we assume 𝑁 = 32 below for simplicity, the same arguments trivially work for the general case as well. The recent work of Lemire, Kaser, and Kurz [1] presented the following algorithm for computing the modulo of two unsigned 32-bit integers:

// c = ceil( (1<<64) / d )
uint64_t c = UINT64_C(0xFFFFFFFFFFFFFFFF) / d + 1;

// fastmod computes (n mod d) given precomputed c
uint32_t fastmod(uint32_t n /* , uint64_t c, uint32_t d */)
{
    uint64_t lowbits = c * n;
    return ((__uint128_t)lowbits * d) >> 64;
}

Here the fastmod function requires either the 128-bit multiplication or the 64-bit unsigned-multiply-high instruction (e.g., __umulh Visual C++ x64 intrinsic or UMULH ARM instruction), which may not be natively available on some platforms or in some languages. Even when it is available, it may have worse performance characteristics than an ordinary 64-bit multiplication. (See, for instance, execution latency and throughput for ARM Cortex-A75 in section 4.6 of [2].) And if it is not available natively, it has to be emulated with two 64-bit multiplications, which reduces or even eliminates the potential performance win achieved by avoiding the integer division.

We suggest the algorithm modification that solves the aforementioned issue. Namely, we round up lowbits to the next power of 2³², which allows to replace the unsigned-multiply-high instruction with an ordinary 32×32⭢64-bit unsigned multiplication:

uint32_t fastmod(uint32_t n /* , uint64_t c, uint32_t d */)
{
    if ((int32_t)d < 0)
        return n < d ? n : n - d;

    uint64_t lowbits = c * n;
    return ((uint64_t)((uint32_t)(lowbits >> 32) + 1) * d) >> 32;
}

Here casting (lowbits >> 32) to uint32_t is needed only to help the compiler to infer the correct size of the multiplication operand and generate faster code, especially when targeting 32-bit processors. Also note that our modification uses only the high half of the first multiplication, which may allow additional optimization.

Let’s prove this algorithm produces the correct modulo value. The case 𝑑 ≥ 2³¹ is handled by the if statement: obviously the modulo is either 𝑛 itself, if 𝑛 < 𝑑, or 𝑛 − 𝑑 otherwise. If 𝑑 = 1, then we have 𝑐 = 0, lowbits = 0, and fastmod returns 0 as expected.

Finally consider the case 𝑑 ∈ (1, 2³¹). We can write

𝑐 = ⌈2⁶⁴/𝑑⌉ = 2⁶⁴/𝑑 + ε,

where rational ε ∈ [0, 1), and 𝑛 = 𝑞𝑑 + 𝑟, where integer 𝑞 ≥ 0 and integer 𝑟 ∈ [0, 𝑑). Then

𝑐𝑛 = (2⁶⁴/𝑑 + ε)𝑛 = 2⁶⁴⋅𝑞 + 2⁶⁴⋅𝑟/𝑑 + ε𝑛.

Since ε𝑛𝑑 < 1⋅2³²⋅2³¹ < 2⁶⁴, we have ε𝑛 < 2⁶⁴/𝑑 and

0 ≤ 2⁶⁴⋅𝑟/𝑑 + ε𝑛 < 2⁶⁴.

That means

lowbits = 𝑐𝑛 mod 2⁶⁴ = 2⁶⁴⋅𝑟/𝑑 + ε𝑛

(in particular, that shows that the quotient 𝑞 may be calculated as 𝑐𝑛 >> 64). Rounding lowbits up may be represented as adding some rational δ ∈ (0, 1]:

(lowbits >> 32) + 1 = ⌊lowbits / 2³²⌋ + 1 = 2³²⋅𝑟/𝑑 + ε𝑛/2³² + δ.

After multiplying by 𝑑 we have

(2³²⋅𝑟/𝑑 + ε𝑛/2³² + δ)𝑑 = 2³²⋅𝑟 + (ε𝑛/2³² + δ)𝑑.

Since

0 < (ε𝑛/2³² + δ)𝑑 < (1 + 1)𝑑 < 2³²,

shifting the last multiplication result right by 32 bits produces 𝑟, which concludes the proof. □

The table below summarizes size requirements for the two multiplications performed in the original and our method for 𝑁-bit integers.

Mult. Op. Original Method Our Method
1 2𝑁×𝑁⭢2𝑁
(full 2𝑁 bits)
2𝑁×𝑁⭢2𝑁
(high 𝑁 bits only)
2 2𝑁×𝑁⭢3𝑁
(high 𝑁 bits only)
𝑁×𝑁⭢2𝑁
(high 𝑁 bits only)

On common processors the 2𝑁×𝑁⭢3𝑁 multiplication from the original algorithm may be executed natively (as a single instruction) only in the form of a 2𝑁×2𝑁⭢4𝑁 multiplication, which imposes a limit on the size of integers that allow very fast modulo calculation. Roughly speaking, our approach makes possible to use native multiplications for twice longer numbers. As a practical example, below is the 64-bit variant of fastmod, which uses the __uint128_t extension type available in GCC and Clang compilers.

uint64_t d = ...;  // Non-zero divisor
__uint128_t c = ~__uint128_t() / d + 1;  // 128-bit reciprocal of d

// Computes n mod d given precomputed reciprocal c
uint64_t fastmod64(uint64_t n /* , uint128_t c, uint64_t d */)
{
    // This check is needed only if d may be >= 2**63
    if ((int64_t)d < 0)
        return n < d ? n : n - d;

    __uint128_t lowbits = c * n;
    return ((__uint128_t)((uint64_t)(lowbits >> 64) + 1) * d) >> 64;
}

Conclusion

We demonstrated that the unsigned-multiply-high instruction for 2𝑁-bit integers in the direct remainder computation via the multiplicative inverse may be replaced, to improve performance, with an ordinary 𝑁×𝑁⭢2𝑁-bit unsigned multiplication if the highest bit of the divisor is guaranteed to be zero. If there is no such guarantee, trivial handling of that special case adds a single branch to the common code path.

References

  1. D. Lemire, O. Kaser, and N. Kurz, Faster Remainder by Direct Computation, 2018. https://arxiv.org/abs/1902.01961
  2. Arm Cortex-A75 Software Optimization Guide v2.0. https://developer.arm.com/docs/101398/latest/arm-cortex-a75-software-optimization-guide-v20

@danmoseley
Copy link
Member

@benadams

@danmoseley
Copy link
Member

Would it be possible to run the Dictionary performance tests?

@AntonLapounov AntonLapounov force-pushed the EvenFasterMod branch 2 times, most recently from 8c1d54c to 4bb0bbb Compare December 1, 2019 22:06
Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as the perf tests results are shared and demonstrate the improvements.

@stephentoub
Copy link
Member

I don't know how stable our dictionary perf tests are (@adamsitnik?), but I just tried running them comparing master against this PR (dotnet run -c Release -f netcoreapp3.0 --filter System.Collections.*.Dictionary --corerun d:\coreclrtest\old\corerun.exe d:\coreclrtest\new\corerun.exe --join):

Namespace Type Method Toolchain Size Count Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
System.Collections CtorDefaultSize Dictionary \new\corerun.exe ? ? 7.971 ns 0.0806 ns 0.0673 ns 7.955 ns 7.834 ns 8.108 ns 1.01 0.03 0.0127 - - 80 B
System.Collections CtorDefaultSize Dictionary \old\corerun.exe ? ? 7.864 ns 0.1780 ns 0.1578 ns 7.812 ns 7.703 ns 8.203 ns 1.00 0.00 0.0127 - - 80 B
System.Collections CtorDefaultSize Dictionary \new\corerun.exe ? ? 21.037 ns 0.1110 ns 0.1038 ns 21.010 ns 20.898 ns 21.241 ns 1.00 0.01 0.0127 - - 80 B
System.Collections CtorDefaultSize Dictionary \old\corerun.exe ? ? 20.957 ns 0.2391 ns 0.2120 ns 20.901 ns 20.639 ns 21.438 ns 1.00 0.00 0.0127 - - 80 B
System.Collections.Concurrent IsEmpty Dictionary \new\corerun.exe 0 ? 179.471 ns 0.9346 ns 0.8743 ns 179.208 ns 178.610 ns 181.316 ns 1.01 0.00 - - - -
System.Collections.Concurrent IsEmpty Dictionary \old\corerun.exe 0 ? 178.311 ns 0.1173 ns 0.0980 ns 178.334 ns 178.195 ns 178.562 ns 1.00 0.00 - - - -
System.Collections.Concurrent IsEmpty Dictionary \new\corerun.exe 0 ? 183.243 ns 2.2765 ns 2.1294 ns 182.639 ns 181.007 ns 189.120 ns 1.03 0.01 - - - -
System.Collections.Concurrent IsEmpty Dictionary \old\corerun.exe 0 ? 178.162 ns 2.0080 ns 1.7801 ns 178.079 ns 175.092 ns 181.594 ns 1.00 0.00 - - - -
System.Collections TryAddDefaultSize Dictionary \new\corerun.exe ? 512 8,642.181 ns 129.1360 ns 120.7939 ns 8,668.149 ns 8,447.814 ns 8,850.007 ns 0.99 0.02 5.4658 0.6790 - 34496 B
System.Collections TryAddDefaultSize Dictionary \old\corerun.exe ? 512 8,760.753 ns 54.7569 ns 51.2197 ns 8,763.397 ns 8,687.479 ns 8,844.295 ns 1.00 0.00 5.4818 0.6634 - 34496 B
System.Collections TryAddDefaultSize Dictionary \new\corerun.exe ? 512 16,811.581 ns 166.5277 ns 155.7701 ns 16,844.224 ns 16,516.245 ns 17,120.856 ns 0.90 0.01 7.5840 1.2189 - 48096 B
System.Collections TryAddDefaultSize Dictionary \old\corerun.exe ? 512 18,783.389 ns 191.5275 ns 179.1549 ns 18,807.522 ns 18,421.037 ns 19,094.640 ns 1.00 0.00 7.5735 1.2500 - 48096 B
System.Collections TryAddGiventSize Dictionary \new\corerun.exe ? 512 5,132.154 ns 24.9406 ns 23.3295 ns 5,139.332 ns 5,094.976 ns 5,169.649 ns 1.05 0.01 1.6658 0.1028 - 10552 B
System.Collections TryAddGiventSize Dictionary \old\corerun.exe ? 512 4,886.334 ns 45.8165 ns 42.8567 ns 4,881.514 ns 4,837.830 ns 4,988.198 ns 1.00 0.00 1.6617 0.1173 - 10552 B
System.Collections TryAddGiventSize Dictionary \new\corerun.exe ? 512 11,921.791 ns 143.4045 ns 134.1407 ns 11,898.788 ns 11,691.920 ns 12,085.736 ns 0.89 0.01 2.3113 0.1887 - 14720 B
System.Collections TryAddGiventSize Dictionary \old\corerun.exe ? 512 13,446.358 ns 97.0206 ns 86.0062 ns 13,485.080 ns 13,259.873 ns 13,520.866 ns 1.00 0.00 2.3148 0.2153 - 14720 B
System.Collections ContainsKeyFalse<Int32, Int32> Dictionary \new\corerun.exe 512 ? 4,196.509 ns 37.6365 ns 35.2052 ns 4,191.837 ns 4,131.109 ns 4,265.272 ns 1.05 0.01 - - - -
System.Collections ContainsKeyFalse<Int32, Int32> Dictionary \old\corerun.exe 512 ? 4,005.805 ns 45.2786 ns 42.3536 ns 4,011.049 ns 3,930.971 ns 4,067.127 ns 1.00 0.00 - - - -
System.Collections ContainsKeyFalse<String, String> Dictionary \new\corerun.exe 512 ? 8,190.718 ns 113.9371 ns 106.5768 ns 8,148.785 ns 8,076.627 ns 8,386.335 ns 0.88 0.02 - - - -
System.Collections ContainsKeyFalse<String, String> Dictionary \old\corerun.exe 512 ? 9,295.734 ns 108.1640 ns 101.1767 ns 9,257.564 ns 9,167.581 ns 9,456.280 ns 1.00 0.00 - - - -
System.Collections ContainsKeyTrue<Int32, Int32> Dictionary \new\corerun.exe 512 ? 2,878.326 ns 27.9878 ns 24.8105 ns 2,876.134 ns 2,839.007 ns 2,920.624 ns 0.82 0.01 - - - -
System.Collections ContainsKeyTrue<Int32, Int32> Dictionary \old\corerun.exe 512 ? 3,529.037 ns 33.0705 ns 30.9342 ns 3,523.313 ns 3,489.681 ns 3,592.514 ns 1.00 0.00 - - - -
System.Collections ContainsKeyTrue<String, String> Dictionary \new\corerun.exe 512 ? 11,939.961 ns 190.4188 ns 178.1178 ns 11,927.492 ns 11,645.192 ns 12,306.622 ns 1.17 0.02 - - - -
System.Collections ContainsKeyTrue<String, String> Dictionary \old\corerun.exe 512 ? 10,199.234 ns 137.2107 ns 128.3469 ns 10,234.781 ns 9,984.764 ns 10,365.183 ns 1.00 0.00 - - - -
System.Collections.Concurrent Count Dictionary \new\corerun.exe 512 ? 11,109.799 ns 55.6945 ns 52.0967 ns 11,097.451 ns 11,063.839 ns 11,209.116 ns 1.00 0.01 - - - -
System.Collections.Concurrent Count Dictionary \old\corerun.exe 512 ? 11,093.136 ns 59.9762 ns 56.1018 ns 11,074.956 ns 11,033.843 ns 11,182.722 ns 1.00 0.00 - - - -
System.Collections.Concurrent Count Dictionary \new\corerun.exe 512 ? 11,059.155 ns 13.5144 ns 11.2851 ns 11,058.164 ns 11,043.449 ns 11,078.536 ns 1.99 0.01 - - - -
System.Collections.Concurrent Count Dictionary \old\corerun.exe 512 ? 5,547.308 ns 29.2962 ns 25.9704 ns 5,539.846 ns 5,521.386 ns 5,601.478 ns 1.00 0.00 - - - -
System.Collections.Concurrent IsEmpty Dictionary \new\corerun.exe 512 ? 1.942 ns 0.0164 ns 0.0146 ns 1.937 ns 1.925 ns 1.968 ns 1.02 0.01 - - - -
System.Collections.Concurrent IsEmpty Dictionary \old\corerun.exe 512 ? 1.906 ns 0.0168 ns 0.0140 ns 1.900 ns 1.894 ns 1.946 ns 1.00 0.00 - - - -
System.Collections.Concurrent IsEmpty Dictionary \new\corerun.exe 512 ? 1.983 ns 0.0230 ns 0.0204 ns 1.982 ns 1.940 ns 2.016 ns 1.11 0.01 - - - -
System.Collections.Concurrent IsEmpty Dictionary \old\corerun.exe 512 ? 1.782 ns 0.0197 ns 0.0184 ns 1.772 ns 1.766 ns 1.817 ns 1.00 0.00 - - - -
System.Collections TryGetValueFalse<Int32, Int32> Dictionary \new\corerun.exe 512 ? 3,228.310 ns 51.6579 ns 48.3208 ns 3,210.684 ns 3,167.055 ns 3,308.001 ns 0.84 0.02 - - - -
System.Collections TryGetValueFalse<Int32, Int32> Dictionary \old\corerun.exe 512 ? 3,847.236 ns 73.3052 ns 78.4358 ns 3,832.010 ns 3,741.345 ns 4,025.378 ns 1.00 0.00 - - - -
System.Collections TryGetValueFalse<String, String> Dictionary \new\corerun.exe 512 ? 8,511.985 ns 94.6626 ns 88.5474 ns 8,478.742 ns 8,400.332 ns 8,671.092 ns 0.82 0.01 - - - -
System.Collections TryGetValueFalse<String, String> Dictionary \old\corerun.exe 512 ? 10,372.818 ns 98.1868 ns 91.8440 ns 10,384.001 ns 10,231.016 ns 10,560.541 ns 1.00 0.00 - - - -
System.Collections TryGetValueTrue<Int32, Int32> Dictionary \new\corerun.exe 512 ? 3,268.947 ns 35.1876 ns 32.9145 ns 3,270.629 ns 3,205.703 ns 3,330.972 ns 0.93 0.01 - - - -
System.Collections TryGetValueTrue<Int32, Int32> Dictionary \old\corerun.exe 512 ? 3,524.350 ns 28.8635 ns 26.9989 ns 3,522.698 ns 3,479.728 ns 3,582.092 ns 1.00 0.00 - - - -
System.Collections TryGetValueTrue<String, String> Dictionary \new\corerun.exe 512 ? 11,183.808 ns 107.2321 ns 83.7198 ns 11,199.878 ns 10,963.612 ns 11,253.554 ns 0.92 0.02 - - - -
System.Collections TryGetValueTrue<String, String> Dictionary \old\corerun.exe 512 ? 12,147.178 ns 238.7773 ns 211.6699 ns 12,162.397 ns 11,922.243 ns 12,662.935 ns 1.00 0.00 - - - -
System.Collections AddGivenSize Dictionary \new\corerun.exe 512 ? 4,917.614 ns 58.8491 ns 55.0475 ns 4,902.796 ns 4,864.008 ns 5,043.991 ns 0.96 0.01 1.6759 0.1183 - 10552 B
System.Collections AddGivenSize Dictionary \old\corerun.exe 512 ? 5,098.848 ns 67.7048 ns 60.0185 ns 5,087.442 ns 5,042.253 ns 5,262.915 ns 1.00 0.00 1.6607 0.1013 - 10552 B
System.Collections AddGivenSize Dictionary \new\corerun.exe 512 ? 11,787.198 ns 135.8180 ns 127.0442 ns 11,791.783 ns 11,565.203 ns 11,985.283 ns 0.85 0.01 2.3113 0.1887 - 14720 B
System.Collections AddGivenSize Dictionary \old\corerun.exe 512 ? 13,931.366 ns 252.1318 ns 210.5415 ns 13,837.602 ns 13,706.755 ns 14,511.746 ns 1.00 0.00 2.3189 0.2208 - 14720 B
System.Collections CreateAddAndRemove Dictionary \new\corerun.exe 512 ? 12,219.999 ns 54.8920 ns 51.3460 ns 12,214.902 ns 12,143.314 ns 12,325.980 ns 0.98 0.01 5.4902 0.5882 - 34496 B
System.Collections CreateAddAndRemove Dictionary \old\corerun.exe 512 ? 12,509.785 ns 75.7809 ns 70.8855 ns 12,512.010 ns 12,395.650 ns 12,629.810 ns 1.00 0.00 5.4500 0.6500 - 34496 B
System.Collections CreateAddAndRemove Dictionary \new\corerun.exe 512 ? 28,848.349 ns 425.4446 ns 397.9611 ns 28,857.743 ns 28,160.139 ns 29,488.310 ns 0.97 0.02 7.5231 1.1574 - 48096 B
System.Collections CreateAddAndRemove Dictionary \old\corerun.exe 512 ? 29,711.881 ns 315.1039 ns 246.0124 ns 29,794.525 ns 29,185.016 ns 29,958.979 ns 1.00 0.00 7.5793 1.1660 - 48096 B
System.Collections CtorFromCollection Dictionary \new\corerun.exe 512 ? 5,328.539 ns 42.2652 ns 39.5349 ns 5,325.277 ns 5,260.674 ns 5,396.585 ns 1.01 0.01 1.6604 0.1064 - 10552 B
System.Collections CtorFromCollection Dictionary \old\corerun.exe 512 ? 5,268.343 ns 80.6653 ns 71.5077 ns 5,261.446 ns 5,141.944 ns 5,399.126 ns 1.00 0.00 1.6741 0.1033 - 10552 B
System.Collections CtorFromCollection Dictionary \new\corerun.exe 512 ? 12,155.699 ns 230.5432 ns 204.3706 ns 12,109.271 ns 11,856.386 ns 12,555.615 ns 0.99 0.03 2.3274 0.1939 - 14720 B
System.Collections CtorFromCollection Dictionary \old\corerun.exe 512 ? 12,281.335 ns 239.0049 ns 275.2384 ns 12,371.391 ns 11,924.332 ns 12,588.691 ns 1.00 0.00 2.3238 0.1936 - 14720 B
System.Collections CtorGivenSize Dictionary \new\corerun.exe 512 ? 600.400 ns 4.2618 ns 3.7779 ns 600.195 ns 594.210 ns 609.696 ns 0.99 0.01 1.6802 0.1185 - 10552 B
System.Collections CtorGivenSize Dictionary \old\corerun.exe 512 ? 604.849 ns 6.0067 ns 5.3248 ns 604.945 ns 593.972 ns 615.608 ns 1.00 0.00 1.6791 0.1199 - 10552 B
System.Collections CtorGivenSize Dictionary \new\corerun.exe 512 ? 828.336 ns 5.2881 ns 4.1286 ns 829.201 ns 820.189 ns 832.795 ns 0.98 0.01 2.3406 0.2324 - 14720 B
System.Collections CtorGivenSize Dictionary \old\corerun.exe 512 ? 845.827 ns 11.1096 ns 9.8484 ns 844.358 ns 823.553 ns 864.187 ns 1.00 0.00 2.3405 0.2323 - 14720 B
System.Collections IndexerSet Dictionary \new\corerun.exe 512 ? 3,627.391 ns 42.7597 ns 39.9974 ns 3,614.602 ns 3,578.202 ns 3,701.079 ns 0.84 0.02 - - - -
System.Collections IndexerSet Dictionary \old\corerun.exe 512 ? 4,314.844 ns 56.0820 ns 52.4592 ns 4,315.108 ns 4,196.149 ns 4,403.242 ns 1.00 0.00 - - - -
System.Collections IndexerSet Dictionary \new\corerun.exe 512 ? 12,620.999 ns 158.5170 ns 148.2769 ns 12,677.643 ns 12,302.958 ns 12,804.831 ns 1.12 0.03 - - - -
System.Collections IndexerSet Dictionary \old\corerun.exe 512 ? 11,317.209 ns 214.2848 ns 200.4422 ns 11,291.518 ns 11,084.916 ns 11,718.958 ns 1.00 0.00 - - - -
System.Collections CreateAddAndClear Dictionary \new\corerun.exe 512 ? 8,549.983 ns 148.1630 ns 138.5918 ns 8,502.677 ns 8,422.230 ns 8,920.879 ns 0.96 0.02 5.4812 0.6550 - 34496 B
System.Collections CreateAddAndClear Dictionary \old\corerun.exe 512 ? 8,901.832 ns 77.5331 ns 68.7310 ns 8,902.074 ns 8,792.957 ns 9,026.153 ns 1.00 0.00 5.4781 0.6759 - 34496 B
System.Collections CreateAddAndClear Dictionary \new\corerun.exe 512 ? 17,432.387 ns 161.1934 ns 142.8938 ns 17,474.341 ns 17,190.930 ns 17,628.844 ns 0.98 0.01 7.5758 1.0522 - 48096 B
System.Collections CreateAddAndClear Dictionary \old\corerun.exe 512 ? 17,685.794 ns 180.9287 ns 169.2408 ns 17,774.844 ns 17,462.791 ns 17,939.059 ns 1.00 0.00 7.5822 1.2046 - 48096 B
System.Collections IterateForEach Dictionary \new\corerun.exe 512 ? 2,964.574 ns 13.4741 ns 12.6037 ns 2,960.335 ns 2,948.225 ns 2,996.135 ns 1.00 0.01 - - - -
System.Collections IterateForEach Dictionary \old\corerun.exe 512 ? 2,968.804 ns 32.1897 ns 30.1103 ns 2,956.874 ns 2,915.913 ns 3,017.299 ns 1.00 0.00 - - - -
System.Collections IterateForEach Dictionary \new\corerun.exe 512 ? 2,610.703 ns 17.8546 ns 15.8276 ns 2,613.893 ns 2,578.656 ns 2,638.124 ns 1.15 0.01 - - - -
System.Collections IterateForEach Dictionary \old\corerun.exe 512 ? 2,276.244 ns 32.7461 ns 30.6307 ns 2,267.638 ns 2,238.542 ns 2,348.679 ns 1.00 0.00 - - - -

@benaadams
Copy link
Member

I don't know how stable our dictionary perf tests are

I have an outstanding PR for stable keys (rather than random) dotnet/performance#938

@AntonLapounov AntonLapounov added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Dec 5, 2019
@AntonLapounov
Copy link
Member Author

@stephentoub Thank you for running the tests. Since this is a very local optimization (3 multiplications instead of 2), we should not see any big differences at all. Any difference above a couple of percent sounds suspicious to me. For instance, 15% diff in the last test, which I think don't even use the code path that I am changing (only in the GlobalSetup stage) sounds very fishy.

I am going to run the tests myself on x64 and ARM64 in a few days and report results. Adding NO MERGE label for now.

@stephentoub
Copy link
Member

I am going to run the tests myself on x64 and ARM64 in a few days and report results

@AntonLapounov, any update? Should this be closed? Merged?

@stephentoub
Copy link
Member

stephentoub commented Jan 21, 2020

@AntonLapounov, can you please comment on the status of this PR?

@AntonLapounov
Copy link
Member Author

@stephentoub Sorry, I have not had a chance to measure it due to other work. I might need some guidance on how to run those tests on ARM64. I would be quite surprised if this change had negative performance impact.

@jkotas
Copy link
Member

jkotas commented Jan 22, 2020

I am fine with taking this even without confirmed throughput measurements. I agree with @AntonLapounov that it is very unlikely to make anything shower; and it makes the Dictionary code measurably smaller. For example, Dictionary<int,int>.TryInsert goes from 713 bytes to 685 bytes on Windows x64 (~4% improvement).

@jkotas
Copy link
Member

jkotas commented Jan 22, 2020

@AntonLapounov Could you please rebase or push a merge from master, so that we get a fresh run of the CI?

@@ -90,21 +90,24 @@ public static int ExpandPrime(int oldSize)
}

#if BIT64
// Returns approximate reciprocal of the divisor: ceil(2**64 / divisor)
public static ulong GetFastModMultiplier(uint divisor)
=> ulong.MaxValue / divisor + 1;

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static uint FastMod(uint value, uint divisor, ulong multiplier)
Copy link
Member

@danmoseley danmoseley Jan 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: surely this is inlined even without the attribute..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified it? This method is going to be ~20 bytes of IL. It is in the gray where the JIT may choose to not inline it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are exactly right .. it is 18 bytes IL, if I remove the + 1 to make 15 bytes of IL, it is then inlined.

@AntonLapounov
Copy link
Member Author

@jkotas Thank you for trying this change. One question that I had in mind is what would generate better code for x64 and ARM64, the current change:

uint highbits = (uint)((((lowbits >> 32) + 1) * divisor) >> 32);

or this variation:

uint highbits = (uint)(((ulong)((uint)(lowbits >> 32) + 1) * divisor) >> 32);

Here (lowbits >> 32) + 1 is guaranteed to fit in 32 bits and we can use a 32×32⭢64 multiplication; however, JIT cannot deduce that and uses a 64×64⭢64 multiplication.

@jkotas
Copy link
Member

jkotas commented Jan 22, 2020

I think your current version is fine. I do not think that the alternative would make a difference with the current JIT, and given that this is enabled for 64-bit only anyway.

@jkotas jkotas removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Jan 22, 2020
@jkotas jkotas merged commit ea9bfd2 into dotnet:master Jan 22, 2020
@AntonLapounov AntonLapounov deleted the EvenFasterMod branch September 14, 2020 06:13
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants