Even faster modulo computation #406

AntonLapounov · 2019-11-30T11:33:42Z

Improved Direct Remainder Computation

Summary: Lemire, Kaser, and Kurz developed a faster method for calculating the remainder of two unsigned 𝑁-bit integers if repeated integer divisions are performed with the same divisor [1]. Their method avoids an expensive 𝑁-bit integer division instruction by precomputing the 2𝑁-bit multiplicative inverse of the divisor and leveraging the unsigned-multiply-high instruction for 2𝑁-bit integers, which is, however, not universally available. We improve their method by replacing the unsigned-multiply-high instruction with an ordinary 𝑁×𝑁⭢2𝑁-bit unsigned multiplication for the frequent case when the divisor belongs to [1, 2ᴺ⁻¹) interval. For divisors outside that interval, the remainder may be calculated by simple comparison and subtraction instructions.

While we assume 𝑁 = 32 below for simplicity, the same arguments trivially work for the general case as well. The recent work of Lemire, Kaser, and Kurz [1] presented the following algorithm for computing the modulo of two unsigned 32-bit integers:

// c = ceil( (1<<64) / d )
uint64_t c = UINT64_C(0xFFFFFFFFFFFFFFFF) / d + 1;

// fastmod computes (n mod d) given precomputed c
uint32_t fastmod(uint32_t n /* , uint64_t c, uint32_t d */)
{
    uint64_t lowbits = c * n;
    return ((__uint128_t)lowbits * d) >> 64;
}

Here the fastmod function requires either the 128-bit multiplication or the 64-bit unsigned-multiply-high instruction (e.g., __umulh Visual C++ x64 intrinsic or UMULH ARM instruction), which may not be natively available on some platforms or in some languages. Even when it is available, it may have worse performance characteristics than an ordinary 64-bit multiplication. (See, for instance, execution latency and throughput for ARM Cortex-A75 in section 4.6 of [2].) And if it is not available natively, it has to be emulated with two 64-bit multiplications, which reduces or even eliminates the potential performance win achieved by avoiding the integer division.

We suggest the algorithm modification that solves the aforementioned issue. Namely, we round up lowbits to the next power of 2³², which allows to replace the unsigned-multiply-high instruction with an ordinary 32×32⭢64-bit unsigned multiplication:

uint32_t fastmod(uint32_t n /* , uint64_t c, uint32_t d */)
{
    if ((int32_t)d < 0)
        return n < d ? n : n - d;

    uint64_t lowbits = c * n;
    return ((uint64_t)((uint32_t)(lowbits >> 32) + 1) * d) >> 32;
}

Here casting (lowbits >> 32) to uint32_t is needed only to help the compiler to infer the correct size of the multiplication operand and generate faster code, especially when targeting 32-bit processors. Also note that our modification uses only the high half of the first multiplication, which may allow additional optimization.

Let’s prove this algorithm produces the correct modulo value. The case 𝑑 ≥ 2³¹ is handled by the if statement: obviously the modulo is either 𝑛 itself, if 𝑛 < 𝑑, or 𝑛 − 𝑑 otherwise. If 𝑑 = 1, then we have 𝑐 = 0, lowbits = 0, and fastmod returns 0 as expected.

Finally consider the case 𝑑 ∈ (1, 2³¹). We can write

𝑐 = ⌈2⁶⁴/𝑑⌉ = 2⁶⁴/𝑑 + ε,

where rational ε ∈ [0, 1), and 𝑛 = 𝑞𝑑 + 𝑟, where integer 𝑞 ≥ 0 and integer 𝑟 ∈ [0, 𝑑). Then

𝑐𝑛 = (2⁶⁴/𝑑 + ε)𝑛 = 2⁶⁴⋅𝑞 + 2⁶⁴⋅𝑟/𝑑 + ε𝑛.

Since ε𝑛𝑑 < 1⋅2³²⋅2³¹ < 2⁶⁴, we have ε𝑛 < 2⁶⁴/𝑑 and

0 ≤ 2⁶⁴⋅𝑟/𝑑 + ε𝑛 < 2⁶⁴.

That means

lowbits = 𝑐𝑛 mod 2⁶⁴ = 2⁶⁴⋅𝑟/𝑑 + ε𝑛

(in particular, that shows that the quotient 𝑞 may be calculated as 𝑐𝑛 >> 64). Rounding lowbits up may be represented as adding some rational δ ∈ (0, 1]:

(lowbits >> 32) + 1 = ⌊lowbits / 2³²⌋ + 1 = 2³²⋅𝑟/𝑑 + ε𝑛/2³² + δ.

After multiplying by 𝑑 we have

(2³²⋅𝑟/𝑑 + ε𝑛/2³² + δ)𝑑 = 2³²⋅𝑟 + (ε𝑛/2³² + δ)𝑑.

Since

0 < (ε𝑛/2³² + δ)𝑑 < (1 + 1)𝑑 < 2³²,

shifting the last multiplication result right by 32 bits produces 𝑟, which concludes the proof. □

The table below summarizes size requirements for the two multiplications performed in the original and our method for 𝑁-bit integers.

Mult. Op.	Original Method	Our Method
1	2𝑁×𝑁⭢2𝑁 (full 2𝑁 bits)	2𝑁×𝑁⭢2𝑁 (high 𝑁 bits only)
2	2𝑁×𝑁⭢3𝑁 (high 𝑁 bits only)	𝑁×𝑁⭢2𝑁 (high 𝑁 bits only)

On common processors the 2𝑁×𝑁⭢3𝑁 multiplication from the original algorithm may be executed natively (as a single instruction) only in the form of a 2𝑁×2𝑁⭢4𝑁 multiplication, which imposes a limit on the size of integers that allow very fast modulo calculation. Roughly speaking, our approach makes possible to use native multiplications for twice longer numbers. As a practical example, below is the 64-bit variant of fastmod, which uses the __uint128_t extension type available in GCC and Clang compilers.

uint64_t d = ...;  // Non-zero divisor
__uint128_t c = ~__uint128_t() / d + 1;  // 128-bit reciprocal of d

// Computes n mod d given precomputed reciprocal c
uint64_t fastmod64(uint64_t n /* , uint128_t c, uint64_t d */)
{
    // This check is needed only if d may be >= 2**63
    if ((int64_t)d < 0)
        return n < d ? n : n - d;

    __uint128_t lowbits = c * n;
    return ((__uint128_t)((uint64_t)(lowbits >> 64) + 1) * d) >> 64;
}

Conclusion

We demonstrated that the unsigned-multiply-high instruction for 2𝑁-bit integers in the direct remainder computation via the multiplicative inverse may be replaced, to improve performance, with an ordinary 𝑁×𝑁⭢2𝑁-bit unsigned multiplication if the highest bit of the divisor is guaranteed to be zero. If there is no such guarantee, trivial handling of that special case adds a single branch to the common code path.

References

D. Lemire, O. Kaser, and N. Kurz, Faster Remainder by Direct Computation, 2018. https://arxiv.org/abs/1902.01961
Arm Cortex-A75 Software Optimization Guide v2.0. https://developer.arm.com/docs/101398/latest/arm-cortex-a75-software-optimization-guide-v20

src/libraries/System.Private.CoreLib/src/System/Collections/HashHelpers.cs

danmoseley · 2019-11-30T17:07:55Z

@benadams

src/libraries/System.Private.CoreLib/src/System/Collections/HashHelpers.cs

danmoseley · 2019-11-30T19:58:54Z

Would it be possible to run the Dictionary performance tests?

stephentoub

LGTM as long as the perf tests results are shared and demonstrate the improvements.

stephentoub · 2019-12-05T14:48:20Z

I don't know how stable our dictionary perf tests are (@adamsitnik?), but I just tried running them comparing master against this PR (dotnet run -c Release -f netcoreapp3.0 --filter System.Collections.*.Dictionary --corerun d:\coreclrtest\old\corerun.exe d:\coreclrtest\new\corerun.exe --join):

Namespace	Type	Method	Toolchain	Size	Count	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
System.Collections	CtorDefaultSize	Dictionary	\new\corerun.exe	?	?	7.971 ns	0.0806 ns	0.0673 ns	7.955 ns	7.834 ns	8.108 ns	1.01	0.03	0.0127	-	-	80 B
System.Collections	CtorDefaultSize	Dictionary	\old\corerun.exe	?	?	7.864 ns	0.1780 ns	0.1578 ns	7.812 ns	7.703 ns	8.203 ns	1.00	0.00	0.0127	-	-	80 B

System.Collections	CtorDefaultSize	Dictionary	\new\corerun.exe	?	?	21.037 ns	0.1110 ns	0.1038 ns	21.010 ns	20.898 ns	21.241 ns	1.00	0.01	0.0127	-	-	80 B
System.Collections	CtorDefaultSize	Dictionary	\old\corerun.exe	?	?	20.957 ns	0.2391 ns	0.2120 ns	20.901 ns	20.639 ns	21.438 ns	1.00	0.00	0.0127	-	-	80 B

System.Collections.Concurrent	IsEmpty	Dictionary	\new\corerun.exe	0	?	179.471 ns	0.9346 ns	0.8743 ns	179.208 ns	178.610 ns	181.316 ns	1.01	0.00	-	-	-	-
System.Collections.Concurrent	IsEmpty	Dictionary	\old\corerun.exe	0	?	178.311 ns	0.1173 ns	0.0980 ns	178.334 ns	178.195 ns	178.562 ns	1.00	0.00	-	-	-	-

System.Collections.Concurrent	IsEmpty	Dictionary	\new\corerun.exe	0	?	183.243 ns	2.2765 ns	2.1294 ns	182.639 ns	181.007 ns	189.120 ns	1.03	0.01	-	-	-	-
System.Collections.Concurrent	IsEmpty	Dictionary	\old\corerun.exe	0	?	178.162 ns	2.0080 ns	1.7801 ns	178.079 ns	175.092 ns	181.594 ns	1.00	0.00	-	-	-	-

System.Collections	TryAddDefaultSize	Dictionary	\new\corerun.exe	?	512	8,642.181 ns	129.1360 ns	120.7939 ns	8,668.149 ns	8,447.814 ns	8,850.007 ns	0.99	0.02	5.4658	0.6790	-	34496 B
System.Collections	TryAddDefaultSize	Dictionary	\old\corerun.exe	?	512	8,760.753 ns	54.7569 ns	51.2197 ns	8,763.397 ns	8,687.479 ns	8,844.295 ns	1.00	0.00	5.4818	0.6634	-	34496 B

System.Collections	TryAddDefaultSize	Dictionary	\new\corerun.exe	?	512	16,811.581 ns	166.5277 ns	155.7701 ns	16,844.224 ns	16,516.245 ns	17,120.856 ns	0.90	0.01	7.5840	1.2189	-	48096 B
System.Collections	TryAddDefaultSize	Dictionary	\old\corerun.exe	?	512	18,783.389 ns	191.5275 ns	179.1549 ns	18,807.522 ns	18,421.037 ns	19,094.640 ns	1.00	0.00	7.5735	1.2500	-	48096 B

System.Collections	TryAddGiventSize	Dictionary	\new\corerun.exe	?	512	5,132.154 ns	24.9406 ns	23.3295 ns	5,139.332 ns	5,094.976 ns	5,169.649 ns	1.05	0.01	1.6658	0.1028	-	10552 B
System.Collections	TryAddGiventSize	Dictionary	\old\corerun.exe	?	512	4,886.334 ns	45.8165 ns	42.8567 ns	4,881.514 ns	4,837.830 ns	4,988.198 ns	1.00	0.00	1.6617	0.1173	-	10552 B

System.Collections	TryAddGiventSize	Dictionary	\new\corerun.exe	?	512	11,921.791 ns	143.4045 ns	134.1407 ns	11,898.788 ns	11,691.920 ns	12,085.736 ns	0.89	0.01	2.3113	0.1887	-	14720 B
System.Collections	TryAddGiventSize	Dictionary	\old\corerun.exe	?	512	13,446.358 ns	97.0206 ns	86.0062 ns	13,485.080 ns	13,259.873 ns	13,520.866 ns	1.00	0.00	2.3148	0.2153	-	14720 B

System.Collections	ContainsKeyFalse<Int32, Int32>	Dictionary	\new\corerun.exe	512	?	4,196.509 ns	37.6365 ns	35.2052 ns	4,191.837 ns	4,131.109 ns	4,265.272 ns	1.05	0.01	-	-	-	-
System.Collections	ContainsKeyFalse<Int32, Int32>	Dictionary	\old\corerun.exe	512	?	4,005.805 ns	45.2786 ns	42.3536 ns	4,011.049 ns	3,930.971 ns	4,067.127 ns	1.00	0.00	-	-	-	-

System.Collections	ContainsKeyFalse<String, String>	Dictionary	\new\corerun.exe	512	?	8,190.718 ns	113.9371 ns	106.5768 ns	8,148.785 ns	8,076.627 ns	8,386.335 ns	0.88	0.02	-	-	-	-
System.Collections	ContainsKeyFalse<String, String>	Dictionary	\old\corerun.exe	512	?	9,295.734 ns	108.1640 ns	101.1767 ns	9,257.564 ns	9,167.581 ns	9,456.280 ns	1.00	0.00	-	-	-	-

System.Collections	ContainsKeyTrue<Int32, Int32>	Dictionary	\new\corerun.exe	512	?	2,878.326 ns	27.9878 ns	24.8105 ns	2,876.134 ns	2,839.007 ns	2,920.624 ns	0.82	0.01	-	-	-	-
System.Collections	ContainsKeyTrue<Int32, Int32>	Dictionary	\old\corerun.exe	512	?	3,529.037 ns	33.0705 ns	30.9342 ns	3,523.313 ns	3,489.681 ns	3,592.514 ns	1.00	0.00	-	-	-	-

System.Collections	ContainsKeyTrue<String, String>	Dictionary	\new\corerun.exe	512	?	11,939.961 ns	190.4188 ns	178.1178 ns	11,927.492 ns	11,645.192 ns	12,306.622 ns	1.17	0.02	-	-	-	-
System.Collections	ContainsKeyTrue<String, String>	Dictionary	\old\corerun.exe	512	?	10,199.234 ns	137.2107 ns	128.3469 ns	10,234.781 ns	9,984.764 ns	10,365.183 ns	1.00	0.00	-	-	-	-

System.Collections.Concurrent	Count	Dictionary	\new\corerun.exe	512	?	11,109.799 ns	55.6945 ns	52.0967 ns	11,097.451 ns	11,063.839 ns	11,209.116 ns	1.00	0.01	-	-	-	-
System.Collections.Concurrent	Count	Dictionary	\old\corerun.exe	512	?	11,093.136 ns	59.9762 ns	56.1018 ns	11,074.956 ns	11,033.843 ns	11,182.722 ns	1.00	0.00	-	-	-	-

System.Collections.Concurrent	Count	Dictionary	\new\corerun.exe	512	?	11,059.155 ns	13.5144 ns	11.2851 ns	11,058.164 ns	11,043.449 ns	11,078.536 ns	1.99	0.01	-	-	-	-
System.Collections.Concurrent	Count	Dictionary	\old\corerun.exe	512	?	5,547.308 ns	29.2962 ns	25.9704 ns	5,539.846 ns	5,521.386 ns	5,601.478 ns	1.00	0.00	-	-	-	-

System.Collections.Concurrent	IsEmpty	Dictionary	\new\corerun.exe	512	?	1.942 ns	0.0164 ns	0.0146 ns	1.937 ns	1.925 ns	1.968 ns	1.02	0.01	-	-	-	-
System.Collections.Concurrent	IsEmpty	Dictionary	\old\corerun.exe	512	?	1.906 ns	0.0168 ns	0.0140 ns	1.900 ns	1.894 ns	1.946 ns	1.00	0.00	-	-	-	-

System.Collections.Concurrent	IsEmpty	Dictionary	\new\corerun.exe	512	?	1.983 ns	0.0230 ns	0.0204 ns	1.982 ns	1.940 ns	2.016 ns	1.11	0.01	-	-	-	-
System.Collections.Concurrent	IsEmpty	Dictionary	\old\corerun.exe	512	?	1.782 ns	0.0197 ns	0.0184 ns	1.772 ns	1.766 ns	1.817 ns	1.00	0.00	-	-	-	-

System.Collections	TryGetValueFalse<Int32, Int32>	Dictionary	\new\corerun.exe	512	?	3,228.310 ns	51.6579 ns	48.3208 ns	3,210.684 ns	3,167.055 ns	3,308.001 ns	0.84	0.02	-	-	-	-
System.Collections	TryGetValueFalse<Int32, Int32>	Dictionary	\old\corerun.exe	512	?	3,847.236 ns	73.3052 ns	78.4358 ns	3,832.010 ns	3,741.345 ns	4,025.378 ns	1.00	0.00	-	-	-	-

System.Collections	TryGetValueFalse<String, String>	Dictionary	\new\corerun.exe	512	?	8,511.985 ns	94.6626 ns	88.5474 ns	8,478.742 ns	8,400.332 ns	8,671.092 ns	0.82	0.01	-	-	-	-
System.Collections	TryGetValueFalse<String, String>	Dictionary	\old\corerun.exe	512	?	10,372.818 ns	98.1868 ns	91.8440 ns	10,384.001 ns	10,231.016 ns	10,560.541 ns	1.00	0.00	-	-	-	-

System.Collections	TryGetValueTrue<Int32, Int32>	Dictionary	\new\corerun.exe	512	?	3,268.947 ns	35.1876 ns	32.9145 ns	3,270.629 ns	3,205.703 ns	3,330.972 ns	0.93	0.01	-	-	-	-
System.Collections	TryGetValueTrue<Int32, Int32>	Dictionary	\old\corerun.exe	512	?	3,524.350 ns	28.8635 ns	26.9989 ns	3,522.698 ns	3,479.728 ns	3,582.092 ns	1.00	0.00	-	-	-	-

System.Collections	TryGetValueTrue<String, String>	Dictionary	\new\corerun.exe	512	?	11,183.808 ns	107.2321 ns	83.7198 ns	11,199.878 ns	10,963.612 ns	11,253.554 ns	0.92	0.02	-	-	-	-
System.Collections	TryGetValueTrue<String, String>	Dictionary	\old\corerun.exe	512	?	12,147.178 ns	238.7773 ns	211.6699 ns	12,162.397 ns	11,922.243 ns	12,662.935 ns	1.00	0.00	-	-	-	-

System.Collections	AddGivenSize	Dictionary	\new\corerun.exe	512	?	4,917.614 ns	58.8491 ns	55.0475 ns	4,902.796 ns	4,864.008 ns	5,043.991 ns	0.96	0.01	1.6759	0.1183	-	10552 B
System.Collections	AddGivenSize	Dictionary	\old\corerun.exe	512	?	5,098.848 ns	67.7048 ns	60.0185 ns	5,087.442 ns	5,042.253 ns	5,262.915 ns	1.00	0.00	1.6607	0.1013	-	10552 B

System.Collections	AddGivenSize	Dictionary	\new\corerun.exe	512	?	11,787.198 ns	135.8180 ns	127.0442 ns	11,791.783 ns	11,565.203 ns	11,985.283 ns	0.85	0.01	2.3113	0.1887	-	14720 B
System.Collections	AddGivenSize	Dictionary	\old\corerun.exe	512	?	13,931.366 ns	252.1318 ns	210.5415 ns	13,837.602 ns	13,706.755 ns	14,511.746 ns	1.00	0.00	2.3189	0.2208	-	14720 B

System.Collections	CreateAddAndRemove	Dictionary	\new\corerun.exe	512	?	12,219.999 ns	54.8920 ns	51.3460 ns	12,214.902 ns	12,143.314 ns	12,325.980 ns	0.98	0.01	5.4902	0.5882	-	34496 B
System.Collections	CreateAddAndRemove	Dictionary	\old\corerun.exe	512	?	12,509.785 ns	75.7809 ns	70.8855 ns	12,512.010 ns	12,395.650 ns	12,629.810 ns	1.00	0.00	5.4500	0.6500	-	34496 B

System.Collections	CreateAddAndRemove	Dictionary	\new\corerun.exe	512	?	28,848.349 ns	425.4446 ns	397.9611 ns	28,857.743 ns	28,160.139 ns	29,488.310 ns	0.97	0.02	7.5231	1.1574	-	48096 B
System.Collections	CreateAddAndRemove	Dictionary	\old\corerun.exe	512	?	29,711.881 ns	315.1039 ns	246.0124 ns	29,794.525 ns	29,185.016 ns	29,958.979 ns	1.00	0.00	7.5793	1.1660	-	48096 B

System.Collections	CtorFromCollection	Dictionary	\new\corerun.exe	512	?	5,328.539 ns	42.2652 ns	39.5349 ns	5,325.277 ns	5,260.674 ns	5,396.585 ns	1.01	0.01	1.6604	0.1064	-	10552 B
System.Collections	CtorFromCollection	Dictionary	\old\corerun.exe	512	?	5,268.343 ns	80.6653 ns	71.5077 ns	5,261.446 ns	5,141.944 ns	5,399.126 ns	1.00	0.00	1.6741	0.1033	-	10552 B

System.Collections	CtorFromCollection	Dictionary	\new\corerun.exe	512	?	12,155.699 ns	230.5432 ns	204.3706 ns	12,109.271 ns	11,856.386 ns	12,555.615 ns	0.99	0.03	2.3274	0.1939	-	14720 B
System.Collections	CtorFromCollection	Dictionary	\old\corerun.exe	512	?	12,281.335 ns	239.0049 ns	275.2384 ns	12,371.391 ns	11,924.332 ns	12,588.691 ns	1.00	0.00	2.3238	0.1936	-	14720 B

System.Collections	CtorGivenSize	Dictionary	\new\corerun.exe	512	?	600.400 ns	4.2618 ns	3.7779 ns	600.195 ns	594.210 ns	609.696 ns	0.99	0.01	1.6802	0.1185	-	10552 B
System.Collections	CtorGivenSize	Dictionary	\old\corerun.exe	512	?	604.849 ns	6.0067 ns	5.3248 ns	604.945 ns	593.972 ns	615.608 ns	1.00	0.00	1.6791	0.1199	-	10552 B

System.Collections	CtorGivenSize	Dictionary	\new\corerun.exe	512	?	828.336 ns	5.2881 ns	4.1286 ns	829.201 ns	820.189 ns	832.795 ns	0.98	0.01	2.3406	0.2324	-	14720 B
System.Collections	CtorGivenSize	Dictionary	\old\corerun.exe	512	?	845.827 ns	11.1096 ns	9.8484 ns	844.358 ns	823.553 ns	864.187 ns	1.00	0.00	2.3405	0.2323	-	14720 B

System.Collections	IndexerSet	Dictionary	\new\corerun.exe	512	?	3,627.391 ns	42.7597 ns	39.9974 ns	3,614.602 ns	3,578.202 ns	3,701.079 ns	0.84	0.02	-	-	-	-
System.Collections	IndexerSet	Dictionary	\old\corerun.exe	512	?	4,314.844 ns	56.0820 ns	52.4592 ns	4,315.108 ns	4,196.149 ns	4,403.242 ns	1.00	0.00	-	-	-	-

System.Collections	IndexerSet	Dictionary	\new\corerun.exe	512	?	12,620.999 ns	158.5170 ns	148.2769 ns	12,677.643 ns	12,302.958 ns	12,804.831 ns	1.12	0.03	-	-	-	-
System.Collections	IndexerSet	Dictionary	\old\corerun.exe	512	?	11,317.209 ns	214.2848 ns	200.4422 ns	11,291.518 ns	11,084.916 ns	11,718.958 ns	1.00	0.00	-	-	-	-

System.Collections	CreateAddAndClear	Dictionary	\new\corerun.exe	512	?	8,549.983 ns	148.1630 ns	138.5918 ns	8,502.677 ns	8,422.230 ns	8,920.879 ns	0.96	0.02	5.4812	0.6550	-	34496 B
System.Collections	CreateAddAndClear	Dictionary	\old\corerun.exe	512	?	8,901.832 ns	77.5331 ns	68.7310 ns	8,902.074 ns	8,792.957 ns	9,026.153 ns	1.00	0.00	5.4781	0.6759	-	34496 B

System.Collections	CreateAddAndClear	Dictionary	\new\corerun.exe	512	?	17,432.387 ns	161.1934 ns	142.8938 ns	17,474.341 ns	17,190.930 ns	17,628.844 ns	0.98	0.01	7.5758	1.0522	-	48096 B
System.Collections	CreateAddAndClear	Dictionary	\old\corerun.exe	512	?	17,685.794 ns	180.9287 ns	169.2408 ns	17,774.844 ns	17,462.791 ns	17,939.059 ns	1.00	0.00	7.5822	1.2046	-	48096 B

System.Collections	IterateForEach	Dictionary	\new\corerun.exe	512	?	2,964.574 ns	13.4741 ns	12.6037 ns	2,960.335 ns	2,948.225 ns	2,996.135 ns	1.00	0.01	-	-	-	-
System.Collections	IterateForEach	Dictionary	\old\corerun.exe	512	?	2,968.804 ns	32.1897 ns	30.1103 ns	2,956.874 ns	2,915.913 ns	3,017.299 ns	1.00	0.00	-	-	-	-

System.Collections	IterateForEach	Dictionary	\new\corerun.exe	512	?	2,610.703 ns	17.8546 ns	15.8276 ns	2,613.893 ns	2,578.656 ns	2,638.124 ns	1.15	0.01	-	-	-	-
System.Collections	IterateForEach	Dictionary	\old\corerun.exe	512	?	2,276.244 ns	32.7461 ns	30.6307 ns	2,267.638 ns	2,238.542 ns	2,348.679 ns	1.00	0.00	-	-	-	-

benaadams · 2019-12-05T15:15:23Z

I don't know how stable our dictionary perf tests are

I have an outstanding PR for stable keys (rather than random) dotnet/performance#938

AntonLapounov · 2019-12-05T18:18:04Z

@stephentoub Thank you for running the tests. Since this is a very local optimization (3 multiplications instead of 2), we should not see any big differences at all. Any difference above a couple of percent sounds suspicious to me. For instance, 15% diff in the last test, which I think don't even use the code path that I am changing (only in the GlobalSetup stage) sounds very fishy.

I am going to run the tests myself on x64 and ARM64 in a few days and report results. Adding NO MERGE label for now.

…asterMod

stephentoub · 2020-01-15T12:59:00Z

I am going to run the tests myself on x64 and ARM64 in a few days and report results

@AntonLapounov, any update? Should this be closed? Merged?

stephentoub · 2020-01-21T15:40:59Z

@AntonLapounov, can you please comment on the status of this PR?

AntonLapounov · 2020-01-21T19:53:25Z

@stephentoub Sorry, I have not had a chance to measure it due to other work. I might need some guidance on how to run those tests on ARM64. I would be quite surprised if this change had negative performance impact.

jkotas · 2020-01-22T00:27:52Z

I am fine with taking this even without confirmed throughput measurements. I agree with @AntonLapounov that it is very unlikely to make anything shower; and it makes the Dictionary code measurably smaller. For example, Dictionary<int,int>.TryInsert goes from 713 bytes to 685 bytes on Windows x64 (~4% improvement).

jkotas · 2020-01-22T00:29:20Z

@AntonLapounov Could you please rebase or push a merge from master, so that we get a fresh run of the CI?

…asterMod

danmoseley · 2020-01-22T00:50:46Z

src/libraries/System.Private.CoreLib/src/System/Collections/HashHelpers.cs

@@ -90,21 +90,24 @@ public static int ExpandPrime(int oldSize)
        }

 #if BIT64
+        // Returns approximate reciprocal of the divisor: ceil(2**64 / divisor)
        public static ulong GetFastModMultiplier(uint divisor)
            => ulong.MaxValue / divisor + 1;

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static uint FastMod(uint value, uint divisor, ulong multiplier)


Nit: surely this is inlined even without the attribute..

Have you verified it? This method is going to be ~20 bytes of IL. It is in the gray where the JIT may choose to not inline it.

You are exactly right .. it is 18 bytes IL, if I remove the + 1 to make 15 bytes of IL, it is then inlined.

AntonLapounov · 2020-01-22T03:32:45Z

@jkotas Thank you for trying this change. One question that I had in mind is what would generate better code for x64 and ARM64, the current change:

uint highbits = (uint)((((lowbits >> 32) + 1) * divisor) >> 32);

or this variation:

uint highbits = (uint)(((ulong)((uint)(lowbits >> 32) + 1) * divisor) >> 32);

Here (lowbits >> 32) + 1 is guaranteed to fit in 32 bits and we can use a 32×32⭢64 multiplication; however, JIT cannot deduce that and uses a 64×64⭢64 multiplication.

jkotas · 2020-01-22T03:56:01Z

I think your current version is fine. I do not think that the alternative would make a difference with the current JIT, and given that this is enabled for 64-bit only anyway.

AntonLapounov added area-System.Collections optimization labels Nov 30, 2019

AntonLapounov self-assigned this Nov 30, 2019

AntonLapounov force-pushed the EvenFasterMod branch 2 times, most recently from fece4fe to 9e413c0 Compare November 30, 2019 11:37

tannergooding reviewed Nov 30, 2019

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Collections/HashHelpers.cs Outdated Show resolved Hide resolved

AntonLapounov requested a review from benaadams November 30, 2019 17:16

benaadams reviewed Nov 30, 2019

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Collections/HashHelpers.cs Show resolved Hide resolved

benaadams approved these changes Nov 30, 2019

View reviewed changes

AntonLapounov force-pushed the EvenFasterMod branch 2 times, most recently from 8c1d54c to 4bb0bbb Compare December 1, 2019 22:06

Even faster modulo computation: remove one of the three multiplications.

0927e77

AntonLapounov force-pushed the EvenFasterMod branch from 4bb0bbb to 0927e77 Compare December 1, 2019 22:08

stephentoub approved these changes Dec 2, 2019

View reviewed changes

AntonLapounov added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Dec 5, 2019

Merge branch 'master' of https://github.com/dotnet/runtime into EvenF…

38cfffd

…asterMod

jkotas approved these changes Jan 22, 2020

View reviewed changes

Merge branch 'master' of https://github.com/dotnet/runtime into EvenF…

2d3a8ca

…asterMod

danmoseley reviewed Jan 22, 2020

View reviewed changes

jkotas removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Jan 22, 2020

jkotas merged commit ea9bfd2 into dotnet:master Jan 22, 2020

ycrumeyrolle mentioned this pull request Feb 17, 2020

Minor improvement on the dataLength % 3. gfoidl/Base64#114

Merged

adamsitnik mentioned this pull request Sep 2, 2020

[ARM64] Performance regression: DictionarySequentialKeys.ContainsKey #41743

Closed

AntonLapounov deleted the EvenFasterMod branch September 14, 2020 06:13

ghost locked as resolved and limited conversation to collaborators Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Even faster modulo computation #406

Even faster modulo computation #406

AntonLapounov commented Nov 30, 2019 •

edited

Loading

danmoseley commented Nov 30, 2019

danmoseley commented Nov 30, 2019

stephentoub left a comment

stephentoub commented Dec 5, 2019

benaadams commented Dec 5, 2019

AntonLapounov commented Dec 5, 2019

stephentoub commented Jan 15, 2020

stephentoub commented Jan 21, 2020 •

edited

Loading

AntonLapounov commented Jan 21, 2020

jkotas commented Jan 22, 2020

jkotas commented Jan 22, 2020

danmoseley Jan 22, 2020 •

edited

Loading

jkotas Jan 22, 2020

danmoseley Jan 22, 2020

AntonLapounov commented Jan 22, 2020

jkotas commented Jan 22, 2020

Even faster modulo computation #406

Even faster modulo computation #406

Conversation

AntonLapounov commented Nov 30, 2019 • edited Loading

Improved Direct Remainder Computation

Conclusion

References

danmoseley commented Nov 30, 2019

danmoseley commented Nov 30, 2019

stephentoub left a comment

Choose a reason for hiding this comment

stephentoub commented Dec 5, 2019

benaadams commented Dec 5, 2019

AntonLapounov commented Dec 5, 2019

stephentoub commented Jan 15, 2020

stephentoub commented Jan 21, 2020 • edited Loading

AntonLapounov commented Jan 21, 2020

jkotas commented Jan 22, 2020

jkotas commented Jan 22, 2020

danmoseley Jan 22, 2020 • edited Loading

Choose a reason for hiding this comment

jkotas Jan 22, 2020

Choose a reason for hiding this comment

danmoseley Jan 22, 2020

Choose a reason for hiding this comment

AntonLapounov commented Jan 22, 2020

jkotas commented Jan 22, 2020

AntonLapounov commented Nov 30, 2019 •

edited

Loading

stephentoub commented Jan 21, 2020 •

edited

Loading

danmoseley Jan 22, 2020 •

edited

Loading