[SIMD] [POC] Improve performance of rounding dates in date_histogram aggregation #10392

ketanv3 · 2023-10-05T05:40:58Z

Problem Statement

In the date histogram aggregation, the timestamp value from each hit must be rounded down to the nearest interval (year, quarter, month, week, day, etc.) as defined in the search request. This rounded-down timestamp serves as the bucket key to aggregate results.

#9727 improved performance by using linear search when the array size is small (<= 64 elements). This POC evaluates if performance can be improved further using SIMD (related to #9423).

1 — Baseline

The current approach uses linear search when the search space is small or binary search otherwise. This is because linear search is far superior for small arrays as it avoids the penalty of branch mispredictions and pipeline stalls and accesses memory sequentially. To simplify the remaining comparison, I'll assume that each value is equally likely to be picked (uniform distribution), so we need not consider bidirectional linear search in this POC.

2 — Vectorized Linear Search

Assume a platform that supports AVX2 instructions (i.e., 256-bit wide registers). We perform a linear search comparing 4 longs at once, taking a stride of 4 elements with each iteration. We pad the array with $\infty$ to make it an exact multiple of the number of vector lanes ($B$). The search stops once we find the first value greater than the desired key.

3 — Vectorized B-Tree Search

Here, the idea is to improve binary search using vectorization. Since binary search compares values at non-contiguous memory locations, which is non-trivial to load into a vector register, the first step is to fix the memory layout.

The first idea is to use the Eytzinger layout. It is an implicit binary tree packed into an array where the left and the right sub-trees of a node at index $i$ are stored at $2i+1$ and $2i+2$ indices respectively. This is also how binary heaps are laid out in memory.
The second idea is to store more multiple values ($B$) per node, which makes the branching factor ($1+B$). This is essentially a B-tree! But unlike B-trees where the number of values is chosen to fit in a single cache line, we choose it to be equal to the number of vector lanes.
The underlying array is padded to the nearest multiple of vector lanes. We do not need to provision memory for completely empty nodes. I also experimented with a complete tree (all levels are filled except the last level, which is filled from left to right) but didn't see much difference in performance.

To find the round-down point in this layout, we start our search from the $0$-th index comparing $B$ values at once. The position of the first set bit in the vector mask tells us which sub-tree (between $0...B$) to visit to converge to the round-down point.

POC code

Code for the above implementations (and many others) can be found here:
https://github.com/ketanv3/opensearch-simd-playground/blob/main/src/main/java/org/opensearch/Rounder.java

Results

Results are from c6i.2xlarge EC2 instance (Intel Xeon 8375C CPU @ 2.90GHz) which supports full 512-bit registers (8 vector lanes for long).

Cpuinfo Version: 7.0.0
Vendor ID Raw: GenuineIntel
Hardware Raw:
Brand Raw: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Hz Advertised Friendly: 2.9000 GHz
Hz Actual Friendly: 3.4986 GHz
Hz Advertised: (2900000000, 0)
Hz Actual: (3498596000, 0)
Arch: X86_64
Bits: 64
Count: 16
Arch String Raw: x86_64
L1 Data Cache Size: 384 KiB (8 instances)
L1 Instruction Cache Size: 256 KiB (8 instances)
L2 Cache Size: 10 MiB (8 instances)
L2 Cache Line Size: 256
L2 Cache Associativity: 6
L3 Cache Size: 56623104
Stepping: 6
Model: 106
Family: 6
Processor Type:
Flags: 3dnowprefetch, abm, adx, aes, aperfmperf, apic, arat, arch_capabilities, avx, avx2, avx512_bitalg, avx512_vbmi2, avx512_vnni, avx512_vpopcntdq, avx512bitalg, avx512bw, avx512cd, avx512dq, avx512f, avx512ifma, avx512vbmi, avx512vbmi2, avx512vl, avx512vnni, avx512vpopcntdq, bmi1, bmi2, clflush, clflushopt, clwb, cmov, constant_tsc, cpuid, cx16, cx8, de, erms, f16c, flush_l1d, fma, fpu, fsgsbase, fxsr, gfni, ht, hypervisor, ibpb, ibrs, ibrs_enhanced, ida, invpcid, invpcid_single, lahf_lm, lm, mca, mce, md_clear, mmx, movbe, msr, mtrr, nonstop_tsc, nopl, nx, ospke, osxsave, pae, pat, pcid, pclmulqdq, pdpe1gb, pge, pku, pni, popcnt, pse, pse36, rdpid, rdrand, rdrnd, rdseed, rdtscp, rep_good, sep, sha, sha_ni, smap, smep, ss, ssbd, sse, sse2, sse4_1, sse4_2, ssse3, stibp, syscall, tme, tsc, tsc_adjust, tsc_deadline_timer, tsc_known_freq, tscdeadline, vaes, vme, vpclmulqdq, wbnoinvd, x2apic, xgetbv1, xsave, xsavec, xsaveopt, xsaves, xtopology

Micro Benchmarks

These results compare the performance to look up round-down points for 1 million random keys picked from a uniform distribution. Since pre-computed array rounding is only used for small arrays (code), I've limited the benchmarks to array sizes up to 256.

Code for the micro benchmark harness can be found here:
https://github.com/ketanv3/opensearch-simd-playground/blob/main/src/jmh/java/org/opensearch/RounderBenchmark.java

size	binary	linear	hybrid	eytzinger	vectorized_linear	vectorized_eytzinger	unrolled_eytzinger	vectorized_btree
1	295.947	323.76	301.949	338.849	244.832	204.102	253.831	234.225
2	90.548	93.462	90.892	91.012	244.903	204.151	87.779	234.251
3	71.746	76.643	76.824	74.248	244.886	204.12	106.255	234.136
4	65.067	68.915	69.974	61.451	244.737	204.287	89.979	234.172
5	55.412	65.889	66.058	57.786	244.869	204.388	79.007	234.26
6	51.574	64.382	63.812	53.777	244.905	204.282	69.455	234.156
7	48.416	61.33	62.01	50.647	244.747	204.372	65.809	234.111
8	47.702	60.355	60.123	47.504	162.481	204.162	63.483	233.679
9	45.354	59.581	59.805	46.404	121.948	148.625	56.026	114.642
10	43.744	58.828	58.401	44.667	101.297	120.654	52.898	99.97
12	40.655	56.81	56.354	42.518	82.603	92.188	53.281	83.234
14	38.731	56.113	56.068	40.256	77.681	77.288	51.574	75.347
16	37.615	54.779	54.723	37.298	74.428	72.294	50.882	73.562
18	36.908	53.505	53.931	36.632	68.818	78.296	49.609	124.952
20	35.854	52.526	52.702	36.131	65.477	82.456	49.973	107.786
22	35.123	51.663	52.095	35.515	65.466	86.55	50.202	97.138
24	34.347	50.718	51.107	34.691	62.389	89.973	50.378	88.548
26	32.883	50.147	50.473	33.915	62.359	97.614	48.697	136.888
29	30.583	48.334	49.443	32.826	60.524	100.794	46.673	115.876
32	29.595	47.177	47.937	31.706	56.593	68.765	40.408	66.419
37	29.848	46.967	46.555	31.43	57.42	72.925	39.564	77.961
41	29.811	45.541	45.27	30.857	52.363	76.323	39.398	83.197
45	29.556	44.105	43.571	29.062	50.98	79.744	39.099	85.851
49	28.931	42.526	42.466	29.751	52.5	81.896	37.444	86.226
54	27.374	40.689	41.094	28.304	51.97	85.37	37.668	87.361
60	25.949	39.313	39.48	27.921	50.749	88.148	36.952	90.138
64	25.58	37.665	38.654	27.902	49.935	90.495	35.73	90.597
74	25.846	37.815	35.785	27.282	48.425	93.44	35.207	92.326
83	25.802	36.663	35.015	27.173	44.904	88.798	33.979	107.161
90	25.667	35.771	34.606	26.868	46.483	101.417	34.684	87.596
98	25.189	35.14	34.186	26.426	45.846	84.157	34.626	74.626
108	24.39	34.062	33.866	25.769	45.428	72.028	33.819	67.345
118	23.36	32.94	33.167	25.148	45.474	63.726	33.487	61.756
128	22.822	32.137	32.764	24.592	44.685	58.101	33.241	58.654
144	22.918	31.154	29.949	24.454	44.492	54.808	32.561	55.242
159	22.866	30.141	29.577	24.587	43.129	54.998	32.347	55.568
171	22.899	29.35	29.319	24.223	42.683	55.818	32.496	57.786
187	22.614	28.353	28.732	23.867	42.226	58.316	32.375	60.868
204	22.161	27.33	28.79	23.311	41.591	60.952	32.127	62.728
229	21.273	25.934	28.433	22.11	40.616	64.439	31.989	66.387
256	20.548	24.543	27.909	21.573	40.063	66.471	31.358	68.556

Macro Benchmarks

These results compare the performance to execute the following aggregation with varying intervals using the noaa dataset.

{
  "size": 0,
  "aggs": {
    "observations_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      }
    }
  }
}

Interval	Num Buckets	Num Hits	Baseline (ms)	Vectorized Linear (ms)	Vectorized B-Tree (ms)
year	3	33659481	828	562	563
quarter	12	33659481	1046	887	878
month	36	33659481	1170	1408	1241

These results compare the performance to execute the following aggregation with increasing number of months.

{
  "size": 0,
  "query": {
    "range": {
      "date": {
        "gte": "2014-01-01",
        "lt": "2015-01-01"
      }
    }
  },
  "aggs": {
    "observations_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      }
    }
  }
}

Num Months	Num Hits	Baseline (ms)	Vectorized Linear (ms)	Vectorized B-Tree (ms)
12	11392659	438	336	296
14	13187069	510	393	337
16	15073450	586	456	383
18	17009688	648	584	485
20	18957437	735	642	537
22	20865502	804	727	592
24	22745413	874	789	629
26	24548269	927	877	717
28	26395510	960	985	798
30	28239883	1018	1085	896
32	30094407	1090	1163	960
34	31896193	1148	1303	1162
36	33659481	1199	1413	1255

Hot Methods Profile

To understand why 36-months' aggregation isn't seeing the expected speed-up, I took a profile of hot methods.

70.69% jdk/incubator/vector/Long512Vector$Long512Mask.firstTrue ()I Inlined
6.63% org/opensearch/common/util/ReorganizingLongHash.find (J)J Inlined
5.33% org/opensearch/common/BidirectionalLinearSearchArrayRounding.round (J)J Inlined
3.46% jdk/incubator/vector/LongVector$LongSpecies.broadcastBits (J)Ljdk/incubator/vector/LongVector; Inlined
2.81% org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer$5.longValue ()J Inlined
1.70% org/opensearch/search/aggregations/bucket/histogram/DateHistogramAggregator$1.collect (IJ)V Inlined
1.68% org/opensearch/common/util/BigArrays$LongArrayWrapper.get (J)J Inlined
1.50% org/opensearch/search/aggregations/LeafBucketCollector.collect (I)V JIT compiled

Turns out that VectorMask::firstTrue itself performs a linear scan on the result, which almost defeats the purpose of vectorization for this use-case. While there is some improvement still, it could have been much greater if Java had intrinsified this with bit-scan instructions, which is widely available on almost all platforms.

Conclusion

TODO

Follow-Up Questions

What makes the vectorized implementations slower as it approaches, say, 36 buckets?
What makes the vectorized B-tree and Eytzinger slow-then-fast as the size increases? (possibly branch misprediction).
How does the performance look on a platform that supports 128-bit and 256-bit vector registers? What's the minimum vector size to use?

The text was updated successfully, but these errors were encountered:

ketanv3 · 2023-10-05T07:24:09Z

@reta @backslasht @msfroh @kkmr

Tagging folks here since you all helped me review #9727. Would love to know your thoughts on this. :)

backslasht · 2023-10-05T15:19:11Z

Thanks @ketanv3 for the detailed analysis. This is impressive.

+1 on the follow up questions you have already captured. In addition to that, it will be good to understand the saw tooth pattern observed between buckets 30-50.

Also, what will be the max bucket size for the data aggregations, would it be approx. 165 (weekly aggregations for three years) buckets?

ketanv3 · 2023-10-06T13:14:48Z

@backslasht

Ran the vectorized B-tree implementation on a c5.metal EC2 instance to gather detailed information from the hardware counters. Had to run on a bare metal instance since virtualized instances do not expose them.

Size	Throughput	CPI	IPC	L1-dcache-load-misses	L1-dcache-loads	L1-dcache-stores	L1-icache-load-misses	LLC-load-misses	LLC-loads	LLC-store-misses	LLC-stores	branch-misses	branches	cycles	dTLB-load-misses	dTLB-loads	dTLB-store-misses	dTLB-stores	iTLB-load-misses	iTLB-loads	instructions
1	95.874	0.725	1.38	264.827	7.03066e+06	1.0058e+06	2264.76	0.848	49.926	0.774	18.979	88.926	8.03226e+06	3.92059e+07	23.772	7.03983e+06	15.436	1.00556e+06	13.055	41.248	5.40963e+07
2	95.768	0.726	1.377	272.975	7.01967e+06	1.00447e+06	1818.57	0.954	36.504	0.462	25.627	82.201	8.02847e+06	3.94148e+07	22.036	7.03899e+06	17.835	1.00811e+06	8.848	37.011	5.42907e+07
3	96.016	0.725	1.379	290.278	7.03927e+06	1.00608e+06	2049.83	0.461	38.973	0.743	22.95	83.625	8.04175e+06	3.93464e+07	27.647	7.02184e+06	15.771	1.00653e+06	4.534	28.435	5.42687e+07
4	96.05	0.727	1.376	262.629	7.04542e+06	1.00806e+06	1838.4	0.431	41.014	1.099	18.696	99.848	8.03455e+06	3.94105e+07	21.873	7.0194e+06	13.751	1.00594e+06	14.835	17.76	5.42351e+07
5	95.775	0.728	1.374	275.355	7.02671e+06	1.00656e+06	1938.57	0.089	24.69	0	20.518	98.147	8.0234e+06	3.9417e+07	23.632	7.01597e+06	15.377	1.00233e+06	5.334	26.627	5.4156e+07
6	95.366	0.728	1.375	312.206	6.99739e+06	1.00405e+06	1897.28	0.748	50.454	0	16.933	105.634	7.99518e+06	3.92564e+07	23.021	7.02015e+06	15.392	1.00646e+06	7.659	27.643	5.39589e+07
7	95.9	0.729	1.373	264.639	7.03131e+06	1.00701e+06	1970.05	0.045	41.097	0	25.9	88.431	8.02093e+06	3.9415e+07	26.302	6.99932e+06	13.649	1.0057e+06	6.609	22.387	5.41032e+07
8	95.592	0.726	1.378	354.794	7.03617e+06	1.0071e+06	1841.84	6.987	49.819	6.673	21.289	95.309	8.02671e+06	3.92976e+07	22.931	6.99024e+06	13.048	1.00322e+06	4.419	40.414	5.41546e+07
9	63.315	0.801	1.249	617.818	9.69542e+06	1.00697e+06	3116.1	13.697	95.539	9.822	36.224	119271	8.80853e+06	5.93837e+07	18.495	9.67298e+06	24.578	1.00426e+06	36.968	81.843	7.41659e+07
10	60.09	0.863	1.159	496.856	9.44618e+06	1.00902e+06	3138.78	13.99	70.236	10.736	28.669	208233	8.64376e+06	6.29426e+07	37.339	9.3751e+06	25.867	1.0066e+06	16.936	48.455	7.29565e+07
12	54.747	0.976	1.024	681.461	9.001e+06	1.01197e+06	3210.83	11.309	89.169	13.003	50.343	359792	8.32875e+06	6.85826e+07	54.513	9.01791e+06	27.621	1.00418e+06	17.511	82.056	7.02387e+07
14	51.744	1.059	0.945	641.664	8.75148e+06	1.00728e+06	3187.01	19.211	88.146	14.691	37.982	471548	8.17846e+06	7.30514e+07	42.447	8.72178e+06	21.306	1.00888e+06	14.388	65.765	6.9002e+07
16	51.066	1.087	0.92	660.417	8.52045e+06	1.0098e+06	3354.7	13.666	96.553	15.957	44.238	503159	8.00972e+06	7.35539e+07	42.226	8.52095e+06	22.021	1.00916e+06	16.768	82.72	6.76489e+07
18	65.718	0.76	1.316	541.493	9.86976e+06	1.00859e+06	2920.04	11.216	96.883	9.458	38.745	61133.8	8.92311e+06	5.71804e+07	33.278	9.88539e+06	22.366	1.00214e+06	9.957	82.761	7.52593e+07
20	61.527	0.83	1.205	638.088	9.57528e+06	1.00767e+06	2819.85	15.317	85.453	12.718	32.352	160634	8.72312e+06	6.12435e+07	33.907	9.61102e+06	26.898	1.01134e+06	15.155	89.561	7.37905e+07
22	58.819	0.878	1.139	570.445	9.4221e+06	1.01172e+06	4452.26	24.198	451.857	11.836	37.353	245222	8.76544e+06	6.40753e+07	39.586	9.36482e+06	96.721	1.01146e+06	143.224	637.41	7.30086e+07
24	56.819	0.934	1.071	1368.66	9.17573e+06	1.01792e+06	5162.59	23.236	262.022	15.901	45.994	306082	8.45094e+06	6.67882e+07	40.819	9.24182e+06	145.618	1.18714e+06	85.607	401.309	7.15256e+07
26	68.989	0.717	1.395	472.694	1.00595e+07	1.01266e+06	2755.95	4.945	63.907	13.882	27.971	138.945	9.05603e+06	5.48559e+07	82.444	1.00307e+07	26.523	1.0174e+06	9.372	41.605	7.65271e+07
29	63.786	0.797	1.255	569.826	9.72496e+06	1.00955e+06	2994.72	11.409	72.887	11.991	33.267	114304	8.83568e+06	5.94615e+07	32.126	9.72594e+06	22.841	1.00755e+06	11.186	44.094	7.46222e+07
32	47.699	1.087	0.92	714.38	9.51346e+06	1.01037e+06	4138.21	17.772	99.721	16.964	51.521	560013	1.08223e+07	7.92444e+07	46.105	9.45747e+06	30.069	1.00898e+06	17.114	111.539	7.28692e+07
37	52.013	0.996	1.004	789.009	9.89731e+06	1.0056e+06	5028.54	17.345	157.725	18.662	50.087	410603	1.09088e+07	7.28123e+07	45.531	9.82403e+06	29.723	1.0047e+06	20.529	59.665	7.31295e+07
41	51.942	0.953	1.049	756.281	1.00409e+07	1.01251e+06	3879.46	14.015	132.619	13.823	38.005	362017	1.13224e+07	7.235e+07	42.951	1.00065e+07	21.187	1.0088e+06	91.454	66.859	7.58788e+07
45	54.019	0.948	1.055	712.197	1.00141e+07	1.02068e+06	4207.79	4.967	122.984	12.523	44.649	336803	1.09437e+07	6.93715e+07	42.139	9.92021e+06	27.952	1.00049e+06	14.055	136.775	7.31762e+07
49	53.325	0.935	1.069	708.496	1.00252e+07	1.01467e+06	3464.04	3.799	111.778	14.019	59.313	320508	1.1254e+07	7.05644e+07	42.244	9.96187e+06	20.119	1.00056e+06	6.367	74.269	7.54311e+07
54	54.642	0.927	1.078	764.913	9.97003e+06	1.00252e+06	3697.65	16.893	101.646	14.465	49.426	316160	1.10379e+07	6.89289e+07	39.374	9.92867e+06	26.214	1.00641e+06	10.418	120.21	7.43245e+07
60	56.007	0.914	1.094	610.947	9.9634e+06	999136	3647.91	16.392	86.091	13.792	44.639	277776	1.09672e+07	6.71901e+07	36.073	1.01557e+07	28.654	1.01134e+06	11.09	91.725	7.34886e+07
64	55.917	0.922	1.085	561.867	9.88767e+06	1.00131e+06	3672.99	14.724	94.465	15.566	49.989	290158	1.08959e+07	6.7401e+07	39.322	1e+07	26.615	1.00849e+06	13.575	68.233	7.31207e+07
74	57.026	0.895	1.118	678.513	1.01321e+07	1.01404e+06	3952.69	14.267	107.625	11.639	37.519	255394	1.1081e+07	6.62103e+07	38.67	9.97083e+06	29.059	1.00383e+06	8.235	63.649	7.4013e+07
83	60.974	0.797	1.255	650.349	1.02673e+07	1.00644e+06	3286.43	11.421	101.078	12.848	45.659	104655	9.18171e+06	6.2039e+07	35.549	1.03907e+07	25.322	1.01205e+06	11.561	51.604	7.78523e+07
90	54.967	0.868	1.152	963.141	1.05857e+07	1.00523e+06	3618.53	11.224	86.491	16.55	45.701	203684	9.38337e+06	6.86505e+07	39.439	1.07365e+07	28.475	1.00752e+06	16.134	86.673	7.9099e+07
98	49.989	0.931	1.074	832.424	1.08614e+07	1.0153e+06	3523.59	13.99	127.626	15.132	57.132	301198	9.55783e+06	7.51262e+07	42.485	1.08269e+07	30.065	1.01126e+06	15.989	98.075	8.06989e+07
108	47.1	0.982	1.018	901.536	1.09803e+07	1.0065e+06	3840.3	1.364	89.457	2.455	41.849	379133	9.67322e+06	8.03336e+07	43.516	1.09913e+07	32.304	1.01087e+06	12.121	59.698	8.18058e+07
118	44.95	1.016	0.984	904.969	1.12242e+07	1.01413e+06	4055.78	1.46	106.601	0.381	46.634	430835	9.7877e+06	8.40423e+07	49.364	1.11956e+07	35.587	1.01163e+06	15.27	87.935	8.26873e+07
128	42.9	1.047	0.955	1120.26	1.13429e+07	1.01437e+06	6091.54	0.865	121.749	3.593	57.215	479070	9.92116e+06	8.77383e+07	53.057	1.12421e+07	25.614	1.00708e+06	14.171	174.607	8.37715e+07

I can see a strong inverse correlation between throughput and branch-misses. As I understand, the structure of the B-tree changes as the size increases, which influences the decision of the branch predictor. I'm trying to understand which branch (in code) this is, and if it can be written in a different way. This explains the saw-tooth pattern.

ketanv3 · 2023-10-06T13:18:43Z

@backslasht To your other question, we use array-based prepared rounding only up to 128 buckets. Beyond this, we fall back to recalculating the bucket key for each hit. Based on the performance improvement we get from this experiment, we can always bump up that number as long as 1) the extra cost of pre-computing and 2) the extra memory overhead is justified.

reta · 2023-10-06T18:43:49Z

@ketanv3 didn't find the information what JDK you run tests on, I assume JDK-21 right? Also very curious to figure out the answers to follow up questions, since this does not look like win/win for all cases. Thanks a lot for prototyping that

ketanv3 · 2023-10-06T20:06:27Z

@reta I ran these tests with OpenJDK 20 since Gradle 8.3 didn't support JDK 21. I see that Gradle 8.4 is now released, so I'll rerun with that now.

ketanv3 · 2023-10-07T13:28:14Z

@reta Similar micro-benchmark result with OpenJDK 21.

size	binary	linear	hybrid	eytzinger	vectorized_linear	vectorized_eytzinger	unrolled_eytzinger	vectorized_btree
1	264.155	314.583	285.588	344.837	293.08	204.618	248.012	234.205
2	90.367	92.47	94.972	93.844	293.11	204.634	88.674	234.299
3	72.51	76.006	78.543	73.988	293.157	204.616	106.866	234.249
4	65.311	71.456	71.435	61.342	293.259	204.704	87.509	234.306
5	55.447	66.972	68.807	56.205	293.187	204.628	75.331	234.297
6	51.201	65.289	66.085	53.027	293.253	204.636	68.285	234.294
7	48.63	63.118	64.175	49.977	293.275	204.65	62.634	234.114
8	47.625	61.518	60.958	46.761	153.942	204.635	61.292	234.309
9	46.036	59.52	59.696	46.261	119.686	157.208	56.754	116.75
10	44.718	59.598	58.459	43.521	105.577	124.967	53.308	102.533
12	41.204	56.654	56.457	40.939	83.855	92.816	53.47	85.318
14	38.447	55.718	55.667	40.612	77.115	79.142	51.446	74.932
16	38.559	54.369	54.305	37.986	72.058	73.075	50.882	72.337
18	37.62	54.781	53.246	37.867	67.659	77.339	49.831	124.946
20	36.45	53.226	52.39	37.211	63.468	81.475	50.133	110.335
22	35.508	52.717	51.493	36.466	62.943	84.838	50.378	99.119
24	34.41	51.535	50.479	35.762	59.68	87.753	50.72	90.537
26	33.037	51.033	50.02	34.948	59.139	95.244	48.706	136.978
29	31.095	47.927	48.967	33.812	56.896	98.329	46.572	71.582
32	30.092	47.727	47.843	32.645	53.907	68.113	41.509	67.662
37	30.705	46.304	46.165	32.396	51.849	73.27	40.632	79.408
41	30.849	44.819	44.635	31.814	52.377	76.203	39.843	84.674
45	30.41	43.703	43.33	31.263	49.787	79.194	39.139	85.958
49	29.676	42.095	42.289	30.61	49.073	81.653	38.212	87.302
54	28.149	41.308	40.827	30.066	48.137	85.013	37.632	89.192
60	26.717	39.526	39.246	29.335	47.123	88.212	36.542	91.502
64	26.403	38.52	38.585	28.671	46.605	90.634	35.995	89.81
74	26.75	37.511	35.906	28.448	44.81	93.836	34.993	94.132
83	26.637	36.464	35.198	27.955	43.921	89.272	34.343	106.736
90	26.565	36.06	34.736	27.716	42.11	101.629	34.705	86.848
98	26.162	35.304	34.547	27.327	40.056	86.459	34.398	74.333
108	25.103	34.318	33.992	26.749	38.727	73.583	34.053	67.306
118	24.202	33.416	33.624	26.295	37.978	65.097	33.498	62.358
128	23.505	32.508	32.962	25.248	39.133	59.451	33.253	58.775
144	23.713	30.988	30.597	25.386	38.315	55.393	32.404	54.823
159	23.709	29.907	29.997	25.329	37.235	55.506	32.346	55.405
171	23.768	29.475	29.869	25.111	35.359	56.231	32.496	58.734
187	23.502	28.282	29.383	24.798	34.509	58.977	32.342	62.248
204	23.06	27.414	29.351	24.312	33.801	61.712	32.124	65.354
229	22.066	26	28.569	23.941	33.207	65.18	32.042	69.407
256	21.242	24.473	28.39	23.208	32.735	68.831	31.357	70.795

reta · 2023-10-09T01:40:23Z

@reta Similar micro-benchmark result with OpenJDK 21.

Thank you for confirming, @ketanv3

backslasht · 2023-10-09T06:40:11Z

@backslasht To your other question, we use array-based prepared rounding only up to 128 buckets. Beyond this, we fall back to recalculating the bucket key for each hit. Based on the performance improvement we get from this experiment, we can always bump up that number as long as 1) the extra cost of pre-computing and 2) the extra memory overhead is justified.

Makes sense!

ketanv3 · 2023-10-10T07:39:22Z

@backslasht @reta

Here's some in-depth analysis of the saw-tooth pattern observed with small arrays. As mentioned in an earlier comment, the structure of the B-tree changes as the size grows, which influences the decision of the branch predictor. This has a huge impact on performance!

Part 1

So which branch is being mispredicted? My first suspicion was the outermost loop's breaking condition while (i < values.length) (permalink). Depending on the position of the key in the sorted array, the lookup in a B-tree runs for different number of iterations (though they may differ by one level at most). This is best explained with a picture; the smallest key takes 3 iterations while the largest key takes 2 iterations.

This happens because, in order to optimize for memory requirements, we only store the required number of elements as opposed to storing a "perfect" tree padded with sentinel values. Here's how much the performance improves when I create such a B-tree. But this comes at the cost of using $(1+B)^h - 1$ memory (B: vector lanes, h: height of the tree), which can be wasteful.

You can see that the performance becomes very stable for array sizes 10...40 and follows a predictable step-down pattern in response to the B-tree height. But there are a few more problems: 1) Why is an array of size 45 slower than other larger arrays? 2) Why does the step-down happen at 45? For a platform with 8 vector lanes, the step-down points should be 8, 80, 728, and so on.

Part 2

That's when I explored the only other branch res = (j > i) ? j : res; (permalink). Unfortunately, JVM didn't convert this into a conditional move instruction, making it another point of branch misprediction. With some bit-twiddling hacks, I wrote a branchless version as follows.

int o = v.compare(VectorOperators.GT, k).firstTrue(); // offset of first 'true'
int j = i + o;
int m = (res ^ j) & (~((o - 1) >> 31)); // bitmask to transform 'res' to 'j' if 'o' is non-zero.
res ^= m;

Here's how the performance looks now. This variant uses more instructions (slower) but always executes in the same amount of time for a given B-tree height. The step-down points also align with the theoretical performance.

Part 3

This analysis also reveals a flaw in my micro-benchmarks. The impact of branch misprediction is exaggerated when the distribution of keys is random. But for time-series data, the keys will likely be sorted, so consecutive hits will likely run the same number of loop iterations. It may be a non-issue after all! I need to run the tests with sorted keys.

backslasht · 2023-10-10T13:04:28Z

@ketanv3 - Thanks for the detailed analysis. Looking forward for the results with sorted keys.

ketanv3 · 2023-10-11T16:50:08Z

@backslasht Here you go! Just FYI, the absolute values in these performance results can be a bit meaningless. The trends on the other hand are more meaningful.

Full results: https://gist.github.com/ketanv3/4aba76b694c066ea576575b2291405c1

Random Keys

I've updated the B-tree implementation to build with a complete tree (i.e., padded with sentinel values only on the right side of the array). This reduces the saw-tooth pattern to some extent without using excessive memory of a perfect tree. The remaining saw-tooth pattern is still present due to branch misprediction. Interestingly, linear search is better than binary search even for sizes as high as 256. Vectorized B-tree search is better than linear/binary search for all array sizes greater than 1.

Sorted Keys

As expected, branch misprediction is no longer a concern and the drop in performance is more organic as the array size increases. Even the regular binary search is faster, overtaking the linear search at array size 64. Vectorized B-tree search is better than linear/binary search for all array sizes greater than 1.

This trend matches with what I observed in the macro benchmark results. The only anomaly being slight regression in 36 months' aggregation which is still unexplained. Improvements can be substantial if VectorMask::firstTrue is intrinsified with bit-scan instructions, which can more than make up this regression.

reta · 2023-10-12T15:50:53Z

@ketanv3 thanks for such indepth analysis, do you think:

the optional structure of btree becomes actually dependent on the number of vector lanes?
are those benchmarks biased towards some particular architectures (CPU & co)?

backslasht · 2023-10-16T03:27:54Z

Thanks @ketanv3 for the deep analysis and the great summary. This analysis is with a processor having 8 vector lanes?

ketanv3 · 2023-10-18T05:18:16Z

@reta

the optional structure of btree becomes actually dependent on the number of vector lanes?

I'm not sure what you mean by optional structure. I chose the block size equal to the number of vector lanes so that I can compare all keys of a node in a single shot. Larger block size creates a tree of smaller height, but clearly, that's not the only factor to consider for performance.

are those benchmarks biased towards some particular architectures (CPU & co)?

Yes, this was tested on an Intel Xeon 8375C CPU which supports full 512-bit registers. During development, I tested this on an Intel Core i7-9750H CPU which supports 256-bit registers. Larger vectors saw better performance in general.

@backslasht Yes, that's correct.

ketanv3 · 2023-10-18T07:13:53Z

I noticed that noaa workload isn't ingested in the increasing order of time (workload is randomized). I reindexed the data into another index with "index sorting" on the timestamp field and compared again.

Simple linear search remains faster for small number of buckets:

Interval	Num Buckets	Num Hits	Baseline (ms)	Vectorized B-Tree (ms)
year	3	33659481	583	703
quarter	12	33659481	711	768
month	36	33659481	823	851

Vectorized B-tree was marginally better for medium number of buckets:

Num Months	Num Hits	Baseline (ms)	Vectorized B-Tree (ms)
12	11392659	281	303
14	13187069	328	344
16	15073450	378	381
18	17009688	456	446
20	18957437	507	495
22	20865502	561	551
24	22745413	612	592
26	24548269	655	647
28	26395510	704	691
30	28239883	754	739
32	30094407	800	795
34	31896193	847	828
36	33659481	755	812

Top 5 hot methods:

70.17% jdk/incubator/vector/Long512Vector$Long512Mask.firstTrue ()I Inlined
6.63% org/opensearch/common/util/ReorganizingLongHash.find (J)J Inlined
5.12% org/opensearch/common/BidirectionalLinearSearchArrayRounding.round (J)J Inlined
3.61% org/opensearch/search/aggregations/bucket/histogram/DateHistogramAggregator$1.collect (IJ)V JIT compiled
3.20% org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer$5.longValue ()J Inlined

Conclusion

It's hard to beat the performance of simple linear search for small number of buckets. While some workloads show improvement, others may regress. The drop in performance is also unpredictable.
For large number of buckets, vectorized B-tree is better than regular binary search. We can consider using SIMD only when the number of buckets is greater than 64 and the platform supports a minimum number of vector lanes.
The gap in performance is slimmer when the input is sorted.
Improvements can be substantial if VectorMask::firstTrue is intrinsified with bit-scan instructions, possibly making vectorized B-tree faster than the linear search approach.

@reta @backslasht Please let me know if you have any other experiments in mind.

I feel that the results aren't a win-win; the improvements in some places aren't substantial enough to make up for the regression in other places. That too with much higher code complexity.

reta · 2023-10-18T14:23:05Z

I feel that the results aren't a win-win; the improvements in some places aren't substantial enough to make up for the regression in other places. That too with much higher code complexity.

I agree with you, may be you will find the relevant (to vectorization) discussions on Apache Lucene [1], [2], [3], [4] useful and explaining some of the results. I don't think we should through away this work, what do you think if we introduce the family of experimental feature flags opensearch.experimental.simd.*.enabled to explore (and possibly have an opportunity for community to improve) the implementation of SIMD related optimizations?

[1] apache/lucene#12621
[2] apache/lucene#12694
[3] apache/lucene#12681
[4] apache/lucene#12632

backslasht · 2023-10-19T06:44:23Z

+1, I agree with @reta, we can still contribute this code with the experimental flag for number of buckets greater than 64. It will also stand as an example for folks try to use SIMD in the rest of the code base.

ketanv3 · 2023-10-20T05:03:52Z

@reta @backslasht I like these suggestions, thank you! I'll raise a PR soon.

ketanv3 added enhancement Enhancement or improvement to existing feature or request untriaged labels Oct 5, 2023

ketanv3 mentioned this issue Oct 5, 2023

[META]: SIMD adoption in OpenSearch #9423

Open

4 tasks

ketanv3 mentioned this issue Oct 31, 2023

Refactor common parts from the Rounding class into a separate 'round' package #11023

Merged

8 tasks

ketanv3 mentioned this issue Nov 14, 2023

Add experimental SIMD implementation of B-tree to round down dates #11194

Merged

8 tasks

macohen added the Search:Performance label Nov 29, 2023

reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.12.0 Issues and PRs related to version 2.12.0 and removed untriaged labels Jan 18, 2024

reta closed this as completed in #11194 Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SIMD] [POC] Improve performance of rounding dates in date_histogram aggregation #10392

[SIMD] [POC] Improve performance of rounding dates in date_histogram aggregation #10392

ketanv3 commented Oct 5, 2023 •

edited

Loading

ketanv3 commented Oct 5, 2023

backslasht commented Oct 5, 2023

ketanv3 commented Oct 6, 2023

ketanv3 commented Oct 6, 2023

reta commented Oct 6, 2023 •

edited

Loading

ketanv3 commented Oct 6, 2023

ketanv3 commented Oct 7, 2023

reta commented Oct 9, 2023

backslasht commented Oct 9, 2023

ketanv3 commented Oct 10, 2023 •

edited

Loading

backslasht commented Oct 10, 2023

ketanv3 commented Oct 11, 2023 •

edited

Loading

reta commented Oct 12, 2023

backslasht commented Oct 16, 2023

ketanv3 commented Oct 18, 2023

ketanv3 commented Oct 18, 2023

reta commented Oct 18, 2023

backslasht commented Oct 19, 2023

ketanv3 commented Oct 20, 2023

[SIMD] [POC] Improve performance of rounding dates in date_histogram aggregation #10392

[SIMD] [POC] Improve performance of rounding dates in date_histogram aggregation #10392

Comments

ketanv3 commented Oct 5, 2023 • edited Loading

Problem Statement

1 — Baseline

2 — Vectorized Linear Search

3 — Vectorized B-Tree Search

POC code

Results

Micro Benchmarks

Macro Benchmarks

Hot Methods Profile

Conclusion

Follow-Up Questions

ketanv3 commented Oct 5, 2023

backslasht commented Oct 5, 2023

ketanv3 commented Oct 6, 2023

ketanv3 commented Oct 6, 2023

reta commented Oct 6, 2023 • edited Loading

ketanv3 commented Oct 6, 2023

ketanv3 commented Oct 7, 2023

reta commented Oct 9, 2023

backslasht commented Oct 9, 2023

ketanv3 commented Oct 10, 2023 • edited Loading

Part 1

Part 2

Part 3

backslasht commented Oct 10, 2023

ketanv3 commented Oct 11, 2023 • edited Loading

Random Keys

Sorted Keys

reta commented Oct 12, 2023

backslasht commented Oct 16, 2023

ketanv3 commented Oct 18, 2023

ketanv3 commented Oct 18, 2023

Conclusion

reta commented Oct 18, 2023

backslasht commented Oct 19, 2023

ketanv3 commented Oct 20, 2023

ketanv3 commented Oct 5, 2023 •

edited

Loading

reta commented Oct 6, 2023 •

edited

Loading

ketanv3 commented Oct 10, 2023 •

edited

Loading

ketanv3 commented Oct 11, 2023 •

edited

Loading