Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIMD] [POC] Improve performance of rounding dates in date_histogram aggregation #10392

Closed
ketanv3 opened this issue Oct 5, 2023 · 19 comments · Fixed by #11194
Closed

[SIMD] [POC] Improve performance of rounding dates in date_histogram aggregation #10392

ketanv3 opened this issue Oct 5, 2023 · 19 comments · Fixed by #11194
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance v2.12.0 Issues and PRs related to version 2.12.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@ketanv3
Copy link
Contributor

ketanv3 commented Oct 5, 2023

Problem Statement

In the date histogram aggregation, the timestamp value from each hit must be rounded down to the nearest interval (year, quarter, month, week, day, etc.) as defined in the search request. This rounded-down timestamp serves as the bucket key to aggregate results.

#9727 improved performance by using linear search when the array size is small (<= 64 elements). This POC evaluates if performance can be improved further using SIMD (related to #9423).

1 — Baseline

The current approach uses linear search when the search space is small or binary search otherwise. This is because linear search is far superior for small arrays as it avoids the penalty of branch mispredictions and pipeline stalls and accesses memory sequentially. To simplify the remaining comparison, I'll assume that each value is equally likely to be picked (uniform distribution), so we need not consider bidirectional linear search in this POC.

2 — Vectorized Linear Search

Assume a platform that supports AVX2 instructions (i.e., 256-bit wide registers). We perform a linear search comparing 4 longs at once, taking a stride of 4 elements with each iteration. We pad the array with $\infty$ to make it an exact multiple of the number of vector lanes ($B$). The search stops once we find the first value greater than the desired key.

simd-linear

3 — Vectorized B-Tree Search

Here, the idea is to improve binary search using vectorization. Since binary search compares values at non-contiguous memory locations, which is non-trivial to load into a vector register, the first step is to fix the memory layout.

  1. The first idea is to use the Eytzinger layout. It is an implicit binary tree packed into an array where the left and the right sub-trees of a node at index $i$ are stored at $2i+1$ and $2i+2$ indices respectively. This is also how binary heaps are laid out in memory.
  2. The second idea is to store more multiple values ($B$) per node, which makes the branching factor ($1+B$). This is essentially a B-tree! But unlike B-trees where the number of values is chosen to fit in a single cache line, we choose it to be equal to the number of vector lanes.
  3. The underlying array is padded to the nearest multiple of vector lanes. We do not need to provision memory for completely empty nodes. I also experimented with a complete tree (all levels are filled except the last level, which is filled from left to right) but didn't see much difference in performance.

simd-btree-levels

To find the round-down point in this layout, we start our search from the $0$-th index comparing $B$ values at once. The position of the first set bit in the vector mask tells us which sub-tree (between $0...B$) to visit to converge to the round-down point.

simd-btree

POC code

Code for the above implementations (and many others) can be found here:
https://github.com/ketanv3/opensearch-simd-playground/blob/main/src/main/java/org/opensearch/Rounder.java

Results

Results are from c6i.2xlarge EC2 instance (Intel Xeon 8375C CPU @ 2.90GHz) which supports full 512-bit registers (8 vector lanes for long).

Cpuinfo Version: 7.0.0
Vendor ID Raw: GenuineIntel
Hardware Raw:
Brand Raw: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Hz Advertised Friendly: 2.9000 GHz
Hz Actual Friendly: 3.4986 GHz
Hz Advertised: (2900000000, 0)
Hz Actual: (3498596000, 0)
Arch: X86_64
Bits: 64
Count: 16
Arch String Raw: x86_64
L1 Data Cache Size: 384 KiB (8 instances)
L1 Instruction Cache Size: 256 KiB (8 instances)
L2 Cache Size: 10 MiB (8 instances)
L2 Cache Line Size: 256
L2 Cache Associativity: 6
L3 Cache Size: 56623104
Stepping: 6
Model: 106
Family: 6
Processor Type:
Flags: 3dnowprefetch, abm, adx, aes, aperfmperf, apic, arat, arch_capabilities, avx, avx2, avx512_bitalg, avx512_vbmi2, avx512_vnni, avx512_vpopcntdq, avx512bitalg, avx512bw, avx512cd, avx512dq, avx512f, avx512ifma, avx512vbmi, avx512vbmi2, avx512vl, avx512vnni, avx512vpopcntdq, bmi1, bmi2, clflush, clflushopt, clwb, cmov, constant_tsc, cpuid, cx16, cx8, de, erms, f16c, flush_l1d, fma, fpu, fsgsbase, fxsr, gfni, ht, hypervisor, ibpb, ibrs, ibrs_enhanced, ida, invpcid, invpcid_single, lahf_lm, lm, mca, mce, md_clear, mmx, movbe, msr, mtrr, nonstop_tsc, nopl, nx, ospke, osxsave, pae, pat, pcid, pclmulqdq, pdpe1gb, pge, pku, pni, popcnt, pse, pse36, rdpid, rdrand, rdrnd, rdseed, rdtscp, rep_good, sep, sha, sha_ni, smap, smep, ss, ssbd, sse, sse2, sse4_1, sse4_2, ssse3, stibp, syscall, tme, tsc, tsc_adjust, tsc_deadline_timer, tsc_known_freq, tscdeadline, vaes, vme, vpclmulqdq, wbnoinvd, x2apic, xgetbv1, xsave, xsavec, xsaveopt, xsaves, xtopology

Micro Benchmarks

These results compare the performance to look up round-down points for 1 million random keys picked from a uniform distribution. Since pre-computed array rounding is only used for small arrays (code), I've limited the benchmarks to array sizes up to 256.

Code for the micro benchmark harness can be found here:
https://github.com/ketanv3/opensearch-simd-playground/blob/main/src/jmh/java/org/opensearch/RounderBenchmark.java

Screenshot 2023-10-05 at 9 54 31 AM
size binary linear hybrid eytzinger vectorized_linear vectorized_eytzinger unrolled_eytzinger vectorized_btree
1 295.947 323.76 301.949 338.849 244.832 204.102 253.831 234.225
2 90.548 93.462 90.892 91.012 244.903 204.151 87.779 234.251
3 71.746 76.643 76.824 74.248 244.886 204.12 106.255 234.136
4 65.067 68.915 69.974 61.451 244.737 204.287 89.979 234.172
5 55.412 65.889 66.058 57.786 244.869 204.388 79.007 234.26
6 51.574 64.382 63.812 53.777 244.905 204.282 69.455 234.156
7 48.416 61.33 62.01 50.647 244.747 204.372 65.809 234.111
8 47.702 60.355 60.123 47.504 162.481 204.162 63.483 233.679
9 45.354 59.581 59.805 46.404 121.948 148.625 56.026 114.642
10 43.744 58.828 58.401 44.667 101.297 120.654 52.898 99.97
12 40.655 56.81 56.354 42.518 82.603 92.188 53.281 83.234
14 38.731 56.113 56.068 40.256 77.681 77.288 51.574 75.347
16 37.615 54.779 54.723 37.298 74.428 72.294 50.882 73.562
18 36.908 53.505 53.931 36.632 68.818 78.296 49.609 124.952
20 35.854 52.526 52.702 36.131 65.477 82.456 49.973 107.786
22 35.123 51.663 52.095 35.515 65.466 86.55 50.202 97.138
24 34.347 50.718 51.107 34.691 62.389 89.973 50.378 88.548
26 32.883 50.147 50.473 33.915 62.359 97.614 48.697 136.888
29 30.583 48.334 49.443 32.826 60.524 100.794 46.673 115.876
32 29.595 47.177 47.937 31.706 56.593 68.765 40.408 66.419
37 29.848 46.967 46.555 31.43 57.42 72.925 39.564 77.961
41 29.811 45.541 45.27 30.857 52.363 76.323 39.398 83.197
45 29.556 44.105 43.571 29.062 50.98 79.744 39.099 85.851
49 28.931 42.526 42.466 29.751 52.5 81.896 37.444 86.226
54 27.374 40.689 41.094 28.304 51.97 85.37 37.668 87.361
60 25.949 39.313 39.48 27.921 50.749 88.148 36.952 90.138
64 25.58 37.665 38.654 27.902 49.935 90.495 35.73 90.597
74 25.846 37.815 35.785 27.282 48.425 93.44 35.207 92.326
83 25.802 36.663 35.015 27.173 44.904 88.798 33.979 107.161
90 25.667 35.771 34.606 26.868 46.483 101.417 34.684 87.596
98 25.189 35.14 34.186 26.426 45.846 84.157 34.626 74.626
108 24.39 34.062 33.866 25.769 45.428 72.028 33.819 67.345
118 23.36 32.94 33.167 25.148 45.474 63.726 33.487 61.756
128 22.822 32.137 32.764 24.592 44.685 58.101 33.241 58.654
144 22.918 31.154 29.949 24.454 44.492 54.808 32.561 55.242
159 22.866 30.141 29.577 24.587 43.129 54.998 32.347 55.568
171 22.899 29.35 29.319 24.223 42.683 55.818 32.496 57.786
187 22.614 28.353 28.732 23.867 42.226 58.316 32.375 60.868
204 22.161 27.33 28.79 23.311 41.591 60.952 32.127 62.728
229 21.273 25.934 28.433 22.11 40.616 64.439 31.989 66.387
256 20.548 24.543 27.909 21.573 40.063 66.471 31.358 68.556

Macro Benchmarks

These results compare the performance to execute the following aggregation with varying intervals using the noaa dataset.

{
  "size": 0,
  "aggs": {
    "observations_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      }
    }
  }
}
Interval Num Buckets Num Hits Baseline (ms) Vectorized Linear (ms) Vectorized B-Tree (ms)
year 3 33659481 828 562 563
quarter 12 33659481 1046 887 878
month 36 33659481 1170 1408 1241

These results compare the performance to execute the following aggregation with increasing number of months.

{
  "size": 0,
  "query": {
    "range": {
      "date": {
        "gte": "2014-01-01",
        "lt": "2015-01-01"
      }
    }
  },
  "aggs": {
    "observations_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      }
    }
  }
}
Num Months Num Hits Baseline (ms) Vectorized Linear (ms) Vectorized B-Tree (ms)
12 11392659 438 336 296
14 13187069 510 393 337
16 15073450 586 456 383
18 17009688 648 584 485
20 18957437 735 642 537
22 20865502 804 727 592
24 22745413 874 789 629
26 24548269 927 877 717
28 26395510 960 985 798
30 28239883 1018 1085 896
32 30094407 1090 1163 960
34 31896193 1148 1303 1162
36 33659481 1199 1413 1255

Hot Methods Profile

To understand why 36-months' aggregation isn't seeing the expected speed-up, I took a profile of hot methods.

70.69% jdk/incubator/vector/Long512Vector$Long512Mask.firstTrue ()I Inlined
6.63% org/opensearch/common/util/ReorganizingLongHash.find (J)J Inlined
5.33% org/opensearch/common/BidirectionalLinearSearchArrayRounding.round (J)J Inlined
3.46% jdk/incubator/vector/LongVector$LongSpecies.broadcastBits (J)Ljdk/incubator/vector/LongVector; Inlined
2.81% org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer$5.longValue ()J Inlined
1.70% org/opensearch/search/aggregations/bucket/histogram/DateHistogramAggregator$1.collect (IJ)V Inlined
1.68% org/opensearch/common/util/BigArrays$LongArrayWrapper.get (J)J Inlined
1.50% org/opensearch/search/aggregations/LeafBucketCollector.collect (I)V JIT compiled

Turns out that VectorMask::firstTrue itself performs a linear scan on the result, which almost defeats the purpose of vectorization for this use-case. While there is some improvement still, it could have been much greater if Java had intrinsified this with bit-scan instructions, which is widely available on almost all platforms.

Conclusion

TODO

Follow-Up Questions

  • What makes the vectorized implementations slower as it approaches, say, 36 buckets?
  • What makes the vectorized B-tree and Eytzinger slow-then-fast as the size increases? (possibly branch misprediction).
  • How does the performance look on a platform that supports 128-bit and 256-bit vector registers? What's the minimum vector size to use?
@ketanv3 ketanv3 added enhancement Enhancement or improvement to existing feature or request untriaged labels Oct 5, 2023
@ketanv3
Copy link
Contributor Author

ketanv3 commented Oct 5, 2023

@reta @backslasht @msfroh @kkmr

Tagging folks here since you all helped me review #9727. Would love to know your thoughts on this. :)

@backslasht
Copy link
Contributor

Thanks @ketanv3 for the detailed analysis. This is impressive.

+1 on the follow up questions you have already captured. In addition to that, it will be good to understand the saw tooth pattern observed between buckets 30-50.

Also, what will be the max bucket size for the data aggregations, would it be approx. 165 (weekly aggregations for three years) buckets?

@ketanv3
Copy link
Contributor Author

ketanv3 commented Oct 6, 2023

@backslasht

Ran the vectorized B-tree implementation on a c5.metal EC2 instance to gather detailed information from the hardware counters. Had to run on a bare metal instance since virtualized instances do not expose them.

Size Throughput CPI IPC L1-dcache-load-misses L1-dcache-loads L1-dcache-stores L1-icache-load-misses LLC-load-misses LLC-loads LLC-store-misses LLC-stores branch-misses branches cycles dTLB-load-misses dTLB-loads dTLB-store-misses dTLB-stores iTLB-load-misses iTLB-loads instructions
1 95.874 0.725 1.38 264.827 7.03066e+06 1.0058e+06 2264.76 0.848 49.926 0.774 18.979 88.926 8.03226e+06 3.92059e+07 23.772 7.03983e+06 15.436 1.00556e+06 13.055 41.248 5.40963e+07
2 95.768 0.726 1.377 272.975 7.01967e+06 1.00447e+06 1818.57 0.954 36.504 0.462 25.627 82.201 8.02847e+06 3.94148e+07 22.036 7.03899e+06 17.835 1.00811e+06 8.848 37.011 5.42907e+07
3 96.016 0.725 1.379 290.278 7.03927e+06 1.00608e+06 2049.83 0.461 38.973 0.743 22.95 83.625 8.04175e+06 3.93464e+07 27.647 7.02184e+06 15.771 1.00653e+06 4.534 28.435 5.42687e+07
4 96.05 0.727 1.376 262.629 7.04542e+06 1.00806e+06 1838.4 0.431 41.014 1.099 18.696 99.848 8.03455e+06 3.94105e+07 21.873 7.0194e+06 13.751 1.00594e+06 14.835 17.76 5.42351e+07
5 95.775 0.728 1.374 275.355 7.02671e+06 1.00656e+06 1938.57 0.089 24.69 0 20.518 98.147 8.0234e+06 3.9417e+07 23.632 7.01597e+06 15.377 1.00233e+06 5.334 26.627 5.4156e+07
6 95.366 0.728 1.375 312.206 6.99739e+06 1.00405e+06 1897.28 0.748 50.454 0 16.933 105.634 7.99518e+06 3.92564e+07 23.021 7.02015e+06 15.392 1.00646e+06 7.659 27.643 5.39589e+07
7 95.9 0.729 1.373 264.639 7.03131e+06 1.00701e+06 1970.05 0.045 41.097 0 25.9 88.431 8.02093e+06 3.9415e+07 26.302 6.99932e+06 13.649 1.0057e+06 6.609 22.387 5.41032e+07
8 95.592 0.726 1.378 354.794 7.03617e+06 1.0071e+06 1841.84 6.987 49.819 6.673 21.289 95.309 8.02671e+06 3.92976e+07 22.931 6.99024e+06 13.048 1.00322e+06 4.419 40.414 5.41546e+07
9 63.315 0.801 1.249 617.818 9.69542e+06 1.00697e+06 3116.1 13.697 95.539 9.822 36.224 119271 8.80853e+06 5.93837e+07 18.495 9.67298e+06 24.578 1.00426e+06 36.968 81.843 7.41659e+07
10 60.09 0.863 1.159 496.856 9.44618e+06 1.00902e+06 3138.78 13.99 70.236 10.736 28.669 208233 8.64376e+06 6.29426e+07 37.339 9.3751e+06 25.867 1.0066e+06 16.936 48.455 7.29565e+07
12 54.747 0.976 1.024 681.461 9.001e+06 1.01197e+06 3210.83 11.309 89.169 13.003 50.343 359792 8.32875e+06 6.85826e+07 54.513 9.01791e+06 27.621 1.00418e+06 17.511 82.056 7.02387e+07
14 51.744 1.059 0.945 641.664 8.75148e+06 1.00728e+06 3187.01 19.211 88.146 14.691 37.982 471548 8.17846e+06 7.30514e+07 42.447 8.72178e+06 21.306 1.00888e+06 14.388 65.765 6.9002e+07
16 51.066 1.087 0.92 660.417 8.52045e+06 1.0098e+06 3354.7 13.666 96.553 15.957 44.238 503159 8.00972e+06 7.35539e+07 42.226 8.52095e+06 22.021 1.00916e+06 16.768 82.72 6.76489e+07
18 65.718 0.76 1.316 541.493 9.86976e+06 1.00859e+06 2920.04 11.216 96.883 9.458 38.745 61133.8 8.92311e+06 5.71804e+07 33.278 9.88539e+06 22.366 1.00214e+06 9.957 82.761 7.52593e+07
20 61.527 0.83 1.205 638.088 9.57528e+06 1.00767e+06 2819.85 15.317 85.453 12.718 32.352 160634 8.72312e+06 6.12435e+07 33.907 9.61102e+06 26.898 1.01134e+06 15.155 89.561 7.37905e+07
22 58.819 0.878 1.139 570.445 9.4221e+06 1.01172e+06 4452.26 24.198 451.857 11.836 37.353 245222 8.76544e+06 6.40753e+07 39.586 9.36482e+06 96.721 1.01146e+06 143.224 637.41 7.30086e+07
24 56.819 0.934 1.071 1368.66 9.17573e+06 1.01792e+06 5162.59 23.236 262.022 15.901 45.994 306082 8.45094e+06 6.67882e+07 40.819 9.24182e+06 145.618 1.18714e+06 85.607 401.309 7.15256e+07
26 68.989 0.717 1.395 472.694 1.00595e+07 1.01266e+06 2755.95 4.945 63.907 13.882 27.971 138.945 9.05603e+06 5.48559e+07 82.444 1.00307e+07 26.523 1.0174e+06 9.372 41.605 7.65271e+07
29 63.786 0.797 1.255 569.826 9.72496e+06 1.00955e+06 2994.72 11.409 72.887 11.991 33.267 114304 8.83568e+06 5.94615e+07 32.126 9.72594e+06 22.841 1.00755e+06 11.186 44.094 7.46222e+07
32 47.699 1.087 0.92 714.38 9.51346e+06 1.01037e+06 4138.21 17.772 99.721 16.964 51.521 560013 1.08223e+07 7.92444e+07 46.105 9.45747e+06 30.069 1.00898e+06 17.114 111.539 7.28692e+07
37 52.013 0.996 1.004 789.009 9.89731e+06 1.0056e+06 5028.54 17.345 157.725 18.662 50.087 410603 1.09088e+07 7.28123e+07 45.531 9.82403e+06 29.723 1.0047e+06 20.529 59.665 7.31295e+07
41 51.942 0.953 1.049 756.281 1.00409e+07 1.01251e+06 3879.46 14.015 132.619 13.823 38.005 362017 1.13224e+07 7.235e+07 42.951 1.00065e+07 21.187 1.0088e+06 91.454 66.859 7.58788e+07
45 54.019 0.948 1.055 712.197 1.00141e+07 1.02068e+06 4207.79 4.967 122.984 12.523 44.649 336803 1.09437e+07 6.93715e+07 42.139 9.92021e+06 27.952 1.00049e+06 14.055 136.775 7.31762e+07
49 53.325 0.935 1.069 708.496 1.00252e+07 1.01467e+06 3464.04 3.799 111.778 14.019 59.313 320508 1.1254e+07 7.05644e+07 42.244 9.96187e+06 20.119 1.00056e+06 6.367 74.269 7.54311e+07
54 54.642 0.927 1.078 764.913 9.97003e+06 1.00252e+06 3697.65 16.893 101.646 14.465 49.426 316160 1.10379e+07 6.89289e+07 39.374 9.92867e+06 26.214 1.00641e+06 10.418 120.21 7.43245e+07
60 56.007 0.914 1.094 610.947 9.9634e+06 999136 3647.91 16.392 86.091 13.792 44.639 277776 1.09672e+07 6.71901e+07 36.073 1.01557e+07 28.654 1.01134e+06 11.09 91.725 7.34886e+07
64 55.917 0.922 1.085 561.867 9.88767e+06 1.00131e+06 3672.99 14.724 94.465 15.566 49.989 290158 1.08959e+07 6.7401e+07 39.322 1e+07 26.615 1.00849e+06 13.575 68.233 7.31207e+07
74 57.026 0.895 1.118 678.513 1.01321e+07 1.01404e+06 3952.69 14.267 107.625 11.639 37.519 255394 1.1081e+07 6.62103e+07 38.67 9.97083e+06 29.059 1.00383e+06 8.235 63.649 7.4013e+07
83 60.974 0.797 1.255 650.349 1.02673e+07 1.00644e+06 3286.43 11.421 101.078 12.848 45.659 104655 9.18171e+06 6.2039e+07 35.549 1.03907e+07 25.322 1.01205e+06 11.561 51.604 7.78523e+07
90 54.967 0.868 1.152 963.141 1.05857e+07 1.00523e+06 3618.53 11.224 86.491 16.55 45.701 203684 9.38337e+06 6.86505e+07 39.439 1.07365e+07 28.475 1.00752e+06 16.134 86.673 7.9099e+07
98 49.989 0.931 1.074 832.424 1.08614e+07 1.0153e+06 3523.59 13.99 127.626 15.132 57.132 301198 9.55783e+06 7.51262e+07 42.485 1.08269e+07 30.065 1.01126e+06 15.989 98.075 8.06989e+07
108 47.1 0.982 1.018 901.536 1.09803e+07 1.0065e+06 3840.3 1.364 89.457 2.455 41.849 379133 9.67322e+06 8.03336e+07 43.516 1.09913e+07 32.304 1.01087e+06 12.121 59.698 8.18058e+07
118 44.95 1.016 0.984 904.969 1.12242e+07 1.01413e+06 4055.78 1.46 106.601 0.381 46.634 430835 9.7877e+06 8.40423e+07 49.364 1.11956e+07 35.587 1.01163e+06 15.27 87.935 8.26873e+07
128 42.9 1.047 0.955 1120.26 1.13429e+07 1.01437e+06 6091.54 0.865 121.749 3.593 57.215 479070 9.92116e+06 8.77383e+07 53.057 1.12421e+07 25.614 1.00708e+06 14.171 174.607 8.37715e+07

I can see a strong inverse correlation between throughput and branch-misses. As I understand, the structure of the B-tree changes as the size increases, which influences the decision of the branch predictor. I'm trying to understand which branch (in code) this is, and if it can be written in a different way. This explains the saw-tooth pattern.

Screenshot 2023-10-06 at 6 35 45 PM

@ketanv3
Copy link
Contributor Author

ketanv3 commented Oct 6, 2023

@backslasht To your other question, we use array-based prepared rounding only up to 128 buckets. Beyond this, we fall back to recalculating the bucket key for each hit. Based on the performance improvement we get from this experiment, we can always bump up that number as long as 1) the extra cost of pre-computing and 2) the extra memory overhead is justified.

@reta
Copy link
Collaborator

reta commented Oct 6, 2023

@ketanv3 didn't find the information what JDK you run tests on, I assume JDK-21 right? Also very curious to figure out the answers to follow up questions, since this does not look like win/win for all cases. Thanks a lot for prototyping that

@ketanv3
Copy link
Contributor Author

ketanv3 commented Oct 6, 2023

@reta I ran these tests with OpenJDK 20 since Gradle 8.3 didn't support JDK 21. I see that Gradle 8.4 is now released, so I'll rerun with that now.

@ketanv3
Copy link
Contributor Author

ketanv3 commented Oct 7, 2023

@reta Similar micro-benchmark result with OpenJDK 21.

size binary linear hybrid eytzinger vectorized_linear vectorized_eytzinger unrolled_eytzinger vectorized_btree
1 264.155 314.583 285.588 344.837 293.08 204.618 248.012 234.205
2 90.367 92.47 94.972 93.844 293.11 204.634 88.674 234.299
3 72.51 76.006 78.543 73.988 293.157 204.616 106.866 234.249
4 65.311 71.456 71.435 61.342 293.259 204.704 87.509 234.306
5 55.447 66.972 68.807 56.205 293.187 204.628 75.331 234.297
6 51.201 65.289 66.085 53.027 293.253 204.636 68.285 234.294
7 48.63 63.118 64.175 49.977 293.275 204.65 62.634 234.114
8 47.625 61.518 60.958 46.761 153.942 204.635 61.292 234.309
9 46.036 59.52 59.696 46.261 119.686 157.208 56.754 116.75
10 44.718 59.598 58.459 43.521 105.577 124.967 53.308 102.533
12 41.204 56.654 56.457 40.939 83.855 92.816 53.47 85.318
14 38.447 55.718 55.667 40.612 77.115 79.142 51.446 74.932
16 38.559 54.369 54.305 37.986 72.058 73.075 50.882 72.337
18 37.62 54.781 53.246 37.867 67.659 77.339 49.831 124.946
20 36.45 53.226 52.39 37.211 63.468 81.475 50.133 110.335
22 35.508 52.717 51.493 36.466 62.943 84.838 50.378 99.119
24 34.41 51.535 50.479 35.762 59.68 87.753 50.72 90.537
26 33.037 51.033 50.02 34.948 59.139 95.244 48.706 136.978
29 31.095 47.927 48.967 33.812 56.896 98.329 46.572 71.582
32 30.092 47.727 47.843 32.645 53.907 68.113 41.509 67.662
37 30.705 46.304 46.165 32.396 51.849 73.27 40.632 79.408
41 30.849 44.819 44.635 31.814 52.377 76.203 39.843 84.674
45 30.41 43.703 43.33 31.263 49.787 79.194 39.139 85.958
49 29.676 42.095 42.289 30.61 49.073 81.653 38.212 87.302
54 28.149 41.308 40.827 30.066 48.137 85.013 37.632 89.192
60 26.717 39.526 39.246 29.335 47.123 88.212 36.542 91.502
64 26.403 38.52 38.585 28.671 46.605 90.634 35.995 89.81
74 26.75 37.511 35.906 28.448 44.81 93.836 34.993 94.132
83 26.637 36.464 35.198 27.955 43.921 89.272 34.343 106.736
90 26.565 36.06 34.736 27.716 42.11 101.629 34.705 86.848
98 26.162 35.304 34.547 27.327 40.056 86.459 34.398 74.333
108 25.103 34.318 33.992 26.749 38.727 73.583 34.053 67.306
118 24.202 33.416 33.624 26.295 37.978 65.097 33.498 62.358
128 23.505 32.508 32.962 25.248 39.133 59.451 33.253 58.775
144 23.713 30.988 30.597 25.386 38.315 55.393 32.404 54.823
159 23.709 29.907 29.997 25.329 37.235 55.506 32.346 55.405
171 23.768 29.475 29.869 25.111 35.359 56.231 32.496 58.734
187 23.502 28.282 29.383 24.798 34.509 58.977 32.342 62.248
204 23.06 27.414 29.351 24.312 33.801 61.712 32.124 65.354
229 22.066 26 28.569 23.941 33.207 65.18 32.042 69.407
256 21.242 24.473 28.39 23.208 32.735 68.831 31.357 70.795

@reta
Copy link
Collaborator

reta commented Oct 9, 2023

@reta Similar micro-benchmark result with OpenJDK 21.

Thank you for confirming, @ketanv3

@backslasht
Copy link
Contributor

@backslasht To your other question, we use array-based prepared rounding only up to 128 buckets. Beyond this, we fall back to recalculating the bucket key for each hit. Based on the performance improvement we get from this experiment, we can always bump up that number as long as 1) the extra cost of pre-computing and 2) the extra memory overhead is justified.

Makes sense!

@ketanv3
Copy link
Contributor Author

ketanv3 commented Oct 10, 2023

@backslasht @reta

Here's some in-depth analysis of the saw-tooth pattern observed with small arrays. As mentioned in an earlier comment, the structure of the B-tree changes as the size grows, which influences the decision of the branch predictor. This has a huge impact on performance!

Part 1

So which branch is being mispredicted? My first suspicion was the outermost loop's breaking condition while (i < values.length) (permalink). Depending on the position of the key in the sorted array, the lookup in a B-tree runs for different number of iterations (though they may differ by one level at most). This is best explained with a picture; the smallest key takes 3 iterations while the largest key takes 2 iterations.

loop-mispredict

This happens because, in order to optimize for memory requirements, we only store the required number of elements as opposed to storing a "perfect" tree padded with sentinel values. Here's how much the performance improves when I create such a B-tree. But this comes at the cost of using $(1+B)^h - 1$ memory (B: vector lanes, h: height of the tree), which can be wasteful.

Screenshot 2023-10-10 at 12 41 29 PM

You can see that the performance becomes very stable for array sizes 10...40 and follows a predictable step-down pattern in response to the B-tree height. But there are a few more problems: 1) Why is an array of size 45 slower than other larger arrays? 2) Why does the step-down happen at 45? For a platform with 8 vector lanes, the step-down points should be 8, 80, 728, and so on.

Part 2

That's when I explored the only other branch res = (j > i) ? j : res; (permalink). Unfortunately, JVM didn't convert this into a conditional move instruction, making it another point of branch misprediction. With some bit-twiddling hacks, I wrote a branchless version as follows.

int o = v.compare(VectorOperators.GT, k).firstTrue(); // offset of first 'true'
int j = i + o;
int m = (res ^ j) & (~((o - 1) >> 31)); // bitmask to transform 'res' to 'j' if 'o' is non-zero.
res ^= m;

Here's how the performance looks now. This variant uses more instructions (slower) but always executes in the same amount of time for a given B-tree height. The step-down points also align with the theoretical performance.

Screenshot 2023-10-10 at 12 58 30 PM

Part 3

This analysis also reveals a flaw in my micro-benchmarks. The impact of branch misprediction is exaggerated when the distribution of keys is random. But for time-series data, the keys will likely be sorted, so consecutive hits will likely run the same number of loop iterations. It may be a non-issue after all! I need to run the tests with sorted keys.

@backslasht
Copy link
Contributor

@ketanv3 - Thanks for the detailed analysis. Looking forward for the results with sorted keys.

@ketanv3
Copy link
Contributor Author

ketanv3 commented Oct 11, 2023

@backslasht Here you go! Just FYI, the absolute values in these performance results can be a bit meaningless. The trends on the other hand are more meaningful.

Full results: https://gist.github.com/ketanv3/4aba76b694c066ea576575b2291405c1

Random Keys

I've updated the B-tree implementation to build with a complete tree (i.e., padded with sentinel values only on the right side of the array). This reduces the saw-tooth pattern to some extent without using excessive memory of a perfect tree. The remaining saw-tooth pattern is still present due to branch misprediction. Interestingly, linear search is better than binary search even for sizes as high as 256. Vectorized B-tree search is better than linear/binary search for all array sizes greater than 1.

Screenshot 2023-10-11 at 9 58 44 PM

Sorted Keys

As expected, branch misprediction is no longer a concern and the drop in performance is more organic as the array size increases. Even the regular binary search is faster, overtaking the linear search at array size 64. Vectorized B-tree search is better than linear/binary search for all array sizes greater than 1.

Screenshot 2023-10-11 at 9 58 33 PM

This trend matches with what I observed in the macro benchmark results. The only anomaly being slight regression in 36 months' aggregation which is still unexplained. Improvements can be substantial if VectorMask::firstTrue is intrinsified with bit-scan instructions, which can more than make up this regression.

@reta
Copy link
Collaborator

reta commented Oct 12, 2023

@ketanv3 thanks for such indepth analysis, do you think:

  • the optional structure of btree becomes actually dependent on the number of vector lanes?
  • are those benchmarks biased towards some particular architectures (CPU & co)?

@backslasht
Copy link
Contributor

Thanks @ketanv3 for the deep analysis and the great summary. This analysis is with a processor having 8 vector lanes?

@ketanv3
Copy link
Contributor Author

ketanv3 commented Oct 18, 2023

@reta

the optional structure of btree becomes actually dependent on the number of vector lanes?

I'm not sure what you mean by optional structure. I chose the block size equal to the number of vector lanes so that I can compare all keys of a node in a single shot. Larger block size creates a tree of smaller height, but clearly, that's not the only factor to consider for performance.

are those benchmarks biased towards some particular architectures (CPU & co)?

Yes, this was tested on an Intel Xeon 8375C CPU which supports full 512-bit registers. During development, I tested this on an Intel Core i7-9750H CPU which supports 256-bit registers. Larger vectors saw better performance in general.

@backslasht Yes, that's correct.

@ketanv3
Copy link
Contributor Author

ketanv3 commented Oct 18, 2023

I noticed that noaa workload isn't ingested in the increasing order of time (workload is randomized). I reindexed the data into another index with "index sorting" on the timestamp field and compared again.

Simple linear search remains faster for small number of buckets:

Interval Num Buckets Num Hits Baseline (ms) Vectorized B-Tree (ms)
year 3 33659481 583 703
quarter 12 33659481 711 768
month 36 33659481 823 851

Vectorized B-tree was marginally better for medium number of buckets:

Num Months Num Hits Baseline (ms) Vectorized B-Tree (ms)
12 11392659 281 303
14 13187069 328 344
16 15073450 378 381
18 17009688 456 446
20 18957437 507 495
22 20865502 561 551
24 22745413 612 592
26 24548269 655 647
28 26395510 704 691
30 28239883 754 739
32 30094407 800 795
34 31896193 847 828
36 33659481 755 812

Top 5 hot methods:

70.17% jdk/incubator/vector/Long512Vector$Long512Mask.firstTrue ()I Inlined
6.63% org/opensearch/common/util/ReorganizingLongHash.find (J)J Inlined
5.12% org/opensearch/common/BidirectionalLinearSearchArrayRounding.round (J)J Inlined
3.61% org/opensearch/search/aggregations/bucket/histogram/DateHistogramAggregator$1.collect (IJ)V JIT compiled
3.20% org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer$5.longValue ()J Inlined

Conclusion

  • It's hard to beat the performance of simple linear search for small number of buckets. While some workloads show improvement, others may regress. The drop in performance is also unpredictable.
  • For large number of buckets, vectorized B-tree is better than regular binary search. We can consider using SIMD only when the number of buckets is greater than 64 and the platform supports a minimum number of vector lanes.
  • The gap in performance is slimmer when the input is sorted.
  • Improvements can be substantial if VectorMask::firstTrue is intrinsified with bit-scan instructions, possibly making vectorized B-tree faster than the linear search approach.

@reta @backslasht Please let me know if you have any other experiments in mind.

I feel that the results aren't a win-win; the improvements in some places aren't substantial enough to make up for the regression in other places. That too with much higher code complexity.

@reta
Copy link
Collaborator

reta commented Oct 18, 2023

I feel that the results aren't a win-win; the improvements in some places aren't substantial enough to make up for the regression in other places. That too with much higher code complexity.

I agree with you, may be you will find the relevant (to vectorization) discussions on Apache Lucene [1], [2], [3], [4] useful and explaining some of the results. I don't think we should through away this work, what do you think if we introduce the family of experimental feature flags opensearch.experimental.simd.*.enabled to explore (and possibly have an opportunity for community to improve) the implementation of SIMD related optimizations?

[1] apache/lucene#12621
[2] apache/lucene#12694
[3] apache/lucene#12681
[4] apache/lucene#12632

@backslasht
Copy link
Contributor

+1, I agree with @reta, we can still contribute this code with the experimental flag for number of buckets greater than 64. It will also stand as an example for folks try to use SIMD in the rest of the code base.

@ketanv3
Copy link
Contributor Author

ketanv3 commented Oct 20, 2023

@reta @backslasht I like these suggestions, thank you! I'll raise a PR soon.

@reta reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.12.0 Issues and PRs related to version 2.12.0 and removed untriaged labels Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance v2.12.0 Issues and PRs related to version 2.12.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants