-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SIMD] [POC] Improve performance of rounding dates in date_histogram aggregation #10392
Comments
@reta @backslasht @msfroh @kkmr Tagging folks here since you all helped me review #9727. Would love to know your thoughts on this. :) |
Thanks @ketanv3 for the detailed analysis. This is impressive. +1 on the follow up questions you have already captured. In addition to that, it will be good to understand the saw tooth pattern observed between buckets 30-50. Also, what will be the max bucket size for the data aggregations, would it be approx. 165 (weekly aggregations for three years) buckets? |
@backslasht To your other question, we use array-based prepared rounding only up to 128 buckets. Beyond this, we fall back to recalculating the bucket key for each hit. Based on the performance improvement we get from this experiment, we can always bump up that number as long as 1) the extra cost of pre-computing and 2) the extra memory overhead is justified. |
@ketanv3 didn't find the information what JDK you run tests on, I assume JDK-21 right? Also very curious to figure out the answers to follow up questions, since this does not look like win/win for all cases. Thanks a lot for prototyping that |
@reta I ran these tests with OpenJDK 20 since Gradle 8.3 didn't support JDK 21. I see that Gradle 8.4 is now released, so I'll rerun with that now. |
@reta Similar micro-benchmark result with OpenJDK 21.
|
Makes sense! |
Here's some in-depth analysis of the saw-tooth pattern observed with small arrays. As mentioned in an earlier comment, the structure of the B-tree changes as the size grows, which influences the decision of the branch predictor. This has a huge impact on performance! Part 1So which branch is being mispredicted? My first suspicion was the outermost loop's breaking condition This happens because, in order to optimize for memory requirements, we only store the required number of elements as opposed to storing a "perfect" tree padded with sentinel values. Here's how much the performance improves when I create such a B-tree. But this comes at the cost of using You can see that the performance becomes very stable for array sizes 10...40 and follows a predictable step-down pattern in response to the B-tree height. But there are a few more problems: 1) Why is an array of size 45 slower than other larger arrays? 2) Why does the step-down happen at 45? For a platform with 8 vector lanes, the step-down points should be 8, 80, 728, and so on. Part 2That's when I explored the only other branch int o = v.compare(VectorOperators.GT, k).firstTrue(); // offset of first 'true'
int j = i + o;
int m = (res ^ j) & (~((o - 1) >> 31)); // bitmask to transform 'res' to 'j' if 'o' is non-zero.
res ^= m; Here's how the performance looks now. This variant uses more instructions (slower) but always executes in the same amount of time for a given B-tree height. The step-down points also align with the theoretical performance. Part 3This analysis also reveals a flaw in my micro-benchmarks. The impact of branch misprediction is exaggerated when the distribution of keys is random. But for time-series data, the keys will likely be sorted, so consecutive hits will likely run the same number of loop iterations. It may be a non-issue after all! I need to run the tests with sorted keys. |
@ketanv3 - Thanks for the detailed analysis. Looking forward for the results with sorted keys. |
@backslasht Here you go! Just FYI, the absolute values in these performance results can be a bit meaningless. The trends on the other hand are more meaningful. Full results: https://gist.github.com/ketanv3/4aba76b694c066ea576575b2291405c1 Random KeysI've updated the B-tree implementation to build with a complete tree (i.e., padded with sentinel values only on the right side of the array). This reduces the saw-tooth pattern to some extent without using excessive memory of a perfect tree. The remaining saw-tooth pattern is still present due to branch misprediction. Interestingly, linear search is better than binary search even for sizes as high as 256. Vectorized B-tree search is better than linear/binary search for all array sizes greater than 1. Sorted KeysAs expected, branch misprediction is no longer a concern and the drop in performance is more organic as the array size increases. Even the regular binary search is faster, overtaking the linear search at array size 64. Vectorized B-tree search is better than linear/binary search for all array sizes greater than 1. This trend matches with what I observed in the macro benchmark results. The only anomaly being slight regression in 36 months' aggregation which is still unexplained. Improvements can be substantial if VectorMask::firstTrue is intrinsified with bit-scan instructions, which can more than make up this regression. |
@ketanv3 thanks for such indepth analysis, do you think:
|
Thanks @ketanv3 for the deep analysis and the great summary. This analysis is with a processor having 8 vector lanes? |
I'm not sure what you mean by optional structure. I chose the block size equal to the number of vector lanes so that I can compare all keys of a node in a single shot. Larger block size creates a tree of smaller height, but clearly, that's not the only factor to consider for performance.
Yes, this was tested on an Intel Xeon 8375C CPU which supports full 512-bit registers. During development, I tested this on an Intel Core i7-9750H CPU which supports 256-bit registers. Larger vectors saw better performance in general. @backslasht Yes, that's correct. |
I noticed that Simple linear search remains faster for small number of buckets:
Vectorized B-tree was marginally better for medium number of buckets:
Top 5 hot methods:
Conclusion
@reta @backslasht Please let me know if you have any other experiments in mind. I feel that the results aren't a win-win; the improvements in some places aren't substantial enough to make up for the regression in other places. That too with much higher code complexity. |
I agree with you, may be you will find the relevant (to vectorization) discussions on Apache Lucene [1], [2], [3], [4] useful and explaining some of the results. I don't think we should through away this work, what do you think if we introduce the family of experimental feature flags [1] apache/lucene#12621 |
+1, I agree with @reta, we can still contribute this code with the experimental flag for number of buckets greater than 64. It will also stand as an example for folks try to use SIMD in the rest of the code base. |
@reta @backslasht I like these suggestions, thank you! I'll raise a PR soon. |
Problem Statement
In the date histogram aggregation, the timestamp value from each hit must be rounded down to the nearest interval (year, quarter, month, week, day, etc.) as defined in the search request. This rounded-down timestamp serves as the bucket key to aggregate results.
#9727 improved performance by using linear search when the array size is small (<= 64 elements). This POC evaluates if performance can be improved further using SIMD (related to #9423).
1 — Baseline
The current approach uses linear search when the search space is small or binary search otherwise. This is because linear search is far superior for small arrays as it avoids the penalty of branch mispredictions and pipeline stalls and accesses memory sequentially. To simplify the remaining comparison, I'll assume that each value is equally likely to be picked (uniform distribution), so we need not consider bidirectional linear search in this POC.
2 — Vectorized Linear Search
Assume a platform that supports AVX2 instructions (i.e., 256-bit wide registers). We perform a linear search comparing 4 longs at once, taking a stride of 4 elements with each iteration. We pad the array with$\infty$ to make it an exact multiple of the number of vector lanes ($B$ ). The search stops once we find the first value greater than the desired key.
3 — Vectorized B-Tree Search
Here, the idea is to improve binary search using vectorization. Since binary search compares values at non-contiguous memory locations, which is non-trivial to load into a vector register, the first step is to fix the memory layout.
To find the round-down point in this layout, we start our search from the$0$ -th index comparing $B$ values at once. The position of the first set bit in the vector mask tells us which sub-tree (between $0...B$ ) to visit to converge to the round-down point.
POC code
Code for the above implementations (and many others) can be found here:
https://github.com/ketanv3/opensearch-simd-playground/blob/main/src/main/java/org/opensearch/Rounder.java
Results
Results are from c6i.2xlarge EC2 instance (Intel Xeon 8375C CPU @ 2.90GHz) which supports full 512-bit registers (8 vector lanes for long).
Micro Benchmarks
These results compare the performance to look up round-down points for 1 million random keys picked from a uniform distribution. Since pre-computed array rounding is only used for small arrays (code), I've limited the benchmarks to array sizes up to 256.
Code for the micro benchmark harness can be found here:
https://github.com/ketanv3/opensearch-simd-playground/blob/main/src/jmh/java/org/opensearch/RounderBenchmark.java
Macro Benchmarks
These results compare the performance to execute the following aggregation with varying intervals using the noaa dataset.
These results compare the performance to execute the following aggregation with increasing number of months.
Hot Methods Profile
To understand why 36-months' aggregation isn't seeing the expected speed-up, I took a profile of hot methods.
Turns out that VectorMask::firstTrue itself performs a linear scan on the result, which almost defeats the purpose of vectorization for this use-case. While there is some improvement still, it could have been much greater if Java had intrinsified this with bit-scan instructions, which is widely available on almost all platforms.
Conclusion
TODO
Follow-Up Questions
The text was updated successfully, but these errors were encountered: