[Contrib][Sort] Faster Top-K Implementation #13599
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
This is a simple rewrite of hand-coded top-k function used for CPU targets.
The old implementation sorted each axis and then took the biggest k elements.
The new implementation does a single pass of each axis, keeping a min heap to store the top-k elements up to that point.
If n is the size of the array, and we want to find top k, the old implementation has runtime in O(nlogn) with additional memory O(n) to store the sorted array. The new implementation is O(n log k), and in practice is probably amortized to O(n / k * log k) in many scenarios and only requires O(k). Note n >> k most of the time.
In practice this new kernel led to a 20x speedup over existing one. On a Xeon Platinum 8370C CPU @ 2.80GHz for input shape [1, 3050] with k = 15, the latency went from 200us --> ~10us. There is probably more room for shaving off a little more time on the scale of a single us's, however I have determined it to not be worth it.
This change however is probably in the range of worth committing.
I've launched benchmarks on my m1 mac, and a Xeon Platinum 8370C CPU @ 2.80GHz with 8 cores.
Data:
All data is collected along axis=1.
M1:
Xeon:
As can be seen, except in one pathological case (k ~ axis_size), we see significant speedups across almost all conditions. For M1, this case also has speedups surprisingly.
Other Changes: