[FEA][Follow on] Improve performance of min_by and max_by #11412

thirtiseven · 2024-08-30T04:02:11Z

Is your feature request related to a problem? Please describe.
Currently min_by and max_by aggregation packs order and value into a struct column and uses min/max aggregation to find the result. #11371 (comment) shows that the performance is not ideal because CUDF does not deal well with min/max on a struct of two values.

Describe the solution you'd like
If we went with the argmin and then a gather for many fixed width types we would see a huge performance improvement. If the they are not fixed width, then we would not likely see much performance improvement, but we might see a very small amount, just because there would be less data to look at.

Additional context

A closed cudf pr to use argmin + gather: rapidsai/cudf#16163, but it is still using sort-based agg because we have to pass a struct column to cuDF agg framework and hash-based agg is not supported yet for argmin/argmax, rapidsai/cudf#14412 (comment)

min_by and max_by are Spark specific agg functions so we better add the kernel support in jni if needed, it would be a problem how to integrate implementation in jni with cuDF agg framework. Related to rapidsai/cudf#16633

thirtiseven added ? - Needs Triage Need team to review and classify feature request New feature or request performance A performance related task/issue labels Aug 30, 2024

thirtiseven mentioned this issue Aug 30, 2024

Support MinBy and MaxBy for non-float ordering #11371

Merged

mattahrens removed the ? - Needs Triage Need team to review and classify label Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA][Follow on] Improve performance of min_by and max_by #11412

[FEA][Follow on] Improve performance of min_by and max_by #11412

thirtiseven commented Aug 30, 2024

[FEA][Follow on] Improve performance of min_by and max_by #11412

[FEA][Follow on] Improve performance of min_by and max_by #11412

Comments

thirtiseven commented Aug 30, 2024