Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA][Follow on] Improve performance of min_by and max_by #11412

Open
thirtiseven opened this issue Aug 30, 2024 · 0 comments
Open

[FEA][Follow on] Improve performance of min_by and max_by #11412

thirtiseven opened this issue Aug 30, 2024 · 0 comments
Labels
feature request New feature or request performance A performance related task/issue

Comments

@thirtiseven
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Currently min_by and max_by aggregation packs order and value into a struct column and uses min/max aggregation to find the result. #11371 (comment) shows that the performance is not ideal because CUDF does not deal well with min/max on a struct of two values.

Describe the solution you'd like
If we went with the argmin and then a gather for many fixed width types we would see a huge performance improvement. If the they are not fixed width, then we would not likely see much performance improvement, but we might see a very small amount, just because there would be less data to look at.

Additional context

A closed cudf pr to use argmin + gather: rapidsai/cudf#16163, but it is still using sort-based agg because we have to pass a struct column to cuDF agg framework and hash-based agg is not supported yet for argmin/argmax, rapidsai/cudf#14412 (comment)

min_by and max_by are Spark specific agg functions so we better add the kernel support in jni if needed, it would be a problem how to integrate implementation in jni with cuDF agg framework. Related to rapidsai/cudf#16633

@thirtiseven thirtiseven added ? - Needs Triage Need team to review and classify feature request New feature or request performance A performance related task/issue labels Aug 30, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

2 participants