You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently min_by and max_by aggregation packs order and value into a struct column and uses min/max aggregation to find the result. #11371 (comment) shows that the performance is not ideal because CUDF does not deal well with min/max on a struct of two values.
Describe the solution you'd like
If we went with the argmin and then a gather for many fixed width types we would see a huge performance improvement. If the they are not fixed width, then we would not likely see much performance improvement, but we might see a very small amount, just because there would be less data to look at.
Additional context
A closed cudf pr to use argmin + gather: rapidsai/cudf#16163, but it is still using sort-based agg because we have to pass a struct column to cuDF agg framework and hash-based agg is not supported yet for argmin/argmax, rapidsai/cudf#14412 (comment)
min_by and max_by are Spark specific agg functions so we better add the kernel support in jni if needed, it would be a problem how to integrate implementation in jni with cuDF agg framework. Related to rapidsai/cudf#16633
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Currently min_by and max_by aggregation packs order and value into a struct column and uses min/max aggregation to find the result. #11371 (comment) shows that the performance is not ideal because CUDF does not deal well with min/max on a struct of two values.
Describe the solution you'd like
If we went with the argmin and then a gather for many fixed width types we would see a huge performance improvement. If the they are not fixed width, then we would not likely see much performance improvement, but we might see a very small amount, just because there would be less data to look at.
Additional context
A closed cudf pr to use argmin + gather: rapidsai/cudf#16163, but it is still using sort-based agg because we have to pass a struct column to cuDF agg framework and hash-based agg is not supported yet for argmin/argmax, rapidsai/cudf#14412 (comment)
min_by and max_by are Spark specific agg functions so we better add the kernel support in jni if needed, it would be a problem how to integrate implementation in jni with cuDF agg framework. Related to rapidsai/cudf#16633
The text was updated successfully, but these errors were encountered: