-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor dask_cudf groupby to use apply_concat_apply #11571
Conversation
@charlesbluca - This PR will change the HLG layer(s) produced for a groupby aggregations. Do you expect this to break anything here or in dask-sql? |
I wouldn't expect it to change anything here, I might need to make some changes to #10853 but don't expect those to be too extensive. I can pull this PR in with dask-sql to see if anything ends up breaking there |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-22.10 #11571 +/- ##
===============================================
Coverage ? 86.39%
===============================================
Files ? 145
Lines ? 22963
Branches ? 0
===============================================
Hits ? 19840
Misses ? 3123
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Was able to check this PR against the dask-sql groupby tests, looks like the only test that's failing is |
Thank you for taking the time to test this with dask-sql @charlesbluca - That is super helpful! |
Update: I used the proposed This PR
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nits
@gpucibot merge |
Dask-cudf currently maintains a specialized groupby-aggregation code, this code is faster for cudf-based data than the upstream (
dask.dataframe
) code path. However, the custom implementation does not take advantage of Dask'sapply_concat_apply
function, even though the tree-reduction aspect of the algorithm is the same.This PR refactors the dask_cudf groupby-aggregation code to use
apply_concat_apply
. This reduces the amount of code we will need to maintain in cudf, and should improve graph optimizations (like fusion).Checklist