-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor collect_set
to use cudf::distinct
and cudf::lists::distinct
#11228
Refactor collect_set
to use cudf::distinct
and cudf::lists::distinct
#11228
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java approval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nitpicks. Looks good, otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@gpucibot merge |
This PR completely removes `cudf::lists::drop_list_duplicates`. It is replaced by the new API `cudf::list::distinct` which has a simpler implementation but better performance. The replacements for internal cudf usage have all been merged before thus there is no side effect or breaking for the existing APIs in this work. Closes #11114, #11093, #11053, #11034, and closes #9257. Depends on: * #11228 * #11149 * #11234 * #11233 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jordan Jacobelli (https://github.com/Ethyling) - Robert Maynard (https://github.com/robertmaynard) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #11236
The current groupby/reducttion
collect_set
aggregations uselists::drop_list_duplicates
to generate set(s) of distinct elements. This PR changes that to usecudf::distinct
andcudf::lists::distinct
instead, which have some advantages including:O(n)
instead ofO(nlogn)
) by internally using hash table instead of segmented sort.This also enables nested types support for
collect_set
in spark-rapids (issue NVIDIA/spark-rapids#5508).The changes in Java code here are only to fix unit tests. Previously, they were implemented with the assumption that the
collect_set
results are sorted, now they fail when the results are no longer sorted.