-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Use cudf::distinct
in Python and Java when it has support for duplicate_keep_option
#11089
Comments
This adds `duplicate_keep_option` to `cudf::distinct`, allowing to specify a `keep` option for selecting which of the duplicate elements to keep. It paves the way for many drop duplicate applications to achieve `O(n)` performance. A `KEEP_ANY` option is also added to `duplicate_keep_option`, which was an attempt in #9417 but didn't get in eventually. Partially addresses #11050 and #11053. ---- Main implementation: https://github.com/rapidsai/cudf/pull/11052/files#diff-4c2d4268b3c50000ae845ba15a890bb743709c30e5cab4847af7ad633c5a2823R47 Follow up work: * #11053 * #11089 * #11092 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #11052
This issue has been labeled |
This issue has been labeled |
@ttnghia did you have any specific ideas of places where we could leverage this new feature, or do we just need to do a search for usage of |
I have done the necessary work for the Java side. For Python, I've started #11230 but then @brandon-b-miller took over it by #11656 which should close this issue (that PR is not exactly the same, but covers my work). @brandon-b-miller do you have any update? |
got a little sidetracked but hoping to come back to this this week 👍 |
So both Java + Python binding have been addressed. I think we can close this. |
Now
cudf::distinct
is going to supportduplicate_keep_option
(#11052). When that PR is merged, we should update Python and Java sides to use it instead ofcudf::stable_sort_by_key
+cudf::unique
. This will boost performance fromO(nlogn)
toO(n)
.The text was updated successfully, but these errors were encountered: