[FEA] Use `cudf::distinct` in Python and Java when it has support for `duplicate_keep_option` #11089

ttnghia · 2022-06-09T22:17:26Z

Now cudf::distinct is going to support duplicate_keep_option (#11052). When that PR is merged, we should update Python and Java sides to use it instead of cudf::stable_sort_by_key + cudf::unique. This will boost performance from O(nlogn) to O(n).

The text was updated successfully, but these errors were encountered:

This adds `duplicate_keep_option` to `cudf::distinct`, allowing to specify a `keep` option for selecting which of the duplicate elements to keep. It paves the way for many drop duplicate applications to achieve `O(n)` performance. A `KEEP_ANY` option is also added to `duplicate_keep_option`, which was an attempt in #9417 but didn't get in eventually. Partially addresses #11050 and #11053. ---- Main implementation: https://github.com/rapidsai/cudf/pull/11052/files#diff-4c2d4268b3c50000ae845ba15a890bb743709c30e5cab4847af7ad633c5a2823R47 Follow up work: * #11053 * #11089 * #11092 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #11052

github-actions · 2022-07-15T22:02:49Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2022-10-13T22:02:59Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

vyasr · 2022-10-21T15:45:16Z

@ttnghia did you have any specific ideas of places where we could leverage this new feature, or do we just need to do a search for usage of cudf::unique calls to see where we are sorting first?

ttnghia · 2022-10-24T20:40:46Z

I have done the necessary work for the Java side. For Python, I've started #11230 but then @brandon-b-miller took over it by #11656 which should close this issue (that PR is not exactly the same, but covers my work).

@brandon-b-miller do you have any update?

brandon-b-miller · 2022-10-25T18:05:07Z

got a little sidetracked but hoping to come back to this this week 👍

bdice · 2023-05-31T04:53:26Z

The Python side is using cudf::stable_distinct as of #11656, which closes half of this issue, but perhaps there is still work needed in Java? @ttnghia Can you offer an update here? Maybe we need to search the JNI or spark-rapids code for “cudf::unique”?

ttnghia · 2023-05-31T04:59:05Z

Thanks @bdice. Java doesn't need unique as I'm aware of.
distinct has been added to Java binding almost a year ago: #11232.

ttnghia · 2023-05-31T04:59:51Z

So both Java + Python binding have been addressed. I think we can close this.

ttnghia added feature request New feature or request Needs Triage Need team to review and classify labels Jun 9, 2022

ttnghia mentioned this issue Jun 10, 2022

Support duplicate_keep_option in cudf::distinct #11052

Merged

ttnghia self-assigned this Jun 15, 2022

github-actions bot added the inactive-30d label Jul 15, 2022

github-actions bot added the inactive-90d label Oct 13, 2022

github-actions bot removed the inactive-90d label Oct 21, 2022

ttnghia mentioned this issue Apr 6, 2023

[FEA] Support drop_duplicates on Series containing list objects #6784

Closed

ttnghia closed this as completed May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Use `cudf::distinct` in Python and Java when it has support for `duplicate_keep_option` #11089

[FEA] Use `cudf::distinct` in Python and Java when it has support for `duplicate_keep_option` #11089

ttnghia commented Jun 9, 2022

github-actions bot commented Jul 15, 2022

github-actions bot commented Oct 13, 2022

vyasr commented Oct 21, 2022

ttnghia commented Oct 24, 2022 •

edited

Loading

brandon-b-miller commented Oct 25, 2022

bdice commented May 31, 2023

ttnghia commented May 31, 2023

ttnghia commented May 31, 2023

[FEA] Use cudf::distinct in Python and Java when it has support for duplicate_keep_option #11089

[FEA] Use cudf::distinct in Python and Java when it has support for duplicate_keep_option #11089

Comments

ttnghia commented Jun 9, 2022

github-actions bot commented Jul 15, 2022

github-actions bot commented Oct 13, 2022

vyasr commented Oct 21, 2022

ttnghia commented Oct 24, 2022 • edited Loading

brandon-b-miller commented Oct 25, 2022

bdice commented May 31, 2023

ttnghia commented May 31, 2023

ttnghia commented May 31, 2023

[FEA] Use `cudf::distinct` in Python and Java when it has support for `duplicate_keep_option` #11089

[FEA] Use `cudf::distinct` in Python and Java when it has support for `duplicate_keep_option` #11089

ttnghia commented Oct 24, 2022 •

edited

Loading