Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Use cudf::distinct in Python and Java when it has support for duplicate_keep_option #11089

Closed
ttnghia opened this issue Jun 9, 2022 · 8 comments
Assignees
Labels
feature request New feature or request improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. non-breaking Non-breaking change Performance Performance related issue Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Jun 9, 2022

Now cudf::distinct is going to support duplicate_keep_option (#11052). When that PR is merged, we should update Python and Java sides to use it instead of cudf::stable_sort_by_key + cudf::unique. This will boost performance from O(nlogn) to O(n).

@ttnghia ttnghia added feature request New feature or request Needs Triage Need team to review and classify labels Jun 9, 2022
@ttnghia ttnghia self-assigned this Jun 15, 2022
@ttnghia ttnghia added Python Affects Python cuDF API. Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change not a bug Performance Performance related issue and removed Needs Triage Need team to review and classify labels Jun 15, 2022
rapids-bot bot pushed a commit that referenced this issue Jun 22, 2022
This adds `duplicate_keep_option` to `cudf::distinct`, allowing to specify a `keep` option for selecting which of the duplicate elements to keep. It paves the way for many drop duplicate applications to achieve `O(n)` performance.

A `KEEP_ANY` option is also added to `duplicate_keep_option`, which was an attempt in #9417 but didn't get in eventually.

Partially addresses #11050 and #11053.

----
Main implementation: https://github.com/rapidsai/cudf/pull/11052/files#diff-4c2d4268b3c50000ae845ba15a890bb743709c30e5cab4847af7ad633c5a2823R47

Follow up work:
 * #11053
 * #11089
 * #11092

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Yunsong Wang (https://github.com/PointKernel)
  - Bradley Dice (https://github.com/bdice)

URL: #11052
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@vyasr
Copy link
Contributor

vyasr commented Oct 21, 2022

@ttnghia did you have any specific ideas of places where we could leverage this new feature, or do we just need to do a search for usage of cudf::unique calls to see where we are sorting first?

@ttnghia
Copy link
Contributor Author

ttnghia commented Oct 24, 2022

I have done the necessary work for the Java side. For Python, I've started #11230 but then @brandon-b-miller took over it by #11656 which should close this issue (that PR is not exactly the same, but covers my work).

@brandon-b-miller do you have any update?

@brandon-b-miller
Copy link
Contributor

got a little sidetracked but hoping to come back to this this week 👍

@bdice
Copy link
Contributor

bdice commented May 31, 2023

The Python side is using cudf::stable_distinct as of #11656, which closes half of this issue, but perhaps there is still work needed in Java? @ttnghia Can you offer an update here? Maybe we need to search the JNI or spark-rapids code for “cudf::unique”?

@ttnghia
Copy link
Contributor Author

ttnghia commented May 31, 2023

Thanks @bdice. Java doesn't need unique as I'm aware of.
distinct has been added to Java binding almost a year ago: #11232.

@ttnghia
Copy link
Contributor Author

ttnghia commented May 31, 2023

So both Java + Python binding have been addressed. I think we can close this.

@ttnghia ttnghia closed this as completed May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. non-breaking Non-breaking change Performance Performance related issue Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

4 participants