Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Implement lists::drop_list_duplicates using cudf::distinct #11053

Closed
ttnghia opened this issue Jun 4, 2022 · 2 comments · Fixed by #11236
Closed

[FEA] Implement lists::drop_list_duplicates using cudf::distinct #11053

ttnghia opened this issue Jun 4, 2022 · 2 comments · Fixed by #11236
Labels
feature request New feature or request

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Jun 4, 2022

I just realized that lists::drop_list_duplicates can be implemented using cudf::distinct by using the following approach:

  1. Peel out child column of the input lists column
  2. Generate labels for the child elements
  3. Add the labels column into the keys table that input into cudf::distinct
  4. The output labels of unique elements need to be grouped and maintain their group order, this can be achieved easily by changing cudf::distinct to use a gather map generated from a sequence of linear indices.
  5. The labels output from cudf::distinct are used to reconstruct the final lists column without duplicates.

Such approach can be implemented if we have cudf::distinct supports:

The only concern here is how to enforce step 4 above. Adding parameters is easy, even that is breaking change. Guarantee some internal implementation is more difficult, as devs may accidentally + silently change it. Maybe explicitly writing this into the public interface doxygen and adding unit tests can help.

@ttnghia ttnghia added feature request New feature or request Needs Triage Need team to review and classify labels Jun 4, 2022
@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 12, 2022

Using this approach, my local implementation has only about 50 LOC, instead of >600 LOC as in the current implementation.

rapids-bot bot pushed a commit that referenced this issue Jun 22, 2022
This adds `duplicate_keep_option` to `cudf::distinct`, allowing to specify a `keep` option for selecting which of the duplicate elements to keep. It paves the way for many drop duplicate applications to achieve `O(n)` performance.

A `KEEP_ANY` option is also added to `duplicate_keep_option`, which was an attempt in #9417 but didn't get in eventually.

Partially addresses #11050 and #11053.

----
Main implementation: https://github.com/rapidsai/cudf/pull/11052/files#diff-4c2d4268b3c50000ae845ba15a890bb743709c30e5cab4847af7ad633c5a2823R47

Follow up work:
 * #11053
 * #11089
 * #11092

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Yunsong Wang (https://github.com/PointKernel)
  - Bradley Dice (https://github.com/bdice)

URL: #11052
rapids-bot bot pushed a commit that referenced this issue Jun 23, 2022
This adds `nan_equality` parameter to `cudf::distinct`, allowing to specify the desired behavior when dealing with floating-point data: `NaN` should be compared equally to other `NaN` or not.

Depends on #11052 (built on top of it).
Closes #11092.
This is a blocker for set-like operations (#11043) and also the last blocker for #11053.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #11118
@sameerz
Copy link
Contributor

sameerz commented Jul 8, 2022

Depends on #11149

@ttnghia ttnghia linked a pull request Jul 14, 2022 that will close this issue
rapids-bot bot pushed a commit that referenced this issue Jul 22, 2022
This PR completely removes `cudf::lists::drop_list_duplicates`. It is replaced by the new API `cudf::list::distinct` which has a simpler implementation but better performance. The replacements for internal cudf usage have all been merged before thus there is no side effect or breaking for the existing APIs in this work.

Closes #11114, #11093, #11053, #11034, and closes #9257.

Depends on:
 * #11228
 * #11149
 * #11234
 * #11233

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jordan Jacobelli (https://github.com/Ethyling)
  - Robert Maynard (https://github.com/robertmaynard)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #11236
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
3 participants