-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Implement lists::drop_list_duplicates
using cudf::distinct
#11053
Labels
feature request
New feature or request
Comments
This was referenced Jun 6, 2022
Using this approach, my local implementation has only about 50 LOC, instead of >600 LOC as in the current implementation. |
rapids-bot bot
pushed a commit
that referenced
this issue
Jun 22, 2022
This adds `duplicate_keep_option` to `cudf::distinct`, allowing to specify a `keep` option for selecting which of the duplicate elements to keep. It paves the way for many drop duplicate applications to achieve `O(n)` performance. A `KEEP_ANY` option is also added to `duplicate_keep_option`, which was an attempt in #9417 but didn't get in eventually. Partially addresses #11050 and #11053. ---- Main implementation: https://github.com/rapidsai/cudf/pull/11052/files#diff-4c2d4268b3c50000ae845ba15a890bb743709c30e5cab4847af7ad633c5a2823R47 Follow up work: * #11053 * #11089 * #11092 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #11052
rapids-bot bot
pushed a commit
that referenced
this issue
Jun 23, 2022
This adds `nan_equality` parameter to `cudf::distinct`, allowing to specify the desired behavior when dealing with floating-point data: `NaN` should be compared equally to other `NaN` or not. Depends on #11052 (built on top of it). Closes #11092. This is a blocker for set-like operations (#11043) and also the last blocker for #11053. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #11118
Depends on #11149 |
This was referenced Jul 8, 2022
rapids-bot bot
pushed a commit
that referenced
this issue
Jul 22, 2022
This PR completely removes `cudf::lists::drop_list_duplicates`. It is replaced by the new API `cudf::list::distinct` which has a simpler implementation but better performance. The replacements for internal cudf usage have all been merged before thus there is no side effect or breaking for the existing APIs in this work. Closes #11114, #11093, #11053, #11034, and closes #9257. Depends on: * #11228 * #11149 * #11234 * #11233 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jordan Jacobelli (https://github.com/Ethyling) - Robert Maynard (https://github.com/robertmaynard) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #11236
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I just realized that
lists::drop_list_duplicates
can be implemented usingcudf::distinct
by using the following approach:cudf::distinct
cudf::distinct
to use a gather map generated from a sequence of linear indices.cudf::distinct
are used to reconstruct the final lists column without duplicates.Such approach can be implemented if we have
cudf::distinct
supports:duplicate_keep_option
parameter ([FEA] Supportduplicate_keep_option
in places that use unordered map #11050), andnan_equality
parameter, andThe only concern here is how to enforce step 4 above. Adding parameters is easy, even that is breaking change. Guarantee some internal implementation is more difficult, as devs may accidentally + silently change it. Maybe explicitly writing this into the public interface doxygen and adding unit tests can help.
The text was updated successfully, but these errors were encountered: