-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support set operations #11043
Support set operations #11043
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it has the potential to be a very large PR. I would strongly advocate for breaking this up into smaller, easier to review ones. Likely be doing one of the set_
algorithms at a time.
I'm not sure. If you read my algorithm descriptions (from here) then the implementation for each set-op should be very short. I will try to have smaller dependent PRs if I figured out what to extract from here. Currently, I see some overlap in implementation for this PR and the |
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ttnghia I have some comments attached. I think this is some of your best work I've reviewed. Really great job! The way the set algorithms all fall back to contains
is very nice -- I love when we can write concise higher-order algorithms like these.
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor suggestions, otherwise LGTM!
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
I've learned a lot by going through your PRs in the chain and here we are finally! Thanks @ttnghia for your persistence and effort put into this work.
Signed-off-by: Nghia Truong <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
@gpucibot merge |
This PR add Java binding for the set-like operations: * `lists::have_overlap` * `lists::intersect_distinct` * `lists::union_distinct` * `lists::difference_distinct` Depends on: * #11043 * #11220 New Java APIs start here: https://github.com/rapidsai/cudf/pull/11143/files#diff-50ba2711690aca8e4f28d7b491373a4dd76443127c8b452a77b6c1fe2388d9e3R3545 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #11143
This PR adds the following APIs for set operations:
lists::have_overlap
lists::intersect_distinct
lists::union_distinct
lists::difference_distinct
Name Convention
Except for the first API (
lists::have_overlap
) that returns a boolean column, the suffix_distinct
of the rest APIs denotes that their results will be lists columns in which all list rows have been post-processed to remove duplicates. As such, their results are actually "set" columns in which each row is a "set" of distinct elements.Depends on:
static_multimap::pair_contains
NVIDIA/cuCollections#175duplicate_keep_option
incudf::distinct
#11052nan_equality
incudf::distinct
#11118semi_anti_join
#11100lists::distinct
andcudf::detail::stable_distinct
#11149Closes #10409.