-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix stream compaction's drop_duplicates
API to use stable sort
#9417
Fix stream compaction's drop_duplicates
API to use stable sort
#9417
Conversation
It may not be worth adding the |
Thanks! I agree, but until that issue is addressed this PR is still relevant to fix sorting to use stable sort. In addition, |
I agree with @davidwendt that I looked at #9345 to understand how |
Using stable sort everywhere in
I believe that |
@ttnghia Thanks for the info. I think the scope of this PR should be limited to changing unstable sort to stable sort, since that's clearly a bug.
I saw this a little differently. It sounds like |
Stable/unstable segmented sorts all use
|
The benchmark shows that the performance difference between stable and unstable sorts (using thrust) is very small, less than 5%. I'm totally fine to switch to use stable sort all the time, but just feel that it is necessary. Need more vote @jrhemstad @davidwendt. If using stable sort all the time, we can remove the |
LGTM, save for the removal of |
Removed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ttnghia Great, thanks for the update on this. Would it be possible to add a test of some kind to catch this bug? I recognize that the test might also pass on the previous code, based on what you wrote in the PR description, but it would be helpful to ensure that we are meeting our requirement of stability.
I tried that but couldn't. In order to catch the bug, we have to generate an array (of structs) that So this is a potential bug, which does not show up right away and it's difficult to catch. |
I looked over |
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9417 +/- ##
================================================
- Coverage 10.79% 10.78% -0.01%
================================================
Files 116 117 +1
Lines 18869 19445 +576
================================================
+ Hits 2036 2097 +61
- Misses 16833 17348 +515
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this.
@gpucibot merge |
This adds `duplicate_keep_option` to `cudf::distinct`, allowing to specify a `keep` option for selecting which of the duplicate elements to keep. It paves the way for many drop duplicate applications to achieve `O(n)` performance. A `KEEP_ANY` option is also added to `duplicate_keep_option`, which was an attempt in #9417 but didn't get in eventually. Partially addresses #11050 and #11053. ---- Main implementation: https://github.com/rapidsai/cudf/pull/11052/files#diff-4c2d4268b3c50000ae845ba15a890bb743709c30e5cab4847af7ad633c5a2823R47 Follow up work: * #11053 * #11089 * #11092 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #11052
Currently, stream compaction API
drop_duplicates
uses unstable sort for all of its internal sorting. This is wrong since we may want to have stable sorting results so we can choose to keep the first or the last duplicate element from the repeated sequences.This PR does two things:
duplicate_keep_option
:KEEP_ANY
. This option allows the user to choose to keep one element from the repeated sequence at any position.Note that the issue did not show up by any failed test because thrust default (unstable) sort, which is called internally in
drop_duplicate
, still produces the same results as thrust stable sort most of the time (but this is not guaranteed). As such, the currentdrop_duplicate
still produces correct results in its tests.This PR blocks #9345.