Skip to content

Commit

Permalink
Fix stream compaction's drop_duplicates API to use stable sort (#9417)
Browse files Browse the repository at this point in the history
Currently, stream compaction API `drop_duplicates` uses unstable sort for all of its internal sorting. This is wrong since we may want to have stable sorting results so we can choose to keep the first or the last duplicate element from the repeated sequences. 

This PR does two things:
 * Fixes the issue mentioned above by using stable sort if the input option is to keep the first or last duplicate element, and
 * Adds a new keep option into the enum class `duplicate_keep_option`:  `KEEP_ANY`. This option allows the user to choose to keep one element from the repeated sequence at any position.

Note that the issue did not show up by any failed test because thrust default (unstable) sort, which is called internally in `drop_duplicate`, still produces the same results as thrust stable sort most of the time (but this is not guaranteed). As such, the current `drop_duplicate` still produces correct results in its tests.

This PR blocks #9345.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - MithunR (https://github.com/mythrocks)

URL: #9417
  • Loading branch information
ttnghia authored Oct 15, 2021
1 parent 13e9ec0 commit b66f14e
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 5 deletions.
6 changes: 3 additions & 3 deletions cpp/include/cudf/stream_compaction.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -208,9 +208,9 @@ std::unique_ptr<table> apply_boolean_mask(
* @brief Choices for drop_duplicates API for retainment of duplicate rows
*/
enum class duplicate_keep_option {
KEEP_FIRST = 0, ///< Keeps first duplicate row and unique rows
KEEP_LAST, ///< Keeps last duplicate row and unique rows
KEEP_NONE ///< Keeps only unique rows are kept
KEEP_FIRST = 0, ///< Keeps first duplicate element and unique elements
KEEP_LAST, ///< Keeps last duplicate element and unique elements
KEEP_NONE ///< Keeps only unique elements
};

/**
Expand Down
5 changes: 3 additions & 2 deletions cpp/src/stream_compaction/drop_duplicates.cu
Original file line number Diff line number Diff line change
Expand Up @@ -137,8 +137,9 @@ column_view get_unique_ordered_indices(cudf::table_view const& keys,
null_order null_precedence,
rmm::cuda_stream_view stream)
{
// sort only indices
auto sorted_indices = sorted_order(
// Sort only the indices.
// Note that stable sort must be used to maintain the order of duplicate elements.
auto sorted_indices = stable_sorted_order(
keys,
std::vector<order>{},
std::vector<null_order>{static_cast<uint64_t>(keys.num_columns()), null_precedence},
Expand Down

0 comments on commit b66f14e

Please sign in to comment.