[FEA] `drop_duplicates` and `distinct_count` behavior/implementation is very inefficient #9413

jrhemstad · 2021-10-11T17:22:16Z

Is your feature request related to a problem? Please describe.

#9411 made me take a closer look at cudf::drop_duplicates and cudf::distinct_count.

Unlike std::unique, both of these APIs will drop/count all unique rows across the entire table (as opposed to only contiguous equivalent rows).

On the surface, this seems convenient for a user because they don't have to worry about sorting their dataframe if they want to get only the unique rows. However, it has an insidious impact on performance.

Imagine if you were to call distinct_count and then drop_duplicates. The way these functions are currently implemented, they both require doing a sort of the inputs.

Instead, if distinct_count and drop_duplicates worked like std::unique and the user had to first sort the input, then only one sort would be needed. Alternatively, the data may already be sorted (as is the case with Python indexes), where no sort would be necessary.

The current behavior is very sub-optimal for performance as it can require 2 redundant multi-column sorts. Multi-column sorts are among the most expensive operations in libcudf, so this is a bad thing.

(Furthermore, if the data isn't already sorted, using a sort-based implementation is likely to be much slower than a hash-based implementation, so we should look at refactoring these implementations).

Describe the solution you'd like

I'd like to do two things:

Update drop_duplicates/distinct_count to work like std::unique, i.e., it only considers contiguous equivalent elements.
- This would require the data be presorted to preserve the existing behavior. Note that this also requires the user to describe how the data is sorted for things like null_order and such. We can look at groupby for an example of how we've handled this there.
Add unordered_drop_duplicates and unordered_distinct_count that behave like the current APIs.
- We should look at using a hash-based implementation for the unordered_* algorithms as it will likely be much faster

Describe alternatives you've considered

Why not preserve the existing APIs and add ordered_drop_duplicates and ordered_distinct_count?

While this is certainly an option, I think that the behavior I described above would be more canonical for C++.

Why have both unordered_* and ordered versions?

If the data is already sorted, there's a good chance a sort based implementation could be faster than hash-based, but we can test that and see.

The text was updated successfully, but these errors were encountered:

harrism · 2021-10-12T04:10:32Z

Why not rename the new ordered version of drop_duplicates to cudf::unique?

harrism · 2021-10-12T04:22:49Z

Just for reference in thinking about naming and semantics, I also note that std (our eternal inspiration) does not have a unique_count. Typical approaches to implementing it:

std::vector<int> v{...};
int unique_count = std::set<int>(v.begin(), v.end()).size();

std::vector<int> v{...};
std::sort(v.begin(), v.end());
int unique_count = std::unique(v.begin(), v.end()) - v.begin();

I think we discussed this in the past, and decided that due to the costs of materializing columns / tables, it's better to provide an algorithm that actually counts the unique items with a scan than to require the user to build their own as in method 2. The hashing approach you mention would be similar to approach 1 and probably would be faster in a lot of not pre-sorted cases.

I can't remember why we decided to call it distinct_count instead of unique_count. I bet @codereport was involved or would have an opinion on the naming!

shwina · 2021-10-12T14:20:48Z

The Pandas API is explicit about not having sorting behaviour:

Signature: s.unique() -> 'ArrayLike'
Docstring:
Return unique values of Series object.

Uniques are returned in order of appearance. Hash table-based unique,
therefore does NOT sort.

So for Python, we'll likely have to do the regular trick of tacking on a sequence column and later sorting by it, at least until an unordered_disctinct_count is available.

devavret · 2021-10-12T20:20:50Z

One more thing: I've found that for a very small subset (single column, non-nullable, integer representation) sorting is actually much faster than hashing. But I suppose that can become an implementation detail in unordered_drop_duplicates

harrism · 2021-10-13T00:28:03Z

Given the Python observation, why not call cudf::drop_duplicates just cudf::unique() to match both STL and Pandas and have it be unordered? Users need to sort first if they want it ordered.

codereport · 2021-10-13T01:39:01Z

I bet @codereport was involved or would have an opinion on the naming!

I think it was actually called unique_count at one point and I pointed out that "unique" has semantics in C++ that didn't align with what the libcudf algorithm was doing. So we ended up changing it to distinct_count as "distinct" doesn't have any specific C++ semantics.

Other side note: it is regarded as a mistake that we called our containers:

map / set (for sorted tree implementations)
unordered_map / unordered_set (for hashing implementations)

instead of:

sorted_map / sorted_set (for sorted tree implementations)
map / set (for hashing implementations)

The "sortedness" is an extra requirement that the user should have to think about if they want it. "Unsortedness" should be the default (Python got it right with Set / Map and SortedSet / SortedMap

harrism · 2021-10-13T01:46:07Z

Other side note: it is regarded as a mistake that we called our containers:

map / set (for sorted tree implementations)
unordered_map / unordered_set (for hashing implementations)

I don't know of such classes in libcudf. "we"?

harrism · 2021-10-13T01:47:43Z

I think it was actually called unique_count at one point and I pointed out that "unique" has semantics in C++ that didn't align with what the libcudf algorithm was doing. So we ended up changing it to distinct_count as "distinct" doesn't have any specific C++ semantics.

But now Jake is suggesting changing the non-ordering semantics to match std::unique. Are there other semantics that should stop us using that name?

jrhemstad · 2021-10-13T03:20:59Z

Given the Python observation, why not call cudf::drop_duplicates just cudf::unique() to match both STL and Pandas and have it be unordered? Users need to sort first if they want it ordered.

But the behavior of std::unique doesn't match Pandas unique. If we go with the unique/unordered_unique naming, Pandas would use the unordered_unique.

harrism · 2021-10-13T11:15:08Z

Hmmm, perhaps I misunderstood @shwina 's message.

What does Pandas unique return for [4, 4, 0, 1, 2, 2, 4, 4]? std:unique would return [4, 0, 1, 2, 4].

jrhemstad · 2021-10-13T12:21:03Z

https://pandas.pydata.org/docs/reference/api/pandas.unique.html

Hash table-based unique. Uniques are returned in order of appearance. This does NOT sort.

The fact that it calls out it uses a hash table means it's doing a total unique.

shwina · 2021-10-13T13:17:39Z

What does Pandas unique return for [4, 4, 0, 1, 2, 2, 4, 4]? std:unique would return [4, 0, 1, 2, 4].

Pandas would return [4, 0, 1, 2].

codereport · 2021-10-13T15:20:36Z

Other side note: it is regarded as a mistake that we called our containers:
map / set (for sorted tree implementations)
unordered_map / unordered_set (for hashing implementations)

I don't know of such classes in libcudf. "we"?

"We" refers to the C++ community at large. And it was just a general comment about the "unordered_" prefix that is used in the C++ Standard Library.

harrism · 2021-10-13T22:46:26Z

The fact that it calls out it uses a hash table means it's doing a total unique

Let's make sure to make our docs more clear. Understanding what our APIs do shouldn't require inference from implementation details.

jrhemstad · 2021-10-15T16:52:36Z

As a promising anecdote, @gaohao95 has already experimented with a hash-based solution of drop_duplicates with cuCollections and reports it's 100x faster than the current sort-based solution.

github-actions · 2021-11-15T21:02:54Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

ttnghia · 2021-12-17T22:55:21Z

As a promising anecdote, @gaohao95 has already experimented with a hash-based solution of drop_duplicates with cuCollections and reports it's 100x faster than the current sort-based solution.

@jrhemstad Is there any draft PR yet? I am implementing cudf::contains for structs and have similar need for structs hashing. I planned to modify the existsing unordered_multiset implementation to support structs but not sure if the solution above (using cuCollection) is worth to wait.

This PR adds support for `cudf::contains` so we can check whether a structs column contains a scalar struct element. Partially addresses #8965. This does not support checking if structs given in a structs column exist in another structs column. Such cases will be supported when the new data structure mentioned in #9413 is merged into cudf. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - MithunR (https://github.com/mythrocks) URL: #9929

Related to #9413. This PR adds `unordered_drop_duplicates`/`unordered_distinct_count` APIs by using hash-based algorithms. It doesn't close the original issue since adding `std::unique`-like `drop_duplicates` is not addressed in this PR. It involves several changes: - [x] Change the behavior of the existing `distinct_count`: counting the number of consecutive groups of equivalent rows instead of total unique. - [x] Add hash-based `unordered_distinct_count`: this new API counts unique rows across the whole table by using a hash map. It requires a newer version of `cuco` with bug fixing: NVIDIA/cuCollections#132 and NVIDIA/cuCollections#138. - [x] Add hash-based `unordered_drop_duplicates`: similar to `drop_duplicates`, but this API doesn't support `keep` option and the output is in an unspecified order. - [x] Replace all the cpp-side `drop_duplicates`/`distinct_count` use cases with `unordered_` versions. - [x] Update and replace the existing compaction benchmark with `nvbench`. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - https://github.com/brandon-b-miller - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) - Robert Maynard (https://github.com/robertmaynard) URL: #10030

Closes #9413 Depending on #10387. There are several changes involved in this PR: - Refactors `cudf::drop_duplicates` to match `std::unique`'s behavior and renames it as `cudf::unique`. `cudf::unique` creates a table by removing duplicate rows in each consecutive group of equivalent rows of the input. - Renames `cudf::unordered_drop_duplicates` as `cudf::distinct`. `cudf::distinct` creates a table by keeping unique rows across the whole input table. Unique rows in the new table are in unspecified orders due to the nature of hash-based algorithms. - Renames `cudf::unordered_distinct_count` as `cudf::distinct_count`: count of `cudf::distinct` - Renames `cudf::distinct_count` as `cudf::unique_count`: count of `cudf::unique` - Updates corresponding tests and benchmarks. - Updates related JNI/Cython bindings. In order not to break the existing behavior in java and python, JNI and Cython bindings of `drop_duplicates` are updated to stably sort the input table first and then `cudf::unique`. Performance hints for `cudf::unique` and `cudf::distinct`: - If the input is pre-sorted, use `cudf::unique` - If the input is **not** pre-sorted and the behavior of `pandas.DataFrame.drop_duplicates` is desired: - If `keep` control (keep the first, last, or none of the duplicates) doesn't matter, use the hash-based `cudf::distinct` - If `keep` control is required, stable sort the input then `cudf::unique` Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Bradley Dice (https://github.com/bdice) - https://github.com/brandon-b-miller - MithunR (https://github.com/mythrocks) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10370

jrhemstad added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue labels Oct 11, 2021

davidwendt mentioned this issue Oct 12, 2021

Fix stream compaction's drop_duplicates API to use stable sort #9417

Merged

github-actions bot added the inactive-30d label Nov 15, 2021

This was referenced Dec 10, 2021

[FEA] Implement drop_duplicates for range values without sorting #9886

Closed

Fix fallback to sort aggregation for grouping only hash aggregate #9891

Merged

github-actions bot removed the inactive-30d label Dec 17, 2021

ttnghia mentioned this issue Dec 17, 2021

[FEA] Support structs in unordered_multiset #9934

Closed

ttnghia mentioned this issue Jan 7, 2022

Support structs for cudf::contains with column/scalar input #9929

Merged

PointKernel self-assigned this Jan 11, 2022

PointKernel mentioned this issue Jan 12, 2022

Optimize compaction operations #10030

Merged

5 tasks

PointKernel mentioned this issue Feb 28, 2022

Refactor stream compaction APIs #10370

Merged

rapids-bot bot closed this as completed in #10370 Mar 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] `drop_duplicates` and `distinct_count` behavior/implementation is very inefficient #9413

[FEA] `drop_duplicates` and `distinct_count` behavior/implementation is very inefficient #9413

jrhemstad commented Oct 11, 2021

harrism commented Oct 12, 2021

harrism commented Oct 12, 2021

shwina commented Oct 12, 2021 •

edited

Loading

devavret commented Oct 12, 2021

harrism commented Oct 13, 2021

codereport commented Oct 13, 2021

harrism commented Oct 13, 2021 •

edited

Loading

harrism commented Oct 13, 2021

jrhemstad commented Oct 13, 2021

harrism commented Oct 13, 2021

jrhemstad commented Oct 13, 2021

shwina commented Oct 13, 2021

codereport commented Oct 13, 2021

harrism commented Oct 13, 2021

jrhemstad commented Oct 15, 2021 •

edited

Loading

github-actions bot commented Nov 15, 2021

ttnghia commented Dec 17, 2021 •

edited

Loading

[FEA] drop_duplicates and distinct_count behavior/implementation is very inefficient #9413

[FEA] drop_duplicates and distinct_count behavior/implementation is very inefficient #9413

Comments

jrhemstad commented Oct 11, 2021

harrism commented Oct 12, 2021

harrism commented Oct 12, 2021

shwina commented Oct 12, 2021 • edited Loading

devavret commented Oct 12, 2021

harrism commented Oct 13, 2021

codereport commented Oct 13, 2021

harrism commented Oct 13, 2021 • edited Loading

harrism commented Oct 13, 2021

jrhemstad commented Oct 13, 2021

harrism commented Oct 13, 2021

jrhemstad commented Oct 13, 2021

shwina commented Oct 13, 2021

codereport commented Oct 13, 2021

harrism commented Oct 13, 2021

jrhemstad commented Oct 15, 2021 • edited Loading

github-actions bot commented Nov 15, 2021

ttnghia commented Dec 17, 2021 • edited Loading

[FEA] `drop_duplicates` and `distinct_count` behavior/implementation is very inefficient #9413

[FEA] `drop_duplicates` and `distinct_count` behavior/implementation is very inefficient #9413

shwina commented Oct 12, 2021 •

edited

Loading

harrism commented Oct 13, 2021 •

edited

Loading

jrhemstad commented Oct 15, 2021 •

edited

Loading

ttnghia commented Dec 17, 2021 •

edited

Loading