-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] drop_duplicates
and distinct_count
behavior/implementation is very inefficient
#9413
Comments
Why not rename the new ordered version of |
Just for reference in thinking about naming and semantics, I also note that
I think we discussed this in the past, and decided that due to the costs of materializing columns / tables, it's better to provide an algorithm that actually counts the unique items with a scan than to require the user to build their own as in method 2. The hashing approach you mention would be similar to approach 1 and probably would be faster in a lot of not pre-sorted cases. I can't remember why we decided to call it |
The Pandas API is explicit about not having sorting behaviour: Signature: s.unique() -> 'ArrayLike'
Docstring:
Return unique values of Series object.
Uniques are returned in order of appearance. Hash table-based unique,
therefore does NOT sort. So for Python, we'll likely have to do the regular trick of tacking on a sequence column and later sorting by it, at least until an |
One more thing: I've found that for a very small subset (single column, non-nullable, integer representation) sorting is actually much faster than hashing. But I suppose that can become an implementation detail in |
Given the Python observation, why not call |
I think it was actually called Other side note: it is regarded as a mistake that we called our containers:
instead of:
The "sortedness" is an extra requirement that the user should have to think about if they want it. "Unsortedness" should be the default (Python got it right with |
I don't know of such classes in libcudf. "we"? |
But now Jake is suggesting changing the non-ordering semantics to match |
But the behavior of |
Hmmm, perhaps I misunderstood @shwina 's message. What does Pandas unique return for [4, 4, 0, 1, 2, 2, 4, 4]? std:unique would return [4, 0, 1, 2, 4]. |
https://pandas.pydata.org/docs/reference/api/pandas.unique.html
The fact that it calls out it uses a hash table means it's doing a total unique. |
Pandas would return |
"We" refers to the C++ community at large. And it was just a general comment about the "unordered_" prefix that is used in the C++ Standard Library. |
Let's make sure to make our docs more clear. Understanding what our APIs do shouldn't require inference from implementation details. |
As a promising anecdote, @gaohao95 has already experimented with a hash-based solution of |
This issue has been labeled |
@jrhemstad Is there any draft PR yet? I am implementing |
This PR adds support for `cudf::contains` so we can check whether a structs column contains a scalar struct element. Partially addresses #8965. This does not support checking if structs given in a structs column exist in another structs column. Such cases will be supported when the new data structure mentioned in #9413 is merged into cudf. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - MithunR (https://github.com/mythrocks) URL: #9929
Related to #9413. This PR adds `unordered_drop_duplicates`/`unordered_distinct_count` APIs by using hash-based algorithms. It doesn't close the original issue since adding `std::unique`-like `drop_duplicates` is not addressed in this PR. It involves several changes: - [x] Change the behavior of the existing `distinct_count`: counting the number of consecutive groups of equivalent rows instead of total unique. - [x] Add hash-based `unordered_distinct_count`: this new API counts unique rows across the whole table by using a hash map. It requires a newer version of `cuco` with bug fixing: NVIDIA/cuCollections#132 and NVIDIA/cuCollections#138. - [x] Add hash-based `unordered_drop_duplicates`: similar to `drop_duplicates`, but this API doesn't support `keep` option and the output is in an unspecified order. - [x] Replace all the cpp-side `drop_duplicates`/`distinct_count` use cases with `unordered_` versions. - [x] Update and replace the existing compaction benchmark with `nvbench`. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - https://github.com/brandon-b-miller - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) - Robert Maynard (https://github.com/robertmaynard) URL: #10030
Closes #9413 Depending on #10387. There are several changes involved in this PR: - Refactors `cudf::drop_duplicates` to match `std::unique`'s behavior and renames it as `cudf::unique`. `cudf::unique` creates a table by removing duplicate rows in each consecutive group of equivalent rows of the input. - Renames `cudf::unordered_drop_duplicates` as `cudf::distinct`. `cudf::distinct` creates a table by keeping unique rows across the whole input table. Unique rows in the new table are in unspecified orders due to the nature of hash-based algorithms. - Renames `cudf::unordered_distinct_count` as `cudf::distinct_count`: count of `cudf::distinct` - Renames `cudf::distinct_count` as `cudf::unique_count`: count of `cudf::unique` - Updates corresponding tests and benchmarks. - Updates related JNI/Cython bindings. In order not to break the existing behavior in java and python, JNI and Cython bindings of `drop_duplicates` are updated to stably sort the input table first and then `cudf::unique`. Performance hints for `cudf::unique` and `cudf::distinct`: - If the input is pre-sorted, use `cudf::unique` - If the input is **not** pre-sorted and the behavior of `pandas.DataFrame.drop_duplicates` is desired: - If `keep` control (keep the first, last, or none of the duplicates) doesn't matter, use the hash-based `cudf::distinct` - If `keep` control is required, stable sort the input then `cudf::unique` Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Bradley Dice (https://github.com/bdice) - https://github.com/brandon-b-miller - MithunR (https://github.com/mythrocks) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10370
Is your feature request related to a problem? Please describe.
#9411 made me take a closer look at
cudf::drop_duplicates
andcudf::distinct_count
.Unlike
std::unique
, both of these APIs will drop/count all unique rows across the entire table (as opposed to only contiguous equivalent rows).On the surface, this seems convenient for a user because they don't have to worry about sorting their dataframe if they want to get only the unique rows. However, it has an insidious impact on performance.
Imagine if you were to call
distinct_count
and thendrop_duplicates
. The way these functions are currently implemented, they both require doing a sort of the inputs.Instead, if
distinct_count
anddrop_duplicates
worked likestd::unique
and the user had to first sort the input, then only one sort would be needed. Alternatively, the data may already be sorted (as is the case with Python indexes), where no sort would be necessary.The current behavior is very sub-optimal for performance as it can require 2 redundant multi-column sorts. Multi-column sorts are among the most expensive operations in libcudf, so this is a bad thing.
(Furthermore, if the data isn't already sorted, using a sort-based implementation is likely to be much slower than a hash-based implementation, so we should look at refactoring these implementations).
Describe the solution you'd like
I'd like to do two things:
drop_duplicates/distinct_count
to work likestd::unique
, i.e., it only considers contiguous equivalent elements.groupby
for an example of how we've handled this there.unordered_drop_duplicates
andunordered_distinct_count
that behave like the current APIs.unordered_*
algorithms as it will likely be much fasterDescribe alternatives you've considered
While this is certainly an option, I think that the behavior I described above would be more canonical for C++.
If the data is already sorted, there's a good chance a sort based implementation could be faster than hash-based, but we can test that and see.
The text was updated successfully, but these errors were encountered: