-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up DistinctCountAccumulator
#5472
Comments
I'm not intimately familiar with this operator, but if it needs to support tuples or nested types the row format might be an option. You can implement the operator using rows and convert to OwnedRow to store in a HashSet |
FYI @comphead For dictionary implementation see e.g. https://docs.rs/arrow-array/34.0.0/src/arrow_array/builder/primitive_dictionary_builder.rs.html#82 |
Hi @Dandandan I was crawling through the dictionary builder implementation. Please elaborate if we can try a simple |
I think we could make COUNT DISTINCT go much faster if implemented via |
I wonder how a A way to make it even faster is to use Also there are probably more opportunities to rewrite queries (i.e. rewrite |
I was thinking something like the following (templated on primitive type): struct CountDistinctGroupsAccumulator<T: ArrowPrimitiveType> {
values: HashSet<T>,
} Clearly we could then take it a step farther to use the Then use
that is also quite interesting |
I tried to prototype a solution. I think we need more than just storing a I think an efficient way to store the data for primitives could be the following:
|
Storing chained values seems like a neat idea -- if you plan to use it for the Join maybe we could make it its own data structure and reuse in both places. 🤔 |
Since #8827 has been merged(including the support for both Binary/LargeBinary), I think it's okay to close this one. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently the code uses a
HashSet<ScalarValue>
to track unique keys, which is slow and uses more memory than needed.Describe the solution you'd like
Use a typed array for storing the entries & keep a hashmap.
We can use the same approach as present in the dictionary builders (
PrimitiveDictionaryBuilder
) or parquetInterner
contributed by @tustvold.Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: