-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use faster hashing approach for first CategoricalVector grouping key #1565
Conversation
Since hash() takes as an argument the hash for the previous keys, we cannot in general reuse precomputed hashes. However, this is possible for the first grouping key since the previous hash is always zero in that case. Since the number of keys is generally small, it makes a large difference to improve performance for this case.
Why can't we hash |
Yes it's a problem when joining if we have |
@nalimilan Given h2oai/db-benchmark#30 (comment) what do you think should be our sequence of actions in:
so that we can give @jangorecki a clear recommendation on the recommended workflow after we merge the relevant changes? CC @pdeffebach |
good point. But we could have a special path in code checking for this (probably this is not a top priority). |
I think we can merge this, then I'll open a PR to make further improvements. #1520 is quite orthogonal even if related. JuliaData/DataFramesMeta.jl#101 is done, but indeed we should be able to make things quite faster in DataFramesMeta by passing a subset of columns as a tuple to an anonymous function. I'm not sure where @pdeffebach still plans to work on this, but it shouldn't be hard to do. Anyway we should probably wait until all of this is merged and released before pinging @jangorecki again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I'm going to try and avoid PRs until after December 4th because of grad apps. But after then I'm open to working on whatever is needed. I think there remain some open questions about where "find which variables are used in the command". For DataFramesMeta, we collect a Dict of all symbols anyways, so it's easier there. DataFramesMeta is probably a good place to start for this approach and the see what can be ported to DataFrames. proper. |
Since
hash
takes as an argument the hash for the previous keys, we cannot in general reuse precomputed hashes. However, this is possible for the first grouping key since the previous hash is always zero in that case. Since the number of keys is generally small, it makes a large difference to improve performance for this case.This makes
groupby
twice faster on a simple benchmark withCategoricalString
. Of course the difference is much smaller withCategoricalValue{Int}
.I suspect it would be possible to make this much faster by taking advantage of the fact that
CategoricalArray
references already represent groupings. But that would require making the code slightly more complex, by special-casing single-key groupings, and distinguishing grouping (where comparison happens with the same column) from joining (where the second column may not be categorical).