Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use faster hashing approach for first CategoricalVector grouping key #1565

Merged
merged 2 commits into from
Nov 1, 2018

Conversation

nalimilan
Copy link
Member

Since hash takes as an argument the hash for the previous keys, we cannot in general reuse precomputed hashes. However, this is possible for the first grouping key since the previous hash is always zero in that case. Since the number of keys is generally small, it makes a large difference to improve performance for this case.

This makes groupby twice faster on a simple benchmark with CategoricalString. Of course the difference is much smaller with CategoricalValue{Int}.

using DataFrames, BenchmarkTools
v = categorical(["AZERT", "AERDD", "DGGI", "RRFKLK", "4FGFD"][rand(1:5, 10_000)])

df = DataFrame(rand(10_000, 10))
df.key = v

# Master
julia> @btime groupby(df, :key);
  316.541 μs (49 allocations: 365.33 KiB)

# This PR
julia> @btime groupby(df, :key);
  153.849 μs (51 allocations: 365.47 KiB)

I suspect it would be possible to make this much faster by taking advantage of the fact that CategoricalArray references already represent groupings. But that would require making the code slightly more complex, by special-casing single-key groupings, and distinguishing grouping (where comparison happens with the same column) from joining (where the second column may not be categorical).

Since hash() takes as an argument the hash for the previous keys, we cannot in general
reuse precomputed hashes. However, this is possible for the first grouping key since the
previous hash is always zero in that case. Since the number of keys is generally small,
it makes a large difference to improve performance for this case.
@bkamins
Copy link
Member

bkamins commented Oct 30, 2018

Why can't we hash v.refs directly? I understand that then we get a different hash value in case of CategoricalString than in case we converted it to a string (same withe CategoricalValue), but is it a problem?

@nalimilan
Copy link
Member Author

Yes it's a problem when joining if we have String in one table and CategoricalString in another one. But I'm working on a PR which uses the refs directly for grouping (and not for joining).

@bkamins
Copy link
Member

bkamins commented Nov 1, 2018

@nalimilan Given h2oai/db-benchmark#30 (comment) what do you think should be our sequence of actions in:

so that we can give @jangorecki a clear recommendation on the recommended workflow after we merge the relevant changes?

CC @pdeffebach

@bkamins
Copy link
Member

bkamins commented Nov 1, 2018

it's a problem when joining if we have String in one table and CategoricalString in another one

good point. But we could have a special path in code checking for this (probably this is not a top priority).

@nalimilan
Copy link
Member Author

I think we can merge this, then I'll open a PR to make further improvements. #1520 is quite orthogonal even if related. JuliaData/DataFramesMeta.jl#101 is done, but indeed we should be able to make things quite faster in DataFramesMeta by passing a subset of columns as a tuple to an anonymous function. I'm not sure where @pdeffebach still plans to work on this, but it shouldn't be hard to do.

Anyway we should probably wait until all of this is merged and released before pinging @jangorecki again.

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@pdeffebach
Copy link
Contributor

I'm going to try and avoid PRs until after December 4th because of grad apps. But after then I'm open to working on whatever is needed.

I think there remain some open questions about where "find which variables are used in the command". For DataFramesMeta, we collect a Dict of all symbols anyways, so it's easier there. DataFramesMeta is probably a good place to start for this approach and the see what can be ported to DataFrames. proper.

@nalimilan nalimilan merged commit 9bea863 into master Nov 1, 2018
@nalimilan nalimilan deleted the nl/hash branch November 1, 2018 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants