Use faster hashing approach for first CategoricalVector grouping key #1565

nalimilan · 2018-10-13T20:56:27Z

Since hash takes as an argument the hash for the previous keys, we cannot in general reuse precomputed hashes. However, this is possible for the first grouping key since the previous hash is always zero in that case. Since the number of keys is generally small, it makes a large difference to improve performance for this case.

This makes groupby twice faster on a simple benchmark with CategoricalString. Of course the difference is much smaller with CategoricalValue{Int}.

using DataFrames, BenchmarkTools
v = categorical(["AZERT", "AERDD", "DGGI", "RRFKLK", "4FGFD"][rand(1:5, 10_000)])

df = DataFrame(rand(10_000, 10))
df.key = v

# Master
julia> @btime groupby(df, :key);
  316.541 μs (49 allocations: 365.33 KiB)

# This PR
julia> @btime groupby(df, :key);
  153.849 μs (51 allocations: 365.47 KiB)

I suspect it would be possible to make this much faster by taking advantage of the fact that CategoricalArray references already represent groupings. But that would require making the code slightly more complex, by special-casing single-key groupings, and distinguishing grouping (where comparison happens with the same column) from joining (where the second column may not be categorical).

Since hash() takes as an argument the hash for the previous keys, we cannot in general reuse precomputed hashes. However, this is possible for the first grouping key since the previous hash is always zero in that case. Since the number of keys is generally small, it makes a large difference to improve performance for this case.

bkamins · 2018-10-30T13:05:34Z

Why can't we hash v.refs directly? I understand that then we get a different hash value in case of CategoricalString than in case we converted it to a string (same withe CategoricalValue), but is it a problem?

nalimilan · 2018-11-01T14:00:18Z

Yes it's a problem when joining if we have String in one table and CategoricalString in another one. But I'm working on a PR which uses the refs directly for grouping (and not for joining).

bkamins · 2018-11-01T15:41:13Z

@nalimilan Given h2oai/db-benchmark#30 (comment) what do you think should be our sequence of actions in:

this PR
Improve performance of by() using NamedTuples #1520
and follow up to Change the way grouped transforms work DataFramesMeta.jl#101 (is this finished or something more is pending there - and should we make a release of DataFramesMeta.jl?)

so that we can give @jangorecki a clear recommendation on the recommended workflow after we merge the relevant changes?

CC @pdeffebach

bkamins · 2018-11-01T15:43:03Z

it's a problem when joining if we have String in one table and CategoricalString in another one

good point. But we could have a special path in code checking for this (probably this is not a top priority).

nalimilan · 2018-11-01T15:59:18Z

I think we can merge this, then I'll open a PR to make further improvements. #1520 is quite orthogonal even if related. JuliaData/DataFramesMeta.jl#101 is done, but indeed we should be able to make things quite faster in DataFramesMeta by passing a subset of columns as a tuple to an anonymous function. I'm not sure where @pdeffebach still plans to work on this, but it shouldn't be hard to do.

Anyway we should probably wait until all of this is merged and released before pinging @jangorecki again.

bkamins

Thanks!

pdeffebach · 2018-11-01T17:27:48Z

I'm going to try and avoid PRs until after December 4th because of grad apps. But after then I'm open to working on whatever is needed.

I think there remain some open questions about where "find which variables are used in the command". For DataFramesMeta, we collect a Dict of all symbols anyways, so it's easier there. DataFramesMeta is probably a good place to start for this approach and the see what can be ported to DataFrames. proper.

nalimilan mentioned this pull request Oct 13, 2018

WIP: Make hash(::CategoricalValue) faster by pre-computing hashes JuliaData/CategoricalArrays.jl#61

Open

bkamins approved these changes Nov 1, 2018

View reviewed changes

Hoist CategoricalArrays.index(v.pool) call

759c2e9

nalimilan merged commit 9bea863 into master Nov 1, 2018

nalimilan deleted the nl/hash branch November 1, 2018 20:32

nalimilan mentioned this pull request Nov 15, 2018

Add fast grouping method for CategoricalArray keys #1600

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use faster hashing approach for first CategoricalVector grouping key #1565

Use faster hashing approach for first CategoricalVector grouping key #1565

nalimilan commented Oct 13, 2018

bkamins commented Oct 30, 2018

nalimilan commented Nov 1, 2018

bkamins commented Nov 1, 2018

bkamins commented Nov 1, 2018

nalimilan commented Nov 1, 2018

bkamins left a comment

pdeffebach commented Nov 1, 2018

Use faster hashing approach for first CategoricalVector grouping key #1565

Use faster hashing approach for first CategoricalVector grouping key #1565

Conversation

nalimilan commented Oct 13, 2018

bkamins commented Oct 30, 2018

nalimilan commented Nov 1, 2018

bkamins commented Nov 1, 2018

bkamins commented Nov 1, 2018

nalimilan commented Nov 1, 2018

bkamins left a comment

Choose a reason for hiding this comment

pdeffebach commented Nov 1, 2018