Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of by() using NamedTuples #1520

Merged
merged 19 commits into from
Nov 8, 2018
Merged

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Sep 20, 2018

Remove GroupApplied and deprecate combine in favor of map(f, ::GroupedDataFrame). This avoids storing a copy of the per-group data returned by the user-provided function. Take advantage of this by allowing that function to return a NamedTuple. Introduce two completely different code paths depending on whether the first returned object is DataFrame or a NamedTuple, as the latter allows for more efficient operation by assuming that it represents a single row. Use the same progressive eltype widening approach as Base.map so that we fill column vectors whose types are known inside the kernel functions. This does not eliminate the type unstability due to the fact that the user-provided function takes a DataFrame, but ensuring type stability for half of the operations still improves performance significantly.

Also parameterize GroupedDataFrame on the type of data frame it wraps, and make its column index
have a concrete type. Deprecate an old map method for SubDataFrame. Fix a type unstability in hcat!.

The new approach is about 10× faster for a simple sum when the number of groups is large. There's a slight slowdown the first time map is called, but it disappears for the second call even for a different function (anyway each anonymous function is unique). EDIT: that's incorrect, I had benchmarked the wrong functions on master: the new approach is always faster even including compilation.

df = DataFrame(a = repeat(1:40000, outer=[20]),
               b = repeat(20000:-1:1, outer=[40]),
               c = randn(800000))
gd = groupby(df, :a)
using BenchmarkTools, Statistics

# Master

julia> @time combine(map(d -> mean(d[:c]), gd)); # Warm up using a different function
  3.185602 seconds (11.40 M allocations: 638.859 MiB, 12.20% gc time)

julia> @time combine(map(d -> sum(d[:c]), gd));
  0.369755 seconds (2.19 M allocations: 143.901 MiB, 27.63% gc time)

julia> @btime combine(map(d -> sum(d[:c]), gd));
  293.117 ms (2117577 allocations: 140.35 MiB)

julia> @time combine(map(d -> (s=sum(d[:c]), m=mean(d[:b]), std=std(d[:b])), gd));
  0.936598 seconds (3.88 M allocations: 239.738 MiB, 20.60% gc time)

julia> @btime combine(map(d -> (s=sum(d[:c]), m=mean(d[:b]), std=std(d[:b])), gd));
  368.254 ms (2437577 allocations: 169.04 MiB)

julia> @time combine(map(d -> DataFrame(x=sum(d[:c])), gd));
  0.444001 seconds (2.34 M allocations: 148.652 MiB, 32.80% gc time)

julia> @btime combine(map(d -> DataFrame(x=sum(d[:c])), gd));
  315.268 ms (2197577 allocations: 141.57 MiB)

# This PR

julia> @time combine(d -> mean(d[:c]), gd); # Warm up using a different function
  0.795786 seconds (2.59 M allocations: 135.423 MiB, 14.48% gc time)

julia> @time combine(d -> sum(d[:c]), gd);
  0.075818 seconds (472.99 k allocations: 28.381 MiB, 4.06% gc time)

julia> @btime combine(d -> sum(d[:c]), gd);
  15.243 ms (399571 allocations: 24.72 MiB)

julia> @time combine(d -> (s=sum(d[:c]), m=mean(d[:b]), std=std(d[:b])), gd);
  0.443895 seconds (1.95 M allocations: 109.046 MiB, 4.72% gc time)

julia> @btime combine(d -> (s=sum(d[:c]), m=mean(d[:b]), std=std(d[:b])), gd);
  51.372 ms (879580 allocations: 53.40 MiB)

julia> @time combine(d -> DataFrame(x=sum(d[:c])), gd);
  0.521432 seconds (2.68 M allocations: 172.153 MiB, 5.70% gc time)

julia> @btime combine(d -> DataFrame(x=sum(d[:c])), gd);
  202.689 ms (1999599 allocations: 137.80 MiB)

CC: @quinnj @piever who are familiar with these eltype widening tricks

Fixes #1472, #1532, #1520.

@nalimilan nalimilan force-pushed the nl/grouping2 branch 2 times, most recently from 671e69a to d69e3f2 Compare September 20, 2018 12:01
m = length(first)
n = length(gd)
idx = Vector{Int}(undef, n)
initialcols = ntuple(i -> Vector{typeof(first[i])}(undef, n), m)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about CategoricalValues here? Don't we want to use some kind of similar construct? In Tables.jl, we have allocatecolumn which can be overloaded by custom scalars to provide an alternative AbstractVector type. It'd be nice possibly have something like vectortype(scalar::T) = Vector{T} in Base eventually so this could be used throughout the package ecosystem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It's ironic I missed that point. :-)

I've added a commit to use similar for the data frame case, but as you note it can't be solved for the scalar case without calling allocatecolumn. I guess we can wait until your Tables.jl PR is merged before merging this one. Do you think it would be OK for Tables.jl to implement the necessary method? It could use Require.jl to avoid a hard dependency on CategoricalArrays.

I also agree it would be useful to have something like this in Base.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Tables.jl already defines the right methods for both CategoricalValue & WeakRefString.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. So I can update the PR once you've merged yours.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. That allowed simplifying constructors a bit too.


function _combine(first::AbstractDataFrame, f::Function, gd::GroupedDataFrame)
m = size(first, 2)
idx = Vector{Int}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int[]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep it that way for consistency with the method above.

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really look into in-depth "correctness" details here as I'm not very familiar w/ the existing code, but the approach looks solid. It's great how a little bit of type uncertainty can actually be really great for performance.

@piever
Copy link

piever commented Sep 20, 2018

I haven't looked in depth to this specific case but I wanted to say that I would generally be in favor of having "universally useful" methods in Tables and relying or them (to avoid code duplication).

In particular there are buildcolumns(::Nothing, itr) in Tables (which is an improved version of collect_columns from IndexedTables) and collect_colums_flattened here in IndexedTables to collect an iterable of iterables of NamedTuples into a table with columnar storage. I wonder if some of that could be useful here. It'd be nicer if at least for basic things (collecting grouped data into normal data) the machinery was shared between here and IndexedTables. Vice versa, if the method here is superior, it be nice to keep the "basic tools" in Tables so that IndexedTables can rely on them (to the extent possible of course).

@nalimilan
Copy link
Member Author

Yeah, I've thought about reusing code too, but AFAICT the situation is too specific here. Note that I couldn't even share the code between the data frame method and the named tuple method. Also the user-provided function doesn't operate on a single row, but on a whole table, and it can return multiple types. Maybe this function could be moved to Tables.jl or another generic package, but it would remain specialized for grouping operations (even if it worked on any kind of table).

@nalimilan
Copy link
Member Author

One point I should have mentioned is the choice to replace combine with map for GroupedDataFrames. I think it makes sense since mapping a function to each group in a GroupedDataFrame cannot return a GroupedDataFrame. It could return a vector with the results, but that's not very useful.

@nalimilan
Copy link
Member Author

I've pushed two new commits. The first one allows using names in a different order from one group to the other. This is currently supported even if it affects performance a bit.

The second one reintroduces combine, and changes map to return a GroupedDataFrame (where it currently returns a GroupApplied). I think this makes sense as the grouping information is preserved, allowing for further operations if one wishes. The code is mostly the same between the two functions anyway, it's more a matter of API. In the longer term, we could imagine making GroupedDataFrame an AbstractDataFrame, which would allow behaviors similar to dplyr.

Unfortunately, the new code crashes Julia with "Illegal instruction" (JuliaLang/julia#29430).

This was referenced Oct 4, 2018
@bkamins
Copy link
Member

bkamins commented Oct 5, 2018

@nalimilan Is this ready for a review (I am asking because tests are failing)

@nalimilan
Copy link
Member Author

I think so, or at least for a discussion of the design, but it can't be merged until the Julia bug is fixed and released (or we find a workaround).

@bkamins
Copy link
Member

bkamins commented Oct 5, 2018

@nalimilan maybe you will want to comment more on h2oai/db-benchmark#30

@@ -87,7 +90,7 @@ function groupby(df::AbstractDataFrame, cols::Vector;
permute!(df_groups.starts, group_perm)
Base.permute!!(df_groups.stops, group_perm)
end
GroupedDataFrame(df, cols, df_groups.rperm,
GroupedDataFrame(df, DataFrames.index(df)[cols], df_groups.rperm,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is DataFrames. needed here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also index(df) assumes that cols are symbols and they can be integers or vector of bools.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT integers and Bools are accepted. Good point about the unnecessary prefix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - I do not know why it threw an error when I checked this.

wrap(A::Matrix) = convert(DataFrame, A)
wrap(s::Any) = DataFrame(x1 = s)
wrap(s::Union{AbstractVector, Tuple}) = DataFrame(x1 = s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the idea of wrapping a tuple like this?

As the code gives us:

julia> DataFrame(x=(1,2,3))
1×1 DataFrame
│ Row │ x         │
│     │ Tuple…    │
├─────┼───────────┤
│ 1   │ (1, 2, 3) │

(why do we prefer DataFrame to NamedTuple in this case?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I think I incorrectly assumed that the existing code treated tuples like vectors, but clearly it doesn't. I'll change that to use NamedTuple as it's more efficient.

@bkamins
Copy link
Member

bkamins commented Nov 6, 2018

map should also accept Type not only Function.

@bkamins bkamins mentioned this pull request Nov 6, 2018
Remove GroupApplied and deprecate combine in favor of map(f, ::GroupedDataFrame). This avoids
storing a copy of the per-group data returned by the user-provided function. Take advantage of
this by allowing that function to return a NamedTuple. Introduce two completely different
code paths depending on whether the first returned object is DataFrame or a NamedTuple, as
the latter allows for more efficient operation by assuming that it represents a single row.
Use the same progressive eltype widening approach as Base.map so that we fill column vectors
whose types are known inside the kernel functions. This does not eliminate the type unstability
due to the fact that the user-provided function takes a DataFrame, but ensuring type stability for
half of the operations still improves performance significantly.

Also parameterize GroupedDataFrame on the type of data frame it wraps, and make its column index
have a concrete type. Deprecate an old map method for SubDataFrame. Fix a type unstability in hcat!.
@nalimilan
Copy link
Member Author

Fixed. Good to go now?

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to merge (apart from a single comment on documentation I have left).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants