-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of by() using NamedTuples #1520
Conversation
671e69a
to
d69e3f2
Compare
src/groupeddataframe/grouping.jl
Outdated
m = length(first) | ||
n = length(gd) | ||
idx = Vector{Int}(undef, n) | ||
initialcols = ntuple(i -> Vector{typeof(first[i])}(undef, n), m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about CategoricalValues here? Don't we want to use some kind of similar
construct? In Tables.jl, we have allocatecolumn
which can be overloaded by custom scalars to provide an alternative AbstractVector type. It'd be nice possibly have something like vectortype(scalar::T) = Vector{T}
in Base eventually so this could be used throughout the package ecosystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. It's ironic I missed that point. :-)
I've added a commit to use similar
for the data frame case, but as you note it can't be solved for the scalar case without calling allocatecolumn
. I guess we can wait until your Tables.jl PR is merged before merging this one. Do you think it would be OK for Tables.jl to implement the necessary method? It could use Require.jl to avoid a hard dependency on CategoricalArrays.
I also agree it would be useful to have something like this in Base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, Tables.jl already defines the right methods for both CategoricalValue & WeakRefString.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect. So I can update the PR once you've merged yours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. That allowed simplifying constructors a bit too.
|
||
function _combine(first::AbstractDataFrame, f::Function, gd::GroupedDataFrame) | ||
m = size(first, 2) | ||
idx = Vector{Int}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int[]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep it that way for consistency with the method above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't really look into in-depth "correctness" details here as I'm not very familiar w/ the existing code, but the approach looks solid. It's great how a little bit of type uncertainty can actually be really great for performance.
I haven't looked in depth to this specific case but I wanted to say that I would generally be in favor of having "universally useful" methods in Tables and relying or them (to avoid code duplication). In particular there are |
Yeah, I've thought about reusing code too, but AFAICT the situation is too specific here. Note that I couldn't even share the code between the data frame method and the named tuple method. Also the user-provided function doesn't operate on a single row, but on a whole table, and it can return multiple types. Maybe this function could be moved to Tables.jl or another generic package, but it would remain specialized for grouping operations (even if it worked on any kind of table). |
One point I should have mentioned is the choice to replace |
1884bb8
to
657123c
Compare
c6a27f4
to
8a67a81
Compare
8a67a81
to
256bd1d
Compare
I've pushed two new commits. The first one allows using names in a different order from one group to the other. This is currently supported even if it affects performance a bit. The second one reintroduces Unfortunately, the new code crashes Julia with "Illegal instruction" (JuliaLang/julia#29430). |
@nalimilan Is this ready for a review (I am asking because tests are failing) |
I think so, or at least for a discussion of the design, but it can't be merged until the Julia bug is fixed and released (or we find a workaround). |
@nalimilan maybe you will want to comment more on h2oai/db-benchmark#30 |
src/groupeddataframe/grouping.jl
Outdated
@@ -87,7 +90,7 @@ function groupby(df::AbstractDataFrame, cols::Vector; | |||
permute!(df_groups.starts, group_perm) | |||
Base.permute!!(df_groups.stops, group_perm) | |||
end | |||
GroupedDataFrame(df, cols, df_groups.rperm, | |||
GroupedDataFrame(df, DataFrames.index(df)[cols], df_groups.rperm, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is DataFrames.
needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also index(df)
assumes that cols
are symbols and they can be integers or vector of bools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT integers and Bools are accepted. Good point about the unnecessary prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - I do not know why it threw an error when I checked this.
src/groupeddataframe/grouping.jl
Outdated
wrap(A::Matrix) = convert(DataFrame, A) | ||
wrap(s::Any) = DataFrame(x1 = s) | ||
wrap(s::Union{AbstractVector, Tuple}) = DataFrame(x1 = s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the idea of wrapping a tuple like this?
As the code gives us:
julia> DataFrame(x=(1,2,3))
1×1 DataFrame
│ Row │ x │
│ │ Tuple… │
├─────┼───────────┤
│ 1 │ (1, 2, 3) │
(why do we prefer DataFrame
to NamedTuple
in this case?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I think I incorrectly assumed that the existing code treated tuples like vectors, but clearly it doesn't. I'll change that to use NamedTuple
as it's more efficient.
|
Remove GroupApplied and deprecate combine in favor of map(f, ::GroupedDataFrame). This avoids storing a copy of the per-group data returned by the user-provided function. Take advantage of this by allowing that function to return a NamedTuple. Introduce two completely different code paths depending on whether the first returned object is DataFrame or a NamedTuple, as the latter allows for more efficient operation by assuming that it represents a single row. Use the same progressive eltype widening approach as Base.map so that we fill column vectors whose types are known inside the kernel functions. This does not eliminate the type unstability due to the fact that the user-provided function takes a DataFrame, but ensuring type stability for half of the operations still improves performance significantly. Also parameterize GroupedDataFrame on the type of data frame it wraps, and make its column index have a concrete type. Deprecate an old map method for SubDataFrame. Fix a type unstability in hcat!.
6c5db1c
to
4aea51a
Compare
Fixed. Good to go now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to merge (apart from a single comment on documentation I have left).
Remove
GroupApplied
and deprecate combine in favor ofmap(f, ::GroupedDataFrame)
. This avoids storing a copy of the per-group data returned by the user-provided function. Take advantage of this by allowing that function to return aNamedTuple
. Introduce two completely different code paths depending on whether the first returned object isDataFrame
or aNamedTuple
, as the latter allows for more efficient operation by assuming that it represents a single row. Use the same progressive eltype widening approach asBase.map
so that we fill column vectors whose types are known inside the kernel functions. This does not eliminate the type unstability due to the fact that the user-provided function takes aDataFrame
, but ensuring type stability for half of the operations still improves performance significantly.Also parameterize
GroupedDataFrame
on the type of data frame it wraps, and make its column indexhave a concrete type. Deprecate an old map method for
SubDataFrame
. Fix a type unstability in hcat!.The new approach is about 10× faster for a simple sum when the number of groups is large.
There's a slight slowdown the first timeEDIT: that's incorrect, I had benchmarked the wrong functions on master: the new approach is always faster even including compilation.map
is called, but it disappears for the second call even for a different function (anyway each anonymous function is unique).CC: @quinnj @piever who are familiar with these eltype widening tricks
Fixes #1472, #1532, #1520.