Improve performance of by() using NamedTuples #1520

nalimilan · 2018-09-20T11:56:30Z

Remove GroupApplied and deprecate combine in favor of map(f, ::GroupedDataFrame). This avoids storing a copy of the per-group data returned by the user-provided function. Take advantage of this by allowing that function to return a NamedTuple. Introduce two completely different code paths depending on whether the first returned object is DataFrame or a NamedTuple, as the latter allows for more efficient operation by assuming that it represents a single row. Use the same progressive eltype widening approach as Base.map so that we fill column vectors whose types are known inside the kernel functions. This does not eliminate the type unstability due to the fact that the user-provided function takes a DataFrame, but ensuring type stability for half of the operations still improves performance significantly.

Also parameterize GroupedDataFrame on the type of data frame it wraps, and make its column index
have a concrete type. Deprecate an old map method for SubDataFrame. Fix a type unstability in hcat!.

The new approach is about 10× faster for a simple sum when the number of groups is large. ~~There's a slight slowdown the first time map is called, but it disappears for the second call even for a different function (anyway each anonymous function is unique).~~ EDIT: that's incorrect, I had benchmarked the wrong functions on master: the new approach is always faster even including compilation.

df = DataFrame(a = repeat(1:40000, outer=[20]),
               b = repeat(20000:-1:1, outer=[40]),
               c = randn(800000))
gd = groupby(df, :a)
using BenchmarkTools, Statistics

# Master

julia> @time combine(map(d -> mean(d[:c]), gd)); # Warm up using a different function
  3.185602 seconds (11.40 M allocations: 638.859 MiB, 12.20% gc time)

julia> @time combine(map(d -> sum(d[:c]), gd));
  0.369755 seconds (2.19 M allocations: 143.901 MiB, 27.63% gc time)

julia> @btime combine(map(d -> sum(d[:c]), gd));
  293.117 ms (2117577 allocations: 140.35 MiB)

julia> @time combine(map(d -> (s=sum(d[:c]), m=mean(d[:b]), std=std(d[:b])), gd));
  0.936598 seconds (3.88 M allocations: 239.738 MiB, 20.60% gc time)

julia> @btime combine(map(d -> (s=sum(d[:c]), m=mean(d[:b]), std=std(d[:b])), gd));
  368.254 ms (2437577 allocations: 169.04 MiB)

julia> @time combine(map(d -> DataFrame(x=sum(d[:c])), gd));
  0.444001 seconds (2.34 M allocations: 148.652 MiB, 32.80% gc time)

julia> @btime combine(map(d -> DataFrame(x=sum(d[:c])), gd));
  315.268 ms (2197577 allocations: 141.57 MiB)

# This PR

julia> @time combine(d -> mean(d[:c]), gd); # Warm up using a different function
  0.795786 seconds (2.59 M allocations: 135.423 MiB, 14.48% gc time)

julia> @time combine(d -> sum(d[:c]), gd);
  0.075818 seconds (472.99 k allocations: 28.381 MiB, 4.06% gc time)

julia> @btime combine(d -> sum(d[:c]), gd);
  15.243 ms (399571 allocations: 24.72 MiB)

julia> @time combine(d -> (s=sum(d[:c]), m=mean(d[:b]), std=std(d[:b])), gd);
  0.443895 seconds (1.95 M allocations: 109.046 MiB, 4.72% gc time)

julia> @btime combine(d -> (s=sum(d[:c]), m=mean(d[:b]), std=std(d[:b])), gd);
  51.372 ms (879580 allocations: 53.40 MiB)

julia> @time combine(d -> DataFrame(x=sum(d[:c])), gd);
  0.521432 seconds (2.68 M allocations: 172.153 MiB, 5.70% gc time)

julia> @btime combine(d -> DataFrame(x=sum(d[:c])), gd);
  202.689 ms (1999599 allocations: 137.80 MiB)

CC: @quinnj @piever who are familiar with these eltype widening tricks

Fixes #1472, #1532, #1520.

quinnj · 2018-09-20T13:34:34Z

src/groupeddataframe/grouping.jl

+    m = length(first)
+    n = length(gd)
+    idx = Vector{Int}(undef, n)
+    initialcols = ntuple(i -> Vector{typeof(first[i])}(undef, n), m)


What about CategoricalValues here? Don't we want to use some kind of similar construct? In Tables.jl, we have allocatecolumn which can be overloaded by custom scalars to provide an alternative AbstractVector type. It'd be nice possibly have something like vectortype(scalar::T) = Vector{T} in Base eventually so this could be used throughout the package ecosystem.

Right. It's ironic I missed that point. :-)

I've added a commit to use similar for the data frame case, but as you note it can't be solved for the scalar case without calling allocatecolumn. I guess we can wait until your Tables.jl PR is merged before merging this one. Do you think it would be OK for Tables.jl to implement the necessary method? It could use Require.jl to avoid a hard dependency on CategoricalArrays.

I also agree it would be useful to have something like this in Base.

Yeah, Tables.jl already defines the right methods for both CategoricalValue & WeakRefString.

Perfect. So I can update the PR once you've merged yours.

Done. That allowed simplifying constructors a bit too.

quinnj · 2018-09-20T13:34:55Z

src/groupeddataframe/grouping.jl

+
+function _combine(first::AbstractDataFrame, f::Function, gd::GroupedDataFrame)
+    m = size(first, 2)
+    idx = Vector{Int}()


I'd rather keep it that way for consistency with the method above.

quinnj

I didn't really look into in-depth "correctness" details here as I'm not very familiar w/ the existing code, but the approach looks solid. It's great how a little bit of type uncertainty can actually be really great for performance.

piever · 2018-09-20T14:01:16Z

I haven't looked in depth to this specific case but I wanted to say that I would generally be in favor of having "universally useful" methods in Tables and relying or them (to avoid code duplication).

In particular there are buildcolumns(::Nothing, itr) in Tables (which is an improved version of collect_columns from IndexedTables) and collect_colums_flattened here in IndexedTables to collect an iterable of iterables of NamedTuples into a table with columnar storage. I wonder if some of that could be useful here. It'd be nicer if at least for basic things (collecting grouped data into normal data) the machinery was shared between here and IndexedTables. Vice versa, if the method here is superior, it be nice to keep the "basic tools" in Tables so that IndexedTables can rely on them (to the extent possible of course).

nalimilan · 2018-09-20T14:04:50Z

Yeah, I've thought about reusing code too, but AFAICT the situation is too specific here. Note that I couldn't even share the code between the data frame method and the named tuple method. Also the user-provided function doesn't operate on a single row, but on a whole table, and it can return multiple types. Maybe this function could be moved to Tables.jl or another generic package, but it would remain specialized for grouping operations (even if it worked on any kind of table).

nalimilan · 2018-09-20T17:55:31Z

One point I should have mentioned is the choice to replace combine with map for GroupedDataFrames. I think it makes sense since mapping a function to each group in a GroupedDataFrame cannot return a GroupedDataFrame. It could return a vector with the results, but that's not very useful.

nalimilan · 2018-09-29T16:17:19Z

I've pushed two new commits. The first one allows using names in a different order from one group to the other. This is currently supported even if it affects performance a bit.

The second one reintroduces combine, and changes map to return a GroupedDataFrame (where it currently returns a GroupApplied). I think this makes sense as the grouping information is preserved, allowing for further operations if one wishes. The code is mostly the same between the two functions anyway, it's more a matter of API. In the longer term, we could imagine making GroupedDataFrame an AbstractDataFrame, which would allow behaviors similar to dplyr.

Unfortunately, the new code crashes Julia with "Illegal instruction" (JuliaLang/julia#29430).

bkamins · 2018-10-05T12:19:51Z

@nalimilan Is this ready for a review (I am asking because tests are failing)

nalimilan · 2018-10-05T12:23:53Z

I think so, or at least for a discussion of the design, but it can't be merged until the Julia bug is fixed and released (or we find a workaround).

bkamins · 2018-10-05T12:58:52Z

@nalimilan maybe you will want to comment more on h2oai/db-benchmark#30

src/groupeddataframe/grouping.jl

bkamins · 2018-10-06T16:50:25Z

src/groupeddataframe/grouping.jl

@@ -87,7 +90,7 @@ function groupby(df::AbstractDataFrame, cols::Vector;
        permute!(df_groups.starts, group_perm)
        Base.permute!!(df_groups.stops, group_perm)
    end
-    GroupedDataFrame(df, cols, df_groups.rperm,
+    GroupedDataFrame(df, DataFrames.index(df)[cols], df_groups.rperm,


is DataFrames. needed here?

Also index(df) assumes that cols are symbols and they can be integers or vector of bools.

AFAICT integers and Bools are accepted. Good point about the unnecessary prefix.

Right - I do not know why it threw an error when I checked this.

bkamins · 2018-10-06T16:59:32Z

src/groupeddataframe/grouping.jl

 wrap(A::Matrix) = convert(DataFrame, A)
-wrap(s::Any) = DataFrame(x1 = s)
+wrap(s::Union{AbstractVector, Tuple}) = DataFrame(x1 = s)


what is the idea of wrapping a tuple like this?

As the code gives us:

julia> DataFrame(x=(1,2,3)) 1×1 DataFrame │ Row │ x │ │ │ Tuple… │ ├─────┼───────────┤ │ 1 │ (1, 2, 3) │

(why do we prefer DataFrame to NamedTuple in this case?)

Hmm. I think I incorrectly assumed that the existing code treated tuples like vectors, but clearly it doesn't. I'll change that to use NamedTuple as it's more efficient.

bkamins · 2018-11-06T20:22:31Z

map should also accept Type not only Function.

Remove GroupApplied and deprecate combine in favor of map(f, ::GroupedDataFrame). This avoids storing a copy of the per-group data returned by the user-provided function. Take advantage of this by allowing that function to return a NamedTuple. Introduce two completely different code paths depending on whether the first returned object is DataFrame or a NamedTuple, as the latter allows for more efficient operation by assuming that it represents a single row. Use the same progressive eltype widening approach as Base.map so that we fill column vectors whose types are known inside the kernel functions. This does not eliminate the type unstability due to the fact that the user-provided function takes a DataFrame, but ensuring type stability for half of the operations still improves performance significantly. Also parameterize GroupedDataFrame on the type of data frame it wraps, and make its column index have a concrete type. Deprecate an old map method for SubDataFrame. Fix a type unstability in hcat!.

nalimilan · 2018-11-08T19:00:40Z

Fixed. Good to go now?

src/groupeddataframe/grouping.jl

bkamins

Good to merge (apart from a single comment on documentation I have left).

nalimilan force-pushed the nl/grouping2 branch 2 times, most recently from 671e69a to d69e3f2 Compare September 20, 2018 12:01

quinnj reviewed Sep 20, 2018

View reviewed changes

nalimilan mentioned this pull request Sep 20, 2018

Review row vs. column orientation of API #1514

Closed

6 tasks

nalimilan force-pushed the nl/grouping2 branch 2 times, most recently from 1884bb8 to 657123c Compare September 21, 2018 16:32

nalimilan mentioned this pull request Sep 22, 2018

RFC: add mapdf(), restore the default Base.map() behaviour #1049

Closed

nalimilan force-pushed the nl/grouping2 branch 2 times, most recently from c6a27f4 to 8a67a81 Compare September 22, 2018 17:20

This was referenced Sep 23, 2018

DataFrame "by()" causes stackoverflow #1532

Closed

split-apply-combine on large dataframe segfaults #1472

Closed

pdeffebach mentioned this pull request Sep 26, 2018

Make describe work for a grouped dataframe #1443

Open

nalimilan mentioned this pull request Sep 29, 2018

Change the way grouped transforms work JuliaData/DataFramesMeta.jl#101

Merged

nalimilan force-pushed the nl/grouping2 branch from 8a67a81 to 256bd1d Compare September 29, 2018 15:29

bkamins mentioned this pull request Oct 3, 2018

Add describe for GroupedDataFrame #1539

Closed

This was referenced Oct 4, 2018

sanitized by function #1555

Closed

Ungrouping #1438

Closed

bkamins mentioned this pull request Oct 5, 2018

Add Julia h2oai/db-benchmark#30

Closed

bkamins reviewed Oct 6, 2018

View reviewed changes

src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved

bkamins reviewed Oct 6, 2018

View reviewed changes

bkamins mentioned this pull request Nov 6, 2018

cleanup map function #1588

Merged

nalimilan added 18 commits November 8, 2018 19:47

Work around julia#15276

87d3878

Fix CategoricalArray handling

63f7086

Remove unneeded new SubDataFrame methods

9da4e96

Use Tables.allocatecolumn to fix CategoricalArray handling

7bfeef3

Switch to Tables.allocatecolumns

c0c4b19

Small cleanup

85e5d15

Test behavior when input is empty

66b0030

More fixes

6f3f56f

Allow columns in different orders

34ed595

Reinstate combine(), change map() to return a GroupedDataFrame

702fb52

Fix key name in show

59ce161

Review fixes

ac8cce0

Uncomment tests that crash Julia

5d021c4

Work around crash

e428b89

Review fixes

2e57ec0

Rename functions to avoid variable name conflict on 0.7

fdb7293

Fixes

4aea51a

nalimilan force-pushed the nl/grouping2 branch from 6c5db1c to 4aea51a Compare November 8, 2018 19:00

bkamins reviewed Nov 8, 2018

View reviewed changes

src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved

bkamins approved these changes Nov 8, 2018

View reviewed changes

Fix signature

3f4e815

nalimilan merged commit bd3e1b4 into master Nov 8, 2018

nalimilan deleted the nl/grouping2 branch November 8, 2018 21:28

bkamins mentioned this pull request Nov 9, 2018

Unify eachcol and columns functions #1590

Merged

nalimilan mentioned this pull request Nov 17, 2018

Support type-stable map and combine on GroupedDataFrame #1601

Merged

nalimilan mentioned this pull request Oct 18, 2019

Get values of grouped columns #1908

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of by() using NamedTuples #1520

Improve performance of by() using NamedTuples #1520

nalimilan commented Sep 20, 2018 •

edited

Loading

quinnj Sep 20, 2018

nalimilan Sep 20, 2018

quinnj Sep 20, 2018

nalimilan Sep 20, 2018

nalimilan Sep 21, 2018

quinnj Sep 20, 2018

nalimilan Sep 20, 2018

quinnj left a comment

piever commented Sep 20, 2018 •

edited

Loading

nalimilan commented Sep 20, 2018

nalimilan commented Sep 20, 2018

nalimilan commented Sep 29, 2018

bkamins commented Oct 5, 2018

nalimilan commented Oct 5, 2018

bkamins commented Oct 5, 2018

bkamins Oct 6, 2018

bkamins Oct 6, 2018

nalimilan Oct 6, 2018

bkamins Oct 6, 2018

bkamins Oct 6, 2018

nalimilan Oct 6, 2018

bkamins commented Nov 6, 2018

nalimilan commented Nov 8, 2018

bkamins left a comment

Improve performance of by() using NamedTuples #1520

Improve performance of by() using NamedTuples #1520

Conversation

nalimilan commented Sep 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quinnj left a comment

Choose a reason for hiding this comment

piever commented Sep 20, 2018 • edited Loading

nalimilan commented Sep 20, 2018

nalimilan commented Sep 20, 2018

nalimilan commented Sep 29, 2018

bkamins commented Oct 5, 2018

nalimilan commented Oct 5, 2018

bkamins commented Oct 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Nov 6, 2018

nalimilan commented Nov 8, 2018

bkamins left a comment

Choose a reason for hiding this comment

nalimilan commented Sep 20, 2018 •

edited

Loading

piever commented Sep 20, 2018 •

edited

Loading