Indexing of a GroupedDataFrame by group #1693

pdeffebach · 2019-01-22T19:14:33Z

Should we allow indexing of a grouped DataFrame by the grouping variables? gd(:a == 1 && :b == 2) which would return a grouped dataframe of all the groups matching those values (say there is a third group, :c.

The text was updated successfully, but these errors were encountered:

bkamins · 2019-01-25T13:01:03Z

The syntax you propose fits more to DataFramesMeta.jl than DataFrames.jl. I guess what you propose should be rather handled by a custom filter function.

nalimilan · 2019-01-25T17:42:07Z

I've thought many times that GroupedDataFrame should behave like an AbstractDataFrame in most situations. That would fix this use case, among others.

bkamins · 2019-01-25T18:28:45Z

@nalimilan. So is your proposal the following:

make it a subtype of AbstractDataFrame
make all non-mutating functions defined on AbstractDataFrame work on GroupedDataFrame per row (i.e. as if DataFrame(gd) were called)
make it a functor so that a call like gd(idxs) returns a new GroupedDataFrame with groups that meet the idxs index;
add a function like getgroupcols(gd, idx) retrieving a NamedTuple or DataFrameRow of values defining a given group (i.e. all cols that are grouping cols with their respective values)? (the last point could be added anyway as I think it would be useful as currently it is a bit clumsy to get them)

nalimilan · 2019-01-25T19:13:31Z

* make it a subtype of `AbstractDataFrame`

* make all non-mutating functions defined on `AbstractDataFrame` work on `GroupedDataFrame` per row (i.e. as if `DataFrame(gd)` were called)

Agreed.

make it a functor so that a call like gd(idxs) returns a new GroupedDataFrame with groups that meet the idxs index;

No, I don't think using puns with this syntax is a good idea. Standard indexing should be OK (with two dimensions).

add a function like getgroupcols(gd, idx) retrieving a NamedTuple or DataFrameRow of values defining a given group (i.e. all cols that are grouping cols with their respective values)? (the last point could be added anyway as I think it would be useful as currently it is a bit clumsy to get them)

When would that be useful?

bkamins · 2019-01-25T19:27:23Z

No, I don't think using puns with this syntax is a good idea

So how do you want the user to select some specific group from a GroupedDataFrame?

When would that be useful?

You select group 3 from a GroupedDataFrame and want to learn the values of grouping variables in this group. Now you have to write gd[3][1, cols] to get a DataFrameRow with this information.

bkamins · 2019-01-25T19:29:56Z

of course assuming that you know cols used to create a GroupedDataFrame

bkamins · 2019-01-25T20:36:35Z

Having thought about it a bit maybe a better startegy would be to leave GroupedDataFrame as is as a view, but rather equip DataFrame with an index that would be created on as-needed basis and GroupedDataFrame would just use it to give a different view to the DataFrame and also in many cases we would know that it got invalidated when DataFrame is mutated.

The only drawback would be that now you can create GroupedDataFrame on a SubDataFrame without calculating the index for the whole parent, but I guess this is a minor issue.

bkamins · 2019-02-01T13:26:50Z

The decision on this should also mean what we return with keys(::GroupedDataFrame). Now it is undefined.

nalimilan · 2019-02-04T21:51:16Z

So how do you want the user to select some specific group from a GroupedDataFrame?

I imagine we could support something like gd[(1, 2)] or even gd[(a=1, b=2)].

You select group 3 from a GroupedDataFrame and want to learn the values of grouping variables in this group. Now you have to write gd[3][1, cols] to get a DataFrameRow with this information.

Yes but is that an actual use case? I mean, that sounds reasonable, but in what concrete operations does one need this? Anyway, I agree using keys for this is appealing, especially if we accept tuples in indexing (like for Dict).

Having thought about it a bit maybe a better startegy would be to leave GroupedDataFrame as is as a view, but rather equip DataFrame with an index that would be created on as-needed basis and GroupedDataFrame would just use it to give a different view to the DataFrame and also in many cases we would know that it got invalidated when DataFrame is mutated.
The only drawback would be that now you can create GroupedDataFrame on a SubDataFrame without calculating the index for the whole parent, but I guess this is a minor issue.

Sorry, I don't follow. What kind of index would that be?

bkamins · 2019-02-04T22:43:12Z

I imagine we could support something like gd[(1, 2)] or even gd[(a=1, b=2)].

We could do it (but this is breaking, so we should do it sooner than later). However, I think that instead of gd[(a=1, b=2)] it would be better to implement a performant filter function that would do this job.

Yes but is that an actual use case?

In interactive use it is not very useful. But if you write a script that iterates groups it could be useful to get information about the values of grouping variables, something like:

for (i, sdf) in enumerate(gd)
     if predicate(keys(gd)[i])
          # do something
     end
end

instead of what you currently have to do:

for sdf in gd
     if predicate(sdf[1, groupvars(gd)])
          # do something
     end
end

Looking at this what we have now is not that bad, but maybe not an obvious pattern. I am not sure if it is worth to add.

bkamins · 2019-02-04T22:43:58Z

What kind of index would that be?

The same kind of index that e.g. JuliaDB uses for indexing.

nalimilan · 2019-02-05T08:53:36Z

We could do it (but this is breaking, so we should do it sooner than later). However, I think that instead of gd[(a=1, b=2)] it would be better to implement a performant filter function that would do this job.

Why is it breaking? AFAICT that's an error now.

In interactive use it is not very useful. But if you write a script that iterates groups it could be useful to get information about the values of grouping variables, something like:

OK, makes sense.

The same kind of index that e.g. JuliaDB uses for indexing.

So what you propose is to have groupby return a DataFrame with grouping information stored somewhere?

bkamins · 2019-02-05T09:54:54Z

AFAICT that's an error now.

This is true. I mean that changing normal indexing into GroupedDataFrame is significantly breaking.

with grouping information stored somewhere

Yes. Actually you could add this information "in place" to the same DataFrame. Even you could imagine a dictionary of multiple indices (not sure if this would be very useful). Then you could retain GroupedDataFrame as an object indicating the selected grouping. The benefit would be that e.g. calling by multiple times on the same DataFrame with the same columns as grouping could reuse this case.

This is something I think was mentioned when discussing H2O benchmarks that data.table plans to implement (i.e. if you run a group-by operation on it you can opt-in to cache grouping metadata).

I do not want to say that this is a must, but I just wanted to explore the possibilities before we make a decision.

nalimilan · 2019-02-05T18:55:06Z

This is true. I mean that changing normal indexing into GroupedDataFrame is significantly breaking.

I still don't understand. What's the "normal indexing"?

Yes. Actually you could add this information "in place" to the same DataFrame. Even you could imagine a dictionary of multiple indices (not sure if this would be very useful). Then you could retain GroupedDataFrame as an object indicating the selected grouping. The benefit would be that e.g. calling by multiple times on the same DataFrame with the same columns as grouping could reuse this case.

This is something I think was mentioned when discussing H2O benchmarks that data.table plans to implement (i.e. if you run a group-by operation on it you can opt-in to cache grouping metadata).

I do not want to say that this is a must, but I just wanted to explore the possibilities before we make a decision.

Yeah, that's interesting. A related thing that we could do is to support primary keys and/or keeping track of whether a DataFrame has been sorted on some columns. That would allow GroupedDataFrame to drop the permutation vector (or just use a range object and parameterize on that).

OTOH it's not very typical of Julia to discretely mutate the argument in a function which doesn't end with ! (like gropuby or by), even if that's just for optimization. AFAIK in Julia we rather tend to be explicit by having users call groupby first if they intend to do multiple operations with the same grouping. So there would probably be an explicit mutating operation like groupby! or setkey!.

bkamins · 2019-02-05T19:11:20Z

With "normal indexing" I mean that gd[int] would return int-th column of gd after the change right? This would be breaking.

Regarding the index issue - I agree that an explicit ! in methods should be used. And also keeping track if DataFrame is sorted. Actually even if we are not sure if it is sorted then - as discussed some time ago - we can run issorted on it as it should be fast relative to all other operations and use the result of issorted to optimize things.

nalimilan · 2019-02-05T20:25:01Z

With "normal indexing" I mean that gd[int] would return int-th column of gd after the change right? This would be breaking.

Ah, yes, that's the main (or only) conflict with the AbstractDataFrame API. But gd[(1, 2)] and gd[(a=1, b=2)] are totally orthogonal to that issue AFAICT.

bkamins · 2019-02-05T21:27:45Z

Yes. In general (of course no pressure and no rush 😄) it would be best it at some point to write down the contract around these issues related to GroupedDataFrame (like I try to do for indexing and broadcasting) as then we would have a clear head around the design.

bkamins · 2020-02-12T12:23:35Z

Now we support indexing by values.

pdeffebach mentioned this issue Feb 1, 2019

Group Indices function #1704

Closed

jlumpe mentioned this issue Aug 4, 2019

Get values of grouped columns #1908

Merged

bkamins closed this as completed Feb 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing of a GroupedDataFrame by group #1693

Indexing of a GroupedDataFrame by group #1693

pdeffebach commented Jan 22, 2019

bkamins commented Jan 25, 2019

nalimilan commented Jan 25, 2019

bkamins commented Jan 25, 2019

nalimilan commented Jan 25, 2019

bkamins commented Jan 25, 2019 •

edited

Loading

bkamins commented Jan 25, 2019

bkamins commented Jan 25, 2019

bkamins commented Feb 1, 2019

nalimilan commented Feb 4, 2019

bkamins commented Feb 4, 2019 •

edited

Loading

bkamins commented Feb 4, 2019

nalimilan commented Feb 5, 2019

bkamins commented Feb 5, 2019

nalimilan commented Feb 5, 2019

bkamins commented Feb 5, 2019

nalimilan commented Feb 5, 2019

bkamins commented Feb 5, 2019

bkamins commented Feb 12, 2020

Indexing of a GroupedDataFrame by group #1693

Indexing of a GroupedDataFrame by group #1693

Comments

pdeffebach commented Jan 22, 2019

bkamins commented Jan 25, 2019

nalimilan commented Jan 25, 2019

bkamins commented Jan 25, 2019

nalimilan commented Jan 25, 2019

bkamins commented Jan 25, 2019 • edited Loading

bkamins commented Jan 25, 2019

bkamins commented Jan 25, 2019

bkamins commented Feb 1, 2019

nalimilan commented Feb 4, 2019

bkamins commented Feb 4, 2019 • edited Loading

bkamins commented Feb 4, 2019

nalimilan commented Feb 5, 2019

bkamins commented Feb 5, 2019

nalimilan commented Feb 5, 2019

bkamins commented Feb 5, 2019

nalimilan commented Feb 5, 2019

bkamins commented Feb 5, 2019

bkamins commented Feb 12, 2020

bkamins commented Jan 25, 2019 •

edited

Loading

bkamins commented Feb 4, 2019 •

edited

Loading