Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing of a GroupedDataFrame by group #1693

Closed
pdeffebach opened this issue Jan 22, 2019 · 18 comments
Closed

Indexing of a GroupedDataFrame by group #1693

pdeffebach opened this issue Jan 22, 2019 · 18 comments

Comments

@pdeffebach
Copy link
Contributor

Should we allow indexing of a grouped DataFrame by the grouping variables? gd(:a == 1 && :b == 2) which would return a grouped dataframe of all the groups matching those values (say there is a third group, :c.

@bkamins
Copy link
Member

bkamins commented Jan 25, 2019

The syntax you propose fits more to DataFramesMeta.jl than DataFrames.jl. I guess what you propose should be rather handled by a custom filter function.

@nalimilan
Copy link
Member

I've thought many times that GroupedDataFrame should behave like an AbstractDataFrame in most situations. That would fix this use case, among others.

@bkamins
Copy link
Member

bkamins commented Jan 25, 2019

@nalimilan. So is your proposal the following:

  • make it a subtype of AbstractDataFrame
  • make all non-mutating functions defined on AbstractDataFrame work on GroupedDataFrame per row (i.e. as if DataFrame(gd) were called)
  • make it a functor so that a call like gd(idxs) returns a new GroupedDataFrame with groups that meet the idxs index;
  • add a function like getgroupcols(gd, idx) retrieving a NamedTuple or DataFrameRow of values defining a given group (i.e. all cols that are grouping cols with their respective values)? (the last point could be added anyway as I think it would be useful as currently it is a bit clumsy to get them)

@nalimilan
Copy link
Member

* make it a subtype of `AbstractDataFrame`

* make all non-mutating functions defined on `AbstractDataFrame` work on `GroupedDataFrame` per row (i.e. as if `DataFrame(gd)` were called)

Agreed.

make it a functor so that a call like gd(idxs) returns a new GroupedDataFrame with groups that meet the idxs index;

No, I don't think using puns with this syntax is a good idea. Standard indexing should be OK (with two dimensions).

add a function like getgroupcols(gd, idx) retrieving a NamedTuple or DataFrameRow of values defining a given group (i.e. all cols that are grouping cols with their respective values)? (the last point could be added anyway as I think it would be useful as currently it is a bit clumsy to get them)

When would that be useful?

@bkamins
Copy link
Member

bkamins commented Jan 25, 2019

No, I don't think using puns with this syntax is a good idea

So how do you want the user to select some specific group from a GroupedDataFrame?

When would that be useful?

You select group 3 from a GroupedDataFrame and want to learn the values of grouping variables in this group. Now you have to write gd[3][1, cols] to get a DataFrameRow with this information.

@bkamins
Copy link
Member

bkamins commented Jan 25, 2019

of course assuming that you know cols used to create a GroupedDataFrame

@bkamins
Copy link
Member

bkamins commented Jan 25, 2019

Having thought about it a bit maybe a better startegy would be to leave GroupedDataFrame as is as a view, but rather equip DataFrame with an index that would be created on as-needed basis and GroupedDataFrame would just use it to give a different view to the DataFrame and also in many cases we would know that it got invalidated when DataFrame is mutated.

The only drawback would be that now you can create GroupedDataFrame on a SubDataFrame without calculating the index for the whole parent, but I guess this is a minor issue.

@bkamins
Copy link
Member

bkamins commented Feb 1, 2019

The decision on this should also mean what we return with keys(::GroupedDataFrame). Now it is undefined.

@nalimilan
Copy link
Member

So how do you want the user to select some specific group from a GroupedDataFrame?

I imagine we could support something like gd[(1, 2)] or even gd[(a=1, b=2)].

You select group 3 from a GroupedDataFrame and want to learn the values of grouping variables in this group. Now you have to write gd[3][1, cols] to get a DataFrameRow with this information.

Yes but is that an actual use case? I mean, that sounds reasonable, but in what concrete operations does one need this? Anyway, I agree using keys for this is appealing, especially if we accept tuples in indexing (like for Dict).

Having thought about it a bit maybe a better startegy would be to leave GroupedDataFrame as is as a view, but rather equip DataFrame with an index that would be created on as-needed basis and GroupedDataFrame would just use it to give a different view to the DataFrame and also in many cases we would know that it got invalidated when DataFrame is mutated.
The only drawback would be that now you can create GroupedDataFrame on a SubDataFrame without calculating the index for the whole parent, but I guess this is a minor issue.

Sorry, I don't follow. What kind of index would that be?

@bkamins
Copy link
Member

bkamins commented Feb 4, 2019

I imagine we could support something like gd[(1, 2)] or even gd[(a=1, b=2)].

We could do it (but this is breaking, so we should do it sooner than later). However, I think that instead of gd[(a=1, b=2)] it would be better to implement a performant filter function that would do this job.

Yes but is that an actual use case?

In interactive use it is not very useful. But if you write a script that iterates groups it could be useful to get information about the values of grouping variables, something like:

for (i, sdf) in enumerate(gd)
     if predicate(keys(gd)[i])
          # do something
     end
end

instead of what you currently have to do:

for sdf in gd
     if predicate(sdf[1, groupvars(gd)])
          # do something
     end
end

Looking at this what we have now is not that bad, but maybe not an obvious pattern. I am not sure if it is worth to add.

@bkamins
Copy link
Member

bkamins commented Feb 4, 2019

What kind of index would that be?

The same kind of index that e.g. JuliaDB uses for indexing.

@nalimilan
Copy link
Member

We could do it (but this is breaking, so we should do it sooner than later). However, I think that instead of gd[(a=1, b=2)] it would be better to implement a performant filter function that would do this job.

Why is it breaking? AFAICT that's an error now.

In interactive use it is not very useful. But if you write a script that iterates groups it could be useful to get information about the values of grouping variables, something like:

OK, makes sense.

The same kind of index that e.g. JuliaDB uses for indexing.

So what you propose is to have groupby return a DataFrame with grouping information stored somewhere?

@bkamins
Copy link
Member

bkamins commented Feb 5, 2019

AFAICT that's an error now.

This is true. I mean that changing normal indexing into GroupedDataFrame is significantly breaking.

with grouping information stored somewhere

Yes. Actually you could add this information "in place" to the same DataFrame. Even you could imagine a dictionary of multiple indices (not sure if this would be very useful). Then you could retain GroupedDataFrame as an object indicating the selected grouping. The benefit would be that e.g. calling by multiple times on the same DataFrame with the same columns as grouping could reuse this case.

This is something I think was mentioned when discussing H2O benchmarks that data.table plans to implement (i.e. if you run a group-by operation on it you can opt-in to cache grouping metadata).

I do not want to say that this is a must, but I just wanted to explore the possibilities before we make a decision.

@nalimilan
Copy link
Member

This is true. I mean that changing normal indexing into GroupedDataFrame is significantly breaking.

I still don't understand. What's the "normal indexing"?

Yes. Actually you could add this information "in place" to the same DataFrame. Even you could imagine a dictionary of multiple indices (not sure if this would be very useful). Then you could retain GroupedDataFrame as an object indicating the selected grouping. The benefit would be that e.g. calling by multiple times on the same DataFrame with the same columns as grouping could reuse this case.

This is something I think was mentioned when discussing H2O benchmarks that data.table plans to implement (i.e. if you run a group-by operation on it you can opt-in to cache grouping metadata).

I do not want to say that this is a must, but I just wanted to explore the possibilities before we make a decision.

Yeah, that's interesting. A related thing that we could do is to support primary keys and/or keeping track of whether a DataFrame has been sorted on some columns. That would allow GroupedDataFrame to drop the permutation vector (or just use a range object and parameterize on that).

OTOH it's not very typical of Julia to discretely mutate the argument in a function which doesn't end with ! (like gropuby or by), even if that's just for optimization. AFAIK in Julia we rather tend to be explicit by having users call groupby first if they intend to do multiple operations with the same grouping. So there would probably be an explicit mutating operation like groupby! or setkey!.

@bkamins
Copy link
Member

bkamins commented Feb 5, 2019

With "normal indexing" I mean that gd[int] would return int-th column of gd after the change right? This would be breaking.

Regarding the index issue - I agree that an explicit ! in methods should be used. And also keeping track if DataFrame is sorted. Actually even if we are not sure if it is sorted then - as discussed some time ago - we can run issorted on it as it should be fast relative to all other operations and use the result of issorted to optimize things.

@nalimilan
Copy link
Member

With "normal indexing" I mean that gd[int] would return int-th column of gd after the change right? This would be breaking.

Ah, yes, that's the main (or only) conflict with the AbstractDataFrame API. But gd[(1, 2)] and gd[(a=1, b=2)] are totally orthogonal to that issue AFAICT.

@bkamins
Copy link
Member

bkamins commented Feb 5, 2019

Yes. In general (of course no pressure and no rush 😄) it would be best it at some point to write down the contract around these issues related to GroupedDataFrame (like I try to do for indexing and broadcasting) as then we would have a clear head around the design.

@bkamins
Copy link
Member

bkamins commented Feb 12, 2020

Now we support indexing by values.

@bkamins bkamins closed this as completed Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants